Snowflake Servers
Note: The following is an extract of the transcript from an interview I gave in 2018, after my talk at the Centiq Experts Forum that year. It’s also published on the Centiq blog and more recently reproduced on my LinkedIn. Photo by Marc Newberry on Unsplash.
What are snowflake servers and why does this happen?
This is a term that comes from the Visible Ops Handbook (you can read more about it on Martin Fowler’s blog) – and essentially it’s the problem where, over time, the processes around your IT infrastructure become inconsistent. So, the reasons that choices were made becomes lost as different people manage different systems — and trying to keep them alive becomes a real manual, and therefore error-prone, process.
No two snowflakes are alike — and essentially, it’s the same situation here, with snowflake servers.
What are the cost and consequences of this?
The configuration of these servers starts to diverge over time, so you start to see differences in infrastructure when you typically want them to be the same, especially across environments.
Imagine you run a project through both Dev and Test environments and it all works fine, then in the critical Production release, a problem arises. Quite often, the main cause comes from inconsistencies from manual management, so you’re putting critical deadlines at risk and it takes both time and effort to fix.
Then there’s also unplanned events where you might need to solve challenges in Production right away. Making sure that those changes are then reflected back into your Pre-production environments is crucial, or you end up with Pre-production configuration drift, which increases both project and service risk.
Worse still, this is latent risk. It can creep up and hit you some time in the future, increasing costs, pushing back deadlines, introducing downtime and more.
Where should people start in trying to fix this?
Traditionally people turn toward change control mechanisms. It gets associated with a big process and change advisory boards, which don’t happen often and actually start to slow changes down.
So the need to control changes comes really from trying to maintain stability and remove risk. The flip side of that is that they inhibit innovation and prevent business-driven changes going through.
The following extract from the 2014 State of DevOps report shows what we mean here:
“Peer-reviewed change approval process. We found that when external approval (e.g., change approval boards) was required in order to deploy to production, IT performance decreased. But when the technical team held itself accountable for the quality of its code through peer review, performance increased. Surprisingly, the use of external change approval processes had no impact on restore times, and had only a negligible e ect on reducing failed changes. In other words, external change approval boards had a big negative impact on throughput, with negligible impact on stability.”
The more forward thinking and agile way of looking at this is through automation processes. Ultimately, consistency is repetition by definition — and this is what computers are great at anyway.
This requires a level of understanding and expertise and strategy, so it’s not easy if you haven’t done it before — but it helps deliver the same outcome at scale and reduced operational cost.
This is especially important in a cloud world — where you are potentially thinking of scaling massive amounts. If you do this manually across that scale, it just isn’t feasible. You also want to remove human intervention where possible, which necessarily means some level of automation.
What are the key technologies to fixing this?
The first thing is to be clear that you can’t deliver this outcome looking at technology alone. It’s about People, Process and Technology.
But once you do get down to the technology level, key tools typically include Chef, Puppet, Ansible, Salt.
The open source tooling is what we lean toward at Centiq because they’re open technologies, which helps when working with cutting edge technologies such as SAP HANA, you are free to develop those bits yourself and contribute them back into the wider open source community.
Another reason is they have a huge community inputting. It’s built by people who are passionate enough about making it work that they will even do it in their spare time.
What’s the role of software thinking in hardware work?
Infrastructure as code is much more than just having automation. The software-engineering mindset has brought not just how to write code, but how to manage it, how to test it, how to release it, how to integrate into your processes.
When you apply that to an infrastructure world, it moves away from relying on a subject matter expert (SME) to get stuff done, but rather it’s in the code itself. When you put that into the software, those long feedback loops reliant on SMEs start to disappear. So if things are going to go wrong, or you need to fix them, all that info is in one place, version-controlled, with no need to rely on experts for business as usual activities.
So it’s almost about people making themselves less crucial for maintaining the basics, to free them up to solve the real problems?
Absolutely, hit the nail on the head. It’s very natural for people to want to protect their own jobs a bit, but the mindset needs to be that we’re here to do the real thinking that computers can’t do — the complex thinking — move anything repeatable into the work the computer does.
And in both cases it’s about centralisation?
That’s right — some of what comes from the DevOps world is transparency, making sure there aren’t blockers in the way.
If you are moving that info down into the code, and making that investment something that can be shared much more easily, that’s when you’re winning.