Configuration Should Be Treated Like Code

Ian Nowland
9 min readOct 9, 2022

--

Amazon.com 2007 — My introduction to “servware”

“Configuration is Code” was a rant I wrote in late 2007 after I had been at Amazon 18 months. Before that I had been doing embedded development — where everything you shipped you had to accept it could never be changed. Whereas Amazon was my first experience in a “you build it, you own it, you run it” world which opened up a lot of flexibility for iteration on both product and system. This is a no-brainer thing for all today, but this old Steve Yegge “its not software” essay captures how revolutionary it was. He used the term “Servware” which never caught on, but which no real term for the artifact ever caught on.

So looking at configuration in “Servware”. What I noticed was that my new teammates would spend a lot of time intensely arguing about the right abstractions for code, and the right way to architect data structures and systems for reliability and scale, then totally lowered their bar when it came to configuration — the magic numbers and strings that affected system behavior. Here they just “no brain” accepted a bunch of things they never accepted of code:

  • A lot of copy and paste in definition of values (i.e. the YAML hell you see with k8s today)
  • Almost no automated testing of system behavior in light of likely values
  • In some cases, it was in a system with no version control
  • In all cases, extremely fast rollouts of changes to production, bypassing all safety checks that the new system is running on some nodes before deploying to more nodes.

From new eyes to the “Servware” world, my teammates seemed to be unnecessarily carrying across a “software” world configuration file/registry mindset. My argument was that since we were operating all instances of the developed system and could deploy at will, “configuration” was an outdated notion. If we needed a new configuration value in production, we could change it in the codebase, run it through tests, and build and deploy in minutes. Thus we should move all the definitions of configuration values to a type safe, Turing complete language. Since we were a Java shop, that meant treating it all as Java code (i.e. rather than a Java constants file).

This was influential within my organization, which was teams of generally ~100 hosts worldwide, well served by Amazon’s build (Brazil) and deployment system (Apollo) — i.e. you could go from checking in code to a canary tested full fleet deployment in ~15 minutes, staggered across data centers. I then joined AWS EMR in 2008 with a similar setup and had to drop the type safety for some of our Ruby code, but once again we saw good outcomes in treating arbitrary constant values the same way we did arbitrary logic constructs.

So all simple, right — configuration is code? This idealist mindset was challenged when I moved to EC2 in 2010.

EC2 2010 — Configuration in the world of Servware Infrastructure

One thing to capture up front here as in 2022 people have a hard time believing it, is exactly how poor quality EC2 software was not only by modern standards; but even by Amazon.com servware standards of the time. This was in the period where Andy Jassy would give AWS projects a codename with a number which was the number of days you were expected to launch in; and the numbers were less than 200. In hindsight it was likely rational; a land grab to determine what were the right building blocks for a massive software infrastructure business. But from an engineering perspective, it encouraged/empowered a “we don’t have time to think about tomorrow” mindset in engineers; which over EC2’s 4 years of rapid growth as an infrastructure business had built up a massive pile of hard-to-change tech debt. At the point I joined, it was widely known internally as a place you went to burn out, between getting paged 30 times a week and the difficulty of changing the systems that were doing it. Indeed the year I joined there were _zero_ other internal transfers — my reason for joining was simple curiosity — “how bad could it be?”

The answer was “pretty bad” —

  • The > 30 week page rate and difficulty of changing systems was real.
  • EC2 was already at over 20,000 hypervisor nodes, and would be at over 100,000 in 2 years
  • On the hypervisor five different languages were used to write production code (Bash, Ruby, Python, Perl, C++), and they needed to interact. That is, there was no easy code based way to share configuration values across systems -so my whole “pick one Turing complete language for config” idea was bunk.
  • On top of this, Xen virtual machine behavior was tied to various system limits you could only program through sysctls, where the way is literally old school config files. Anything trying to work around this with something more “modern” was going to run in to race conditions on the case of a system reboot.
  • EC2's build system was not the Amazon standard, instead it was a bespoke mess which failed to scale to a larger codebase as the EC2 org grew, taking an hour to produce a binary (TLDR was lots of Ruby and some very slow unit tests and no ability to bypass them selectively)
  • EC2's deployment system, to use the word “system” kindly, was a parallel SSH mechanism to lay down RPMs; rate limited by the number of SSH connections a single deployer host could sustain, which was not high. This meant by the time we hit 100,000 hosts, deployments took over an hour at the very fastest.

The worst aspect of this world was a few teams had decided to work around the limitations of the deployment system and build their own ad-hoc configuration deployment mechanisms, all with the same idiom: a cron-based agent fetching config files from S3 which they would apply as soon as the next job runs. These had exactly _zero_ of the controls my “configuration is code” screed talked about. But in the hands of the experts who built them, they were incredibly useful for quickly solving business problems, leading to quick promotions for said experts.

Like a lot of EC2 at the time, it was a standing but fragile house of cards, set up to start falling apart as soon as the experts who built them started moving on, which between burn out and looking for the next promotion, they did. However, between the inherent challenges of the system, and the tech debt, there was no silver bullet getting to a better place.

AWS EC2 2013 — Configuration working Around Idealistic Deployment System

Progress was a lot of rewriting of the individual systems, and in doing so thinking hard about what aspects needed to be dynamic (i.e. fleetwide in minutes) with a lot of safety checks that limited flexibility, and which aspects should be static and go through the rom scratch rewritten code deployment system. This turned out to be a 6 year mission — so for instance when we released the c3 instance type in November 2013, when I did a “night before the announcement” configuration deployment to fix how we were doing EBS throttling, I used the old SSH system, as the new system had been written with a “one AZ per day” enforcement baked in, with no flexibility to override for pre-production hosts like in my case.

The implementation of that deployment system comes to a general challenge, which is in a complicated high entropy world and humans trying to get stuff done, systems that are too ideal in their design often end up causing more damage than pragmatic ones with more obvious flaws. So beyond the one AZ a day thing being baked in, that system had a worse problem that between creator idealism and compliance regulation, it was not written with any type of API to automate deployments; it required a human push a whole bunch of buttons to make the deployment go; and further push more buttons to react to a whole bunch “looks suspicious and triggered monitors” events during the deployment. Given the entropy of a 100,000+ fleet of servers, this meant manually looking at the logs and clicking a “resume” button of O(tens) of hosts every deployment. Combined that then with the “one AZ a day” policy, doing a deployment was about half a week of toil, with the rest of the week with work difficult to focus, waiting for the page so you could unblock things as soon as possible, else the whole thing slips a week and your leadership chain yells at you for slipping a date.

This was especially painful for one of the EC2 networking teams that managed the physical network firewall rules of the hypervisors — since the network was constantly expanding, they were deploying config changes about once a month. They tried to work with the deployment team to cover the case, but due to the combination of idealism and the difficulty of revisiting idealistic design assumptions in an already complex system, the deployment team was not able to accommodate.

As a result the EC2 networking decided to build a “better” version of the “fetch a file from S3” mechanism — literally — “but we will build in safeties!” The safeties were exactly two — command line switches that someone had to include as more risk was taken, and a “fall back to last” behavior of any primary consumer of the configuration. Now, I am both a natural idealist and had lived the white knuckle terror of making EC2-wide changes in the earlier S3 syncer systems, so I strongly argued these safeties were inadequate and this was a massive mistake. But I got told the toil was too high, this was the only solution to fix that could fix that soon, and so I needed to do the Amazon “disagree and commit” (i.e. shut up), so I did.

Play forward about 18 months and I inherited this system. Now I knew it was flawed, but it had been 18 months, I didn’t want to ask an engineer on my team to do the high toil thing (especially since with the pressure removed the deployment team had built nothing in the meantime), and so I normalized the deviance, and gave the system to a strong but junior engineer.

And in karmic fashion, about 3 months in, he had an emergency change he needed to roll out globally quickly — a product manager dropped the ball on executing network changes so we were going to literally run out of IPs for new instances. So the engineer plugged in the needed command line safety overrides and deployed a change to over 1 million hypervisors worldwide in less than 5 minutes. And while the system he deployed to picked up the change fine as it was “just more config in the list”, it tripped a bug in an adjacent agent running on each and every hypervisor. Luckily this just caused a crash of the agent and when it restarted it could use the new config, and so apart from a very brief blip of some connectivity on the crash, there was no further impact. But given what this other agent controlled, this was just luck; we were very close to a global outage of new network connectivity to every EC2 instance in the world. Obviously the safety was then massively restricted, and we now had impetus to get the deployment system development team to prioritize feature to better suit our needs. But in my memory what remains is the downstream bad outcomes that happens when systems are built too much to an ideal.

Takeaways

So where I came to for configuration, in the way the values of numbers and strings have a direct affect on system behavior, configuration has little that distinguishes it from code. So it should be treated like code.

  • Ideally all configuration can be written in a Turing complete language that minimized copy and paste
  • Ideally all configuration is statically checked for type safety
  • Ideally all configuration is regression tested
  • Ideally all configuration is in a version control system
  • Ideally all configuration is deployed to product with the exact same safety checks as non-configuration, through the exact same system

On top of this though, I repeat “Ideal” in the above as you need to be thoughtful in all cases; I learned the hard way to be wary of the combination of idealist and monolithic approaches to engineering tooling that force all use cases into the presumed ideal usage. They end up causing more harm than something less “pure” would have, as in the real world engineers will just work around a non-functional ideal. In particular the case of configuration, by using a central store as a means of deploying numbers and strings that change system behavior. This is a fast propagation mechanism, but not a safe one. So in your idealism, you have empowered the development of expert’s “quick and dirty” mechanisms that become the massive outage foot guns of the future.

--

--