Managing Systems Engineers
(or devops engineers, or production engineers, or systems development engineers, or SRE-SE, or cloud engineers, etc)
1. Introduction
The purpose of this post is to talk about how I have learned to appreciate the need for Systems Engineers, and what I have learned in terms of managing them well (both hiring and motivating). I do this because I think throughout our industry they are mismanaged, and a big part of this is that people from software engineering backgrounds tend to look at them as glorified toilers/scripters that can’t write production quality software but are useful for doing all the stuff software engineers complain about doing. I know this personally as this is exactly what I thought, before I worked with two great Systems Engineers on the early years of the Nitro project and saw what game changers they were for our outcomes. They very literally did things in minutes that would take my software engineers hours, and could do it when it was much better for the business to have it done in minutes.
Now an important thing to get out of the way first — “why not just hire software engineers”? I actually agree for as long as it works for you, in my last post I talked about minimizing functional specialist hierarchies, but I have a stronger opinion which is minimizing specialist engineering job functions for as long as possible. There are just so many advantages of role fungibility when you need to be highly agile to the engineering needs of the day. In fact, in my first thirteen years in industry before Nitro, I believed in hiring only generalist Software Engineers who could range all over the stack doing all the work needed — front end, back end, distributed systems, database administration, ops scripts. I still think that’s where you should start and hold the line on as long as possible for any individual team— i.e. push back when your software engineers start complaining “I don’t want to do this type of work” until it’s clear a specialist in that type of work is worth the costs.
However, as per the Nitro example, at a certain size of company and so complexity of problems being solved, specialized knowledge allows better solutions, delivered faster. So as engineering orgs grow big it does make sense to create specialties, although:
- doing so should be delayed as long as possible,
- even then they should only be hired in places you really benefit from them
- and as per the last post you should minimize hiring them into a hierarchy around function.
2. Work Process Definitions
I will come back to other limitations of these definitions, but in terms of type of work, I will just restate the Google’s SRE Community great definitions.
Software Engineering — involves writing or modifying code, in addition to any associated design and documentation work. Examples include writing automation scripts, creating tools or frameworks, adding service features for scalability and reliability, or modifying infrastructure code to make it more robust.
Systems Engineering — Involves configuring production systems, modifying configurations, or documenting systems in a way that produces lasting improvements from a one-time effort. Examples include monitoring setup and updates, load balancing configuration, server configuration, tuning of OS parameters, and load balancer setup.
Toil — manual, repetitive, automatable, tactical hands-on system changes such as: manual releases, manual regression testing
3. Doesn’t “the Cloud” Address Systems Engineering Needs?
Nitro was of course in a space where there was unique systems complexity in building planet scale cloud infrastructure. In theory for its users, “the cloud” was supposed to make such complexity go away. I am now 4 years into using AWS rather than building it, and have learned what everyone in the same situation knows, which is given the complexity of the cloud offerings combined with the complexity of open source software that developers leverage, if anything there is more systems complexity then there used to be.
Now, there are many bad decisions that company’s and their software engineers make that needlessly but greatly add system complexity. The types of rules that avoid this are:
- Prefer Serveless to Containers
- Prefer Containers Over VMs
- Prefer vendor hosted offerings to running it yourself
- Particularly for the last one when it involves running open source distributed systems
- For the love of god, if you have 100+ engineers writing stateless web services, don’t make them configure Kubernetes themselves.
The last one is an important one, as more generally you often can find ways that all the system engineering problems go to one org — an Infrastructure (or Platform) org, who builds nice shims that abstract away the systems complexity for a bunch of common problems. This often leads to all the Systems Engineers being in that org, and they have a slightly different hiring process, and the rest of the company moves on treating all systems problems as ones the Infrastructure needs to solve for the commons. Further it makes sense not to hire Systems Engineers outside that org for one-offs as it can encourage bad decisions around shadow infrastructure— particularly inexperienced developers/managers who chase the shiny new open-source/vendor toy of the day and then 12 months later their Software Engineers don’t want to operate it anymore.
The flip of this is, if you (like my employer, Datadog) are trying to solve some super hard large-scale systems problems while running on the cloud, then how your software handles that complexity is often not shimmable (in fact usually you see the shims make it worse). And that is when you struggle the most with a mindset of “Systems Engineers should only be in the Infra org” (or as an adjunct, our SRE/devops org). As per my last post, I think as long as your management learns how to manage them, you should hire them as close as possible to where the other engineers are working on other aspects of these systems and problems.
4. A Brief History of Systems Engineers
I want to summarize some history about how much this role has been re-named, as I do think its history of being undervalued is part of that. It actually starts with a different term, Systems Administration, which has a long history in computer systems, being the work of making software reliably solve real problems for real humans on real computers. As any old school systems administrator will tell you, this did not mean they did not write code. But that code did tend to take the form of shell and later Perl scripts, more focussed on harnessing the generalized software for a specific task, as opposed to being a larger piece of “software”.
Across the early-aughts, 3 things happened: software got more complex, software got delivered to computers more often, and the count of computers delivered to got much larger. From that the role tended to get renamed to Systems Engineering, reflecting that success in the face of increased complexity needed to be more proactive in building ahead of problems.
Across the late-aughts in the face of further growth, the role saw two variants gain success, DevOps Engineering (particularly coming out of fast moving companies using lots of open source), and Site Reliability Engineering (SRE) coming out of Google. In essence these were taking systems administration to more scale, the words of the founder of Google’s Site Reliability Team Ben Treynor were, “what happens when a software engineer is tasked with what used to be called operations.” Both DevOps and SRE have suffered through ambiguity in role and function.
Devops — has a blurring of whether it was taking a software-centered approach to systems/operational engineering all done in a separate “devops” team, or a combination of both functions in teams of “you build it, you own it, you run it” style (the latter some now call No-Ops).
SRE — has the challenge of whether it is just a bunch of good practices Google had refined, or whether it is further a different discipline/function of software development where you needed a structural org hierarchy to weaponize being oncall to empower the right tradeoffs against feature/product oriented engineers and management chain.
Both — have struggled with how much and complex software engineering is important for outcomes — is success 1000 lines scripts or 30,000 line programs? And how does that turn into hiring bars, promotion decisions and the like? In fact for a long time Google was deliberately ambiguous that they answered this with 2 different hiring processes both under the “SRE” label (i.e. SRE-SE vs SRE-SWE)
After these roles, in 2010+ you see some new names, with Facebook introducing Production Engineering, and Amazon introducing Systems Development Engineering. In my mind this is just rebranded Systems Engineering. But I know first hand in the case of Amazon at least, it was not at all a cynical rebranding, rather it was a deliberate decision to ensure Systems Engineers were appropriately recognized for how important their specialization was for solving complex systems problems at scale. Finally in the last few years you see some specialities within the speciality, like Cloud Engineers or Data Reliability Engineers.
5. Understanding Systems Engineer Motivations
As I have said elsewhere, the most important part of successfully managing people is understanding their motivations so you can manage to them. This is where most managers from a software development background struggled just like I did — because systems engineers do tend to code less, the assumption you make is you assume they will be motivated by working on simpler problems. But the opposite is the case. To understand that, I need to talk about categories of work again, but along a different dimension — how motivating they are.
While I like Google’s definitions around distinct grouping of types of work, I do not believe this work difference maps to the major differences between System Engineers and Software Engineers in terms of ability, skills, education, or training. Rather, I define these categories of work that modern agile “develop software and operate it” company’s need:
Toil — as above — manual, repetitive, automatable, tactical hands-on system changes such as: manual releases, manual regression testing
Support — on-call debugging production issues, dealing with user issues. Main difference to toil is it is inherently non-automatable; it requires either software or systems engineering judgment at least some of the time.
Automation — Work to reduce toil. Since it reduces toil, included in this category is release engineering — i.e. investment in automated CI/CD pipelines (including coding the CI tests), and investment in building active monitoring canaries.
Re-engineering — design/architecture/coding/delivery of changes to existing systems to run better in production to improve reliability at increasing scale.
Features — design/architecture/coding/delivery of new features.
Importantly, all but Toil require both Systems Engineering and Software Engineering work. However, while it’s a spectrum and so I am talking about differences on average, I believe there are two broad groups of engineers who have very different motivations in which of this list they are motivated to work on. Starting with Software Engineers, I have found most highly index their motivation from the “flow” experience coming from writing challenging code, and so they get unhappy unless they get to write some amount of challenging code that usually comes from building features and re-engineering (although large scale automation has it as well).
Whereas those who specialize as Systems Engineers are much more motivated by three other things: bringing things into order, understanding broadly and deeply how complex systems work, and seeing their work have an impact. This means Systems Engineers are more motivated by re-engineering and automation projects, and even some of the hard support cases. But just like software engineers, they don’t like the following.
Toil — Engineers are builders. Toil is not building. No engineer likes large amounts of toil. Systems Engineers will do their share, but toil is no more interesting or motivating to them then it is to Software Engineers.
Simple System Engineering — Systems Engineers need growth in their work. If 5 years ago a Systems Engineer could have written this simple YAML you want churned out across 40 similar but slightly different microservices, expect them to do it out of duty or expediency, but that’s different from being motivated by it. I bring this specific example up as Automation is the one specific area software development managers first hiring Systems Engineers expect it to be 100% of the job. It can be if it has complexity, but that is usually not the case if you need it to be 100% of the role.
Growing technical debt — No matter the challenge of the work, Systems Engineers are demotivated if their work makes long term systems complexity worse. Now this gets into a discussion of technical debt, with engineers viscerally arguing it is inherently bad as it immediately makes their job harder, whereas most managers know that if it is taken on judiciously it can be right for the business, especially in a fast headcount-growth environment. While the managers are right, technical debt is hard to make its future costs visible, and so it runs into the problem that management finds it hard to get the balance right, particularly when paying the balance is based on future headcount which may not arrive. Now this is all true of Software Engineers as well, but the reason this is more acute for Systems Engineers is:
- Managers from Software Engineering background often lose their visceral understanding of the tradeoffs when it applies to Systems Engineering work.
- Systems Engineers are often asked to do things like region buildouts, new vendor buildouts etc which are often asked for on extremely tight timelines, and so corners must be cut.
- Systems Engineers have a different view of systems complexity to a software engineer — “lets introduce a new distributed open source storage system for a 25% performance win” can sound really motivating to the Software Engineer building it, whereas the System Engineer know what the maintenance story is going to look like in 12 months.
In any case, just like Software Engineers, Systems Engineers will spend some amount of time doing projects in all three of these categories, just expect them to be ground out if it becomes the majority of the role for a long period of time. (Also expect it to be hard to close them unless you can show you understand this in the interview process).
6. Hiring Systems Engineers
Stating a well known but little directly discussed point: a lot of Systems Engineers struggle to pass the coding questions done in the interactive style (i.e. whiteboard, but even full IDE if under a time limit with people watching them), even though they can write code on the job. So the thing I want to cover here is why I think that is, and how to evaluate their coding ability in light of it. One thing to get out of the way first, is it’s clear you need to evaluate coding ability in the hiring process. There are a lot of people who are extremely competent at doing complex work in their profession, but cannot learn to write anything beyond trivial code. I myself have a Mechanical Engineering degree, and was perplexed watching otherwise very strong peers who couldn’t write 10 lines of MATLAB from scratch. But it is very clearly true. So since there’s a need for it and some people won’t be able to do it, you need to evaluate it.
But how to do so? I think the first thing in answering is taking a step back and looking at why exactly Systems Engineers are less motivated by the “flow” experience of coding that Software Engineers love so much. In my discussions I have heard 3 answers, which I think are somewhat related:
Dopamine-seeking — I won’t say ADHD as it’s a clinical term, but some struggle to focus on one thing for 3–4 hours that a lot of deep software development work requires, particularly when on large codebases so you have to build then hold a lot of detail in your head.
Perfectionism — The best Systems Engineers I know have an encyclopedic knowledge of detail that they bring to a problem, but it means when they come to transforming it into code, they recoil at the tradeoffs they have to make and so get stuck doing yak shaving activities rather than shipping and iterating.
Dislike of abstraction — the usual complaint Systems Engineers make about Software Engineers is their over-use of abstraction. This is fair. On the flip side, the only way you build large software is with abstractions, which even as they leak (and I think it’s those leaks that Systems Engineers hate the most) allow layering to build bigger things.
So what’s the point of this in answering how to interview them? The point is whiteboard coding interviews take all 3 and artificially ramp them up in a manner completely unrelated to the job:
- Interviewer looking over your shoulder distracting you from thinking deep
- Toy problems where you’re supposed to intuit from the context what the complex enough solution shows that they are smart and code, but not too complex that they get stuck in a rathole and don’t write much code.
- Heavily focussed on pure-CS abstractions unrelated to day to day work
Which this in mind, I think best practice is:
- Small practical take home problem which should take 1–2 hours to solve
- Since there is some risk of cheating, ideally validate using open source contributions.
- If they don’t have that, at the in-person stage, validate with a ~15 minute non-abstract, simple coding question (< 10 lines).
Finally, I don’t want to say Systems Engineers are worse in this area than Software Engineers as I have seen plenty of bad both ways, but I will say in terms of knowledge type questions, be very weary of mid-career interviewers assuming that particular things they know as must-haves for someone coming in to the role, particularly at senior levels. I have heard plenty of narcissism of small difference stuff on why given a candidates current knowledge they were not a fit for a specific Linux userspace vs Linux kernel vs Postgres vs Networking vs Reliability practices vs Containers vs Cloud vs Distributed Storage vs In-house distributed systems vs etc role. If someone is strong in other areas (i.e. has shown they can go deep) where they should be strong, and has good general aptitude otherwise, and will have a good mentor, none of this past knowledge stuff matters much after about 6 months. In fact they may be more effective as they are more motivated by learning.
7. Why Not Call them Software Engineers?
This is an important one for me to cover as it comes up with some of the engineers at my current company. I understand where this is coming from, especially given the history of the role being treated as a second class citizen, made worse because any individual is a blend and not as black/white one or other as I used in the rest of this document. As stated in the intro, I think you should avoid specifying functional specialities as long as possible. However, that fungibility comes with a lot of difficulties in hiring and promoting people who aren’t perfectly fungible and have unique specialties. Now in the short term, you can get away with generalizing your job levels as much as possible, and then letting one high need org have a slightly different bar and using their local trust to keep the bar. However, outside that org the informal bar can cause a lot of bad blood around standards, things like:
- “That other org is easier to get promoted than ours”
- “That other org has shitty coders”
- “That other org is a bunch of middleware coders who aren’t deep enough in systems knowledge for the problems they are solving and are going to create a mess”
A classic example of this is hiring a 15 year systems engineering expert as a “mid” level role as they didn’t crush the coding interview like someone 5 years out of their CS degree who you made “senior” did, even though the latter treats everything outside the JVM as someone else’s problem. (or not promoting said veteran because “the complexity of code in their delivered solutions does not hit our senior engineering bar”).
”Calibration” is one of those words that sounds like shitty corporate-speak but if you have ever been in an organization that struggles with shared language and understanding of hiring and promotion bars, you realize how essential a formalness around expectations is needed otherwise you end up with pretty bad infighting over it. In fact the only thing I have found that is more “political” for both engineering and management is questions of fairness around headcount and recruiting access. So there is no avoiding that as company’s get large, calibration of role expectations across larger and larger groups becomes important, and it greatly helps those to be able to talk about different expectations of people working in different functional specializations. Which means a different name.
None of that is to say you need to use “Systems Engineer” as a name, in fact it has a downside, in that we have already seen Software Engineers should do Systems Engineering work, and vice versa. As I have said, Amazon uses “Systems Development Engineer” to acknowledge the role’s creative aspect. Facebook of course uses “Production Engineering”. My prior company used “Platform Software Engineer”. I have also heard “Infrastructure Engineer” although that has the downside of usually also being an org name. The main one I suggest to stay well away from, is variations of “Reliability Engineer”. This is because it creates a needless division firstly that the job is always about reliability, and secondly that reliability isn’t an important part of Software Engineering’s mission as well. This is deeply ironic because Google, the creators of SRE themselves split up the function within their org. This is something I will go deeper into in my next post.