jgandrews.com


DevOps != NoOps

Introduction

I have been mulling over writing an article on some observations with regard to DevOps for some time. In particular around some misunderstandings around how to go about it. One of which is mistaking DevOps to mean NoOps.

Since I seem to keep having the same conversations about this with friends and colleagues I thought I’d finally write down my thoughts and observations and get it out of my head. Some of you may think this is stating the bleeding obvious, but lets not let that stop us!

I have seen the ‘DevOps can fix Ops’ mistake a few times. Either driven by development teams getting frustrated with operations or organisationally driven by a misunderstanding that DevOps is a simple way to optimise operations.

The mistake is almost always motivated because infrastructure and operations appear to be an unnecessary blocker. DevOps is seen as the solution. But when implemented in an unstructured way and coupled with a misunderstanding of what ‘infrastructure as code’ means, it nearly always goes bad.

Lets dig into this.

‘Traditional’ Infrastructure and Operations Isn’t Agile enough

From the outside looking in this seems self evident. How can a function that is often constrained by availability of physical hardware, and all the messy detail that entails, adapt quick enough for the changing and fast paced needs of an development organisations?

Often public cloud is seen to be the savior here. And it can be a huge win. Obviously public cloud divorces us from the hardware concerns. But even if they aren’t our concerns, the hardware concerns are still there. They are simply abstracted away by the Cloud Service Provider (stating the obvious here!).

But there is no reason, as of today, why this doesn’t apply everywhere. The same tools and methods of abstraction are available to us everywhere be it public cloud or in our own data centres.

But here is perhaps the controversial part… Because the tooling has got so good and the hardware vendors have learnt from the cloud providers, The messy hardware details are now a significantly smaller fraction of the overall operational burden. This makes the calculus of what route to take even more difficult. This could be a whole topic on its own. Lets park it for now.

Regardless, hardware procurement or compute provisioning should not be a limiting factor. If hardware and provisioning lead times are on your critical path you are doing it wrong.

Compute needs to be decoupled, consumed a service, on-demand and defined as code. It should make little functional difference if those services are provided via public or private cloud platforms.

Provisioning and deploying to any sort of compute anywhere in less than an hour is an achievable goal.

Yes public cloud allows us to do away with hardware procurement and all those messy details. It also allows us to programmatically define our infrastructure. But neither of these escape us from operations.

In this day an age there simply is no reason why infrastructure can’t be provisioned on-demand, as soon as it’s required in an automated and self-service style. We have at our disposal and embarrassment of well supported tooling riches that allow us to build modern platforms on modern infrastructure anywhere. This is yet another whole topic on its own, but we’ll boldly leave it as a statement of fact for now.

Moving on to another related thing…

Virtualise Like its 1999

If more than a minority of development teams are waiting on bare metal to deploy to. You are doing it wrong.

95% (I made that up based on what I have seen, but I’d wager its not wrong) of workloads today can and should be provisioned on virtualised compute on public or private cloud platforms.

Is this even an argument we need to have any more? Answers on a postcard.

Hypervisors have got so good now that in some cases they can exploit the hardware better than you can directly. Not to mention your instant ability to take advantage of better hardware faster.

Even if we aren’t doing it wrong using ’traditional’ operations teams there are still seemingly unnecessary lead times. Why?

Throw it Over the Fence and Us and Them

To operate at the pace and scale required today there can’t be a sharp divide between the infrastructure and operations and the development.

If there is then two things big things will slow us down. I bet lots of technology folk have seen both of these happen.

First we get the ‘Throw it over the fence’ deployment model where development teams ship artifacts to the operations teams without the whole platform in mind and the operations teams need to fill the gaps.

This is actually intentional because there is a natural divide between development and operations at the point of deployment to production. This actually works really well when there are homogeneous artifacts and a stable production runtime platform. But trouble creeps in when things change or deployment cadences need to increase.

Another big issues is when something needs patching or debugging after the fact. The route to the deployed app is via the operations team. They become a blocker (see how it happens!). Ops add lead time. Dev’s are less empowered to fix things themselves directly. Again, this works ok at small scale but causes headaches when scale and pace increase.

Secondly we get the ‘us and them’ adversarial model between operations and development. Neither team quite understands each others context and concerns.Over time this discordance gets worse. Which leads to unnecessary points of conflict and unnecessarily complex deployment and debug paths involving multiple teams with different motivations.

However we don’t want to eliminate the Dev vs Ops friction completely. There should always be a healthy tension between development and operations. Development teams want to get code deployed fast, operations need to protect production. This tension can slow things down but this isn’t pointless lead time. In the language of value streams, this is actually value add time. We’ll return to this later.

Regardless, both development and operations should be motivated by the same goal: Realising value faster. But both need the same definition of value. Its only any good if we can all deliver a working and robust set of new features and functionality at pace and at scale.

Enter the DevOps Team

So how do we get the operations and development teams to work more closely? Well… You create another team that spans operations and development and we call them the ‘DevOps Team’. Right? No! You are doing it wrong

I have seen this happen in a couple of ways. Either a new team is created which inevitably just adds another layer of ‘us and them’ tension into the mix and makes it worse. Or the operations team is arbitrarily renamed the DevOps team with a plan to change the operating model along the way. Like some sort of strangler pattern to transition traditional operations to the mythic new way of doing things called DevOps.

These approaches are wrong and miss the point. Is it cliche yet to say DevOp’s isn’t a team? Rather it’s an umbrella term for a bunch of things we will get to later. Needless to say you can’t remove the unnecessary complexity, lead time and tension by creating or renaming a team that will ultimately still sit on one side of the great Development and Operations divide.

That’s not to say you can’t solve this by creating new teams or re-focusing existing teams. You just need to know where to put them and what they should do.

The Dev’s Will do it

So if creating a DevOps team is doing it wrong we mean we should simply remove operations and have the Dev’s do it right? No! You are doing it wrong

We said above that infrastructure should be on-demand and self-service and defined as code. Infrastructure as Code. Code is something Dev’s understand! Therefore Dev’s can now do infrastructure right? No!

Before my developer friends come at me with pitch forks. I am not saying Dev’s are incapable. Of course like many operations people can code, many dev’s can do infrastructure. But do you want your day job to be development or infrastructure? Just because we make infrastructure code, doesn’t mean its no longer infrastructure and doesn’t mean the job of operating it goes away.

Infrastructure as code no more turns a developer into an infrastructure person than auto-complete (and/or AI assistants) and an IDE turns an infrastructure person into a developer. If we said to our infrastructure teams “Here is some template code and an IDE, now you can write the applications and we don’t need developers”, we’d be led out of the building. So why is it seemingly ok to do the reverse?

To me, this is the fundamental misunderstanding and where attempts at DevOps go wrong. They are bad assumptions that either DevOps means you can get rid of, or minimise messy operations to the point where it’s just another Dev task. Or that DevOps means Dev’s will be able to ‘do’ infrastructure and ops quicker if they do it all themselves and get the ops people out of the loop.

Getting operations out of the loop is the right idea but making operations a development concern is the wrong solution. Let’s overuse another turn of phrase, this approach is adding cognitive load for no gain.

Developer Alchemy

As well as cognitive load. There is are another couple of assumptions here that are equally if not more problematic.

The first is the assumption that we can turn developers into DevOps engineers without sacrificing to much in the way of developer productivity or that it will, in fact, improve developer productivity. As if turning infrastructure into code just naturally blends it into a developers day job. We need to have some target when we push all the operational concerns left. That target should be a bona fide DevOps engineer.

The second bad assumption is that all developers want to do DevOps. In fact all developers love DevOps right? No! Many developers just want to develop. Should they know about operational and platform concerns? Sure, but should it be a large part of their day job? No.

Turning a great developer into a poor DevOps engineer is a terrible waste of talent on so many levels.

Shhh… It’s always been code

Here’s a dirty secret. In a modern enterprise it’s highly likely most of your infrastructure is already automated and turned into some sort of code. Even before it’s considered a ‘proper’ infrastructure as code platform. In fact if you’re infrastructure and operations team haven’t been using code and automation to run the infrastructure and platforms for years already then (you know whats coming) you are really doing it wrong.

This being the case, then ins’t it obvious that simply changing the way we might do our infrastructure as code will not remove the need for an infrastructure and operations function. Spoiler… Nothing does. But it can be done better.

Day One Dream, Day Two Nightmare

Let’s say we do just decide to rid ourselves of messy operations functions in favour of this infrastructure as code dream land and it’s wonderful. And it will be. Everything coded up! Infrastructure is a simple terraform apply away. Development teams can have what they need when they need it. Nothing more to do right? No! (yes this is getting boring…) You are doing it wrong

There is a temptation to bridge the Development and Op’s divide by having infrastructure and operations teams construct the code for the development teams to consume. While I feel this approach can work, its dangerous.

I shall explain by way of anecdote! I once worked on an infrastructure platform that used software defined networking (SDN). To the consumer, it looked like all the networking primitives were hidden behind an API and I’d never have to learn any low-level networking ever again. The network engineers could deal with that, all I had to do was deal with the API and related code. We were provided really good template code we could use to get everything running. I had to do the bare minimum to adapt the templates and could ignore the bulk of the rest of the config I didn’t need to change. The problem was the networking primitives didn’t go away and it was important to understand how the SDN config mapped to those primitives. Or at least it mattered when it broke. I needed to understand all that config I was allowed to ignore. But because I am not a network engineer I was immediately in over my head. I didn’t have a clue what the bulk of the SDN config did because it wasn’t my area of expertise. The point here is that the assumption that providing template code for engineers to deploy without really knowing what the templates are doing is dangerous when things go wrong.

Its dangerous if you don’t have a plan for days 2,3,4…

So this dream can quickly become a nightmare. Yes it’s quite easy to spin up all the infrastructure you require and deploy something. Nice and quick. But unfortunately as second law of thermodynamics states. Entropy will increase with time.

Or, what I shall call ‘Andrews first law of infrastructure’: Any platform complex enough to run any useful workload, when left unattended, will eventually fail

And when it does fail you need to have people that really understand it at more than a cursory level in order to debug it.

Not just fail. Infrastructure requires patching, upgrading, configuration changes etc. All of which need to be done by the team deploying and running it. Because there are myriad combinations of ways the systems will need to be run and maintained after day 1.

You need to have a serious plan for day two. Running the application and the platform are distinct concerns. This is where I have seen ill prepared development teams come unstuck. The infrastructure is never ‘set and forget’ it requires management and (ahem) operation. Without a plan this quickly becomes a burden for a pure development team. Welcome back our old friend cognitive load.

Bridging this Dev Ops gap with template infrastructure as code is one reason why internal developer portals have become such a hot topic in the last 12 to 18 months. A low code option to help ease the cognitive load of having to completely get to grips with the nuts and bolts of infrastructure as code.

Infrastructure and Operations Dark Matter

What about Platform As a Service (PaaS) and server-less I hear you cry. Those things obviate the need for operations because there is nothing to operate! No! (I’m not going to write it this time… You know the drill).

There is no denying that using PaaS and server-less services remove a lot of operations headache by taking away many of the infrastructure underlay concerns. However, even though the infrastructure disappears behind the veil of the PaaS or Serverless API, there still exists what I am calling ‘Operations Dark Matter’.

The majority of matter in the universe is unseen dark matter. So it is with any mildly complex distributed system. In a way we can think of the hardware as the stuff you can see, but as I hypothesised above, the hardware is the smaller part of the universe.

The network overlay, routing, securing and observing etc etc that are required to hold the system together. The elements that turn an isolated serverless thing into a useful part of a whole. This is the dark matter. The things that often get under specified, overlooked or swept under the carpet in the rush to get to day one.

Often we only optimise for the part of the universe we can see, the hardware, even though the vast majority of the complexity lies in the unseen. Waiting to collapse us into a black hole if left unmanaged (ok. That analogy stops now…)

These things slowly add cognitive load to a team that is required to develop, deploy and run an application and the supporting services. They strip away valuable development time. Worse still shortcuts are often taken to make it easier (‘we'll just leave that storage bucket policy open for now’ ‘It's easier if we just allow anyone to ssh in with a username and password to do runtime debugging’ ‘We don't need to worry about the dead letter queue yet’ These aren’t made up examples.). Its horrible. The cognitive load only ever increase in number they never go down.

Dealing with the Dark matter is where the DevOps value add time lives.

Ok smarty pants. How should we do this?

Creating a DevOps team doesn’t work. At best its a band aid.

Asking developers to do all the operations tasks doesn’t work. It just adds cognitive load and reduces developer productivity at best. At worst it wrecks their job.

Having a separated Infrastructure and Operations team in the loop doesn’t work. It just adds lead time and creates an ‘us and them’ culture.

So if we don’t want to do the things that don’t work, what is left?

It’s might be obvious, the first thing we need to do is ensure teams have both developers and infrastructure and operations people in them. Both of whom do different things but are equally ‘DevOps Engineers’.

To ‘do DevOps’ these roles should overlap. Meaning you need to have Developers that develop most of the time and do operations some of the time and you need Infrastructure and Operations people that do infra and ops most of the time and development some of time.

But that doesn’t mean we get rid of teams that don’t operate this way. Because we still need hardcore developers and development teams and we still need hardcore infrastructure and operations teams. But those teams should not be on the critical path for delivery. In the parlance of Team Topologies, this is roughly a ‘complicated subsystems’ team and the components of a ‘platform team’.

To my way of thinking, the natural interface between a DevOps enabled team, a complicated subsystems team and the platform team are the DevOps engineers.

Platform Teams and Build Run Operate

Let’s dig into platform teams for a moment. The platform team concept is really important when transitioning from a ’traditional’ infrastructure and operations model to a DevOps type model.

The Platform Team terminology comes from the Skelton/Pais Team Topologies book. At least that’s the definition I am skewing towards and aspects of some of the topologies talked about in the book.

There are two really important aspects of a platform team in my way of looking at it. There are the services that are ‘above the line’ that have public interfaces that external teams can see and consume, then there are those that are ‘below the line’ that have private interfaces that only the platform team see’s and consumes.

The below the line services are all the core infrastructure services that are used to build the infrastructure platform. Where a lot of the dark matter lives.

Platform teams change the way development teams consume platforms and infrastructure. By implementing the above/below the line model, they enable development teams with DevOp’s engineers to build, run and operate the application and the contextually important parts of the infrastructure and platform while minimizing cognitive load.

I feel like this is where the mistakes get made when we assume that build,run operate means that the development team takes on the stack top to bottom without the need for a platform team. Rather a platform team is vital to enabling a build, run, operate mode.

Instead of attempting to fold raw infrastructure and operations into development teams and calling it DevOps, we need to transform the way we do infrastructure and operations in line with transforming the way we do development and operations. Then we have done DevOps.

However just like we can’t do Developer Alchemy and turn development teams into full stack infrastructure teams, we cant do Operations Alchemy and turn Infrastructure and Operations into a platform team. In addition we can’t just invent a new team in the between development and infrastructure and call it a platform team (see DevOps team above). The platform teams needs to operate in the same DevOps engineering way as well. i.e. We need Engineers that do above the line engineering most of the time and below the line engineering some of the time and vice versa.

Perhaps most importantly, there should be a continuum of engineering practise throughout. Both platform and development teams can operate autonomously providing and consuming services in ways the fit the respective operating models.

The engineering continuum will make it less jarring for people to move between teams and roles. In fact this should be encouraged. Its a great way to spread the culture and a great way enable mobility within tech.

The Two Transformations

So now we see that two transformations need to happen when adopting a DevOps approach. Firstly adding Ops engineers to development teams. Who can develop and operate the infrastructure in line with the application development. This is the obvious step.

Secondly, and perhaps less obviously, the transition of infrastructure teams to platform teams composed of ’traditional’ infrastructure capabilities (i.e. networks, storage, monitoring etc) and the ‘public interface capability’ which presents the infrastructure outwardly as consumable services.

Both of these larger scale transformations are equally important and need to be done in step with each other. Done in isolation they won’t work. You just end up with a different arrangement of the old ‘us and them’ tune.

The DevOps transformation is a much about transforming infrastructure and operations capabilities as it is about transforming development capabilities. Sounds obvious? I think so, but I feel it’s often seen as a largely a development centric transformation.

A Multi Year Overnight Success

One final thing to call out is that it is a mistake to think that a dramatic transformation to a different way of working like a DevOps aligned model will yield rapid positive results. I think people know this, but somehow it doesn’t stop us expecting magic.

The years of DevOps Research Assessments (DORA) have established that there is huge benefit in these approaches and amazing results can be had. But we need to be realistic about what results we expect and when.

There is a cognitive trap we can fall into where we reason backwards from someone else’s results and filter out inconvenient facts. Sometimes we hear what we want to hear, we see the product of the process and forget the process. I expect this happens all over the place but I can only speak to my experience and we fall into this trap in tech transformations all the time.

We hear: “Company X got amazing results by changing to this new fangled way of doing Y! if we copy we’ll get the same amazing results tomorrow”

But what was said: “Company X got amazing results by changing to this new fangled way of doing Y over a number of years”

I have read many books and attended many talks whose theme is ‘How we did DevOps’ and pretty much all of them describe a multi-year journey of on going continuous improvement. Often involving false starts and course corrections.

It takes time, its an ongoing journey. There is no ‘one size fits all’ pattern that will work for everyone. We need to take the journey and Company X’s journey isn’t our journey. But the value is the journey.

This doesn’t mean we can’t learn from Company X. The reverse sentiment is also bad: “Yes I know Google does it this way, but we aren’t Google so it doesn’t apply to us”. I think history has taught us that there is a lot of good stuff we’ll miss if we take that approach.

Constant improvement means we are always improving the way we build and deliver things but also always improving how we work. I am sure all engineers will agree that both of these things make for a more fulfilling day job.

The good news is that every tech organisation probably already has the intellectual capital to do it right. As DORA proves these approaches will certainly accelerate tech delivery and make it more reliable. But I hope I have made the case that there is no side benefit where we get to remove operations concerns by folding them in as developer tasks.