The Survival Metric
A Ticking Time-Bomb
Earlier this year I was standing on the shore of Lake Rotorua in New Zealand staring at a flock of black swans. I got to thinking about metaphorical black swan events I have seen in my career in technology. In the roughly 30 years I have been at this gig I counted six major events I would call a black swan type event that have impacted the organisation I worked for. Turns out they aren’t as rare as you might think. Metaphorical or real.
After the cyber attacks on M&S, the Co-Op, Heathrow airport, Harrods and Jaguar Landrover (JLR), which in some cases took the respective businesses out of the market for days, weeks or even months, leaders across the board must be taking a good hard look at their IT estate and security posture and asking some important questions.
It’s not just a cyber attack, but any event which could take down critical systems.
We can’t bury our heads. We may be sitting on a ticking time-bomb. The thing about time-bombs though, is they have a timer. As long as you defuse them before the timer runs out you can avoid getting blown up.
I fear many of us recognise the time-bomb exists but we aren’t doing much to defuse it. Maybe its time to go do that before it explodes?
Three Big Questions
I think there are three big questions that frame everything else. And they are:
- How long can your business survive if all your computer systems are down or offline?
We all probably have a fairly good idea what the size of the ‘Survival Window’ is for our respective organisations. For those in highly competitive markets it will surprisingly small.
Clearly, this is the limiting factor for everything else.
Heres the key point though: The size of the Survival Window is a fixed value. Once you determine the size of it, it’s likely there isn’t much you can do to widen it. Even if you could, it’s not the right value to try and change. It’s also the case that the more of the window that is used to recover, the more money and reputation the business loses. But even if there is a massive war chest reserved for such an event, the recovery of any commercial business depends on it being commercially viable following on from such an event. Reputational damage can’t easily be fixed with money.
For reference, a BBC Report states analysts estimate the JLR incident cost them £50 Million per week.
- Is it possible to recover the business within the ‘Survival Window’?
Every good business has disaster recovery (DR) and Business Continuity (BC) plans which are evaluated and tested regularly. But do the plans properly consider a total outage? Including to the DR site? Do they test recovering the business from zero to minimum viable company within the window? How up to date is that plan?
To anyone working in tech these are obvious and normal questions about DR and BCP to which there must be answers. But underpinning each of them is a lot of complexity. Because of the complexity you have to be very careful of shortcuts like hand waving away some seemingly unlikely situations or solving only for the obvious path.
Ask yourself why Jaguar Landrover’s DR plan couldn’t get the business back up and running in a couple of days to a week at worst? I don’t have any inside information on this, but I am guessing the answer is that the DR plan and DR site were probably useless for the situation they found themselves in. Or worse, they couldn’t access the systems which stored the plan itself.
If you can’t put your hands on an up to date plan for dealing with a major outage framed around recovering the business from zero to minimum viable company within your survival window… The bomb is still ticking.
If your survival window plan isn’t testable and isn’t tested as a matter of business as usual operations… The bomb is still ticking.
- Who would you rely on to respond and can you identify them?
I mean specifically. Not just ’the technology org’ or ‘cyber security team’. There should be identifiable chains from the top of the company down to the person or people who can recover every critical system in the business.
You should look closely at those chains. If too many of them go through or converge on a single person or small team, or worse they converge on no one… The bomb is still ticking.
The First Technical Debt Problem
The case where there are dependencies on single people or small teams is particularly dangerous when a business is carrying a lot of technical debt. Debt in the form of quick fixes, workarounds and delayed modernisation.
The interest due on technical debt is paid by ensuring the knowledge of the systems which have embedded debt is retained. This is a compounding problem. Key people may not wholly enjoy labouring under the burden of or being boxed in by heavy tech debt and may be more likely to move on.
To this end, organisations can’t afford to be blind to the risk that tech debt poses in terms of people risk. I feel this is often a hidden risk that may not be uncovered until it’s to late.
Having to work around specific complexities introduced by carrying technical debt will have a direct impact on how long it takes to recover within the survival window. This then is clearly a variable that can be managed to mitigate the risk.
Specifically Cyber Security
Being effective at, proactive and aware of cyber security is another key way to reduce the time you spend in the survival window if the worst were to happen.
With this in mind, here is a bonus question:
- Who is responsible for securing your systems and software?
Hint. It’s not your security org.
It’s a cliche, but when dealing with todays complex and highly distributed systems, the answer has to be everyone.
We often talk about ‘shifting security left’ making sure its a core part of the systems development lifecycle. But how often do we put off implementing inconvenient security features in favour of shipping quick? How often do we leave security concerns to late in the delivery lifecycle? How often do we miss a patching cycle to avoid inconvenient clashes with software releases or downtime? … The bomb is still ticking.
The Inevitable Breach
Cyber security is a buzzword and cliche laden domain, the very name of the discipline itself conjures up images of dystopian sci-fi. Here is another cyber security cliche: Defenders need to be right every time, attackers only need to be right once.
If your mental image of an attacker is some hooded hollywood hacker who can ‘crack the firewall’ then you need to have a rethink. They may indeed wear hoodies, but they aren’t trying to come in via a well fortified front door. The favoured and most successful malware delivery mechanism today is good old e-mail.
I am sure every corporate citizen has been subject to many internal ‘phishing’ campaigns run by the internal security teams. These are very important and help raise awareness of email bourne malware and being vigilant and wary of unusual content. They likely prevent a large number of what would be otherwise successful attacks. However you can’t combat basic human nature and with the rise of AI tooling, phishing emails can be hyper personalised and cleverly disguised in more and more ways. It’s not if one will sneak past, it’s when. They only need to be right once.
It’s the inevitability of a breach happening that should dictate the internal posture the security organisation takes. Having strong walls is meaningless if a well aimed e-mail can drop a bomb right in the middle of your castle. Early detection of a breach by constantly looking for ‘indicators of compromise’ and preventing the bomb from destroying everything if it does go off, or (to use another cliche), limiting the ‘blast radius’ is key.
There are two main functional ways to limit the internal blast radius of a malware detonation:
Ensure there is good isolation and firebreaks between key internal systems. This is hard, inconvenient and disruptive. It is where a good security organisation earns its keep by taking away as much of the complexity and burden of making the security walls effective but as transparent to the rest of the organisation as possible.
Ensure exploitable pathways are removed by having good internal security hygiene. See above, security is everyone’s responsibility. See below, ‘The Second Technical Debt Problem’
Security and AI
I briefly mentioned AI above. Attackers are getting more and more sophisticated as the stakes get higher. With AI tools to assist, the time to exploit will likely get shorter in the near term.
AI tooling being used to build more effective exploits and to craft hyper personalised phishing attacks is only one vector. The rise of AI tooling and agents within organisations presents another.
How closely are you looking at what AI tools have access to within your org and what controls are in place? They are safely inside your walls right? What could go wrong!? Think about the ‘simplest’ attack. A phishing mail is used to discover locally available AI tooling on the victims computer and it is used to rapidly discover details about the internal workings of the organisation for later use. This attack might be hard to detect if it looks like normal use of the AI tooling. Agents reaching out around your internal systems and bulk retrieving information may not look like unusual behaviour any more.
I mentioned that it’s not common for an attacker to come through the front door. However how many organisations have attached AI tools to externally facing systems? How much access do they have to backend systems either directly or indirectly? Is it possible that AI tools can be chained together to effectively build a bypass through security walls, jump firebreaks and air gaps? For example a chatbot that has access to an internal knowledge base that in turn uses a database system that is using an AI agent for housekeeping tasks. Could that be discovered and interacted with via the chatbot? It’s a naive and contrived example, but how aware are we of the overlapping pathways between powerful autonomously operating tooling? Its a brave new world! We may be good at locking down human users, but are we blind to the bots?
These powerful tools are a new vector for exploit that many of us may indeed be blind to right now. Attackers aren’t. ‘Prompt injection’, ‘jailbreaks’ and ‘personality changes’ are all becoming part of the security vernacular. We need to adapt fast. … The bomb is still ticking.
The Second Technical Debt Problem
I mention above that a key way to mitigate security risk is to limit exploitable pathways between systems. This is done by the laborious and thankless task of keeping systems up to date and patched. General maintenance of systems is indeed laborious and often gets in the way of delivery. But it’s an insidious beast. It has no outward and immediate benefit to the core business and it’s easy to de-prioritise in favour of delivery. The truly nasty part is that the longer you leave it, the harder it becomes to address.
Unpatched systems represent exploitable pathways attackers can use to move ’east/west’ within your systems. Burrowing ever deeper, increasing the blast radius.
Leaving systems unmaintained and unpatched is the single most dangerous thing you can do in terms of ensuring organisational survivability within a fixed survival window. If you haven’t patched systems regularly and a security breach exploits an unpatched vulnerability you are in double trouble. This is because during the survival window, not only do you have to recover systems, but you have to patch them and make your software work on systems that are in an untested state
Here is another question: Why haven’t you patched your systems? If the answer is something like: ‘Because our software only works with version X of Java and we haven’t had time to modernise’ then you really are sitting on that ticking time bomb. How can you recover anything if you not only have to upgrade a key dependency but also refactor your software to even work with that dependency?
As tempting as it might be to put off security remediation or cut corners in favour of delivery… In today’s climate… Don’t do it! But if you really really must (you mustn’t!), have a watertight, unbreakable and clear plan to mitigate as soon as possible and have everyone on Def Con 1 until you do.
The Survival Metric
With all this said, and it might be obvious to state it, but it seems to me then that: Every business is only as viable as the size of its fixed survival window and wholly depends on the set of people that can restore it before the window closes.
You can’t change your survival window size and you never want to have to use it all. The one goal should be to be able to recover your business within as little of the survival window as possible. And that you can change. By at least:
Having up-to-date, achievable and testable business continuity processes as part of business as usual operations
Having a healthy approach to tech debt
Keeping on top of security concerns by having everyone own them
Avoiding falling into the trap of losing sight of the value in keeping on top of all of the above as a core business concern
Maybe it is hyperbolic to state the viability of a business purely in terms of this ‘Survival Metric’. But we have seen so many cases in recent months where companies have indeed found themselves inside the survival window. It seems silly not to consider it and where we stand with regard to it.