The Crowdstrike Incident for Normies
Crowdstriking
Before last week I am willing to bet not many people had heard of a company called ‘Crowdstrike’, now however I am betting a large portion of the planet now at least knows the name. No such thing as bad publicity? I am not sure about that. But who the heck are Crowdstrike and how come they were able to break everything? Lets dig in!
Endpoint Detection and Response
Crowdstrike are a cybersecurity company. Their flagship product is called ‘Falcon’ and its what is known as an ‘Endpoint Detection and Response’ or EDR tool. These are sophisticated programs that, watch for anomalous, suspicious and possible malicious activity on computers and can take action to stop it. It’s like fancy anti-virus that can detect new viruses before they become a major issue.
An ’endpoint’ just refers to any computer that sits on a network. It’s an endpoint because its at the end of a network cable or wifi connection.
Crowdstrike sells the Falcon platform to corporate customers that need to protect endpoint devices from people with bad intentions on the internet or perhaps even a rouge employee within a company.
15 Minutes Till Doom
A few years ago I ran an experiment mainly as an excuse to do some coding and play around with some logging and analytics software. This experiment involved putting an endpoint (computer) on the internet running what is called a ‘honeypot’.
In cybersecurity terminology, a honeypot is an inviting target designed to attract and trap bad guys. Honeypots are used to study the methods used by the bad guys to exploit computers. My honeypot was very simple and couldn’t do anything. It simply waited for people to try and login and recorded the attempts.
Once my honeypot was live on the internet, it took around 15 minutes for the first bad dudes to turn up and attempt to login.
The reason it didn’t take long for baddies to show up is because there are vast numbers of computers on the internet that have been exploited already and weaponised to do the bidding of their malicious controllers. These big collections of machines are used to form what is called a ‘bot net’ that can be commanded to do many things. One of the things they can do is scan the internet to find other computers to exploit and suck into the bot net for use later. Because the bot nets are so vast (often thousands of computers) they can scan large chunks of the internet quickly looking for victims. Seems crazy, given the vastness of the internet, but it really did only take 15 minutes for my inviting honeypot to be discovered by a bot net.
You can read all about my honeypot experiment on this very blog. The article is called Building an SSH Honeypot and analysing the results with Elastic Stack.
Now my honeypot computer was so stupid (by design) it couldn’t do anything of any use. But imagine if you put a computer that could do something useful or had something valuable on it on the internet? Lots of companies have to do this to provide services to consumers or to interact with other businesses. In fact every web server that serves you websites (like this amazing blog) are potentially useful to the bad people. Not just that… Every computer that connects out to the internet (such as the one you are reading this blog on) can potentially get exploited via a bad piece of software. For corporate machines this would be very bad indeed. So we have to protect all of them all the time.
Painting the Forth Bridge
To protect systems from miscreants and people with ill intentions we have to use anti-virus software or fancy EDR software. But its no good just installing and running the software, you have to also keep it up to date.
Keeping your home computer up to date is relatively easy. The software does it for you, it reaches out to the vendor of the software and pulls down the latest version of the program and any ‘signature’ files or files that provide patterns for the security software to use to identify bad things. Usually the software will alert you and ask you to update the software.
In a controlled corporate environment it isn’t that easy. Corporates typically need to control when and how their IT infrastructure updates. Especially when it may require a process or system to have to restart and interrupt service. So the way this used to be done was via centralised systems within the walls of the enterprise that would download updates and distribute them internally in a staged manner. This could be done over a number of weeks.
However this started to get problematic. As the bad guys became more relentless with the production of malicious and harmful software they could deliver at scale (via bot nets), there was a pressing need to constantly update the protection. It got to the point where you’d finish one round of updates just in time to start the next. It made it impossible to effectively test. It is a thankless and often unachievable task.
Time and progress marches on and now lots of enterprises have hundreds, commonly thousands, tens of thousands or even hundreds of thousands of endpoint devices. It’s impossible to keep these up to date in a sustainable way.
Reaction Time
Not only do companies need to keep endpoints up to date but they need to be able to react quickly to a cybersecurity event to stop it in its tracks and limit damage. To do this the endpoint software needs to be kept up to date in real time and to be instructed to react to a new threat proactively.
This effectively means that organisations can’t really stage updates that easily. i.e. they cant really update a few machines first to test if it works. This is because the whole purpose of the system is to update endpoints quickly in the event of a fast moving cyberattack. By the time the update is tested for side effects the attack may have taken hold. Risky as it may be, updates need to be done fast.
This means it needs to be constantly connected to a command and control function. With thousands of endpoints to orchestrate, it becomes very expensive in terms of compute requirements and personnel requirements. It’s very complicated for enterprises to do this themselves.
This is because it requires infrastructure and automation on a grand scale. The type of scale ‘Cloud Compute’ or highly scalable compute resources running on hardware managed by third parties like Amazon Web Services (AWS), Google’s Cloud Platform (GCP) or Microsoft’s Azure.
CyberSecurity Utility Company
The holy grail for Cybersecurity protection is to be able to collapse the required endpoint software down to as few products as possible and orchestrate and update it safely and quickly centrally.
Crowdstrike provided the premier product in their Falcon platform for doing this.
Simplistically, Falcon consists of two main components. Firstly a sensor program that runs on every endpoint which can detect and take action when a cyber threat is detected. Crowdstrike’s sensor is a bit like a Swiss army knife in that it can fulfill many cybersecurity use cases. Such that companies don’t need separate tools for antivirus and malicious activity detection for example. This is very appealing.
The second component is an orchestration platform which keeps the sensors updated and running to specification. It also provides central administration, monitoring and reporting functions. The orchestration platform is hosted on Crowdstrike’s cloud platform. In other words outside of the walls of the customer company. Customer companies don’t need to manage or run the complicated and expensive backend infrastructure. Crowdstrike can then to sell their cybersecurity service as if it were another utility like the internet, power or water companies.
Part of that service is also providing testing upfront and assurance. Organisations can pass on that burden and the risk to Crowdsrike.
Trust
Companies are required to put a high degree of trust in companies that provide services in this way. Especially cyber security services. Organisations do trust companies like Crowdstrike for two main reasons. Firstly they have built trust over many years of providing service to their customers and secondly because as a security firm trust is critical to the company success on ability to function as a going concern. If a security company breaks the trust of any of it’s customers it will quickly be dropped by all its customers and cease to exist. Much like if a bank looks like it’s not got enough money, every customer will try to quickly withdraw all their cash.
Once companies have established this trust it’s not hard to pass off control of critical functions to the trusted party. It takes away a massive headache for internal IT and security functions and provides a degree of assurance and comfort that the security posture is managed and under control.
Crowdstrike earned this trust and have done an excellent job providing the service they do. Hence they have been very successful doing it and they have a huge customer base. Banks, airlines, payment providers, ATM’s, point of sale systems, eCommerce companies, Stock exchanges. The list goes on. Lots of big companies spanning the globe. Millions of endpoints using the Falcon platform very successfully.
Kernels
For the Falcon sensor component to be able to perform its Swiss army knife duties it has to be able to see everything that is happening on the endpoint. To do this is needs to run close to the heart of the system. Or maybe close to the brain of the system.
At the centre of modern computer operating systems is a component or set of components called the Kernel. The Kernel has direct and complete control over all components in the system. The operating system is split into things the kernel looks after such as the physical disks, network adapters, video displays, sound controller and the like, and things the user looks after such as applications and data files and folders. The kernel ensures that there are no conflicts between things wanting to access the hardware, it ensures that when you run multiple applications, each gets a fair chance to run and many other complicated low-level functions.
For example, when you want to open a picture from your photo library, the photo album application runs on the user side of the operating system. When it needs to load photo data from the storage device it asks the operating system to instruct the kernel to load the data from storage. The operating system then hands the data back to the application to display the picture.
However the kernel cannot know everything about every type of storage hardware it may need to use to read data. Therefore hardware vendors provide the code for the kernel to use to read data from their specific hardware. These pieces of code are called device drivers and they run on the privileged kernel side of the operating system.
Anything that runs on the kernel side of the operating system has the ability to see what is going on across the entire system at a low level. Part of the Falcon sensor provided by Crowdstrike is a device driver that runs on the kernel side of the operating system. This is what enables it to detect and prevent “bad things” happening quickly. It can see everything and take action quickly and early.
Updating Brain Functions
Because a device (or kernel) driver is a critical piece of software that is tapped into the operating systems brain, updating it can be very risky. So rather than update the device driver directly, Crowdstrike updates data files which the device driver uses to stay up to date with the latest things it requires to detect and stop malware and the like. This means the core driver stays static and ostensibly reliable but can still be updated by replacing the data files it uses.
Passing Out
When something goes wrong on the kernel side of the operating system its very hard for the system to be confident about it’s overall health. This is because the kernel is basically the brain of the system. If the brain is internally inconsistent it can’t be sure of anything. So the safest course of action is to halt and spit out the last few things it was doing in a log. This is like passing out when you get hit in the head or having a stroke. The brain is protecting the rest of the body by immobilising it when the central nervous system can’t function.
When the Microsoft Windows operating system passes out, much like a human deprived of oxygen, it goes blue. Or rather it halts and displays the famous ‘Blue Screen of Death’ (BSOD) which provides users with some information about why it halted. In the background, if it can, Windows is also dumping the contents of the computers memory and information about running processes to disk so engineers can dissect it and find the root cause.
Crowdstroke
So we know a few things now. The Crowdstrike Falcon sensors all talk to the Crowdstrike cloud infrastructure to stay updated, The Falcon sensor has a kernel driver that can be updated via data files, to function in a rapidly moving cybersecurity world it needs to be updated rapidly, Crowdstrike are very good at this and are trusted by a lot of organisations.
But…
“When something goes wrong on the kernel side of the operating system its very hard for the system to be confident about it’s overall health.” A problem will cause the computer to pass out or have a stroke.
Crowdstrike’s data files are actually code that the device driver executes. If there is bad bug in the data files, the device driver can crash causing a something to go wrong in the kernel and therefore causing a halt and BSOD. This is what happened.
Crowdstrike’s orchestration platform updated the data files used by every Windows sensor in its customer base. One data file had a bug which caused the device driver to crash which in turn halted the Windows system running it. Boom. An estimated 8.5 million Windows machines from many different industries all passed out at roughly the same time. This bought these industries to their knees, in some cases for days.
Recovery Pain
The day a large chunk of the planets Windows infrastructure went down was a bad day for many IT workers. Remember the reason why organisation use systems like Crowdstrike is to manage tens of thousands of computers. Either in big data centres they run themselves or in cloud services like AWS, Azure and GCP. Most of these computers are managed remotely using automation. However, recovering from the Crowdstrike crash required them to be manually fixed. Even if they rebooted themselves, they instantly crashed again due to the bad device driver.
The fix for the problem was to boot the Windows machine into a safe mode where only the core device drivers are loaded, then remove the bad Crowdstrike data file and reboot the machine. It would then boot normally. Simple right? Nope!
When this is required the only way to recover is accessing the affected computer directly. Basically standing in front of the computer and interacting with it directly. Two problems with this. First, can you imagine how long it take to directly fix thousands of computers which are mostly designed to work via remote control? Secondly, how do you even stand in front of a computer that doesn’t physically exist because its hosted virtually by a cloud provider?
Luckily, there are what are called ‘out of band’ tools which allow support people to interact with remote machines via other channels even if the machine itself is inoperable. But generally these tools are used to interact with one or two machines at at time and they can be quite laborious to use. Also it’s not expected that everything will require them at the same time so they tend not to function perfectly at large scales. I am sure some automation could be and was used to recover many windows machines, however I expect you can see why it took a while to correct everything when 8.5 million devices were affected.
Bad Testing
The first reaction of many IT professionals was “How did this get through Crowdstrike’s testing?” Remember that updating device drivers is a critical thing. It can very much cause problems like this. With great power comes great responsibility.
There are probably many answers to this. I can posit a couple. One of which is an assumption on my part, the other is from Crowdstrike themselves.
My assumption is that they got over confident with their testing and software architecture. They lost sight of the danger of the updates because of the split between the device driver and the data files. The engineers in charge of release and testing maybe didn’t put enough weight on the data file updates and got a bit blasé about it. This might be unfair and its just my thought.
What Crowdstrike said was that a gap in their testing missed the bug in the data file and passed it as safe to release. Their test processes appear to inspect the contents of the data files and validates them without actually running them on a real machine before releasing them. Maybe they were over confident that the validator could find every problem when in fact it could not!
I can see the attraction of having a piece of automated testing software that validates the data files rather than testing them on an actual machine. It’s much easier to automate, its a lot faster and it takes less compute resource. But you risk missing something that crashes a real computer. I guess they found that out the hard way.
Canaries in the Coal Mine
Another thing Crowdstrike could have done and should do is release updates that aren’t in response to a cyber attack to a few customers first, watch to see if there are problems, then release to the rest. Effectively doing the staged approach many companies used to do. This way you don’t knock out everyone all at once. They could perhaps offer a discount to customers that allow some of their endpoints to be part of the ‘canary’ pool or allow customers to tag less critical infrastructure as available canaries.
Reaction
The reaction to the failure in some circles was to immediately blame the cloud based nature of the Crowdstrike platform. Saying that companies should internalise this function and take control for themselves. However for the reasons I have outlined above, this is nigh on impossible.
I highly doubt IT and security organisations will fundamentally change their approach to managing endpoint security following this incident. Some organisations may move away from Crowdstrike because they have broken a trust at some level, but they will likely move to a product that works in largely the same way.
It’s a pure risk calculation. The risk of a Crowdstrike like event happening again vs the risk of a cyberattack crippling an IT function. Its an interesting argument as to which is more likely. But it’s unlikely that IT and security departments are going to fully take on the burden of running endpoint protection internally. It will quickly swamp your operating cost and turn you into a cybersecurity firm rather than an airline or eCommerce company.
Systemic Risk
What is interesting is how one single company can cause such a massive disruption across so many diverse companies. There is a situation here where, like during the credit crunch in 2007-2008 where we discovered how interconnected big banks and financial institutions were, we have discovered how interconnected a great number of organisations are via a critical connection to the same third-party. Will there be a regulatory reaction to this? How can we ensure diversity for these systems? It seems like a difficult but critical problem to solve.
Normality
Some called this a warning shot. Some were smug because this just affected one vendor OS (MS Windows). But it could have just as easily happened to other systems. I guess it was a warning. But how will the industry react to that warning? I am not sure.
For now we are back to normality. Will the next event be a cyber event where Crowdstrike and it’s peers are the hero’s? Or another large scale software issue? The warning shot has certainly woken many of us up and has us thinking about this interconnectedness and how we can react and protect our systems better. Time will tell!
Shameless
If you enjoy this content and feel like doing so, please feel free to drop be a tip!