Why did 8.4 millions computers die?
Who’s to blame and how healthcare companies can protect themselves in the future?
Unless you’ve been vacationing on an island (good for you), you’ve probably been affected by or heard about the CrowdStrike bug that took down eight million computers and cost healthcare $1.94 billion. Mission critical systems in healthcare went down resulting in delays in providing timely care.
Disclaimer: The below is based on public information released by CrowdStrike (https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/) and from my experience working in the Microsoft Windows codebase twenty years ago.
How is CrowdStrike Software Special?
We have software crashing all the time but that doesn’t take down the computer. So what was different this time?
Software runs into two modes on computers: Kernel mode and User mode. The vast majority of software runs in user mode so if it has an error the affected software is killed by the computer but the computer is fine.
Kernel mode is a special mode that gives the software access to the deep down functioning of the computer. Hence if a software running in kernel mode errors then the whole computer goes down.
From Microsoft’s website:
All code running in kernel mode shares a single virtual address space. As a result, a kernel-mode driver isn’t isolated from other drivers or the operating system. If a kernel-mode driver mistakenly writes to the wrong virtual address, it could compromise data belonging to the operating system or another driver. If a kernel-mode driver crashes, it causes the entire operating system to crash.
There are two main reasons for running a software in Kernel mode: performance and full access to computer. Software such as display drivers (that power computer displays) uses kernel mode to speed up. And security software runs in kernel mode to get control of the computer at a deeper level.
This is where CrowdStrike comes in. CrowdStrike software runs in kernel mode because it wants to get more control over the computer than user mode software is allowed.
Why did CrowdStrike suddenly break down?
Due to the above risk of kernel mode software, Microsoft has a strict testing and certification process for kernel mode software (https://learn.microsoft.com/en-us/windows-hardware/design/compatibility/whcp-certification-process).
CrowdStrike had the problem that they have to update their software frequently as new security threats are discovered. They couldn’t wait for the time taken by the certification process. (I’m assuming this motivation based on their public information).
So the solution they came up with was to keep the software the same but download an “instructions” file that their software ran. This, of course, means that now kernel mode software can do stuff that it was not tested or certified for. (There is no evidence I have seen that Microsoft was unaware that this was happening.)
On July 19th, Crowdstrike put a new “instructions” file (Channel File 291) on their server. All the computers running CrowdStrike software then downloaded this instructions file and the CrowdStrike software running in kernel mode promptly tried to read this file. This file had an error in it and the CrowdStrike software did not handle the error properly.
Technically the CrowdStrike ended up accessing memory around 0x0000000 location which is not a valid location and throws an error.
Since CrowdStrike software did not handle this error and it was running in kernel mode the whole computer crashed.
Who’s to blame?
CrowdStrike for missing adequate quality control on the instructions file release process and for lacking proper error handling in their kernel mode software installed on millions of computers.
Microsoft for allowing CrowdStrike to implement kernel mode software that effectively reads instructions from a downloaded file and for not providing a mechanism in Windows to allow security software to do its job without needing to run in kernel mode.
Human error is a part of any process so we can assume human errors have happened before and will happen again. Our processes and our software architectures are supposed to protect us from human error. The failure was in the processes and in the software design.
How Can Companies Prevent This In The Future?
Healthcare companies lost millions of dollars and this outage affected timely care for a large number of patients. How can companies (especially healthcare companies) protect themselves in the future from similar outages?
First, companies should establish procedures to review and minimize kernel mode software running on their computers. As mentioned above, errors in kernel mode software are much more serious than errors in user mode software. IT departments should uninstall any kernel mode software that is not needed to accomplish the tasks on the computer.
Secondly, companies should put pressure on Microsoft to move security software that has to update frequently out of kernel mode. This will require Microsoft to provide the APIs needed by security software to do their job in user mode.
Thirdly, have procedures in place for IT folks to boot computers in safe mode and disable the offending software. This way IT teams can respond quicker to bring the systems back up.