CrowdStrike: Reboot to Stability

The recent incident with CrowdStrike is a big one in computing history. This is the biggest outage that the world has faced without encountering a cyber attack. In order to not face such an unnecessary and easily avoidable issue, we need to set the context right. In the last decade, we have seen a major shift in the adoption of cloud services instead of going infrastructure mode. The incentive is that in order to create your own infrastructure, a lot of capital expenditure is required. The cloud services on the other hand tend to offer flexibility by providing subscription based adoption depending upon your business needs. This has pushed for an overhaul of the business strategies across the world.

The three major cloud services, AWS (from Amazon), Azure (from Microsoft), and GCP (from Google) were and are the major beneficiaries of this approach. Though AWS is still the popular choice, the service sector heavily relies on Microsoft Azure cloud services. Microsoft has seen a major uplift in their image especially in the last decade with their adoption of Linux based operating system in Azure. While Microsoft Azure services are available with these more secure operating systems, the majority of their clusters rely heavily on Windows Server. The service sectors such as banks, airlines, railways, and news media are the most susceptible to get attracted to the ease-of-use operating systems idea and thus end up with subscriptions that include Microsoft Windows or Windows Server.

Now, this does not necessarily mean Microsoft Windows or Windows Server are a bad lot, but it certainly establishes their vulnerability. We have heard some news in the past that Microsoft is going soft on Linux and ready to adopt it in its core. But, the sad part is that the adoption remained secluded to the additional Azure services and Windows subsystems. The change at the kernel level which is the heart and brain of an operating system is something that we haven’t and won’t be seeing too. And, we can’t blame Microsoft for not getting rid of their core product.

The other major factor that I see and that certainly amounts to some accountability are the race at which the release cycles are planned nowadays and the unplanned layoffs that company readily adopts owing to its restructuring ideas. In the current industrial trend, this might get overlooked especially when a bunch of nincompoops are constantly trying to justify every idea that comes out of the mouth of a head of department. Nobody is ready to do a thorough impact analysis on the team dynamics. Instead, these decisions are poorly based on only the financial data of the company. To top that, going GA (or general availability) frequently with a shorthand staff is something that needs a deep dive and introspection especially when this is becoming a fashion in the IT industry.

This is important to understand because how the team works heavily guides the quality of the software that comes out. Speaking from experience, the development and QA teams generally practice a dangerous trend of ignoring the errors or bugs while developing or testing the software. The last minute impact is mostly on documentation teams to either create a knowledge base (KB) article or get the messaging done right for the user interface impact. A special breed of writers known as UX writers carefully analyzes it and guides the development team to put forward the error messages right. This trend is easy and helps the QA team to test better sometimes owing to the fact that the entire team can focus on finding major impacting bugs.

Unfortunately, CrowdStrike had laid off a number of employees in the last year that actually dealt with such scenarios in the name of return to office (or RTO) mandate meaning that those employees who were comfortably working from remote locations and did not wish to return to offices were impacted. This was in fact done in many companies but since CrowdStrike is too big of an organization to fail, the excuse becomes pardonable. Ironically, the impact isn’t.