How the Microsoft and CrowdStrike Failure Occurred and How to Prevent It from Happening Again

Matrix hacker background
Freepik

Although many consumers likely didn't know the name CrowdStrike before July 19, 2024, an incident occurred that day that caused the cybersecurity firm to become the subject of intense public scrutiny. An update to CrowdStrike's system caused many businesses running on Windows computers to crash and become entirely inoperable, affecting companies from Delta Airlines to Mr. Beast's YouTube channel.

What allowed this failure to occur on such a massive scale? Cybersecurity expert Yashin Manraj, the founder of Pvotal, explains.

"CrowdStrike's recent software update debacle exposed a critical flaw in its operational approach and that of many similar pseudo-cybersecurity companies: a critical lack of automation, safeguards, and checks and balances in their release management systems," Manraj says. "Unlike established IT infrastructure providers like AWS, Google, and Microsoft, CrowdStrike has no robust automated management and rigorous testing procedures for code updates, which could have easily prevented one of the most significant disruptions to civilian infrastructure in the past decade."

Why did the CrowdStrike failure happen?

But how does the failure of a single software cause an entire operating system to become crippled?

"CrowdStrike Falcon isn't a regular client application," Manraj explains. "It is designed to run as a top-level system privilege agent on most endpoints of a company that leverages it. Their Friday morning update causes a BSOD post-update, and then post-reboot, their driver reloads so early in the Windows initialization sequence that Microsoft can't fix it either. It requires IT operators, if possible, to connect physically to kiosks or workstations, unmount the disks, decrypt the drives if they still have the decryption keys, erase the faulty sys file, and reboot."

"While the fix itself is not complex or technically challenging," Manraj continues, "it does require a massive in-person workforce to access devices physically—something antithetical to the work-from-home philosophy and remote culture that many startups have been recently promoting."

One might wonder if the other factor to blame is that CrowdStrike is such a ubiquitous platform. Reports have found that as many as 6,000 (or more) companies trust CrowdStrike for their endpoint security needs. Given the wide variety of sectors that CrowdStrike services, it's no wonder that even everyday consumers felt the impact of this massive outage in ways such as delayed or canceled flights, transaction failures at banks, and even obstacles to retail commerce.

"CrowdStrike offers a technologically sound security solution for all clients' endpoints, and their marketing success has fostered an impression of ubiquitousness and unparalleled security, where leading companies have adopted their systems without fully understanding the potential risks," explains Manraj. "The illusion of security without awareness of the potential pitfalls of embedding CrowdStrike's various products highlights another internal failure from all its customers for not vetting a third-party solution they integrated into their product line. This highlights the danger of equating commercial success with operational competence."

What are the implications and impacts of the CrowdStrike failure?

The CrowdStrike failure had some incredibly significant impacts. While the worst consequences were on the day of the failure, reverberations could be felt in the days and weeks following the incident. One report found that, by July 25, only 18% of organizations had fully recovered their systems, while 25% said their systems still needed attention.

"This incident is a stark reminder that applications running with high privileges at the core of the operating system, often touted for their security benefits, require stringent security procedures similar to those employed by OS or cloud providers like Microsoft, Google, Apple, or open-source Linux," explains Manraj. "Companies must diversify their OS choices and implement gradual rollouts for critical systems and employee devices to prevent widespread infrastructure lockouts."

How can businesses be prepared for another incident like the CrowdStrike failure?

"This incident underscores the need for greater awareness and caution in cybersecurity practices," Manraj says, explaining how future incidents like this can be prevented. "Companies must prioritize security over marketing hype and embrace a more robust, diversified, and responsible approach to IT infrastructure management."

Manraj further suggested three significant shifts that need to occur for the industry to become more resilient to these types of crises:

  • Organizational and operational changes: "Companies need to re-evaluate their IT infrastructure, understand the risks of depending on a single vendor, and prioritize vendor due diligence for those running with local system privileges," he asserts. "They need to protect different vendors' critical infrastructure with secret management, such as disk encryption keys and certificates, and engineers to allow, in a worst-case scenario, the recovery of the rest of their physical infrastructure."
  • Workforce Education: "Training programs are crucial to educating employees, particularly decision-makers, on the pitfalls of hasty decisions that may compromise the long-term security of other critical infrastructure maintainers such as banks, telecommunications, electricity, and water providers," he explains.
  • Infrastructural Diversification: "To prevent future outages, it is essential to implement diverse operating systems and gradual rollout strategies for critical systems and employee devices," he concludes.

      Ed Watal, another cybersecurity expert and the founder and principal of Intellibus, an IT consultancy that assists businesses in adopting advanced technologies like AI, big data, and the cloud, offers another proactive approach to preventing a failure like this from having catastrophic consequences again.

      "Threat platforms like CrowdStrike clearly are secure as they are able to use their AI-powered capabilities to manage the threat vectors; however, the agents are still a piece of software and susceptible to software bugs and instability leading to outages," Watal explains. "Organizations can prevent such outages by focusing on controls around how Open Telemetry and XDR agents are released into production environments. Stability & NFRs Assessments like the ones done by Intellibus can also help identify such risks (of missing change controls, untested agents being deployed) ahead of time and predict and potentially prevent such outages."

      "Any company that collects data from another business must acknowledge the growing risk and protect its data—and its reputation—by ensuring that effective controls are in place," adds Rob Scott, an attorney and Chief Innovator at monjur, a contracts-as-a-service (CaaS) platform for IT managed services and software providers. "Data privacy agreements and the cybersecurity provisions they outline are critical components of those controls."

      Indeed, safety is one of the most important things a business can practice while integrating advanced technology like CrowdStrike into their operations. "While advancements are undoubtedly exciting, companies must prioritize safety throughout their implementation," explains Marcelo Barros, the Cybersecurity Leader and Global Markets Leader at Hacker Rangers, a gamification platform designed to increase engagement in cybersecurity topics. "Dedicating cybersecurity professionals to the process is just the first step. It is important to not lose sight of protecting what matters most at the end of the day: their systems and data."

      If businesses are to take one thing away from the disruptive consequences of this CrowdStrike outage, it is that there needs to be a much greater emphasis on cybersecurity knowledge. Although the convenience of a solution like CrowdStrike can be tempting, a single platform having such a massive market share as this could have devastating consequences.

      By being more comfortable understanding cybersecurity and diversifying cybersecurity solutions providers, businesses can be more prepared to avoid and endure potential fallouts like this in the future.

      ⓒ 2024 TECHTIMES.com All rights reserved. Do not reproduce without permission.
      Join the Discussion
      Real Time Analytics