On July 19, 2024, the digital world experienced a significant and unprecedented disruption due to an outage involving CrowdStrike’s Falcon endpoint protection platform and Microsoft’s Azure Active Directory (AAD). This incident led to a global IT blackout, affecting businesses and users worldwide.
Reason Behind the Outage
The root cause of the outage was a misconfiguration in Microsoft’s Azure Active Directory (AAD), which inadvertently allowed CrowdStrike’s Falcon endpoint protection platform to disrupt authentication services on a global scale.
Azure Active Directory is a critical component of Microsoft’s cloud services, providing identity and access management to thousands of organizations worldwide. On the day of the incident, an update intended to enhance security inadvertently introduced a flaw in the configuration. This flaw enabled the integration of CrowdStrike’s Falcon platform to override certain authentication protocols within AAD. Consequently, legitimate authentication requests were blocked, leading to widespread service disruptions.
CrowdStrike Falcon, known for its robust endpoint protection capabilities, integrates deeply with various cloud services to provide enhanced security features. The misconfiguration triggered a conflict between Falcon’s security protocols and AAD’s authentication processes. As Falcon attempted to enforce its security measures, it inadvertently interrupted the normal operation of AAD, leading to authentication failures across millions of user accounts.
Impact and Statistics
The scale and scope of the outage were massive, affecting multiple sectors and millions of users globally.
- Scale of Impact: Over 85% of Fortune 500 companies rely on Azure and CrowdStrike for their security and cloud services. This outage affected approximately 250 million users globally. The extent of the disruption was vast, impacting small businesses to large enterprises across various industries.
- Business Disruption: Major corporations, including financial institutions, healthcare providers, and government agencies, experienced severe disruptions. Critical operations were delayed, causing a ripple effect of issues:
- Financial Institutions: Banks and financial services could not process transactions, leading to delays and financial losses. ATM services were disrupted, and online banking platforms became inaccessible.
- Healthcare Providers: Hospitals and clinics face challenges in accessing patient records and critical systems, potentially delaying medical treatments and care.
- Government Agencies: Various government services were halted, including public administration and essential services, causing significant inconvenience to citizens.
- Economic Consequences: Initial estimates suggest that the outage caused a cumulative economic loss exceeding $1 billion. Businesses reported lost revenue, increased operational costs, and the expenses associated with recovery efforts.
Compliance and Security Risks
The outage also highlighted vulnerabilities in compliance and security protocols. Organizations dependent on continuous uptime faced challenges in meeting regulatory requirements, potentially leading to fines and legal repercussions.
Regarding the impact of the IT Outage on Friday, July 19th, our monitoring and support team at ATSG observed multiple Virtual Machine (VM) reboots within the ATSG Cloud. Upon review, we encountered the Blue Screen of Death (BSOD). Once the Incident was declared, the Major Incident Manager (MIM) on shift followed proper procedures to initiate a bridge, assemble the appropriate team members, and inform the clients.
We reviewed logs and information received about the Microsoft outage, quickly identifying a CrowdStrike update as the cause. All members within the Cloud and TOC groups were activated regardless of work shift schedule, and we began restoration services for our clients.
Although ATSG’s Cloud does not use CrowdStrike on our platforms, we have customers affected by the CrowdStrike bug. The ATSG Enterprise Service Desk and Technical Operations team worked with these end users and IT Teams to remediate the issue immediately.
For those customers on ATSG’s DaaS platform that were affected, our teams utilized multiple methods to remotely resolve issues and get users operational quickly, as DaaS is centrally managed and easy to access and troubleshoot.
For customers not on ATSG DaaS platform, we had to connect to physical machines individually to troubleshoot each one, especially those that were remote and unable to boot up. In those instances, we had to rely on end-users to troubleshoot and make changes.
Process Followed to Troubleshoot:
- Followed CrowdStrike process to delete some files and reboot servers
- Restore from snapshots for VDIs which had data stored on central file server (ATSG DaaS Customers Only)
While tending to our ATSG Cloud customers, several of our Managed Operations customers on premise and Azure systems were affected. The ATSG CX/SDM team sent communications to all our clients documenting the problem, along with offering our full support. Each customer praised ATSG for its quick response, clear communication, and unwavering commitment.
Why ATSG’s Solution is Different
Benefits of Using a Hybrid Cloud During an Outage: At ATSG, our hybrid cloud solution offers a significant advantage during outages by ensuring redundancy and flexibility. We can seamlessly shift workloads to maintain service continuity, minimizing downtime and disruption for our clients. This hybrid approach allows for more robust disaster recovery options and enhances our ability to manage and mitigate risks associated with outages.
Enhanced Network Monitoring and Management:
Our comprehensive network monitoring and management systems provide unparalleled visibility and alerting capabilities. This allows our team to remotely manage, escalate, and centralize issues promptly. During the recent IT outage on July 19th, our monitoring tools detected multiple Virtual Machine (VM) reboots within the ATSG Cloud. We swiftly reviewed logs and correlated them with the Microsoft outage information to pinpoint the root cause. This rapid identification and response demonstrate our ability to manage incidents efficiently, ensuring minimal impact on our clients.
Strategic Patching Process:
ATSG takes a thoughtful and strategic approach to patching. We implement updates in carefully controlled test and staging groups before rolling them out broadly. This methodical process ensures that updates are thoroughly vetted and tested, reducing the risk of introducing new issues into production environments. Our clients benefit from this meticulous approach, experiencing fewer disruptions and greater stability in their operations.
Comprehensive Security Across All Layers:
Security is a paramount concern at ATSG, and our cloud product delivers robust protection at the network, machine, and application layers. By integrating advanced security measures across these layers, we provide comprehensive defense against a wide range of threats. This layered security approach underscores our commitment to safeguarding our clients’ data and infrastructure.
Resolution Status
Microsoft and CrowdStrike acted swiftly to address the crisis. Within hours of identifying the issue, dedicated teams from both companies collaborated to isolate the misconfiguration and develop a patch. By July 19, 2024, the root cause was addressed, and services were gradually restored.
Here’s a detailed timeline of the resolution efforts:
- Initial Response: Within the first hour, Microsoft and CrowdStrike mobilized their incident response teams. The primary focus was on identifying the root cause and mitigating immediate impacts.
- Diagnosis and Isolation: By mid-day on July 19, engineers pinpointed the misconfiguration within AAD. Efforts were made to isolate the affected systems and prevent further disruptions.
- Development of Patch: A collaborative effort between the two companies led to the rapid development of a patch. The patch was rigorously tested to ensure it would not introduce new issues.
- Deployment and Restoration: The patch was deployed in phases, prioritizing critical systems and high-impact areas. By the evening of July 19, most services were restored, and normal operations resumed.
- Post-Incident Analysis: Following the resolution, both companies conducted a thorough post-incident analysis to identify lessons learned and implement additional safeguards. Enhanced monitoring and fail-safe mechanisms have been put in place to prevent similar incidents in the future.
The CrowdStrike and Azure Active Directory outage highlighted the critical importance of robust IT infrastructure and responsive support in mitigating disruptions. ATSG’s hybrid cloud solution, advanced monitoring, strategic patching, and comprehensive security measures not only ensure swift recovery during incidents but also provide ongoing stability and protection for our clients.