Article

Building business resilience

In today's rapidly changing and increasingly digital world, organizations face a myriad of disruptions, from cyber-attacks and natural disasters to economic volatility and pandemics. The recent CrowdStrike outage, which impacted 8.5 million devices, underscores the critical importance of organizational resilience. This incident serves as a powerful reminder that building a resilient organization is not just about surviving disruptions but thriving amidst them.

INDEX

CrowdStrike outage: a case study from real world

On July 19, 2024, a faulty update to the CrowdStrike Falcon software (one of the most widely used cybersecurity solutions worldwide) caused a significant global outage, highlighting the challenges IT organizations face in balancing cybersecurity with operational stability. This incident impacted approximately 8.5 million Microsoft Windows systems using this solution. Linux and macOS systems were not affected.

The most evident consequences were seen in the aviation sector, particularly affecting airlines, airports, and travelers: On July 19, over three thousand flights were canceled due to technical problems or the progressive accumulation of delays caused by other flights, causing disruptions even at airports not directly affected by IT issues. According to data provided by Cirium, a company specializing in aviation data analysis, the most affected airports were Shenzhen in China, Amsterdam in the Netherlands, and Atlanta in the USA.

Furthermore, disruptions also occurred in other sectors, including banks, retail chains (Australia and the United Kingdom), courts (USA), stock markets, and health services (United Kingdom and Canada). The issues began in Asia and Europe and then spread to North America, following the time zone progression.

What happened: the sequence of events

Origin of the Problem: The issue arose on Friday, July 19, at 4:09 UTC, when an update to the Falcon agent (channel file 291) introduced logical errors that led to crashes in Windows systems, displaying the dreaded BSOD (Blue Screen of Death).
Purpose of the Update: The update aimed to protect Windows systems from certain types of attacks used by command & control (C2) frameworks during cyberattacks.
Temporal Impacts: All Windows systems that downloaded the update after 4:09 UTC were affected. Systems that received the update after 5:27 UTC installed a stable and corrected version and were not affected.

Impacts on Windows systems

Given the privileged role of the CrowdStrike agent within the Windows kernel, the fix required manual intervention on each affected endpoint:

Laptops/PCs:
- The initial resolution involved repeatedly restarting the Windows system to attempt an automatic resolution of the problem.
- If restarts did not resolve the issue, it was necessary to boot the computer in safe mode and manually delete the files causing the disruption.
- For companies that chose to encrypt the hard drives of end users, the recovery procedure was more complex.
Cloud Hosts:
- The resolution included a "rollback" to a system snapshot prior to 4:09 UTC or disconnecting the system disk volume, manually fixing the problem, and then reconnecting the volume.

Global consequences

The incident had significant repercussions on a global scale, affecting a wide range of users and organizations that use CrowdStrike Falcon software on Windows systems. This incident highlighted the importance of operational continuity in today's digital world. Companies rely on their IT and OT systems to function effectively, and any interruption can lead to productivity losses, economic losses, and loss of customer trust. Maintaining operational continuity requires robust systems and comprehensive business continuity plans.

Key lessons

The CrowdStrike outage offers several key lessons for businesses looking to bolster their resilience:

Diversification of Cybersecurity Providers: Businesses should consider diversifying their cybersecurity solutions to avoid over-reliance on a single provider. This can help mitigate risks if one vendor experiences an outage or a breach.
Regular Stress Testing and Drills: Organizations must conduct regular stress tests and drills to prepare for unexpected disruptions, including those affecting third-party services. By simulating various scenarios, companies can identify vulnerabilities in their response plans and improve their readiness.
Enhanced Vendor Management: Effective vendor management is critical for resilience. Companies should establish clear communication channels and response protocols with their service providers to ensure swift action during outages or breaches.
Building Redundancy: Having redundant systems and backups can help ensure business continuity in the event of a primary system failure. This includes not just technical redundancy but also having alternative vendors or in-house solutions that can be quickly deployed.

Key elements of a resilience program

By examining the CrowdStrike outage and its aftermath, businesses can gain valuable insights into the importance of building robust cybersecurity strategies and broader resilience frameworks that account for the complexities of modern digital ecosystems. The incident underscores the need for a proactive approach to resilience, emphasizing preparedness, agility, and continuous improvement.

Reply’s experience of managing disruption is based on years of experience assisting both our company and our clients in developing robust incident management, IT and business continuity, and crisis management plans. Drawing on this experience, we have created a methodology for building strong resilience programs. Organizational resilience does not replace existing risk management processes, such as business continuity or operational risk management. Instead, it builds upon these frameworks to deliver clear and tangible value to the business while minimizing potential harm to our customers.

1. Assessment and planning

Critical Processes and Resources Mapping

Begin by collecting and mapping the current status of critical business processes, services, technology, facilities and people including third parties services and map their dependencies. Collect processes and services RTO and RPO and define the impact tolerance.

Mapping critical processes and systems and knowing the Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) for each process and service is a fundamental component of a resilience program because it allows organizations to identify and prioritize the most essential functions that are vital for their operation and survival. Defining impact tolerance involves determining the maximum level of disruption that a process or service can withstand before causing significant harm to the organization. This information is vital for setting realistic recovery goals and thresholds, ensuring that resilience strategies are not only comprehensive but also pragmatic and aligned with the organization's risk appetite.

By understanding which processes and systems are critical and their impact tolerance, businesses can develop targeted strategies to protect these assets from disruptions, whether from cyberattacks, natural disasters, or other unforeseen events. This mapping enables organizations to allocate resources more effectively, ensure continuity of operations under stress, and quickly recover from any interruptions. Furthermore, it helps in uncovering dependencies and interdependencies within and outside the organization, allowing for a more comprehensive and proactive approach to risk management and resilience planning.

Disruption Scenario (Risk) Analysis and prioritization

Next, map and analyze the risk of potential threats such as natural disaster, cyber attacks, system faults, or critical third parties failures that could impact critical business processes and services. This step involves understanding the various scenarios that could disrupt operations and their potential impact and test the ability of an organization to remain within its impact tolerance in the event of a severe but plausible event.

Performing a threat scenario risk analysis is crucial in a resilience program because it helps organizations anticipate and prepare for a wide range of potential disruptions and prioritize the remediations. By systematically identifying and evaluating different threat scenarios, organizations can assess the likelihood and impact of these events on their critical operations. This analysis enables organizations to prioritize risks, allocate resources effectively, and develop targeted mitigation and response strategies. Moreover, understanding various threat scenarios ensures that resilience plans are comprehensive and adaptable, enhancing the organization's ability to withstand and recover from unexpected disruptions while minimizing operational, financial, and reputational damage.

Resilience measures gap analysis

Identify the existing resilience measures, both technical (IT/OT/IOT) and organizational, and perform a gap analysis on a resilience control framework based on industry guidelines and best practices.

Performing a resilience measure gap analysis based on a resilience control framework is vital for a robust resilience program because it allows organizations to systematically identify and address weaknesses in their current preparedness and response strategies. By evaluating existing measures against a structured framework, such as ISO22301 or NIST Cybersecurity Framework, organizations can pinpoint specific areas where controls are lacking or insufficient. This targeted approach not only ensures compliance with best practices and standards but also helps prioritize investments and actions needed to strengthen overall resilience. Ultimately, a gap analysis provides a clear roadmap for enhancing an organization's ability to withstand disruptions, maintain critical operations, and recover swiftly, thus safeguarding its long-term sustainability and reputation.

2. Developing the resilience plan

Define Resilience Objectives

Establish clear objectives for the resilience program, aligned with the organization’s strategic goals.

Defining resilience objectives is a crucial step in any resilience program because it sets clear, measurable goals that guide all subsequent planning and actions. These objectives outline what the organization aims to achieve in terms of maintaining operations, safeguarding assets, and minimizing impact during disruptions. By establishing specific resilience objectives, businesses can align their resources and efforts towards a common purpose, ensuring that every aspect of their resilience strategy—from risk assessments to response plans—is focused on achieving these targets. Clear objectives also enable organizations to measure progress and performance, making it easier to identify gaps, refine strategies, and continuously improve resilience capabilities. In essence, well-defined resilience objectives provide direction and purpose, ensuring that the organization is prepared not only to survive disruptions but to thrive in their aftermath.

Define Resilience Organizational Structure

Establish an organizational structure, defining roles and responsibilities for resilience planning and response. Defining a sound organization is fundamental to an effective resilience program because it establishes a dedicated structure and clear roles responsible for overseeing and executing resilience strategies across the enterprise.

By creating a specific team or appointing resilience officers, organizations ensure that there is accountability and ownership for resilience efforts, which promotes a proactive and holistic approach to risk management. This structure enables streamlined decision-making and communication during crises, ensuring swift, coordinated responses that minimize downtime and impact. Furthermore, a defined resilience organization fosters a culture of preparedness and agility, integrating resilience into the organization's daily operations and strategic planning, overcoming a “silos based” approach. In essence, it ensures that resilience is not just a reactive response but an embedded capability that enhances the organization’s ability to anticipate, absorb, and recover from disruptions.

Resilience Plan and procedures creation

Develop comprehensive Resilience Plans for each disruption scenario that includes resilient requirements to prepare systems and processes for recovery after a disruption. Define incident response, business continuity, and communication strategies. Ensure all procedures are well-documented, communicated, and regularly reviewed.

Creating plans and procedures is a cornerstone of a robust resilience program, as it provides a detailed roadmap for responding to disruptions and maintaining critical operations. Well-defined plans outline specific actions to take before, during, and after an incident, ensuring that all team members understand their roles and responsibilities. This preparedness reduces confusion and delays in a crisis, enabling a more efficient and coordinated response. Procedures also help standardize responses to various scenarios, from natural disasters to cyberattacks, ensuring that the organization can adapt to different threats effectively. By having comprehensive plans and procedures in place, organizations enhance their ability to protect their people, assets, and reputation while minimizing downtime and financial loss, ultimately contributing to their long-term sustainability and resilience

Resource Allocation

Allocate necessary resources, including budget, personnel, and technology, to implement and sustain the resilience initiatives. Investment competences, technology, and infrastructure is crucial.

Resource allocation is critical in a resilience program because it ensures that the necessary financial, human, and technological resources are available to support resilience efforts effectively. Proper allocation allows organizations to invest in the right tools, training, and infrastructure needed to anticipate, respond to, and recover from disruptions. By strategically distributing resources, businesses can prioritize their most critical processes and systems, ensuring they remain operational during crises. Effective resource allocation also helps in building redundancies and backups, which are vital for minimizing downtime and loss. Ultimately, thoughtful resource allocation not only enhances the organization’s ability to withstand adverse events but also optimizes cost efficiency, ensuring resilience measures are sustainable over the long term.

Third-Party Involvement

Involve third parties such as suppliers and partners in the resilience planning process to ensure a holistic approach.

Involving third parties is essential in a resilience program because many organizations rely heavily on external partners, suppliers, and service providers to maintain their operations. By integrating third parties into resilience planning, organizations can ensure that these external entities are also prepared to handle disruptions, which minimizes the risk of cascading failures across the supply chain. Collaboration with third parties allows for the alignment of resilience objectives and response strategies, fostering a coordinated approach to managing risks and incidents. Additionally, involving third parties in resilience exercises and drills helps identify potential vulnerabilities and interdependencies, allowing for proactive mitigation measures. Overall, the inclusion of third parties in resilience efforts strengthens the entire ecosystem, enhancing the organization's ability to maintain continuity and recover swiftly from any disruptions.

3. Implementation

Training and Awareness

Implement training programs to ensure all employees understand their roles in maintaining resilience. Training should cover emergency response, crisis management, and the use of resilience technologies.

Training and awareness are vital components of a resilience program because they prepare employees to respond effectively during a crisis. Regular training ensures that all staff members understand their roles, the procedures to follow, and the tools at their disposal, fostering a culture of preparedness and confidence.

Technology Integration and Architectural Improvements

Implement technologies for enhanced resilience. Design and deploy robust system architectures.

Technology integration and architectural improvements are crucial in a resilience program because they ensure that an organization’s infrastructure is robust, adaptable, and capable of withstanding disruptions. By integrating advanced technologies such as cloud computing, automation, and cybersecurity solutions, organizations can enhance their ability to detect, respond to, and recover from incidents more effectively. Architectural improvements, such as designing systems with built-in redundancies and failovers, ensure that critical operations can continue uninterrupted even if part of the infrastructure fails. Together, these measures help create a more flexible and responsive IT environment that not only minimizes downtime and data loss but also supports quick adaptation to evolving threats and challenges, thereby strengthening overall organizational resilience.

4. Testing and validation

Regular Drills and Simulations

Conduct regular IT and business continuity drills to test the effectiveness of the plans. Use simulations to identify potential weaknesses and areas for improvement.

Regular drills and simulations are essential in a resilience program because they provide a realistic and controlled environment for testing the effectiveness of emergency plans and procedures. By regularly practicing responses to various disruption scenarios—such as cyberattacks, natural disasters, or operational failures—organizations can identify weaknesses in their preparedness and recovery strategies and make necessary adjustments before a real crisis occurs. These exercises also help build muscle memory among employees, ensuring that they can act quickly and effectively under pressure. Furthermore, drills foster a culture of awareness and continuous improvement, encouraging teams to stay vigilant and adapt to emerging threats. Ultimately, regular drills and simulations enhance an organization’s ability to respond swiftly and cohesively to unexpected events, minimizing potential impacts and accelerating recovery effort.

Feedback Loops

Establish feedback mechanisms to capture lessons learned from drills, simulations, and real-world incidents. Use this feedback to continuously improve the resilience framework.

Establishing feedback mechanisms is crucial in a resilience program because it enables continuous improvement and adaptation. Feedback loops provide valuable insights into the effectiveness of current resilience strategies and highlight areas needing enhancement. By systematically gathering feedback from drills, real incidents, and regular operations, organizations can learn from their experiences and refine their plans, processes, and responses accordingly. Moreover, feedback from employees, partners, and stakeholders fosters a culture of open communication and accountability, ensuring that all levels of the organization are engaged in resilience efforts. This ongoing feedback process is essential for staying agile and responsive to new threats and changes in the operational environment, ultimately strengthening the organization’s ability to withstand and recover from disruptions.

5. Monitoring and continuous improvement

Continuous Monitoring

Implement continuous monitoring systems to detect and respond to threats in real time. This includes monitoring IT systems, physical infrastructure, and supply chains. Implementing continuous monitoring systems is crucial in a resilience program because it enables organizations to detect and respond to threats in real-time, minimizing potential damage.

By continuously monitoring IT systems, physical infrastructure, and supply chain, organizations can quickly identify anomalies, vulnerabilities, or disruptions that could compromise operations. This proactive approach allows for immediate action, reducing response times and limiting the impact of incidents. Continuous monitoring also helps organizations stay ahead of evolving threats, such as cyberattacks, equipment failures, or supply chain disruptions, by providing timely data and insights. Overall, these systems are essential for maintaining operational continuity and safeguarding against a wide range of risks, ensuring the organization remains resilient in the face of both predictable and unforeseen challenges.

Periodic Review and Continuous Improvement

Perform regular reviews to maintain the framework updated according to new business processes, threats, and procedures. Adapt and evolve resilience strategies based on lessons learned from disruptions.

Performing regular reviews is vital in a resilience program to ensure that the framework remains current and effective amidst evolving business processes, emerging threats, and new procedures. By routinely assessing and updating resilience strategies, organizations can adapt to changes in their operational environment and refine their approach based on lessons learned from past disruptions. This continuous improvement process allows for the identification of gaps, the incorporation of new best practices, and the alignment of resilience measures with organizational priorities. Regular reviews foster agility and responsiveness, enabling the organization to stay resilient against a dynamic landscape of risks and to maintain a high level of preparedness for future challenges.

Conclusions

Building organizational resilience is a complex and ongoing process that requires a comprehensive approach. By understanding the key elements of resilience and following a structured methodology, organizations can enhance their ability to withstand and recover from disruptions. This not only ensures sustained operations but also maintains stakeholder trust and confidence.

The lessons learned from incidents like the CrowdStrike outage underscore the importance of preparedness, continuous improvement, and proactive engagement with stakeholders. By adopting these strategies and continuously evolving their resilience frameworks, organizations can navigate the challenges of an increasingly uncertain world and emerge stronger from disruptions.

How we can help

Reply can support your company thanks to its unique expertise, know-how, and technical experience concerning cybersecurity, incident management, IT/business continuity and Crisis Management. We assist our customers in implementing solutions that enhance agility and responsiveness, as well as in developing robust plans to validate and adjust their strategies, ensuring they are prepared for any situation that may arise. Our step-by-step approach is tailored to the customer’s needs and maturity level, allowing us to evaluate an organization’s readiness capabilities and design a suitable solution.

You may be also interested in