A Wake-Up Call for Third-Party Risk Management?

We address the inevitability of IT outages and their impact on organizational resilience. Starting with a case study on the recent CrowdStrike incident, we’ll explore various disruptions—from human errors to cyberattacks—and offer strategies to strengthen your organization’s preparedness.

Written by
Andy Fernandez
Published on
July 25, 2024
Share on social

Navigating the Storms of IT: A Series on Outages and Organizational Resilience

In today's digital landscape, the question isn't if an IT outage will occur, but when. As we've seen with the recent CrowdStrike incident, even global platforms can fall victim to unforeseen disruptions. Whether due to human error, corruptions, or even cyberattacks – this will continue to happen.  

This blog post marks the first in an essential series dedicated to exploring third party risks across different technologies each of us rely on. Most importantly, we’ll focus on how organizations can prepare and protect themselves against challenges that range from simple human error to malicious actors.

Throughout this series, we'll dive into various types of incidents, from cloud breaches to supply chain ransomware attacks. Our goal is to equip you with the knowledge and tools necessary to not just weather these disruptions, but to emerge stronger and more resilient.

The first post looks at the recent CrowdStrike outage as a case study, using it as a springboard to discuss broader themes of human error, third-party risk management, and the critical steps organizations must take to prepare for and respond to IT disruptions. We explore immediate remediation tactics, the importance of resilient on-premises systems, and strategies for evaluating and enhancing your preparedness in cloud and SaaS environments.

As we embark on this journey together, a few good reminders. With IT, preparation is not just about preventing disasters—it's about building the capability to bounce back stronger when they inevitably occur. Let's begin by unpacking the CrowdStrike incident and the valuable lessons it offers for organizations of all sizes.

CrowdStrike Incident Explained

CrowdStrike customers that were running a Falcon sensor for Windows (version 7.11 and above) experienced a system crash. This happened after CrowdStrike released a sensor configuration update to Windows systems and triggered a system crash and blue screen of death (BSOD) on impacted systems. This had a significant impact on systems globally impacting major airlines, travel, hospitality, hospitals, e-commerce, and much more. This was not a criminal cyberattack, but simple human error. For a quick read, one of many on the recent outage, Chris Evans at Architecting IT, shared the following, “Commentary: Critical Infrastructure and Collective Responsibility.”  

Image of quote from CEO and Founder of HYCU Simon Taylor

Microsoft has also released a remediation guide for impacted customers, ”Helping Our Customers Through the CrowdStrike Outage.”  

‘Largest IT Outage in History’ caused by human error  

Human error is inevitable and impacts all organizations. It just happens that this occurred to a critical third-party service with global coverage to millions of PCs and systems. However, this is not the first nor the last outage or third-party incident that will impact organizations across the globe. The lesson learned here is that you must be resilient across any third-party failure. Here are three steps every organization should take:  

Step 1: Ensure immediate remediation

CrowdStrike has already released a remediation guide and video for remote users affected by the BSOD. Microsoft has also released a new Recovery Tool with two repair options to expedite the repair process. However, please make sure you only follow guidance and remediation instructions from CrowdStrike and Microsoft directly as we are already seeing cyber criminals capitalize on this incident and directly target CrowdStrike customers.  

Step 2: Ensure resilience of your production systems on-premises

Most of our energy on third party risk management has been so focused on public cloud and SaaS applications that we often take our data center services for granted. Whether from a third-party system crash or cyberattack, every organization running critical applications on-premises should implement the following:

  • Disaster Recovery solutions with the ability to failover to another facility or to the public cloud
  • Comprehensive backups with point-in-time or bulk recovery that is application aware. This means the ability for point-in-time restore and rapid recovery.  
  • Immutable backups that are logically separated to ensure a safe, offsite copy that is accessible in the event or a mass corruption or cyberattack.  
  • Regular resilience testing of failovers and restores from DR and backup solutions with documented protocols and runbooks accessible to multiple members of the IT team.

Step 3: Evaluate your resilience and preparedness in case of third-party disruption in SaaS and Cloud  

Your cloud infrastructure and SaaS applications are completely reliant on third party vendors to deliver these services, maintain availability, and protect your data at a system-level. However, these services are also at risk of outages, corruptions, and loss of data. These companies provide robust availability and security, but due to human error there will always be a third-party risk leading to downtime, data loss, or corruption.  

Whether cloud customers experience a data loss event (ex. Pension fund experiences accidental deletion of their account by a vendor) or a cybersecurity company and its tenants suffer from a supply chain attack  – It will continue to happen, even with best-of-breed solutions.  

To prepare accordingly, you need to ensure the right third-party risk management. The European Union has released the Digital Operations Resilience (DORA) Act that explicitly asks for organizations to have a third-party risk management framework for ICTs (ex. SaaS and Cloud applications). This extensive framework highlights the need to protect your applications from third-party risks. Some of the requirements include:  

  • Continuous asset discovery
  • Backup policies  
  • Offsite data retention  
  • Resilience testing  
  • Documented runbooks and protocols for business continuity and incident response  

Watch this on-demand webinar about DORA Compliance using Atlassian Cloud as an example. This highlights customer vs. vendor responsibilities. The principles discussed in this video apply to all your cloud and SaaS applications.  

Conclusion: Be Ready, Be Resilient.  

The teams at CrowdStrike and Microsoft are doing everything they can to remediate and ensure all organizations have the tools they need to return to uninterrupted service and achieve maximum uptime. However, this scenario can and will happen to many vendors, from security and cloud to your business applications.  

The key is to understand that this WILL happen and that your organization has taken the steps necessary to protect and recover your data when the time comes.  

More information:  

Shive Raja Headshot

Director of Product Management

Andy Fernandez is the Director of Product Management at HYCU, an Atlassian Ventures company. Andy's entire career has been focused on data protection and disaster recovery for critical applications. Previously holding product and GTM positions at Zerto and Veeam, Andy’s focus now is ensuring organizations protect critical SaaS and Cloud applications across ITSM and DevOps. When not working on data protection, Andy loves attending live gigs, finding the local foodie spots, and going to the beach.

Experience the #1 SaaS data protection platform

Try HYCU for yourself and become a believer.