Learn from Amazon Web Services

When Automated Capacity
Brought Down Half the Internet

In 2021, a seemingly routine AWS capacity increase triggered a cascading failure that crippled us-east-1, impacting countless services. Incident Drill helps engineering teams practice responding to similar high-pressure scenarios, turning chaos into calm.

Amazon Web Services | 2021 | Outage (Cloud)

The High Stakes of Cloud Outages

Major cloud outages highlight the fragility of modern infrastructure. A single misconfiguration or unexpected event can lead to widespread disruption, resulting in significant financial losses, reputational damage, and customer dissatisfaction. Teams need to be prepared to quickly diagnose and mitigate these issues, minimizing downtime and impact.

PREPARE YOUR TEAM

Simulate and Conquer with Incident Drill

Incident Drill provides realistic simulations of complex incidents like the AWS us-east-1 outage. Teams can practice their incident response skills in a safe environment, identifying weaknesses in their processes and improving their ability to collaborate, communicate, and recover from critical events. Build confidence and resilience with hands-on training.

🔥

Realistic Simulations

Experience the pressure of a real-world outage without the real-world consequences.

🔎

Root Cause Analysis

Dive deep into the technical details and understand the underlying causes of the incident.

🤝

Team Collaboration

Practice working together to diagnose, troubleshoot, and resolve incidents effectively.

🗣️

Effective Communication

Learn how to communicate clearly and concisely during high-pressure situations.

⏱️

Time-Based Scenarios

Experience the time pressure of a real incident with scenarios that unfold in real-time.

📊

Post-Incident Analysis

Review your team's performance and identify areas for improvement.

WHY TEAMS PRACTICE THIS

Prepare for the Unthinkable

  • Reduce downtime and minimize impact
  • Improve incident response time
  • Enhance team collaboration and communication
  • Build confidence in handling critical incidents
  • Identify weaknesses in your infrastructure
  • Prevent future outages

AWS us-east-1 Outage Timeline (Simplified)

9:30 AM EST
Automated capacity increase initiated
9:45 AM EST
Network devices overwhelmed
10:00 AM EST
Network congestion impacts us-east-1 services
10:30 AM EST
Widespread service disruptions reported
1:00 PM EST
Services begin to recover

How It Works

1

Step 1: Understand the Incident

Review the official AWS post-mortem and related resources.

2

Step 2: Simulate the Scenario

Use Incident Drill to recreate the conditions that led to the outage.

3

Step 3: Practice Your Response

Work with your team to diagnose the problem and implement solutions.

4

Step 4: Analyze and Improve

Review your team's performance and identify areas for improvement.

Ready to Level Up Your Incident Response?

Join the Incident Drill waitlist and be among the first to experience realistic incident simulations.

Get Early Access
Founding client discounts Shape the roadmap Direct founder support

Join the Incident Drill waitlist

Drop your email and we'll reach out with private beta invites and roadmap updates.