Learn from GitLab.com

The GitLab Global Outage:
A Failover Gone Wrong

In 2018, a routine failover test at GitLab.com turned into a multi-hour global outage due to a misconfigured database. With Incident Drill, your team can practice handling similar scenarios and build resilience against unexpected failures.

GitLab.com | 2018 | Outage (Configuration)

The Cost of Unpreparedness

Incidents like the GitLab outage highlight the importance of robust failover procedures and thorough testing. Without proper preparation, a single mistake can lead to significant downtime, data loss, and reputational damage.

PREPARE YOUR TEAM

Incident Drill: Your Incident Response Training Platform

Incident Drill provides realistic incident simulations based on real-world events like the GitLab outage. Your team will learn to identify root causes, collaborate effectively, and implement solutions under pressure, all in a safe and controlled environment.

🔥

Realistic Simulations

Experience the pressure of a real incident with meticulously crafted scenarios.

🧑‍💻

Hands-on Practice

Dive into the code and infrastructure to diagnose and resolve the issue.

🤝

Team Collaboration

Work together with your team to develop and implement solutions.

⏱️

Time-Based Challenges

Learn to make critical decisions under time constraints.

📈

Performance Analysis

Receive detailed feedback on your team's performance and identify areas for improvement.

📚

Learn from Experts

Access expert insights and best practices for incident response.

WHY TEAMS PRACTICE THIS

Boost Resilience & Reduce Downtime

  • Improve failover procedures
  • Identify infrastructure weaknesses
  • Enhance team communication
  • Reduce mean time to resolution (MTTR)
  • Minimize the impact of future outages
  • Build confidence in crisis situations
00:00
Start: Routine database failover test
00:15
Primary database unresponsive
00:30
Database mounted with incorrect filesystem
01:00
Data corruption detected
04:00
Service restored

How It Works

1

Step 1: Simulation Start

Begin the GitLab outage simulation with a realistic initial state.

2

Step 2: Investigate the Issue

Analyze logs, metrics, and system configurations to identify the root cause.

3

Step 3: Implement a Solution

Develop and deploy a fix to restore service and prevent further data loss.

4

Step 4: Post-Incident Analysis

Review the incident, identify lessons learned, and implement improvements.

Ready to Build a More Resilient Team?

Join the Incident Drill waitlist and be among the first to access our platform. Prepare your team for the unexpected and minimize the impact of future incidents.

Get Early Access
Founding client discounts Shape the roadmap Direct founder support

Join the Incident Drill waitlist

Drop your email and we'll reach out with private beta invites and roadmap updates.