Experiments

We inject chaos proactively,
instead of dreading the unknown.

How To Create a Chaos Engineering Scenario

 

  • Form a hypothesis

  • Baseline your metrics (e.g. SLOs/SLIs per service)

  • Consider the blast radius

  • Determine what chaos to inject

  • Run your chaos engineering scenario

  • Measure the results of your chaos engineering scenario

  • Find and fix issues or scale the blast radius of the scenario

 


How To Pick A Chaos Engineering Scenario

 

  1. Identify your top 5 critical services

  2. Choose one of these critical services:

    1. Monitoring & alerting (Datadog & PagerDuty)

    2. Cache (Redis or Memcache)

    3. Payments

  3. Whiteboard the service with your team

  4. Select the Gremlin Scenario:

    1. Validate Autoscaling

    2. Unavailable Dependency

    3. Host/Container Failure

  5. Determine the magnitude: number of servers/length of time


What is the value of Chaos Engineering?

 

Find your monitoring gaps, reduce signal to noise
“We’ll get paged if that breaks”, until you don’t.
A false sense of security is worse than nothing.

 

Validate Upstream & Downstream Dependencies
Validate that each new service can fail independently.
Protect against cascading failures and knock-on effects.

 

Train your teams
We run fire drills, train firefights, and first responders.
Are you investing in your operations teams?

 

Get A Good Night’s Sleep
We often can’t get a good night’s sleep due to our pager waking us up in the middle of the night, use Chaos Engineering to reduce incidents and increase time spent sleeping in your bed!