Experiments

^{We inject chaos proactively,}
^{instead of dreading the unknown.}

How To Create a Chaos Engineering Scenario

Form a hypothesis
Baseline your metrics (e.g. SLOs/SLIs per service)
Consider the blast radius
Determine what chaos to inject
Run your chaos engineering scenario
Measure the results of your chaos engineering scenario
Find and fix issues or scale the blast radius of the scenario

How To Pick A Chaos Engineering Scenario

Identify your top 5 critical services
Choose one of these critical services:
1. Monitoring & alerting (Datadog & PagerDuty)
2. Cache (Redis or Memcache)
3. Payments
Whiteboard the service with your team
Select the Gremlin Scenario:
1. Validate Autoscaling
2. Unavailable Dependency
3. Host/Container Failure
Determine the magnitude: number of servers/length of time

What is the value of Chaos Engineering?

Find your monitoring gaps, reduce signal to noise
“We’ll get paged if that breaks”, until you don’t.
A false sense of security is worse than nothing.

Validate Upstream & Downstream Dependencies
Validate that each new service can fail independently.
Protect against cascading failures and knock-on effects.

Train your teams
We run fire drills, train firefights, and first responders.
Are you investing in your operations teams?

Get A Good Night’s Sleep
We often can’t get a good night’s sleep due to our pager waking us up in the middle of the night, use Chaos Engineering to reduce incidents and increase time spent sleeping in your bed!