Chaos Experiments Catalogue (cHEX)

Recommended Scenario

Hypothesis

Gremlin Scenario Steps

Blast Radius / Targets

Link

Recommended Scenario

Hypothesis

Gremlin Scenario Steps

Blast Radius / Targets

Link

Validate Auto-Scaling with Status Checks

Status Checks will validate that your Cloud provider (like DigitalOcean) and a critical dependency (like GitHub) are in a steady state before launching attacks. When CPU usage ramps up and hits a set threshold, active instances will increase and decrease when CPU usage goes down. User sessions will remain active without throwing any errors.



  1. Status Check: Are there any operational issues with my Cloud provider?

  2. Status Check: Are there any operational issues with my critical dependency?

  3. CPU 1 minute

  4. Delay 5s

  5. CPU 1 minute

  6. Delay 5s

  7. CPU 1 minute

  8. Delay 5s

  9. CPU 1 minute

  10. Delay 5s

  1. Statuspage.io

  2. 100% of targets (hosts/pods)

https://app.gremlin.com/scenarios/recommended/validate-auto-scaling/hosts

Validate Health Checks - Packet Loss

As packet loss degrades network communications, it should cause the targeted node or service to be marked as unhealthy. Your load balancer should distribute requests to other healthy resources. If an orchestrator is used, the unhealthy node or service should be replaced with a new one.

  1. Packet loss 15s

  2. Packet loss 30s

  3. Packet loss 45s

  4. Packet loss 60s

  1. 100% of targets (hosts/pods)

https://app.gremlin.com/scenarios/recommended/validate-health-checks-packet-loss/hosts

Cache unavailable

When one cache AZ is made unavailable, the user experience will not experience an outage

  1. Blackhole - 60s

  1. Cache instances / hosts in an AZ

 

Cache CPU starvation

When CPU resources spike on the cache, additional cache instances will be added to the pool.

  1. CPU 80% - 6min

  2. Delay 5s

  3. CPU 20% - 2 min

  4. Delay 5s

  5. CPU 10% - 2 min

  1. Cache instances / hosts in an AZ

 

Cache IO starvation

When IO resources spike on the cache instances, we will be alerted via monitors

  1. IO - 5 min

  1. One single cache instance

 

Cache Process Killer

When memcached is killed by the process killer attack. The instance will shutdown and a new instance will replace it.

  1. Process Killer attack on memcached process

  1. One single cache instance