Event-Driven Chaos Engineering: From Failure to Resilience in Kubernetes

Imagine a ship sailing through unpredictable seas. Traditional chaos engineering is like scheduling fire drills on calm days — useful practice, but not always reflective of real storms. Kubernetes often faces turbulence in the moment: pods fail, nodes crash, or workloads spike without warning.

Event-driven chaos engineering is like training the crew with surprise drills triggered by real conditions. Instead of waiting for disaster, it turns every unexpected wave into a chance to strengthen resilience.

In this blog, we’ll explore how event-driven chaos turns Kubernetes from a vessel that merely survives storms into one that grows stronger with each one. This blog builds an event-driven chaos engineering pipeline in Kubernetes, combining tools like Chaos Mesh, Prometheus, and Event-Driven Ansible (EDA).

Why Chaos Engineering?

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. Traditional chaos experiments are often scheduled or manually triggered, which can miss critical windows of vulnerability or relevance.

For example:

What happens when a node fails during a deployment?
How does your system behave when a spike in traffic coincides with a database upgrade?

These scenarios are not just hypothetical — they’re real, and they often occur in response to events.

Read the blogs in this series to know more about chaos engineering and the comparison of traditional and event-driven.

Why Event-Driven?

Event-driven architectures are designed to respond to changes in state — be it a new deployment, a scaling operation, or a system alert. By integrating chaos engineering with these events, we can:

Target chaos experiments more precisely (e.g., inject faults during high-risk operations).
Reduce noise by avoiding irrelevant or redundant tests.
Accelerate feedback loops for developers and SREs.
Simulate real-world failure conditions with higher fidelity.

In essence, event-driven chaos engineering transforms resilience testing from a periodic exercise into a continuous, adaptive process. Think of it like fire drills: traditional chaos is “let’s pull the alarm at 2 AM every day,” while event-driven chaos is “when smoke is detected in a wing, trigger a drill immediately.”

Chaos Engineering: Traditional vs. Event-Driven

	Traditional Chaos Engineering	Event-Driven Chaos Engineering
When it runs	Prescheduled experiments (e.g., daily, weekly)	Triggered in real time by actual events (e.g., pod crash, CPU spike)
	Testing generic failure scenarios	Responding to live failures as they occur
	Simulated conditions, not always reflective of production events	Mirrors real-world incidents and context
	Identify weak points through periodic stress	Build adaptive resilience by turning every failure into a learning moment
	Fire drills planned on sunny days	Crew drills launched the instant a storm hits

Why Inject Chaos After a Real Event

Validate resilience at the right time.
- Instead of chaos at random, you inject it when a real degradation is already in play.
- Example: API latency is 1.4s (warning) → inject CPU stress → see if autoscaling and retries really protect users.
Reveal weak spots in remediation.
- Auto-remediation may restart a pod, but what if the DB is also slow?
- Chaos uncovers cascading failures that a single remediation step can’t cover.
Test SLO guardrails in production-like conditions.
- Injecting stress during live but controlled signals (e.g., warning alerts, not critical) ensures you test under real workloads, not just in lab simulations.
Build confidence in automation.
- Chaos forces the remediation playbooks, HPA policies, and failover logic to run in real time.
- You validate that remediation is not only coded but also effective under real stress

A Safe Design for Chaos

Warning-level event → inject chaos (to push the system harder).
- If system + remediation can hold, you know resilience is strong.
Critical-level event → skip chaos and remediate immediately.
- Protects production and ensures healing takes priority.

Example Use Cases

High CPU on Application Pods
- Realtime Event: Pod CPU usage > 80% for a sustained period.
- Alert: Prometheus alert for “PodHighCPU.”
- Chaos: Inject CPU stress on one pod to simulate saturation.
- Remediation: Scale deployment replicas or restart the unhealthy pod.
Node NotReady or Memory Pressure
- Realtime Event: Node marked NotReady or under memory pressure.
- Alert: “NodeNotReady” alert from kubelet metrics.
- Chaos: Drain a node or simulate node failure.
- Remediation: Reschedule pods to healthy nodes or add capacity.
Database Latency Spike
- Realtime Event: DB query latency exceeds 100ms.
- Alert: “DbHighLatency” alert raised.
- Chaos: Introduce network delay between application and DB.
- Remediation: Switch to a read replica, increase the connection pool, or reroute traffic.
Increased Error Rate (5xx)
- Real-time event: Error rate > X% in a service.
- Alert: “HighErrorRate” alert triggers.
- Chaos: Kill one pod of the service to simulate degraded availability.
- Remediation: Restart failed pods or scale up to distribute load.

Event-Driven Chaos Engineering Architecture for Kubernetes

The diagram below illustrates an example of an event-driven chaos engineering architecture for a Kubernetes environment. It connects event sources, alert management, event routing, chaos orchestration, remediation, and observability into a closed feedback loop. Our tutorial will be based on this architecture, walking through the layers step by step.

Step-by-Step Tutorial

The prerequisite for this tutorial is a running Kubernetes cluster (Minikube, Kind, or managed cluster). This tutorial uses Minikube and can be used to deploy any cluster. All the YAML files required for this tutorial can be downloaded or cloned from https://github.com/jojustin/EDAChaos.

Step 1: Start Minikube

minikube start --cpus=4 --memory=8192
kubectl get nodes

Step 2: Install ChaosMesh

ChaosMesh allows us to inject security-relevant chaos (CPU stress, rogue processes, and network anomalies).

helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
kubectl create ns chaos-testing
helm install chaos-mesh chaos-mesh/chaos-mesh -n chaos-testing --set chaosDaemon.runtime=docker --set chaosDaemon.socketPath=/var/run/docker.sock

Verify the installation.

kubectl -n chaos-testing get pods

Step 3: Deploy a Sample App

Let’s use a simple nginx deployment as our target.

kubectl create deployment nginx --image=nginx
kubectl get pods -n default -l app=nginx -o wide
kubectl expose deployment nginx --port=80 --type=NodePort
minikube service nginx --url   # (optional test)

Make sure all the nginx pods are in a running state.

Step 4: Install Prometheus for Metrics

Let’s install Prometheus as we collect metrics during chaos. Let’s override the default chart configuration with the custom values provided in the file values-kps.yaml. This file also defines a route webhook to the EDA service DNS.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring --create-namespace -f values-kps.yaml # Overrides default chart configuration with the custom values provided

List the Prometheus pods to see if they are in a running state.

kubectl get pods -n monitoring
kubectl get crd | grep monitoring.coreos.com   # should list prometheusrules, servicemonitors, etc.

Step 5: Create a Custom Role to Allow EDA to Read Metrics

Apply it in Kubernetes using kubectl apply -f clusterrole-read-metrics.yaml.

Step 6: Deploy EDA In-Cluster

This step uses a single YAML file that installs Ansible, Ansible Rulebook, and Ansible Galaxy Collection. It also creates an Ansible rulebook, remediation playbook, and other related resources in Kubernetes. remediate.yml is part of eda-incluster.yaml that provides the remediation steps, which can be customized as per the use case. The GitHub token is part of this file, and it can be created as a secret and can be referred. Before running the file, update the fields in the file: github_owner, github_repo, and token. To deploy the EDA listener, apply the files.

# Apply Ruleset & Remediation
kubectl apply -f eda-incluster.yaml

#Roll out the EDA Listener
kubectl -n eda rollout status deploy/eda-listener

Verify the eda-listener pods are in a running state. You can also check the logs.

kubectl -n eda get pods,svc
kubectl -n eda logs deploy/eda-listener -f

Step 7: Ensure a Rule Actually Fires

Create a PrometheusRule defined in the file nginx-high-cpu-rule.yaml that updates Prometheus’ running configuration. Prometheus can evaluate the rule at specified intervals. Apply this rule -> kubectl apply -f nginx-high-cpu-rule.yaml

Optionally, you can port-forward UIs if you want to watch the rule transition using kubectl -n monitoring port-forward svc/monitoring-kube-prometheus-prometheus 9090:9090

Step 8: Include Chaos to Stress the CPU

At an event of a CPU spike noticed in the nginx application, we can trigger StressChaos. In a non-production or a testing environment, to manually test the chaos, apply the chaos using the command -kubectl apply -f cpu-stress.yaml.

In a production system for a complete event-driven approach, add a first rule with the run_playbook attribute (part of the ruleset.yaml in the eda-incluster.yaml) to invoke the chaos stress like this:

- name: High CPU alert
  condition: event.alerts[0].labels.alertname == "PodHighCPU"
  action:
    run_playbook:
      name: chaos-cpu-stress.yaml

This invokes the StreeChaos to hike the CPU for the application. In addition to the above, the remediation rule is maintained for the remediation to be invoked.

Step 9: Manual Test Without Waiting for Prometheus

You can post a dummy alert directly to EDA to verify the rule and playbook wiring:

kubectl -n eda port-forward svc/eda-listener 5001:5001

# in another terminal
curl -X POST http://localhost:5001/alerts -H 'Content-Type: application/json' -d '{"alerts":[{"labels":{"alertname":"HighCPUUsage"},"annotations":{"summary":"Test"}}]}'
# should get 202 Accepted; eda logs show playbook runs

Watch the EDA logs.

kubectl -n eda logs deploy/eda-listener -f

When the high CPU event occurs on the nginx-application, defined remediation is applied, and a GIT Summary issue is created when the event occurs. The GIT issue provides the details of the chaos event and the actions taken to remediate. Insights into these details can be used for feedback.

With this hands-on walkthrough, we demonstrated how event-driven Ansible can seamlessly trigger and orchestrate chaos experiments in Kubernetes. By combining ChaosMesh with EDA, Prometheus, and GitHub workflows, we built an automated feedback loop for resilience validation.

Conclusion

Event-driven chaos engineering moves Kubernetes resilience testing from ad hoc failure injection to an automated, intelligent, and continuous practice. By wiring event sources such as Prometheus alerts or Kubernetes signals into event routers and orchestration layers like EDA, teams can trigger chaos experiments exactly when the system is under stress. This not only validates recovery paths but also closes the loop with observability dashboards and feedback into CI/CD pipelines.

The result is a stronger operational posture: instead of fearing failure, organizations learn from it in real time, hardening their platforms against both predictable and unexpected disruptions. In short, event-driven chaos turns failure into actionable insight — and actionable insight into resilience by design.

Event-Driven Chaos Engineering: From Failure to Resilience in Kubernetes

Why Chaos Engineering?

Why Event-Driven?

Chaos Engineering: Traditional vs. Event-Driven

Why Inject Chaos After a Real Event

A Safe Design for Chaos

Example Use Cases

Event-Driven Chaos Engineering Architecture for Kubernetes

Step-by-Step Tutorial

Step 1: Start Minikube

Step 2: Install ChaosMesh

Step 3: Deploy a Sample App

Step 4: Install Prometheus for Metrics

Step 5: Create a Custom Role to Allow EDA to Read Metrics

Step 6: Deploy EDA In-Cluster

Step 7: Ensure a Rule Actually Fires

Step 8: Include Chaos to Stress the CPU

Step 9: Manual Test Without Waiting for Prometheus

Conclusion

So… the Internet Died Again This Week

How to Enable VM Fault Tolerance with VCF Orchestrator—The Apollo 13 Way

WordPress 6.8 Release Candidate 2

Study: Megalodon’s body shape was closer to a lemon shark

WordCamp Asia 2025: Manila Magic

Why Chaos Engineering?

Why Event-Driven?

Chaos Engineering: Traditional vs. Event-Driven

Why Inject Chaos After a Real Event

A Safe Design for Chaos

Example Use Cases

Event-Driven Chaos Engineering Architecture for Kubernetes

Step-by-Step Tutorial

Step 1: Start Minikube

Step 2: Install ChaosMesh

Step 3: Deploy a Sample App

Step 4: Install Prometheus for Metrics

Step 5: Create a Custom Role to Allow EDA to Read Metrics

Step 6: Deploy EDA In-Cluster

Step 7: Ensure a Rule Actually Fires

Step 8: Include Chaos to Stress the CPU

Step 9: Manual Test Without Waiting for Prometheus

Conclusion

Similar Posts