Testing in production: Yes, you can (and should)

I wrote a piece recently about why we are all distributed systems engineers now. To my surprise, lots of people objected to the observation that you have to test large distributed systems in production. 

It seems testing in production has gotten a bad rap—despite the fact that we all do it, all the time.

Maybe we associate it with cowboy engineering. We hear “testing in production” and assume this means no unit tests, functional tests, or continuous integration.

It’s good to try and catch things before production—we should do that too! But these things aren’t mutually exclusive. Here are some things to consider about testing in production.

1. You already do it

There are lots of things you already test in prod—because there’s no other way you can test them. Sure, you can spin up clones of various system components or entire systems, and capture real traffic to replay offline (the gold standard of systems testing). But many systems are too big, complex, and cost-prohibitive to clone.

Imagine trying to spin up a copy of Facebook for testing (with its multiple, globally distributed data centers). Imagine trying to spin up a copy of the national electrical grid. Even if you succeed, next you need the same number of clients, the same concurrency, same pipelining and usage patterns, etc. The unpredictability of user traffic makes it impossible to mock; even if you could perfectly reproduce yesterday’s traffic, you still can’t predict tomorrow’s.

It’s easy to get dragged down into bikeshedding about cloning environments and miss the real point: Only production is production, and every time you deploy there you are testing a unique combination of deploy code + software + environment. (Just ask anyone who’s ever confidently deployed to “Staging”, and then “Producktion” (sic).) 

2. So does everyone else

You can’t spin up a copy of Facebook. You can’t spin up a copy of the national power grid. Some things just aren’t amenable to cloning. And that’s fine. You simply can’t usefully mimic the qualities of size and chaos that tease out the long, thin tail of bugs or behaviors you care about.

And you shouldn’t try.

Facebook doesn’t try to spin up a copy of Facebook either. They invest in the tools that allow thousands and thousands of engineers to deploy safely to production every day and observe people interacting with the code they wrote. So does Netflix. So does everyone who is fortunate enough to outgrow the delusion that this is a tractable problem.

3. It’s probably fine

There’s a lot of value in testing… to a point. But if you can catch 80% to 90% of the bugs with 10% to 20% of the effort—and you can—the rest is more usefully poured into making your systems resilient, not preventing failure.

You should be practicing failure regularly. Ideally, everyone who has access to production knows how to do a deploy and rollback, or how to get to a known-good state fast. They should know what a normal operating system looks like, and how to debug basic problems. Knowing how to deal with failure should not be rare.

If you test in production, dealing with failure won’t be rare. I’m talking about things like, “Does this have a memory leak?” Maybe run it as a canary on five hosts overnight and see. “Does this functionality work as planned?” At some point, just ship it with a feature flag so only certain users can exercise it. Stuff like that. Practice shipping and fixing lots of small problems, instead of a few big and dramatic releases.

4. You’ve got bigger problems

You’re shipping code every day and causing self-inflicted damage on the regular, and you can’t tell what it’s doing before, during, or after. It’s not the breaking stuff that’s the problem; you can break things safely. It’s the second part—not knowing what it’s doing—that’s not OK. This bigger problem can be addressed by:

  • Canarying. Automated canarying. Automated canarying in graduated levels with automatic promotion. Multiple canaries in simultaneous flight!
  • Making deploys more automated, robust, and fast (5 minutes on the upper bound is good)
  • Making rollbacks wicked fast and reliable
  • Using instrumentation, observability, and other early warning signs for staged canaries
  • Doing end-to-end health checks of key endpoints
  • Choosing good defaults, feature flags, developer tooling
  • Educating, sharing best practices, standardizing practices, making the easy/fast way the right way
  • Taking as much code and as many back-end components as possible out of the critical path
  • Limiting the blast radius of any given user or change
  • Exploring production, verifying that the expected changes are what actually happened. Knowing what normal looks like

These things are all a great use of your time. Unlike staging and test environments, which are notoriously fragile and flaky and hard to keep in sync with prod.

Do those things

Release engineering is a systematically underinvested skillset at companies with more than 50 people. Your deploys are the cause of nearly all your failures because they inject chaos into your system. Having a staging copy of production is not going to do much to change that (and it adds a large category of problems colloquially known as “it looked just like production, so I just dropped that table…”).

Embrace failure. Chaos and failure are your friends. The issue is not if you will fail, it is when you will fail, and whether you will notice. It’s between whether it will annoy all of your users because the entire site is down, or if it will annoy only a few users until you fix it at your leisure the next morning.

Once upon a time, these were optional skills, even specialties. Not anymore. These are table stakes in your new career as a distributed systems engineer.

Lean into it. It’s probably fine.