what happens when you break prod

One of the worst feelings in the world is deploying a bad change that breaks your product / app / service and disappointing your customers.

Three months into my career at Amazon (my first full time job), I released a change that took down one of our services in prod. Here’s exactly what happened:

First, you find out fast

Within minutes, thousands of API calls were failing. People started posting in public Slack channels and pinging the on-call directly. The on-call, in this case, was me.

There's a specific kind of stomach-drop that happens when you realize the thing everyone is panicking about is the thing you shipped an hour ago. The best thing you can do is stay calm, let your team know something is going on, then get to work.

Then you stop the bleeding

The first job isn't to fully understand the bug. It's to make it stop. I'd just released a change, so my first instinct was to roll back that change, fast.

I’ve seen WAY too many developers (especially jr level) get too focused on trying to diagnose exact root cause, or find the line of code which is causing the issue when your top goal is to restore service to users. You can find the EXACT issue later.

Anyways, the rollback took over 10 minutes, which is an eternity when a service is down. But once it finished, things recovered. I told everyone to retry whatever had failed in the last 15 minutes, and that was that.

Then you write it down

After the dust settled, I was assigned to write a document called a postmortem explaining exactly what happened. At Amazon these are a whole ritual [AWS Example], and the purpose is not to assign blame, and it's definitely not to get anyone fired. It's to figure out what happened and change the process so it can't happen again.

In my case, the culprit was a one-line change. It had been reviewed and approved by another engineer. It had passed through multiple test environments, and baked in a pre-prod environment for multiple days. And it still took down prod. I wrote all of that up without names, and then laid out the process changes that would stop a change like it from going out the same way again.

Always focus on the process that allowed this specific failure case to happen. Junior engineers (like myself at the time) will write bad code. You should have systems in place which prevent this code from reaching prod.

Common examples of systems: Integration tests, unit tests, preprod environments, bake times, code reviews, etc.

So did I get fired?

This wasn't bad for my career. If anything, it helped. A lot of people (including our director) suddenly knew my name. A lot of people read my writing for the first time. I'd accidentally introduced myself to the entire org and some of the most important people who would eventually go on to approve my promotion.

Most people will deploy a bad change during their career as a software engineer, but very few of them will handle it gracefully and allow it to move them up the ladder.

Break prod. Just know what to do next.

Have a good weekend!
Arjay.

P.S. A lot of people have been enjoying the new updates on The Daily Dev! Welcome to all our new users, and join us here: https://apps.apple.com/us/app/the-daily-dev/id6758278144

what happens when you break prod

Keep reading

The Dev Download