You are about to break something. It might not be today, it might not be this week but you will break something that will cause downtime or loss of data. It is almost 100% certain.

Or maybe you are too good to break something (you aren’t), you will certainly be privy to something breaking.

When this happens it can be easy to lose your cool and make rash decisions that will only serve to prolong the issue, or worse, cause new issues.

Here are some general steps I feel you should follow. It will seem like a lot but work can be delegated between team members and many items are quick to do;

  1. Evaluate
  2. Write down everything
  3. Communicate internally
  4. Communicate externally
  5. Fix
  6. Communicate
  7. Discuss
  8. Document

Step zero in all of this is to remain calm, I am hoping this document will help. Fixing a serious issue requires your best work and you do not do your best work when you feel under pressure and overwhelmed.

Evaluate

Stop what you are doing and evaluate, do not start thinking about solutions, they will come. First we need to be very aware of what the current issue is. This will help you to verbalise the issue.

This stage also helps to work out if it is broken for everyone or just you. Maybe a poorly configured web server is blocking traffic from your office (or all traffic except from your office!).

Write down everything

Writing down what has happened helps to solidify what the actual problem is and can help at later stages when trying to retrace steps or recreate issues in different environments.

Writing down everything sounds extreme and potentially a waste of time, it isn’t. When you are rushing to fix something you will forget that you changed the name of a file 5 steps before the thing that you are “100% sure” is the cause.

Communicate internally

You need to let your team mates know what is happening, share with them what you have jotted down.

You especially need to let someone senior know. Given that everyone breaks everything all the time then more senior developers have broken more stuff and have had to fix very similar issues before.

If there is a Slack channel or email list for the project these would be ideal places. If there is someone senior physically near you then it is great to talk to them, but make sure you share details with everyone.

At no point attribute blame. If it is obvious that one person’s change caused the issue then they need to be made aware so that a timely fix can be sorted but don’t pass the buck and don’t point the finger.

Communicate externally

You need to let the rest of the company or at least the wider non-technical teams know what is happening.

How much information you give them is up to you and will depend on how technical the other teams are and if it is of benefit of them knowing.

Most of the time this can be as simple as “The site has had to be taken down for some emergency maintenance, we didn’t plan this and are working hard to fix it”.

It is not your job to tell Marketing/Sales that they should start cancelling any promotion to the site, or if they need to start planning for a slew of Twitter activity. You should give them as much information as necessary in order for them to make those calls.

Slack channels and email lists again will play a huge role in this.

As with internal communication, never assign blame. Even if this was an external event, Marketing sending a campaign that would 100x your regular traffic, for example.

Fix

For obvious reasons it is very hard to detail what you need to be doing to fix the issue. Using the list of what happened it should be relatively straight forward to retrace steps and start to fix from there.

During the fixing stage keep the rest of the development team updated on what you are trying and what has/hasn’t worked. If it feels relevant also keep other teams abreast of larger updates.

It is very hard to over communicate at a time like this.

In the case of data loss if there are no recent backups worth exploring there may not be a fix, time to move onto the next stage.

Communicate

Once a fix has been applied (or not applied) it is time to communicate again. You can let the development know all the gory details. Other business units may just need to hear “We are back”.

If there are any outstanding known issues these should be communicated as well. For example “The site is back up but we can’t register new members at the moment”.

In the case of data loss you need to let the relevant people know what has been lost, depending on the data there may be other manual ways to get it back.

Discuss

Serious issues are an occupational hazard for development teams but that isn’t to say that we shouldn’t try and mitigate them as much as possible.

A discussion needs to happen within the development team, or, if the initial issue came from a different department with both departments to work out what needs to happen to make sure the issue doesn’t happen again. This can take many forms;

  • A new or different technical solution
  • Training of staff to make them aware of issues
  • A new policy on how something should be done
  • A combination of the above

Document

Anything that has been discussed and agreed upon needs to be documented and action points for individual folk should be shared.

This document should be shared with relevant management and senior team members.

Here is a sample outline you could use for your document.

On {Wednesday 1st Feb} our {Product} website experienced {3} hours of downtime.

Our development team worked together to understand and fix the issue.

At the time of writing the site is up and we don’t anticipate any further issues.

The issue was {some faulty code that consumed too many resources on the server, grinding it to a halt}. The fix was {to rollback the code}

As a result of this we have implemented the following;

  • { Large changes are load tested on a staging environment }
  • { During code review staff have been asked to remain vigilant on potentially slow methods }

Major changes may need to find their way into your roadmap.