There will be screw-ups

Things will go wrong. Even if you have great deployment processes, good test coverage and QA, there will be issues.

Hopefully you'll catch them before they hit production, but they may still make coworkers unproductive and irritated when you check in some piece of crud which breaks all other scenarios than the happy path you were dealing with.

The first thing to do is to assign blame.

Why? It is so human to do so, everybody does it so get it done right away.

It is also very unproductive to spend time assigning blame when time should be spent fixing whatever it is and then making it very unlikely to happen again. Sometimes it's an all hands on deck scenario and if someone is focused on avoiding blame - or assigning blame - they are not being helpful.

The Promogogo way of bug-fixing

  • 1-
    Assign blame.

    At Promogogo we assign all blame to past-employees. You wouldn't believe what that lot got up to. Even if it is something checked in this morning and no-one has quit, it is clearly the fault of a past-employee.

    Current employees have already learned from the experience and will not repeat this (in the near future).

  • 2-
    Figure out what's wrong

    If people are done avoiding or assigning blame they can focus on finding what's wrong. Sometimes it's obvious but other times it is not.

    We like to start on one end and methodically go through all the steps until we find what's broken. It doesn't matter if nothing has changed or we're not seeing any errors on the API side, something is clearly wrong somewhere and going through it step by step without having to worry about blame is a simple thing to do.

  • 3-
    Fix it

    So some pesky past-employee checked in something that introduced a potential null-pointer (or timeout or un-indexed data lookup)? They've already been blamed so their current and enlightened self can get on with fixing the thing as soon as possible.

  • 4-
    Do a post-mortem once the dust settles

    Figure out whether something could have been done to prevent the issue.

    If yes, do that. If no, don't do anything.

    Learning from (bitter) experience is very valuable. That's how you can lift the curse.