Sometimes, there is an edge case we didn’t anticipate that causes issues in production.

And sometimes, there is a common use case we didn’t test sufficiently that causes issues in production.

And sometimes, production goes down. For reasons yet unknown.

For some of us, there is a production support team that handles these things. They might call us, if they need us, but they handle this stuff. For the rest of us, we need to handle the incident ourselves.

The following is a very simple framework for responding to incidents. It is a brief write-up on roles and key steps for resolving incidents. This primarily serves as a guideline for situations such as production outages or incidents that impact multiple users and adversely affect the ability to do business. This is a starting place. It is lightweight but for many organizations it is all the formality they’ll ever actually need.

Roles

There are four key roles; Command, Recorder, Comms, and Research

Each of these roles needs to be handled. One person can handle multiple roles, but it is best to have at least one individual per role whenever possible. Most roles can have multiple people, if staffing allows or the scope is large enough to warrant a larger group. Only one person can be in the command role at a time.

The fifth role, support, is optional but very valuable in larger scale longer-term incidents.

Command - Runs the Room

This role is coordinating all efforts.
Command does not necessarily make decisions, but can do so if there is a stalemate or time does not allow.
Command is responsible for managing the approach and focus.
Command is the voice most commonly heard in the room.
No action that potentially impacts our ability to do business is to be taken without express confirmation from command
- For example
  - Code deployments
  - Data updates

Things command is managing:

Overall Approach
Focus
- Short term (now)
- Long term (in hours)
Identify who needs to be involved
Know who is working on what

Recorder - Takes the Notes

This role is recording incoming data in a way that is accessible to all participants.
Recorder needs to pay attention
- People may not explicitly share the data with recorder

Things recorder is managing:

Record incoming information
Make information visible to all who need it
Answer command’s questions

Comms - Shares the Status

This role is managing communication; primarily external and secondarily internal.
Status updates to stakeholders on a regular cadence
- Brief Update
  - Incident Name (include incident ID if one exists)
  - Change in status since last update
  - ETA to resolution
    - TBD
    - Days or Hours
    - Complete

Things comms is managing:

Understand approach and current focus
Send out regular status updates
Contact anyone we need who is not yet involved

Research - Finds the Clues

Research needs to ensure Command and Recorder have information as soon as possible

Things research is managing:

Gather the evidence we need
Run queries
Look through code

Support - Helps the Team

This role is available for support - they may switch into any other role as needed.
Keep room organized
Runner - go get people and things needed
- Handle food/drink
Fill in for other roles as needed

Steps

Identification

Identify the specific issue.
If there are multiple issues, make note and determine which one to address first.
Limit WIP - Avoid the temptation to run multiple issues at the same time.

Containment

Take steps to contain the issue - how do we stop it from having further impact?
Disable systems; pause operations?
Communicate to customers?

Impact Analysis

What is the scope of the issue?
How many users are involved?
What communication is required?

Root Cause Analysis

What is causing the issue?
Can we verify the cause?
Can we duplicate the issue in a lower environment?

Mitigation

What are steps to correct the issue?
Do we need to update code or systems?
Do we need to update any data?

Retrospect

How did we do?
- What about the process worked?
- What about the process didn’t work?
- What do we want to do differently in the future?
What about the issue?
- What did we learn?
- What could we have done to prevent this?

A Lightweight Incident Response Framework