A Lightweight Incident Response Framework
Sometimes, there is an edge case we didn’t anticipate that causes issues in production.
And sometimes, there is a common use case we didn’t test sufficiently that causes issues in production.
And sometimes, production goes down. For reasons yet unknown.
For some of us, there is a production support team that handles these things. They might call us, if they need us, but they handle this stuff. For the rest of us, we need to handle the incident ourselves.
The following is a very simple framework for responding to incidents. It is a brief write-up on roles and key steps for resolving incidents. This primarily serves as a guideline for situations such as production outages or incidents that impact multiple users and adversely affect the ability to do business. This is a starting place. It is lightweight but for many organizations it is all the formality they’ll ever actually need.
Roles
There are four key roles; Command, Recorder, Comms, and Research
Each of these roles needs to be handled. One person can handle multiple roles, but it is best to have at least one individual per role whenever possible. Most roles can have multiple people, if staffing allows or the scope is large enough to warrant a larger group. Only one person can be in the command role at a time.
The fifth role, support, is optional but very valuable in larger scale longer-term incidents.
Command - Runs the Room
This role is coordinating all efforts.
Command does not necessarily make decisions, but can do so if there is a stalemate or time does not allow.
Command is responsible for managing the approach and focus.
Command is the voice most commonly heard in the room.
No action that potentially impacts our ability to do business is to be taken without express confirmation from command
For example
Code deployments
Data updates
Things command is managing:
Overall Approach
Focus
Short term (now)
Long term (in hours)
Identify who needs to be involved
Know who is working on what
Recorder - Takes the Notes
This role is recording incoming data in a way that is accessible to all participants.
Recorder needs to pay attention
People may not explicitly share the data with recorder
Things recorder is managing:
Record incoming information
Make information visible to all who need it
Answer command’s questions
Comms - Shares the Status
This role is managing communication; primarily external and secondarily internal.
Status updates to stakeholders on a regular cadence
Brief Update
Incident Name (include incident ID if one exists)
Change in status since last update
ETA to resolution
TBD
Days or Hours
Complete
Things comms is managing:
Understand approach and current focus
Send out regular status updates
Contact anyone we need who is not yet involved
Research - Finds the Clues
Research needs to ensure Command and Recorder have information as soon as possible
Things research is managing:
Gather the evidence we need
Run queries
Look through code
Support - Helps the Team
This role is available for support - they may switch into any other role as needed.
Keep room organized
Runner - go get people and things needed
Handle food/drink
Fill in for other roles as needed
Steps
Identification
Identify the specific issue.
If there are multiple issues, make note and determine which one to address first.
Limit WIP - Avoid the temptation to run multiple issues at the same time.
Containment
Take steps to contain the issue - how do we stop it from having further impact?
Disable systems; pause operations?
Communicate to customers?
Impact Analysis
What is the scope of the issue?
How many users are involved?
What communication is required?
Root Cause Analysis
What is causing the issue?
Can we verify the cause?
Can we duplicate the issue in a lower environment?
Mitigation
What are steps to correct the issue?
Do we need to update code or systems?
Do we need to update any data?
Retrospect
How did we do?
What about the process worked?
What about the process didn’t work?
What do we want to do differently in the future?
What about the issue?
What did we learn?
What could we have done to prevent this?