Breaking down an Incident Chain


Organizations are like trees, they grow and spread their branches wide. When an incident happens in a tree-like IBM, Microsoft, Google or any other giants, it becomes tough to manage those incidents. The most effective manner for incident management is to identify the cause of an incident and try to mitigate it. But nowadays risks can be interconnected and tough to identify. While Identifying the cause by Root cause analysis, we often fail to find the interconnected risk and the process becomes tough while digging deeper and deeper into the incident.

Let me tell you my view, of how to do this in a simpler way. Suppose an incident happened in a big organization like a service failed for some minutes. So how to identify the cause of this Incident. Suppose we go identifying one cause after another until we find the root. The organization would be spread across the globe and a vast failure in service may be caused by independent risks which have occurred across the globe. This is a complicated process and can guide you to a risk which is connected to another incident. This approach may also take more time to identify the root and resolve the incident.

Let’s solve the issue in a different manner. Suppose we have a similar incident, same as above. Following steps can be considered to find the cause and mitigate it.

Step 1: Break down the incident into sections based on different factor’s like time, place, affected area, criticality etc.

Step 2:  Group the identified incidents into various sections depending on the type of incident.

Step 3:  For each grouped incident:

a.   Analyze the incident.

b.   Find the owners responsible for the incident.

c.   Break down into further sections if necessary.

d.   Find the cause of each section.

Step 4: Map all the causes to the sections and form a tree-like structure for the incident where the root node is now the initial incident.

After step 4, we will have the initial incident broke down into different session’s and a risk which is mapped to each incident.

Step 5: Analyze the mapped Incident and the risk. Find which Risk is more critical and, mitigate each and every risk one by one according to its criticality.

Finally, when all risks are mitigated, we will have the initial incident resolved along with other unnoted incidents and risk which were found while dividing into sessions.

For a further understanding of above concept, let me describe it in a simple way by considering the organization as a tree and the initial incident as a branch falling down on a fine evening. We first break down the incident (Step 1), identify possible causes like the heavy wind (considering the climate), a woodpecker is coming there often, the kids usually climb on the same branch etc.. So we have different groups which caused the same incident (Step 2). Now we analyze each group and find the owners ie, climate change, woodpecker and the kids. We can further break down the incident like kids climb the tree because the fruits at the low level of the tree are all moldy. This is further an incident and can be grouped into other sessions. So taking each identified section ie, Moldy fruits, Kids, woodpecker and climate change we arrive at different causes and according to criticality, we solve each one by one like moldy fruits is a critical problem and can be solved by applying pesticides or giving proper nourishment.

Similarly, organizations can also do incident management and mitigate risks and other incidents in the chain by following the above steps.