Effectively Resolving System Problems: Stabilize, Analyze, Remediate

10/1/23

Introduction

How can we best solve complex problems under time pressure? This is a challenge I have faced often throughout my career. Across industries, the successful problem resolution frameworks all seem to share a common factor - they divide resolution into three steps: stabilize, analyze, and remediate, executed in that order. In this note I discuss those different phases, their importance, and anti-patterns that usually lead to poor problem resolution. I make some references to the IT field, but these patterns and anti-patterns apply generally to problem resolution across all industries.

Stabilization

When impactful problems manifest, they cause immediate pain and noise within the organization, which then causes great pressure to “do something” in response. The correct action here is stabilization. To stabilize the system means 1. to get it back to the highest level of service possible, 2. within the existing operating pattern, 3. while making the minimum number of changes.

I suggest taking a moment to review very carefully those three parts of stabilization. Resolutions usually go off the rails because the team tries to skip past stabilization to other phases, or misses one of the three core aspects of stabilization.

Stabilizing a system does not mean to get it working properly. Instead, the goal is to get the system to work at the best function it is currently capable of. Consider the example of a car suddenly warning of tire pressure loss. The driver might choose to slow down and limp along in the right lane, to stop and add air to the tire, to change the tire with an undersized emergency spare, or even to keep driving on the rim. None of these actions restore full function, but they do keep some function. Another goal of stabilization is to keep the problem contained, i.e. to keep the problem from getting worse, or at least minimize its rate of degradation. When a house loses a section of roofing shingle, we cover it with plastic tarp. This isn’t a long-term solution, but it does temporarily prevent further water damage.

Proper stabilization happens within the existing operating patterns, meaning a known activity that the team has documented and practiced. Stabilization is not the place for innovation or extensive analysis. The phrase “I have an idea… we could try” often signals problematic departure from stabilization activities. Brainstorming has its place in problem resolution, but not in the stabilization phase. Stabilize first! Failing over to a well-tested disaster recovery platform represents an example of a stabilization activity within an existing operating pattern.

The reason we stabilize a system in peril first before exploring new ideas is that, from an information theory perspective, most systems operate properly only in a narrow zone of stability. Picture a rock in a sea of lava - staying on the rock is very important! Exploring into the infinite space of configurations where the system does not work properly usually leads only to suffering. With an unstable system, first get back to the island of stability and cling to it as best you can, even if function is degraded. Seek improvement later.

Cling to the island of stability - leave it at your peril!

Cling to the island of stability - leave it at your peril!

For a similar reason, the conservative approach of making the minimal number of changes to stabilize a system usually embodies the best path during stabilization. The system was stable. It became unstable through a change, either in the system itself or in its environment. Since most systems are inherently stable only in a narrow range of parameters, most likely only one change, or perhaps two, pushed it into instability. Therefore we must usually make only one change to put it back. For this reason, believing that the system needs significant reconfiguration or change usually represents a fallacy (the “apocalyptic thinking fallacy”) that leads to serious error. Likewise, making a chain of successive changes in hopes of restoring better system function generally proves a grave mistake (”wandering away from the island of stability”) that greatly complicates recovery. Stabilization remains an operational activity, not a time for re-engineering the system. When stabilizing, make cautious and methodical attempts to reverse changes or to replace suspect components, following known patterns. Accept that having a stable system with degraded functionality remains far superior to having an unstable system. Further functional improvement should be pursued during remediation - two phases in the future - not during stabilization.

Common stabilizations include identifying and rolling back the last change made to the system, or taking a malfunctioning component offline and repairing or replacing it.

Analysis

Once a system has been stabilized, or in parallel with the stabilization activity, another team analyzes the problem. Analysis aims to 1. characterize with data what is actually happening, then 2. to identify the underlying structural condition causing the problem so it can be remediated.

Analysis first requires gathering data. This part is important because humans naturally think in stories, or narratives. People tend to latch on to the first story that explains their problem, but there are almost always several other stories that could also explain it. Proper analysis requires identifying the different narratives, turning them into hypotheses, and trying to falsify them through data, ideally until only one survives. At this point the team has high confidence about the underlying cause of the problem and can make a plan to remediate it.

A team that seizes upon the first story that provides an explanation without seeking alternative explanations and trying to falsify them all through data gathering and proper analysis has fallen prey to the “narrative fallacy”. This can lead to the team confidently pursing a resolution that doesn’t significantly improve the problem, wasting time and resources and eroding trust in their ability to solve problems.

Humans naturally form and follow narratives, but this represents improper engineering. Much of training engineers in problem resolution consists in training teams out of their instinct to follow the narrative fallacy, and instead replacing this instinct with systematic methods of data-based analysis. A team’s inability to produce any significant data points, instead relying on the narrative as if it were the explanation vs. a hypothesis to be disproved, displays a key tell for faulty analysis.

There are times when a team must rely on the opinion of specialists, although it remains always best to falsify these also through data if at all possible. In this case where the available data cannot confirm or disconfirm the specialists’s opinion, the team should gather multiple different expert opinions, ideally expressed in writing because writing clarifies thinking. Those multiple opinions should ideally converge to a consensus with some dissenting outliers. The team should then build a remediation strategy around the consensus, and if this does not work, should return to examine the outliers.

If specialists cannot achieve consensus, the team should choose the opinion whose resulting remediation involves the least possible harm, try that, and continue proceeding through opinions and associated remediations in order of increasing effort/cost/harm. However, this approach is painful. Continuing to seeking specialist analysis and trying to falsify it through data gathering until consensus emerges usually proves a better path.