Following up on my previous post on Preparation, this post explores the common patterns to diagnosing an active software issue.
Painfully obvious but it needs to be said – never stop observing. Much time can be wasted by jumping to wrong conclusions and sticking to them without sufficient observation.
Do not become caught up with the first hypothesis you produce, the cognitive bias of Anchoring. It may be a symptom instead of a cause, a separate issue, or an entirely irrelevant false positive. Continue observing. Look for opportunities to disprove your hypothesis. A destroyed hypothesis is a victory – you have reduced the problem space. I love a rekt hypothesis.
In medicine, doctors employ the differential diagnosis process, a specialized application of the scientific method. We can do similarly with software systems.
- Gather observations about the system and symptoms (logs, version, environment, steps taken).
- List candidate causes (hypotheses). It helps to have a knowledge base of past system issues. If you do not have one, start one.
- Make predictions about what should be true if a hypothesis is correct.
- Make tests of your predictions that could refute them, prioritized by weighing a number of factors:
- Potential for significant threat to safety or security.
- Likelihood of candidate being the cause, based on your personal experience and system documentation.
- Ease of testing the candidate. If something is not very likely but it is quick and easy to rule out, it may be worth doing so sooner.
- Tests that will significantly reduce the problem space are valuable. Can you observe the flow of data somewhere in the middle, such as a network call in a client/server application? You are playing the High Low game with a binary search. Design your system in a way that allows these observation points.
- As you rule out candidates and make more observations, return to earlier steps and form new candidates and tests.
Strictly speaking, you never absolutely prove a candidate is the cause. You reach a point of sufficient confidence that a particular change will result in a working system. Always remember that ruling something out is just as important, if not more so, than adding support to a candidate.
- If a problem takes a long time or many steps to reproduce, effort spent shortening that time can be just as useful as effort spent on the underlying problem. It will help the current effort and make you more effective in the future.
- Reproduce the problem outside of your system, in the simplest and most isolated way possible. This will also make it easier to post your question to message boards where you can only provide a snippet of code. Console applications are great for this.
- Disable or comment out parts of your system to narrow down the problem space. Trigger the problem with the least amount of executed code possible.
- Familiarize yourself with the flow of data and execution. You can work from the start of the flow and step forward until you see something wrong, or start at the end and step backwards until you see something correct.
- If you are dealing with physical components, have spares you can swap in and out.
- Reduce or eliminate concurrency (multiple threads). Easier said than done, but for difficult problems it can be necessary.
- Force feed data or simulate external systems to better control test scenarios.
Beware the Heisenbug, which will trick you into thinking a problem is solved when in fact it is not. Some warning signs of a Heisenbug:
- The failure path involves unmanaged memory or multiple threads.
- The problem occurs in the Release build but not the Debug build, or does not occur when the Debugger is attached.
- Making trivial, seemingly unrelated changes impacts the reproducibility of the issue. Particularly changes to logging.
- CPU load impacts reproducibility.
- The problem occurs on one system but not on another seemingly identical system, or occurs with varying frequency on different systems. But before blaming old Heisenberg, verify the systems are in fact identical and there is no faulty hardware. Good luck.
Tactics for dealing with a Heisenbug:
- All previous tactics for reducing the problem space still apply, with a focus on memory and thread sensitive code.
- Rapid test automation becomes more vital with low frequency failures.
- Carefully compare/contrast your logs between a session that failed and a session that worked. Does a particular sequence always result in a failure?
- Force one thread to run slower (if you have multiple) and try to create a deterministic situation that fails every time. Then you can be more confident in your fix.
When communicating to stakeholders, be wary of providing absolute conclusions too soon. Do not put yourself in the position of being “wrong” when your first hypothesis is disproved. If you are in management, give credit to those who rule things out and narrow the scope of the issue.
Once the system is functioning, you have a few more tasks to consider:
- Create an automated test that checks for the issue in the future. Problems have a habit of returning.
- Maintain a knowledge base of issues, documenting what symptoms indicate or rule out a particular cause of the issue. Both inclusion and exclusion criteria are important.
- Review for opportunities to improve your diagnostic abilities in the future. Do you need more logging? Do you need to design a point of access for observation? How can you shorten the detective work next time?
Featured Image Credit
Photo by Nick Bushby