Systems are designed not to fail, but they inevitably do. Even if your own contribution to a project is solid, complex systems will have aspects outside of your control. Perhaps a memory leak in a third party component. Or the ionizing radiation of a cosmic ray from a galaxy far far away flips a bit.
If we cannot prevent a failure, we can try to detect it and take corrective action. Microcontrollers commonly provide a Watchdog Timer which does just that. We can use similar concepts in higher level software.
The basic design of a watchdog is that it will trigger a failure signal if it is not reset within some interval. Embedded engineers refer to it as “kicking” the dog (lest it bite you). But I prefer to think of it as telling your watchdog to sit and not attack that bit of software which hasn’t failed yet. If you don’t tell him to sit periodically, he will pounce like the good dog he is.
Scenarios
Want to make sure a process keeps running? Want to detect if a hardware peripheral is still alive? You need a good watchdog.
In the case of process monitoring, a simple (but not fully effective) solution is to monitor the Process.Exited event. This has the advantage of immediate feedback without a timer. However, just because a process is alive doesn’t mean it is functioning properly. A better solution is to require the process to do some light but real work that runs through the major parts of your tech stack.
On the hardware side, if you have a sensor, continue to monitor sensor readings with the watchdog even if you do not need them for other purposes. If you cannot get readings repeatedly, such as a device that requires user interaction, check your hardware API for commands that provide metadata like firmware version. These make good ping methods to verify the hardware is still communicating through the app, drivers, OS, and cabling. Just make sure you do not interfere with other commands on other threads of your application. Some hardware doesn’t like to be told to do two things at once. You could treat the successful completion of any device command as a watchdog reset and only send pings when nothing else is going on.
Corrective Actions
So your watchdog pounced. What do we do now?
- At the very least, log it. You may find patterns, commonalities, and differences between situations that produce problems, leading you to better solutions.
- If it’s a process, kill (if its even still alive) and restart it.
- Try to close and re-open resources like serial ports. I don’t generally find this works, but a particular device may have a quirk where this helps. It will be faster than…
- Reboot the system. Slow but often the most reliable solution. Consider how this would impact users. It can feel like a cop out to reboot a system, but don’t feel bad. NASA does it.
Note that there can be multiple stages, trying a less disruptive corrective action before something more drastic. This adds complexity that needs to be tested, but if reboots are highly disruptive it may be worth it.
Pit Bull Pitfalls
- Include a watchdog trigger scenario in your test plan. It may never be encountered in random testing. You need a way to force it and test your corrective actions.
- Purely software based watchdogs are NOT sufficient for safety critical systems. See the references at bottom for hardware watchdog guidance.
- A watchdog that is internal to a process cannot help you if that whole process dies. Internal watchdogs are still fine to use, but be aware of the limitation.
- Do not start your watchdog until you know the system is initialized and ready to go. Turn your watchdog off before closing resources down (i.e. before closing a serial port). Otherwise you will get spurious alerts.
- Consider making your watchdog time limit configurable. It can be hard to know up front what the correct limit should be. If you get spurious alerts in production you may want to raise it quickly.
- Be sure to not call your reset function in a very tight loop, or a loop that could fail in such a way as to have no delay. This will consume lots of CPU time, potentially bringing the system while your watchdog thinks everything is just fine.
References
1. Jack Ganssle: Great Watchdog Timers for Embedded Systems
2. Niall Murphy: Watchdog Timers (PDF)
3. Michael Barr: Introduction to Watchdog Timers
4. NASA: Computers in Spaceflight
Featured Image Credit
“Bubba” by Wes Bushby