When you write to a memory cell, will you always read back the same value? The vast majority of the time the answer is yes. For an individual using a personal computer the odds and consequences of failure are small enough to be irrelevant. But in some systems it can be a regular concern.
- Data centers and supercomputers. Odds of a memory errors rise with the number of chips.
- Financial systems, safety systems, and simulations of chaotic systems (weather forecasting) are averse to even a small chance of a mistake.
- Systems at higher altitude, in aircraft, and especially those in orbit and beyond receive more ionizing radiation, increasing the odds of memory errors.
- Systems with a large quantity of small devices. This category is becoming more prevalent with the Internet of Things. A single device may never have a memory error, but if you deploy thousands the system as a whole will begin to encounter them.
Causes and Categories of Memory Errors
Memory errors can be categorized into:
- Soft Errors: do not cause lasting damage to the circuit. The error could be corrected by rewriting the proper value.
- Hard Errors: physically alter the state of the circuit such that damage or permanent failure may occur. At a minimum a power cycle is needed to clear the error.
Causes of memory errors, both hard and soft include:
- Radiation with enough energy to liberate electrons (ionizing radiation):
- Cosmic rays from distant galaxies. Seriously.
- Natural sources such as isotopes of uranium, potassium, and thorium. Natural radionuclides may be present as impurities in materials that are used in circuits or their packaging.
- Man made sources such as nuclear power, some medical devices, and other industrial users of radiation.
- Interference or crosstalk along communication pathways between circuits, potentially requiring less energy than ionizing radiation to cause a problem.
- Intentional attacks on a device such as Row Hammer, in which repeated access of memory can cause adjacent memory to be altered.
- Other physical or material defects in the circuit.
The space industry is primarily concerned with high energy particles and uses the following terminology:
- Single Event Effects (SEE): all memory errors caused by high energy particles
- Single Event Upset (SEU): soft errors
- Single Event Latchup (SEL), Single Event Gate Rupture (SEGR), Single Event Burnout (SEB): types of hard errors
Colloquially, a “bit flip” is used to describe all types of memory errors.
Solutions
Many ground based systems are best protected by the use of Error-correcting Code (ECC) memory. Extra bits of memory are used to store error detection/correction data. How much protection varies with the hardware, but typically any single bit flip can be detected and fixed with no interruption to software. Both AWS and Azure use ECC memory.
Systems that are highly safety critical and/or exposed to higher doses of ionizing radiation use redundancy and/or radiation hardening. The Space Shuttle had four computers in active use with one more as an independent backup. Commands to an actuator from a faulty computer were outvoted by the others.
Insufficient Solutions
Watchdogs
Watchdogs can be used to reboot a system in the event a memory error causes a crash. But data and program instructions can be corrupted in ways that lead to bad outputs with no crash.
Watchdogs are very useful, but are not a complete solution for memory errors.
Software Checksums
It is common to use checksums to verify that data has been transmitted correctly between systems. Radio communications especially are vulnerable to transmission errors from interference. Are application layer checksums a viable solution to protect against memory errors?
They may help, but fundamentally the answer is no. An error can occur in the memory containing your program’s code, not just in its data.
I once had a log file tell me that the DateTime
type has no attribute Now
(.NET code). No reflection, nothing dynamic in play, only happened one time across thousands of machines. Something under the hood was corrupted. And you thought you were safe from bad pointers in C#.
Finally, if there is any significant time between when you compute a checksum and when you last use that data, a bit flip can happen in between. I encountered a system that had a user workflow between checksum validation and later use of a variable (seconds to minutes). The checksum validation logic passed but later code crashed, with the data dump showing the checksum to be wrong. It was explainable by the flip of a single bit.
Checksums will not save you from the cosmos.