Enhancing Server Availability and Security Through Failure-Oblivious Computing

Rinard, Cadar, Dumitran, Roy, Leu, Beebee (2004)

Enhancing Server Availability and Security Through Failure-Oblivious Computing

Safe compiler inserts checks for invalid memory references.
Ignore failed writes
Manufacture data for reads (ideally values that will trigger normal error paths)
Most software still works!

Memory errors due to security attacks:
- Disable the attack
- Enable continued safe execution
Other memory errors
- Checks induce performance penalty comparable to earlier work
- Server continues to run acceptably
- Individual request may not get handled
- Future requests unlikely dependent upon the failure.

What harm to memory failure cause?
- Termination
- Infinite Looping
- Control flow change resulting in incorrect answer
- Data structure corruption
- Incorrect computational results
Interception guarantees that you don't crash/terminate
If read values are well-chosen, infinite looping becomes unlikely.
Discarding writes keeps errors localized and avoids corrupting data structures.
For server apps -- data and control flow propagation is short, so returning bad values doesn't actually corrupt computation.
Most common problem addressed is buffer overflow attacks -- the bad data isn't actually used by anyone and so not corrupting heap/stack simply avoids the attack, but doesn't damage anything else.
Can choose return values to follow normal error paths
Would not work in all cases (e.g., numerical computation).

Increased availability (no termination)
Improved security (no buffer overflows)
Cost -- can adopt this approach with recompilation
Less administration (don't need to immediately patch buffer overrun vulnerabilities)

Blindingly simple
Standard checking support
Throw away writes
Extract values from buffer for reads -- things like 0 and 1 get returned a lot, because they terminate loops frequently.

Test Programs
- Pine
- Apache
- Sendmail
- Midnight Commander
- Mutt
Criteria
- Security and Resilience
- Performance
- Stability
In all cases, the failure oblivious programs worked fine.
Although overheads were high (factors of 3 or 8) in some cases, most of those occurred in interactive programs for which the acceptability criteria is human perception, and they were still sufficiently fast to be imperceptible.
The failure oblivious versions worked stably over time.
In server scenarios, the pool of threads model where you can kill the thread causing an error and start a new one seems to work, but induces higher overhead due to killing and restarting processes.
For some of the interactive loads, anything but failure oblivious programming results in a user being unable to do something (e.g., open a mailbox).

Acceptability properties reminded me a lot of the agile notion of test-driven development -- if your tests don't fail, then the code you're writing is "correct."
Fun extensions: for out of bound writes, create a hashmap for them and then return the right value when read -- very cute!