Enhancing Server Availability and Security Through Failure-Oblivious
Computing
Rinard, Cadar, Dumitran, Roy, Leu, Beebee (2004)
Enhancing Server Availability and Security Through Failure-Oblivious
Computing
What Kind of Paper is This?
- Big idea?
- Mostly a proof of the idea
- Lots of justification and scoping
The Big Picture
- Safe compiler inserts checks for invalid memory references.
- Ignore failed writes
- Manufacture data for reads (ideally values that will trigger
normal error paths)
- Most software still works!
The Results
- Memory errors due to security attacks:
- Disable the attack
- Enable continued safe execution
- Other memory errors
- Checks induce performance penalty comparable to earlier work
- Server continues to run acceptably
- Individual request may not get handled
- Future requests unlikely dependent upon the failure.
Why is it OK to continue executing after such errors?
- What harm to memory failure cause?
- Termination
- Infinite Looping
- Control flow change resulting in incorrect answer
- Data structure corruption
- Incorrect computational results
- Interception guarantees that you don't crash/terminate
- If read values are well-chosen, infinite looping becomes unlikely.
- Discarding writes keeps errors localized and avoids corrupting data
structures.
- For server apps -- data and control flow propagation is short, so
returning bad values doesn't actually corrupt computation.
- Most common problem addressed is buffer overflow attacks -- the bad
data isn't actually used by anyone and so not corrupting heap/stack
simply avoids the attack, but doesn't damage anything else.
- Can choose return values to follow normal error paths
- Would not work in all cases (e.g., numerical computation).
Advantages
- Increased availability (no termination)
- Improved security (no buffer overflows)
- Cost -- can adopt this approach with recompilation
- Less administration (don't need to immediately patch buffer overrun
vulnerabilities)
Disadvantages
- End up executing new and unanticipated paths
- Developers could come to rely on this approach and become sloppier
Implementation
- Blindingly simple
- Standard checking support
- Throw away writes
- Extract values from buffer for reads -- things like 0 and 1 get
returned a lot, because they terminate loops frequently.
3-way Evaluation
- Standard (unmodified)
- Bounds-checked
- Failure-oblivious
Evaluation
- Test Programs
- Pine
- Apache
- Sendmail
- Midnight Commander
- Mutt
- Criteria
- Security and Resilience
- Performance
- Stability
- In all cases, the failure oblivious programs worked fine.
- Although overheads were high (factors of 3 or 8) in some cases, most of those
occurred in interactive programs for which the acceptability criteria is human
perception, and they were still sufficiently fast to be imperceptible.
- The failure oblivious versions worked stably over time.
- In server scenarios, the pool of threads model where you can kill the
thread causing an error and start a new one seems to work, but induces higher
overhead due to killing and restarting processes.
- For some of the interactive loads, anything but failure oblivious programming
results in a user being unable to do something (e.g., open a mailbox).
Wrapping Up
- Acceptability properties reminded me a lot of the agile notion of
test-driven development -- if your tests don't fail, then the code you're
writing is "correct."
- Fun extensions: for out of bound writes, create a hashmap for them and
then return the right value when read -- very cute!