Using Crash Hoare Logic for Certifying the FSCQ File System

Chen, Ziegler, Chajed, Chlipala, Kaashoek, Zeldovich (2015)

What kind of paper is this?

I'd give it the "big idea" tag -- the new logic to be able to verify persistent things (i.e., a file system) seems like a leap forward.
They are also pretty exhaustive in their contributions and do a real end-to-end job of investigating this.

The Story

Once upon a time, people built software that they thought was correct, but they had no way to prove that it was. This was particularly problematic when that software maintained persistent state was important. If you lost power or your system crashed, you really wanted to get your data back. Although some daring and courageous researchers in Australia hd taken great strides to prove an operating system kernel correct, they ignored the persistent state. The young knights of of the Institute of Technology in Massachusetts decided that they could do better! They invented an extension to Sir Hoare's logic that let them reason about crashes and persistent state. They then showed how they could use this to preserve all the world's data, so we could all sleep well at night as we lived happily ever after.

The Technology

Coq proof assistant
CHL (Crash Hoare Logic) -- a specification language embedded in Coq.
Used CHL to specify a subset of the Posix system calls
Implemented those calls in Coq and proved that the implementation meets the specification.
FSCQ supports the same features as the xv6 (teaching) operating system's file system.
Extract a Haskell implementation from the Coq.
Use the Haskell implementation via a Haskell FUSE implementation (not verified).

The Trusted Computing Base

Haskell
The Haskell driver
The Haskell FUSE library

Crash Hoare Logic

Basic Hoare Logic: Pre-condition, statements, Post-condition
Statements could be disk operations, computation, state manipulation, etc.
What is missing from Hoare Logic? That the set of statements could be interrupted at any point in time, so the post-condition would not hold at the instant of the crash.
So, we extend it: crash conditions, logical address spaces, and recovery execution semantics.
Aha -- in their examples, a is an address and v is the value to write at that address (I wish they'd said that).
Crash Conditions
- Uses separation logic to combine predicates on disjoint parts of the store.
- It seems that the tricky part is enumerating all the possible crash states.
- So, the write specification captures the current value and all previous values (because the block has to contain one of those previous values).
- Sync allows discard of previous values.
- Crash conditions describe the state just before a crash.
- CHL's model specifies that each block nondeterministically chooses one of those states after a crash.
Logical Address Spaces
- Once again, use separation logic to model in terms of three different kinds of maps:
  1. Disk address to disk block contents
  2. Inode number to inode structure
  3. Within a file, from offsets to data
  4. Filenames to inode numbers (a directory)
- They use address spaces to make the disk look synchronous above the level of the log. This wasn't exactly how I was expecting address spaces to be used, so it doesn't feel natural quite yet.
- Ah, log_rep is a representation invariant that describes the physical disk contents and assigns it to the address space.
Recovery Execution Semantics
- log_recover is a procedure that can be specified in the post condition to allow for what can happen if you have a crash.
- Introduce a variable that indicates if you are in a completed state or a recovered state
- Log recover itself does not need a crash condition because you just keep running it until it finishes
- Can have more than just one recovery function (i.e., you can stack them)

The Proofs

Implement CHL in Coq and prove it sound.
For each application using CHL, develop spec and the proof (this is where all the work goes)
At a high level, prove that if the precondition of a block is satisfied, then either its post condition or its recover condition holds.
For the first block inside a high level spec, the precondition of the outer block implies the precondition of the first block.
For all but the last block, prove that the postcondition implies the precondition of the next block.
Prove that the post condition of the last block implies the post condition of the outer block.
Prove that the crash conditions all imply the precondition of the recovery procedure.
Phases
1. Convert p's specification into a series of proof obligations.
2. Predicate implications: Either proofs are trivial or we use separation logic.

The Prototype

31K LoC (code and proofs)
"several" developers 18 months
checking proofs is 11 hours on i7 at 2GHz
CHL is a DSL in Coq (a macro language)
Extract Haskell from Coq
400-line Haskell driver program; 350-lines of Haskell for buffer cache replacement, fixed-size words, and disk blocks
Used Haskell FUSE for applicatoin access.

Eval

Nicely stated questions to answer:
1. Is FSCQ complete enough to run real applications?
  - Runs a build environment
  - Runs a mail server
  - Runs emacs!
  - Compare to ext4 and xv6: Basically comparable to or a big slower than xv6; usually quite a bit slower than ext4, but within a factor of two for the comparable configuration.
2. What kinds of bugs do the theorems preclude? All the categories listed.
3. Does FSCQ recover from crashes? Appears so
4. How difficult is it to build and evolve the code and proofs?
  - 10x larger than xv6
  - Adding async disk writes: changed 1000 lines of CHL and over half of the implementation and proof for FscqLog; But -- nothing above it.
  - Indirect blocks: changed 1500 LoC and only 50 lines above it
  - Buffer Cache: changed 300 LoC; 600 changed LoC above the Log
  - Optimized log layout: Really tiny changes