Using Crash Hoare Logic for Certifying the FSCQ File System
Chen, Ziegler, Chajed, Chlipala, Kaashoek, Zeldovich (2015)
What kind of paper is this?
- I'd give it the "big idea" tag -- the new logic to be able to
verify persistent things (i.e., a file system) seems like a leap
forward.
- They are also pretty exhaustive in their contributions and do a
real end-to-end job of investigating this.
The Story
- Once upon a time, people built software that they thought was correct, but
they had no way to prove that it was. This was particularly problematic when
that software maintained persistent state was important. If you lost power or
your system crashed, you really wanted to get your data back. Although some
daring and courageous researchers in Australia hd taken great strides to prove
an operating system kernel correct, they ignored the persistent state. The
young knights of of the Institute of Technology in Massachusetts decided that
they could do better! They invented an extension to Sir Hoare's logic that
let them reason about crashes and persistent state. They then showed how they
could use this to preserve all the world's data, so we could all sleep well at
night as we lived happily ever after.
The Technology
- Coq proof assistant
- CHL (Crash Hoare Logic) -- a specification language embedded in Coq.
- Used CHL to specify a subset of the Posix system calls
- Implemented those calls in Coq and proved that the implementation meets
the specification.
- FSCQ supports the same features as the xv6 (teaching) operating system's
file system.
- Extract a Haskell implementation from the Coq.
- Use the Haskell implementation via a Haskell FUSE implementation (not
verified).
The Trusted Computing Base
- Haskell
- The Haskell driver
- The Haskell FUSE library
Crash Hoare Logic
- Basic Hoare Logic: Pre-condition, statements, Post-condition
- Statements could be disk operations, computation, state manipulation, etc.
- What is missing from Hoare Logic? That the set of statements could be
interrupted at any point in time, so the post-condition would not hold at
the instant of the crash.
- So, we extend it: crash conditions, logical address spaces, and recovery
execution semantics.
- Aha -- in their examples, a is an address and v is the value to
write at that address (I wish they'd said that).
- Crash Conditions
- Uses separation logic to combine predicates on disjoint parts of the store.
- It seems that the tricky part is enumerating all the possible crash states.
- So, the write specification captures the current value and all previous
values (because the block has to contain one of those previous values).
- Sync allows discard of previous values.
- Crash conditions describe the state just before a crash.
- CHL's model specifies that each block nondeterministically chooses one
of those states after a crash.
- Logical Address Spaces
- Once again, use separation logic to model in terms of three different kinds
of maps:
- Disk address to disk block contents
- Inode number to inode structure
- Within a file, from offsets to data
- Filenames to inode numbers (a directory)
- They use address spaces to make the disk look synchronous above the
level of the log. This wasn't exactly how I was expecting address spaces
to be used, so it doesn't feel natural quite yet.
- Ah, log_rep is a representation invariant that describes the physical
disk contents and assigns it to the address space.
- Recovery Execution Semantics
- log_recover is a procedure that can be specified in the post condition
to allow for what can happen if you have a crash.
- Introduce a variable that indicates if you are in a completed state or
a recovered state
- Log recover itself does not need a crash condition because you just
keep running it until it finishes
- Can have more than just one recovery function (i.e., you can stack them)
The Proofs
- Implement CHL in Coq and prove it sound.
- For each application using CHL, develop spec and the proof (this is
where all the work goes)
- At a high level, prove that if the precondition of a block is satisfied,
then either its post condition or its recover condition holds.
- For the first block inside a high level spec, the precondition of the outer
block implies the precondition of the first block.
- For all but the last block, prove that the postcondition implies the
precondition of the next block.
- Prove that the post condition of the last block implies the post condition
of the outer block.
- Prove that the crash conditions all imply the precondition of the recovery
procedure.
- Phases
- Convert p's specification into a series of proof obligations.
- Predicate implications: Either proofs are trivial or we use
separation logic.
The Prototype
- 31K LoC (code and proofs)
- "several" developers 18 months
- checking proofs is 11 hours on i7 at 2GHz
- CHL is a DSL in Coq (a macro language)
- Extract Haskell from Coq
- 400-line Haskell driver program; 350-lines of Haskell for buffer cache
replacement, fixed-size words, and disk blocks
- Used Haskell FUSE for applicatoin access.
Eval
- Nicely stated questions to answer:
- Is FSCQ complete enough to run real applications?
- Runs a build environment
- Runs a mail server
- Runs emacs!
- Compare to ext4 and xv6: Basically comparable to or a big slower than xv6;
usually quite a bit slower than ext4, but within a factor of two for the
comparable configuration.
- What kinds of bugs do the theorems preclude? All the categories listed.
- Does FSCQ recover from crashes? Appears so
- How difficult is it to build and evolve the code and proofs?
- 10x larger than xv6
- Adding async disk writes: changed 1000 lines of CHL and over half of the
implementation and proof for FscqLog; But -- nothing above it.
- Indirect blocks: changed 1500 LoC and only 50 lines above it
- Buffer Cache: changed 300 LoC; 600 changed LoC above the Log
- Optimized log layout: Really tiny changes