Hance: Storage Systems are Distributed Systems (So Verify Them That Way!)
Hance, Lattuada, Hawblitzel, Howell, Johnson, Parno (2020)
What kind of paper is this?
- Best of both worlds: applying and generalizing a methodology for verifying distributed systems and apply it to storage systems.
- Goal: increase efficiency by reducing wait time or the need for writing too much tedious proofs -- balance automation...
The Story
Once upon a time, storage systems used simple data structures that were great to reason about or verify. However, they suffered performance overheads for random-insertion workloads. Storage systems then switched to LSM-trees and Bε trees. This not only improved performance but also made the code drastically more complex and made verifying it extraordinarily difficult and slow,
reducing programmer productivity. The authors thought: wait a minute: aren't storage systems not distributed systems in disguise? Using that insight, they ant generalized IronFleet's proof methodology for verifying distributed systems and applied it to a storage system.
They specified, built, and used Dafny to verify VeriBetrKV, a key-value store using Bε trees and journaling. Along the way, the authors developed techniques to quickly evaluate correctness of code and proofs using modularization resulting in 99% of the proofs finishing within 20 seconds or less.
Assumptions
- The spec is correct.
- The environment can reorder requests, but does not duplicate them
and only drops them in the presence of a crash.
- Toolchain is correct (Dafny chain, C++ chain).
Overall Approach
- Design three state machines
- The program state machine: specifies correctness (tht can be proven
with Hoare logic) for an optimized, imperative program.
- Environment state machine: encode assumptions about the outside world.
- IOsystem: composes the first two and specifies how they interact.
- Prove that the IOSystem refines a high level application spec.
- Show that the interactions with the outside world match those
allowed in the abstract program state machine.
-
The IOSystem State Machine
- Three possible state transitions:
- Transition 'a step forward' possibly interacting with the bus
- Disk can process a read/write command.
- Can crash (fail stop).
- Top level proof: IOSystem is a refinement of VeriBetrVK's API spec (a map)
- Question: ensures that it never returns incorrect data, but given
the allowance for random corruption, can't you lose data?
- Multiple disk models:
- For the journal: array of journal entries
- Bε tree: collection of nodes
- Lowest level: byte-oriented
- Question: How do you reconcile the models with the realities of how
disks work? Are journal entries block aligned? How do you avoid corruption
of journal entries on writes?
A Different Performance Eval
- (I love this part)
- Verification workflow must: use developer time efficiently!
- Minimize tedious typing
- Verification must return/respond quickly
- They key is a balance between exploiting automation and controlling it.
- I think Table 7.1 is Section 7, Table 1?
Reading Dafny
- arrays are mutable; seq are not (both rely on garbage collection)
- Linear extensions developed in this work (neither garbage collected nor reference counted)
- linear seq: non-aliased, mutable
- shared seq: immutable put mossibly aliased
- Other extensions
- Linear fields in data structures
- linear elements in sequences
- linear to ordinary references
- ordinary to linear references via trusted class BoxedLinear (puts linear
values in ordinary objects)
- Also built a new C++ backend for Dafny
VeriBetrKV
- COW Bε tree w/logical journal
- Accumulate updates in memory like an LSM and then write in batches
- Large notes (1-4 MB)
- Their parameters: 2 MB nodes (disk), 128KB nodes (flash), fanout=8
- Three versions of the tree
- Inserts go to ephemeral tree (no updates?)
- frozen tree is being made durable
- persistent tree is durable
- Nodes are id's via an indirection table that maps logical node numbers
to actual nodes/blocks
- Sync (checkpoint) is three steps
- Flush all dirty nodes (remember COW)
- Flush indirect table
- Write superblock that points to indirect table
- Journal (logically) all updates
- On user-initiated sync: flush journal
Eval
- Three questions (they said two)
- Have they improved eveloper experience?
- Can the verification scale?
- Does the verified system perform?
- The Developer experience
- Introduce the tedium metric: lines of proof:lines of code (but they
gave you raw numbers w/out the metric in the table!)
- Also make a scalability argument: tedium is comparable to an earlier
project that was 1/3 the size.
- Spec is 1/5 the implementation (mayybe that means fewer bugs?)
- Interactive = < 10 seconds ... (says who? but for verification that
does sound pretty darned good)
- Performance: does it achieve write performance? how do the linear
extensions work?
- Zippo details on how BDB was configured, what version was running, etc.
- Oh man: the 25x number they cite in the abstract is one particular
case: load on HDD. On SDD BDB is only 2x slower (and is now write optimized).
Also note that you would never do the load the way they did; you would use
the bulk load utility that BDB provides.
- And the prose never mentions that BDB is way faster than both RocksDB
and VeriBetrKV on queries.
- Here is the deal: if they had actually made fair comparisons it would not
have detracted from their work at all -- what they did was still cool.
However, because they so badly summarized their results, they run the risk
that people will accuse them of overselling and being dishonest.
- And due to the item above, I was so annoyed that I could care less
about how well their linearization performed, which arguably, is way more
important.
Margo's Pet Peeve
- Just from reading the introduction I can pretty much guarantee that
they used BDB out of the box with a microscopic buffer cache.
- I looked at their artifact: they simply used the C++ STL BDB code,
so yeah, did not configure at all. This is sloppy/sloppy/sloppy!
- Also, they are comparing a write-optimized system (theirs) to a
read-optimized system (ours) using writes. The shepherd should have
caught that. Oh and notice that in Figure 6, BDB kicks the crap out of
every other system, but no one mentions that in the abstract; no they
cite the case for which they optimize and we do not. (A better comparison
would have been wiredtiger's LSM KV store.)
- Oh right, and remember that they haven't implemented much of the
hard stuff (e.g., multi-threading). And I bet they don't support things
like: forward/backward traversal, partial key match, duplicates, etc,
etc, etc.
- Oops -- and no transactions, it is crash consistent as most of its
IOs are synchronous.
- OMG -- and that 25x number in the abstract is on HDD; on an SDD they
are only 2x faster and BDB isn't write optimized. This really infuriated
me.