Hance: Storage Systems are Distributed Systems (So Verify Them That Way!)

Hance, Lattuada, Hawblitzel, Howell, Johnson, Parno (2020)

What kind of paper is this?

Best of both worlds: applying and generalizing a methodology for verifying distributed systems and apply it to storage systems.
Goal: increase efficiency by reducing wait time or the need for writing too much tedious proofs -- balance automation...

The Story

Once upon a time, storage systems used simple data structures that were great to reason about or verify. However, they suffered performance overheads for random-insertion workloads. Storage systems then switched to LSM-trees and B^ε trees. This not only improved performance but also made the code drastically more complex and made verifying it extraordinarily difficult and slow, reducing programmer productivity. The authors thought: wait a minute: aren't storage systems not distributed systems in disguise? Using that insight, they ant generalized IronFleet's proof methodology for verifying distributed systems and applied it to a storage system. They specified, built, and used Dafny to verify VeriBetrKV, a key-value store using B^ε trees and journaling. Along the way, the authors developed techniques to quickly evaluate correctness of code and proofs using modularization resulting in 99% of the proofs finishing within 20 seconds or less.

Assumptions

The spec is correct.
The environment can reorder requests, but does not duplicate them and only drops them in the presence of a crash.
Toolchain is correct (Dafny chain, C++ chain).

Overall Approach

Design three state machines
- The program state machine: specifies correctness (tht can be proven with Hoare logic) for an optimized, imperative program.
- Environment state machine: encode assumptions about the outside world.
- IOsystem: composes the first two and specifies how they interact.
Prove that the IOSystem refines a high level application spec.
Show that the interactions with the outside world match those allowed in the abstract program state machine.

The IOSystem State Machine

Three possible state transitions:
1. Transition 'a step forward' possibly interacting with the bus
2. Disk can process a read/write command.
3. Can crash (fail stop).
Top level proof: IOSystem is a refinement of VeriBetrVK's API spec (a map)
Question: ensures that it never returns incorrect data, but given the allowance for random corruption, can't you lose data?
Multiple disk models:
1. For the journal: array of journal entries
2. B^ε tree: collection of nodes
3. Lowest level: byte-oriented
Question: How do you reconcile the models with the realities of how disks work? Are journal entries block aligned? How do you avoid corruption of journal entries on writes?

A Different Performance Eval

(I love this part)
Verification workflow must: use developer time efficiently!
Minimize tedious typing
Verification must return/respond quickly
They key is a balance between exploiting automation and controlling it.
I think Table 7.1 is Section 7, Table 1?

Reading Dafny

arrays are mutable; seq are not (both rely on garbage collection)
Linear extensions developed in this work (neither garbage collected nor reference counted)
1. linear seq: non-aliased, mutable
2. shared seq: immutable put mossibly aliased
Other extensions
- Linear fields in data structures
- linear elements in sequences
- linear to ordinary references
- ordinary to linear references via trusted class BoxedLinear (puts linear values in ordinary objects)
Also built a new C++ backend for Dafny

VeriBetrKV

COW B^ε tree w/logical journal
- Accumulate updates in memory like an LSM and then write in batches
- Large notes (1-4 MB)
- Their parameters: 2 MB nodes (disk), 128KB nodes (flash), fanout=8
Three versions of the tree
1. Inserts go to ephemeral tree (no updates?)
2. frozen tree is being made durable
3. persistent tree is durable
Nodes are id's via an indirection table that maps logical node numbers to actual nodes/blocks
Sync (checkpoint) is three steps
1. Flush all dirty nodes (remember COW)
2. Flush indirect table
3. Write superblock that points to indirect table
Journal (logically) all updates
On user-initiated sync: flush journal

Eval

Three questions (they said two)
1. Have they improved eveloper experience?
2. Can the verification scale?
3. Does the verified system perform?
The Developer experience
- Introduce the tedium metric: lines of proof:lines of code (but they gave you raw numbers w/out the metric in the table!)
- Also make a scalability argument: tedium is comparable to an earlier project that was 1/3 the size.
- Spec is 1/5 the implementation (mayybe that means fewer bugs?)
- Interactive = < 10 seconds ... (says who? but for verification that does sound pretty darned good)
Performance: does it achieve write performance? how do the linear extensions work?
- Zippo details on how BDB was configured, what version was running, etc.
- Oh man: the 25x number they cite in the abstract is one particular case: load on HDD. On SDD BDB is only 2x slower (and is now write optimized). Also note that you would never do the load the way they did; you would use the bulk load utility that BDB provides.
- And the prose never mentions that BDB is way faster than both RocksDB and VeriBetrKV on queries.
- Here is the deal: if they had actually made fair comparisons it would not have detracted from their work at all -- what they did was still cool. However, because they so badly summarized their results, they run the risk that people will accuse them of overselling and being dishonest.
- And due to the item above, I was so annoyed that I could care less about how well their linearization performed, which arguably, is way more important.

Margo's Pet Peeve

Just from reading the introduction I can pretty much guarantee that they used BDB out of the box with a microscopic buffer cache.
I looked at their artifact: they simply used the C++ STL BDB code, so yeah, did not configure at all. This is sloppy/sloppy/sloppy!
Also, they are comparing a write-optimized system (theirs) to a read-optimized system (ours) using writes. The shepherd should have caught that. Oh and notice that in Figure 6, BDB kicks the crap out of every other system, but no one mentions that in the abstract; no they cite the case for which they optimize and we do not. (A better comparison would have been wiredtiger's LSM KV store.)
Oh right, and remember that they haven't implemented much of the hard stuff (e.g., multi-threading). And I bet they don't support things like: forward/backward traversal, partial key match, duplicates, etc, etc, etc.
Oops -- and no transactions, it is crash consistent as most of its IOs are synchronous.
OMG -- and that 25x number in the abstract is on HDD; on an SDD they are only 2x faster and BDB isn't write optimized. This really infuriated me.