Mickens: Blizzard: Fast, Cloud-scale Block Storage for Cloud-oblivious Applications

Mickens, Nightingale, Elson, Gehring, Fan, Kadav, Chidambaram, Khan

Describes a system: block-based storage that provides POSIX API access (and almost POSIX semantics).

Applications user the file system (POSIX).
Cloud services provide blob access.
Blobs don't work so well for applications.
However, if you just naively implement POSIX on top of most cloud APIs, you get abysmal performance.
We introduce Blizzard, which forgoes some of the synchronousity guarantees and provides eventual durability, but as a result gives super high performance for POSIX apps on top of a cloud.
Users and applications all live happily ever after.

Present a virtual disk: looks like a SATA drive; performs like a massive parallel IO system (for the right workload).
Built on top of network infrastructure with 1: no oversubscription and 2: balanced IO and network bandwidth.
High performance and consistency
- Epoch-based ordered writes
- Asynchronous IO (so writes are not immediately durable, but are acked immediately).
- Post-crash, clients always see a state where all writes from epoch N-1 are present and some from N, but none from N+1.
- Pick level of disk parallelism by selecting segment size (more on this later).

Build atop FDS: Flat datacenter storage, blob scale storage that provides the network-IO balance -- this means that you're guaranteed to be able to get the full bandwidth to a remote disk (makes everything locality oblivious).
Blizzard Virtual disk = FDS blob
Blobs broken into 8 MB segments (tracts).
Virtual disks striped across (64 or 128) tracts.
Nested Striping avoides convoys and dilation.
- Convoy: a stream of sequential IOs that hit the same disk at the same time. (Because a file system might request N file system blocks that could fall in the same tract.)
- Dilation: This occurs as a series of requests travel from client to server -- even if the requests are sequential and could be combined, due to network delays they might ge received spaced in time such that they cannot be gathered together into a single IO.
- Nested Striping: stripe data across tracts so that you get enormous parallelism on lots of disks. Size of stripes lets you control the degree of parallelism.

Write through: exactly what you expect -- slow but durable.
Flush-epoch commits with fast acks: Ack immediately, but only write on epoch barriers. Issue all writes from prior epochs that are in the write queue. Eventual Durability -- a recovered client sees a consistent prefix. [Epoch is the time between two sync operations.]
Out of order commits with fast acks: ack and issue writes immediately. However, using a log-structured store, this approach can still provide prefix consistency. (Implementation is rather complex -- is it worth it?) Very much a log-structured file system approach with a linear congruential generator that generates a pattern that is not sequential so that you can still obtain parallelism from the tracts that comprise a segment.

Virtual disk = kernel mode SATA driver + user mode interface to FDS.
Causes a single extra up/down (kernel to user) switch than you would going to a normal disk.
Highly concurrent and asynchronous

What questions do they try to answer?
- What latency do applications see?
- How much bandwidth do applications get?
Why does increasing the block size from 128KB to 256 KB improve performance for segment size = 1, but not for segment sizes 64,128 (for which it actually hurts performance)??? I wanted to see an explanation. A 128KB block should be striped across fewer tracts than a 256KB one, so it's not obvious why one is slower.
Similarly, they show us that Blizzard is faster than EBS, but don't provide much insight into why.
Macrobenchmark performance: impressive, but another black box (I'm seeing a trend here and it makes me sad).
Nice that they show that they really need their network configuration.