Mickens: Blizzard: Fast, Cloud-scale Block Storage for
Mickens, Nightingale, Elson, Gehring, Fan, Kadav, Chidambaram, Khan
What kind of paper is this?
- Describes a system: block-based storage that provides POSIX API access
(and almost POSIX semantics).
- Applications user the file system (POSIX).
- Cloud services provide blob access.
- Blobs don't work so well for applications.
- However, if you just naively implement POSIX on top of most cloud APIs,
you get abysmal performance.
- We introduce Blizzard, which forgoes some of the synchronousity
guarantees and provides eventual durability, but as a result gives
super high performance for POSIX apps on top of a cloud.
- Users and applications all live happily ever after.
What they did
- Present a virtual disk: looks like a SATA drive; performs like
a massive parallel IO system (for the right workload).
- Built on top of network infrastructure with 1: no oversubscription
and 2: balanced IO and network bandwidth.
- High performance and consistency
- Epoch-based ordered writes
- Asynchronous IO (so writes are not immediately durable, but are acked
- Post-crash, clients always see a state where all writes from epoch N-1
are present and some from N, but none from N+1.
- Pick level of disk parallelism by selecting segment size (more on this
- Run unmodified POSIX applications
- Provide "big-data" performance.
Key Enabling Design Point and Tradeoff
- You get high performance by striping.
- If you stripe within a rack, you get low communication overhead, but
- If you stripe across racks you get much higher scalability.
- Build atop FDS: Flat datacenter storage, blob scale storage that
provides the network-IO balance -- this means that you're guaranteed
to be able to get the full bandwidth to a remote disk (makes everything
- Blizzard Virtual disk = FDS blob
- Blobs broken into 8 MB segments (tracts).
- Virtual disks striped across (64 or 128) tracts.
- Nested Striping avoides convoys and dilation.
- Convoy: a stream of sequential IOs that hit the same disk at the same time.
(Because a file system might request N file system blocks that could fall in
the same tract.)
- Dilation: This occurs as a series of requests travel from client to
server -- even if the requests are sequential and could be combined, due
to network delays they might ge received spaced in time such that they
cannot be gathered together into a single IO.
- Nested Striping: stripe data across tracts so that you get enormous
parallelism on lots of disks. Size of stripes lets you control the degree
- Write through: exactly what you expect -- slow but durable.
- Flush-epoch commits with fast acks: Ack immediately, but only write
on epoch barriers. Issue all writes from prior epochs that are in the
write queue. Eventual Durability -- a recovered client sees a
[Epoch is the time between two sync operations.]
- Out of order commits with fast acks: ack and issue writes immediately.
However, using a log-structured store, this approach can still provide
prefix consistency. (Implementation is rather complex -- is it worth it?)
Very much a log-structured file system approach with a linear congruential
generator that generates a pattern that is not sequential so that you can
still obtain parallelism from the tracts that comprise a segment.
- Virtual disk = kernel mode SATA driver + user mode interface to FDS.
- Causes a single extra up/down (kernel to user) switch than you would
going to a normal disk.
- Highly concurrent and asynchronous
- What questions do they try to answer?
- What latency do applications see?
- How much bandwidth do applications get?
- Why does increasing the block size from 128KB to 256 KB improve
performance for segment size = 1, but not for segment sizes 64,128 (for
which it actually hurts performance)??? I wanted to see an explanation.
A 128KB block should be striped across fewer tracts than a 256KB one, so
it's not obvious why one is slower.
- Similarly, they show us that Blizzard is faster than EBS, but
don't provide much insight into why.
- Macrobenchmark performance: impressive, but another black box (I'm
seeing a trend here and it makes me sad).
- Nice that they show that they really need their network configuration.