The Demikernel Datapath OS Architecture for Microsecond-scale Datacenter Systems
Zhang, Raybuck, Patel, Olynk, Nelson, Leija, Martinez, Liu, Simpson, Jayakar, Penna, Demoulin, Choudhour, Badam (2021)
What kind of paper is this?
- Unifying technology: there have been a bunch of one-off solutions
in the kernel bypass space; we tried to build a general purpose system
to do encompass the range of devices and use cases for which bypass
is appropriate.
- At first blush, this should scare you, because the argument in all the
kernel bypass papers is about specialization, and now we're trying to generalize that work!
That also makes this a 'best of both worlds' paper in that they want the generalization
without sacrificing performance. Using libOSs, the authors achieve this goal.
The Story
- IO devices now run at ns-scale.
- Current operating systems are designed for µs-scale. Existing bypass systems are one-off.
- Demikernel is a new OS architecture and flexible database for general purpose bypass.
- We demonstrate TWO demikernels!
- Life just got way faster.
Requirements
- Heterogeneity: existing protocols function at different levels of
abstraction (e.g., RDMA provides a full network protocol; DPDK provides
raw NIC interface). This is the obvious response to the trade-off that
HW vendors make: features versus complexity. A datapath OS must address
this diversity.
- Zero copy coordination:
This requires that 1) devices cope with address translation and
2) memory management prevents modifying/freeing memory that is in use by an
offload engine.
- µs-scale scheduling: Existing systems split scheduling responsibility
between user-level, the kernel, or different parts of the code (i.e., RDMA).
Demikernel Design Goals
- Simplify µs-scale kernel-bypass system development.
- Portability across heterogenous devices (RDMA, DPDK, NICs, SPDK disks,
programmable devices).
- ns-scale latency overhead.
Demikernel Architecture
- Run in the same process/thread as the application -- protection either
provided by demikernel or the bypass device (fine assumptions for datacenter).
- Cooperative scheduling (i.e., applications run in a tight IO processing loop)
of datapath operations (other stuff goes through a conventional OS).
- (Prototype limitation) Single core scheduling.
- A set of Library OSs -- one for each type of device -- share the same
architecture and APIs.
- New API (PDPIX) centered around an IO queue abstraction. Applications
hand the data/buffer over to demikernel and don't get it back until the IO
completes.
- DMA-capable heap.
PDPIX versus POSIX (only changed what needed changing)
- Libcalls not syscalls
- Queue oriented: no fds, but qds instead (queues are similar to Go channels).
- IO is inherently async: pushing an IO request into the queue returns a
qtoken on which you can choose to wait or not.
- Wait offers much better control than posix: can wait on a specific event
or all events and returns data immediately. No mass wakeup if you have
multiple waiters.
- Buffers are passed from application to datapath on push and from datapath
to applications on pop (like Rust's memory ownership model; except that
applications CAN modify buffers that have been turned over to the datapath).
- Demikernel provides write-after-free functionality, but not protection
against writes while the buffer is owned by the database.
Demikernel Design
- Separate libOS per type of device (i.e., DPDK, RDMA, SPDK).
- Each libOS contains the IO stack for that device, a memory allocator
and a coroutine scheduler.
- Library OSs can be combined (i.e., RDMA and SPDK = RDMAxSPDK).
- Implementation (mostly) in Rust (key features that make this a win:
co-routines, memory ownership, async/wait).
IO Processing
- Polling-centered design
- Optimized for error-free, fast-path, run to completion
Co-Routines
- Three types
- Fast-path IO processing (one per IO stack) [lower priority]
- Background co-routines (one or more per IO stack) [lower priority]
- Application co-routines (one per blocked token) [highest priority]
- ns-scale latency imposes serious constraints on scheduling
- Hundreds or thousands of co-routines
- No heap allocations in scheduler
- Current implementation is 12 cycles!
Realizing demikernels
- Built two! DemiLin and DemiWin
- Five libOSs (a clowder of libOSs) plus cross products:
- DemiWin
- Catpaw (RDMA)
- Catnap (for testing on HW w/out kernel bypass)
- DemiLin
- Catnap (for testing on HW w/out kernel bypass)
- Catmint (RDMA)
- Catnip (DPDK)
- Cattree (SPDK)
- CatmintxCattree (RDMAxSPDK)
- CatnipxCattree (DPDKxSPDK)
Eval
- Baselines: two kernel-bypass applications (testpmd and perftest) and
three kernel-bypass libraries (eRPC, Shenango, Caladan).
- How does demikernel impact development? Implement four applications and
count lines of code. Results: some are fewer (echo, txnstore) some require
more (UDP Relay, Redis).
- Echo client/server: Made it easy to avoid memory allocations and
copies with minimal effort.
- UDB Relay server: More code, but developer reported that PDPIX was easier to
use (he got it working) and it was his "favorite part of the system".
- Redis: Required some re-architecting, replicated some functionality (explains
the increased code count). Demikernel fixed some well-known inefficiencies.
Provided 0-copy IO for free.
- TxnStore: Implemented own RPC transport; simplified ability to get 0-copy.
- Demikernel performance
- Echo DemiLin: Using catnap, small improvement over Linux; LibOSs are
competitive with custom solutions from prior work.
- Echo DemiWin: libOS provides HUGE improvement on Windows; even catnap
provides 15% speedup
- Echo DemiLin Azure: Catnap is ~40% better (why so different from native
Linux-- the polling in the vCP?) and libOss are significantly better
(like the custom solutions).
- Echo w/server side logging DemiLin: Similar benefits observed to those
on DemiWin (~20%)
- NetPIPE throughput DemiLin: Tracks raw RDMA pretty well and does
better for large messages; about 15-20% slower than raw DPDK.
- Throughput versus latency performance: Echo DemiLin
- Catnip (TCP) maintains low-latency across all offered loads.
Tracks eRPC but at 20-30% higher latency.
- Shenango and Caladan can handle higher load before latency skyrockets
(9-10 Gpbs compared to 7-8 Gbps for Demikernel based systems).
- UDP Relay DemiLin (non-expert developer): almost a 2x performance
improvement!
- Redis DemiLin in-memory: libOSs give substantially increased throughput
throughput (20%-200%).
- Redis DemiLin persistent: (fsync after set seems kind of unfair)
both catnap and libOSs show increased performance, but the comparison is
a bit skewed due to Cattree not buffering and therefor forcing Linux
to fsync.
- TxnStore DemiLin: Improved latecny across the board (25%-50% of the
latency of Linux UDP)