The Demikernel Datapath OS Architecture for Microsecond-scale Datacenter Systems

Zhang, Raybuck, Patel, Olynk, Nelson, Leija, Martinez, Liu, Simpson, Jayakar, Penna, Demoulin, Choudhour, Badam (2021)

What kind of paper is this?

Unifying technology: there have been a bunch of one-off solutions in the kernel bypass space; we tried to build a general purpose system to do encompass the range of devices and use cases for which bypass is appropriate.
At first blush, this should scare you, because the argument in all the kernel bypass papers is about specialization, and now we're trying to generalize that work! That also makes this a 'best of both worlds' paper in that they want the generalization without sacrificing performance. Using libOSs, the authors achieve this goal.

The Story

IO devices now run at ns-scale.
Current operating systems are designed for µs-scale. Existing bypass systems are one-off.
Demikernel is a new OS architecture and flexible database for general purpose bypass.
We demonstrate TWO demikernels!
Life just got way faster.

Requirements

Heterogeneity: existing protocols function at different levels of abstraction (e.g., RDMA provides a full network protocol; DPDK provides raw NIC interface). This is the obvious response to the trade-off that HW vendors make: features versus complexity. A datapath OS must address this diversity.
Zero copy coordination: This requires that 1) devices cope with address translation and 2) memory management prevents modifying/freeing memory that is in use by an offload engine.
µs-scale scheduling: Existing systems split scheduling responsibility between user-level, the kernel, or different parts of the code (i.e., RDMA).

Demikernel Design Goals

Simplify µs-scale kernel-bypass system development.
Portability across heterogenous devices (RDMA, DPDK, NICs, SPDK disks, programmable devices).
ns-scale latency overhead.

Demikernel Architecture

Run in the same process/thread as the application -- protection either provided by demikernel or the bypass device (fine assumptions for datacenter).
Cooperative scheduling (i.e., applications run in a tight IO processing loop) of datapath operations (other stuff goes through a conventional OS).
(Prototype limitation) Single core scheduling.
A set of Library OSs -- one for each type of device -- share the same architecture and APIs.
New API (PDPIX) centered around an IO queue abstraction. Applications hand the data/buffer over to demikernel and don't get it back until the IO completes.
DMA-capable heap.

PDPIX versus POSIX (only changed what needed changing)

Libcalls not syscalls
Queue oriented: no fds, but qds instead (queues are similar to Go channels).
IO is inherently async: pushing an IO request into the queue returns a qtoken on which you can choose to wait or not.
Wait offers much better control than posix: can wait on a specific event or all events and returns data immediately. No mass wakeup if you have multiple waiters.
Buffers are passed from application to datapath on push and from datapath to applications on pop (like Rust's memory ownership model; except that applications CAN modify buffers that have been turned over to the datapath).
Demikernel provides write-after-free functionality, but not protection against writes while the buffer is owned by the database.

Demikernel Design

Separate libOS per type of device (i.e., DPDK, RDMA, SPDK).
Each libOS contains the IO stack for that device, a memory allocator and a coroutine scheduler.
Library OSs can be combined (i.e., RDMA and SPDK = RDMAxSPDK).
Implementation (mostly) in Rust (key features that make this a win: co-routines, memory ownership, async/wait).

IO Processing

Polling-centered design
Optimized for error-free, fast-path, run to completion

Co-Routines

Three types
1. Fast-path IO processing (one per IO stack) [lower priority]
2. Background co-routines (one or more per IO stack) [lower priority]
3. Application co-routines (one per blocked token) [highest priority]
ns-scale latency imposes serious constraints on scheduling
- Hundreds or thousands of co-routines
- No heap allocations in scheduler
- Current implementation is 12 cycles!

Realizing demikernels

Built two! DemiLin and DemiWin
Five libOSs (a clowder of libOSs) plus cross products:
DemiWin
- Catpaw (RDMA)
- Catnap (for testing on HW w/out kernel bypass)
DemiLin
- Catnap (for testing on HW w/out kernel bypass)
- Catmint (RDMA)
- Catnip (DPDK)
- Cattree (SPDK)
- CatmintxCattree (RDMAxSPDK)
- CatnipxCattree (DPDKxSPDK)

Eval

Baselines: two kernel-bypass applications (testpmd and perftest) and three kernel-bypass libraries (eRPC, Shenango, Caladan).
How does demikernel impact development? Implement four applications and count lines of code. Results: some are fewer (echo, txnstore) some require more (UDP Relay, Redis).
- Echo client/server: Made it easy to avoid memory allocations and copies with minimal effort.
- UDB Relay server: More code, but developer reported that PDPIX was easier to use (he got it working) and it was his "favorite part of the system".
- Redis: Required some re-architecting, replicated some functionality (explains the increased code count). Demikernel fixed some well-known inefficiencies. Provided 0-copy IO for free.
- TxnStore: Implemented own RPC transport; simplified ability to get 0-copy.
Demikernel performance
- Echo DemiLin: Using catnap, small improvement over Linux; LibOSs are competitive with custom solutions from prior work.
- Echo DemiWin: libOS provides HUGE improvement on Windows; even catnap provides 15% speedup
- Echo DemiLin Azure: Catnap is ~40% better (why so different from native Linux-- the polling in the vCP?) and libOss are significantly better (like the custom solutions).
- Echo w/server side logging DemiLin: Similar benefits observed to those on DemiWin (~20%)
- NetPIPE throughput DemiLin: Tracks raw RDMA pretty well and does better for large messages; about 15-20% slower than raw DPDK.
- Throughput versus latency performance: Echo DemiLin
  - Catnip (TCP) maintains low-latency across all offered loads. Tracks eRPC but at 20-30% higher latency.
  - Shenango and Caladan can handle higher load before latency skyrockets (9-10 Gpbs compared to 7-8 Gbps for Demikernel based systems).
- UDP Relay DemiLin (non-expert developer): almost a 2x performance improvement!
- Redis DemiLin in-memory: libOSs give substantially increased throughput throughput (20%-200%).
- Redis DemiLin persistent: (fsync after set seems kind of unfair) both catnap and libOSs show increased performance, but the comparison is a bit skewed due to Cattree not buffering and therefor forcing Linux to fsync.
- TxnStore DemiLin: Improved latecny across the board (25%-50% of the latency of Linux UDP)