Snap: a Microkernel Approach to Host Networking

Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli Michael Dalton , Nandita Dukkipati, William C. Evans, Steve Gribble, Nicholas Kidd, Roman Kononov, Gautam Kumar, Carl Mauer, Emily Musick, Lena Olson, Erik Rubow, Michael Ryan, Kevin Springborn, Paul Turner, Valas Valancius, Xi Wang, and Amin Vahdat (2019)

What kind of paper is this?

It describes a system, Snap, that has been running in production for 3 years.
Experience report of networking at large scale

The Fairy Tale: Once upon a time, networking happened in the kernel or in user-space services. This was a nightmare: upgrading and changing the network infrastructure took a long time making quick adaption to new requirements impossible, worse: upgrades often meant downtime and integrating upstream changes often break optimization through vertical integration. These drawbacks became unbearable for Google, and Marty et al. decided to combine ideas from uKernels and user-level networking with the central resource management and scheduling of the Linux kernel. With Snap and Pony Express Transport layer, they could greatly reduce the resource usage for network processing, improve latecy and bandwidth, and upgrade the networking services within less than 250ms. Since more than three years, they snapped an happyliy ever after.

Principles

high-rate of feature developments through using user-space applications and transparent upgrades
interoperability with existing kernel network functions and application thread scheduers
composable, encapsulated packet processing functions (engines) -> cf. routers
Support for OSI layer 4 and 5 funcionality (by exposing a smart NIC-like interface (through pony express)
Minimizing processing overhead

Claims

Combines centralization of monolithic kernels with user-space development benefits.
Decoupling of release networking functionality from both applications and kernel.
Decoupling of scheduling from network services: spin-polling as a system-wide service.
IPC is less of an issue today (can run multi-core)

Components

Control plane and data plane
engines: define the data processing pipeline (shared or dedicated)
engine groups: resources accounting, scheduling, ... (mapping engines to threads)
Pony Express: packet transport layer
Snap: micro-kernel network service

Snap

Move networking out of the kernel to userspace (Snap framework)
higher productivity (user-space application) and higher release velocity
better isolation: running in separate address spaces as non root user.
Runs as a normal user-space process on Linux.
Use custom data-plane operations using "Engines"
Kernel module driver to efficiently move packets between SNAP and the kernel
Snap modules = control plane. instantiate new engines

Engines

stateful, single threaded tasks
Snap provides libaries for ACL, protocol processing, rate-limiting etc,
pluggable elements for constructing packet processing pipelines
communicate through queues. (+mailboxes)

Engine Groups

like resource containers
define a scheduling policy.
Three scheduling modes: dedicating cores, spreading engines, compacting engines.

Scheduling Class: MicroQuanta

Run a specific microquanta task for x time, every y period.
This class gives priority to Snap when the core is shared with other tasks.
can cleanly separate cores for running applications from cores running snap

Pony Express

custom reliable transport and communications API.
Focus on asynchronous operation-level commands and completions. (one + two sided operations)
Reliability, congestion control, flow control, ordering, remote data access operations ,...
two layers: operation layer to execute operations, and reliability layer for reliable flows.
flow mapper: associates application-level connections with flows
Advertise different versions, compatibility, etc. (cf. protobufs, Android, ...)
Only generate new packets when there is room in the NIC to enqueue them.
Use of shared engines, or exclusive engines.
Pre-defined, one-sided operations for RDMA. (more semantics than traditional RDMA)

Transparent Upgrades

target: 200ms or less
serialize state in new format in memory
one engine at a time updgrade. between then old stanp and the new snap instance.

Evaluation

Evaluation platforms?
Throughput and latency with various configuration (against Linux TCP stack)
Interrupt latency impact
Interference using system calls and background load.
RDMA one-sided operations
Transparent upgrades