Snap: a Microkernel Approach to Host Networking
Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli
Michael Dalton , Nandita Dukkipati, William C. Evans, Steve Gribble, Nicholas Kidd, Roman
Kononov, Gautam Kumar, Carl Mauer, Emily Musick, Lena Olson, Erik Rubow, Michael Ryan,
Kevin Springborn, Paul Turner, Valas Valancius, Xi Wang, and Amin Vahdat (2019)
What kind of paper is this?
- It describes a system, Snap, that has been running in production for 3 years.
- Experience report of networking at large scale
The Fairy Tale:
Once upon a time, networking happened in the kernel or in user-space services. This was a nightmare:
upgrading and changing the network infrastructure took a long time making quick adaption to new
requirements impossible, worse: upgrades often meant downtime and integrating upstream changes
often break optimization through vertical integration. These drawbacks became unbearable for Google,
and Marty et al. decided to combine ideas from uKernels and user-level networking with the central
resource management and scheduling of the Linux kernel. With Snap and Pony Express Transport layer,
they could greatly reduce the resource usage for network processing, improve latecy and bandwidth,
and upgrade the networking services within less than 250ms. Since more than three years, they
snapped an happyliy ever after.
Principles
- high-rate of feature developments through using user-space applications and transparent upgrades
- interoperability with existing kernel network functions and application thread scheduers
- composable, encapsulated packet processing functions (engines) -> cf. routers
- Support for OSI layer 4 and 5 funcionality (by exposing a smart NIC-like interface (through pony express)
- Minimizing processing overhead
Claims
- Combines centralization of monolithic kernels with user-space development benefits.
- Decoupling of release networking functionality from both applications and kernel.
- Decoupling of scheduling from network services: spin-polling as a system-wide service.
- IPC is less of an issue today (can run multi-core)
Components
- Control plane and data plane
- engines: define the data processing pipeline (shared or dedicated)
- engine groups: resources accounting, scheduling, ... (mapping engines to threads)
- Pony Express: packet transport layer
- Snap: micro-kernel network service
Snap
- Move networking out of the kernel to userspace (Snap framework)
- higher productivity (user-space application) and higher release velocity
- better isolation: running in separate address spaces as non root user.
- Runs as a normal user-space process on Linux.
- Use custom data-plane operations using "Engines"
- Kernel module driver to efficiently move packets between SNAP and the kernel
- Snap modules = control plane. instantiate new engines
Engines
- stateful, single threaded tasks
- Snap provides libaries for ACL, protocol processing, rate-limiting etc,
- pluggable elements for constructing packet processing pipelines
- communicate through queues. (+mailboxes)
Engine Groups
- like resource containers
- define a scheduling policy.
- Three scheduling modes: dedicating cores, spreading engines, compacting engines.
Scheduling Class: MicroQuanta
- Run a specific microquanta task for x time, every y period.
- This class gives priority to Snap when the core is shared with other tasks.
- can cleanly separate cores for running applications from cores running snap
Pony Express
- custom reliable transport and communications API.
- Focus on asynchronous operation-level commands and completions. (one + two sided operations)
- Reliability, congestion control, flow control, ordering, remote data access operations ,...
- two layers: operation layer to execute operations, and reliability layer for reliable flows.
- flow mapper: associates application-level connections with flows
- Advertise different versions, compatibility, etc. (cf. protobufs, Android, ...)
- Only generate new packets when there is room in the NIC to enqueue them.
- Use of shared engines, or exclusive engines.
- Pre-defined, one-sided operations for RDMA. (more semantics than traditional RDMA)
Transparent Upgrades
- target: 200ms or less
- serialize state in new format in memory
- one engine at a time updgrade. between then old stanp and the new snap instance.
Evaluation
- Evaluation platforms?
- Throughput and latency with various configuration (against Linux TCP stack)
- Interrupt latency impact
- Interference using system calls and background load.
- RDMA one-sided operations
- Transparent upgrades