NrOS: Effective Replication and Sharing in an Operating System

Bhardwaj, Klukarni, Achermann, Calciu, Kashyap, Stutsman, Tai, Zellweger (2021)

What kind of paper is this?

Synthesis: merge together ideas from Safe languages in OS, Multikernel, and NR.

The Story

Writing a correct OS is hard; the synchronization is particularly hard.
NrOS is a sequential kernel with no concurrency, specifically designed to avoid having to reason about concurrency when reasoning about correctness.
Designing a kernel this way leads to better performance and simplicity.

Principles

Single-threaded, sequential implmentations of core data structures.
Scale via node replication -- separate replica of each data structure on each NUMA node. Mutations to kernel state are batched from cores in a NUMA node and then appended to a log using flat combining. Each node applies the logged updates serially.
Need concurrent node replication (which exploits commutativity) for some parts.
Basically leverage shared memory to communicate global state updates, while avoiding all the complexity around consensus.
Motivation 1: Shared memory continues to be at the heart of modern systems.
Motivation 2: Memory capacity continues to grow, so replication is OK.

Node Replication (NR)

Operation log: Circular buffer into which nodes write mutations to shared data.
Flat combining: One combiner thread applies many mutations.
Optimized readers-writer lock (write-preference).

Concurrency NR -- elegant in its simplicity

Place commutative operations on separate logs
Place conflicting operations in the same long

NrOS design

Multikernel
Kernel is implemented by a copy of per-NUMA node kernel replicas
Replicas access and modify local state and maintain consistency via NR
Major subsystems:
1. NR-vMem: Virtual memory (replicates per-process page mapping metadata and page tables.
2. NR-FS: In-memory file system
3. NR-Scheduler: Process manager that loads and spawns elf binaries
Supports both native execution and POSIX via a NetBSD libOS
Devices and Interrupts are currently out of scope
Comparable to a lightweight hypervisor
Everything is shared memory within a NUMA-node.
Leverage Rust language-level memory safety.

Physical Memory Management

Divide memory into per-NUMA node caches (NCache).
Each NCache has two classes of page frames: 4 KB and 2 MB
Each core has a cache (TCache) of frames; when empty, refill from Ncache
Further allocations are slab allocated from suitable frames

Virtual Memory Management

Uses MMU for isolation
Process mapping via a B-Tree

File System

Support pread/pwrite instead of fd-based read/write (avoids serializing all IO)
Do not place file data in logs; use kernel-side buffers and place references to buffers in log
Write buffers are copied to kernel before logging Users CNR

Process Management and Scaling

Inspired by 1) Barrelfish dispatchers, 2) Lithe, and 3) Scheduler activations
NR-Scheduler allocates CPUs to processes (is a CPU a NUMA-node or a core?)
Processes may ask for more cores and can relinquish them
Kernel notifies processes of allocations/deallocations via upcalls.
Upcalls trigger user-level scheduler
NR-Scheduler maps process IDs to process structures and process executors to cores
Some process creation actions are replicated; others are not. (To a first approximation, read-only parts are replicated; writable portions are logged.)

Log Garbage Collection

Use IPIs to get cores to apply log records so that log space can be reclaimed.
Updates from the log are applied during idle periods as well

Implementation detais

Implemented from scratch in Rust
Runs on x86-64
Parts of a Linux libOS
11K lines of code + 16K LoC library code
Only 3.6% are unsafe
NR is3.4K of Rust (5% unsafe)
Userspace runtime support using vibrio
Vibrio allows for linking against netBSD Rump Kernels

Evaluation

Pinned threads to cores
Disabled hyperthreads
Turbo boost enabled

How does NrOS design compare against monolithic and multikernel OS?
- NR-FS v tmpfs: allocate 256 MB files; each core accesses exactly one file; file shared among 4-96 cores. Cores either read or write a 1 KB block at random.
  - 1 file across 96 cores: read-only scales linearly (Linux does not scale at all)
  - 1 file across 96 cores: 10% writes scales to 10% of read-only (Linux scales to 8 cores and then falls.
  - 1 file across 96 cores: 60% writes scales to 3x at 16 cores and holds stable (Linux faster up to about 24 cores; and then drops).
  - 1 file across 96 cores: write-only 50% improvement on 16 cores and holds steady (Linux faster up to 24 cores).
  - When we share 24 files on up to 96 cores, we see similar trends, but Linux sclaes a bit better (and outperforms on the write-heavy load up until about 72 cores), but the trends are the same.
  - LevelDB: NrOS gets a factor of 3.5 in scaling from 4 to 28 cores (a factor of 7). Linux scales about the same amount, but is slower to begin with.
- NR-vMem: against Linux, xv6, and Barrelfish -- repeated map operations
  - NR replicates the entire address space, so no scaling; performance is stable.
  - Linux degrades dreadfully as it's doing contending operations on its red/black tree.
  - Barrelfish is about 10x faster because it's fully decoupled
  - sv7 gets another factor of 10 beyond barrelfish due to fine grain locking.
  - CNR appears to be about 2x barrelfish.
  - Those are all throughput numbers. Latency numbers vary depending on architecture.
What is the latency, memory and replication mechanism trade-offs in Nr-OS design?
- I didn't feel like they addressed this very well.
Does NrOS matter for applications?: I guess Memcached and LevelDB were the two applications:
- As mentioned above: for LevelDB, NrOS is about 33% faster.
- For memcached, NrOS with 1 replica gets about 25% better throughput; with 4 replicas, it gets about 33% better throughput.