NrOS: Effective Replication and Sharing in an Operating System
Bhardwaj, Klukarni, Achermann, Calciu, Kashyap, Stutsman, Tai, Zellweger (2021)
What kind of paper is this?
- Synthesis: merge together ideas from Safe languages in OS, Multikernel, and
NR.
The Story
- Writing a correct OS is hard; the synchronization is particularly hard.
- NrOS is a sequential kernel with no concurrency, specifically designed to
avoid having to reason about concurrency when reasoning about correctness.
- Designing a kernel this way leads to better performance and simplicity.
Principles
- Single-threaded, sequential implmentations of core data structures.
- Scale via node replication -- separate replica of each data structure on each
NUMA node. Mutations to kernel state are batched from cores in a NUMA node and then
appended to a log using flat combining. Each node applies the logged updates serially.
- Need concurrent node replication (which exploits commutativity) for some parts.
- Basically leverage shared memory to communicate global state updates, while
avoiding all the complexity around consensus.
- Motivation 1: Shared memory continues to be at the heart of modern systems.
- Motivation 2: Memory capacity continues to grow, so replication is OK.
Node Replication (NR)
- Operation log: Circular buffer into which nodes write mutations to shared data.
- Flat combining: One combiner thread applies many mutations.
- Optimized readers-writer lock (write-preference).
Concurrency NR -- elegant in its simplicity
- Place commutative operations on separate logs
- Place conflicting operations in the same long
NrOS design
- Multikernel
- Kernel is implemented by a copy of per-NUMA node kernel replicas
- Replicas access and modify local state and maintain consistency via NR
- Major subsystems:
- NR-vMem: Virtual memory (replicates per-process page mapping metadata and page tables.
- NR-FS: In-memory file system
- NR-Scheduler: Process manager that loads and spawns elf binaries
- Supports both native execution and POSIX via a NetBSD libOS
- Devices and Interrupts are currently out of scope
- Comparable to a lightweight hypervisor
- Everything is shared memory within a NUMA-node.
- Leverage Rust language-level memory safety.
Physical Memory Management
- Divide memory into per-NUMA node caches (NCache).
- Each NCache has two classes of page frames: 4 KB and 2 MB
- Each core has a cache (TCache) of frames; when empty, refill from Ncache
- Further allocations are slab allocated from suitable frames
Virtual Memory Management
- Uses MMU for isolation
- Process mapping via a B-Tree
File System
- Support pread/pwrite instead of fd-based read/write (avoids serializing all IO)
- Do not place file data in logs; use kernel-side buffers and place references to
buffers in log
- Write buffers are copied to kernel before logging
Users CNR
Process Management and Scaling
- Inspired by 1) Barrelfish dispatchers, 2) Lithe, and 3) Scheduler activations
- NR-Scheduler allocates CPUs to processes (is a CPU a NUMA-node or a core?)
- Processes may ask for more cores and can relinquish them
- Kernel notifies processes of allocations/deallocations via upcalls.
- Upcalls trigger user-level scheduler
- NR-Scheduler maps process IDs to process structures and process executors to cores
- Some process creation actions are replicated; others are not. (To a first
approximation, read-only parts are replicated; writable portions are logged.)
Log Garbage Collection
- Use IPIs to get cores to apply log records so that log space can be reclaimed.
- Updates from the log are applied during idle periods as well
Implementation detais
- Implemented from scratch in Rust
- Runs on x86-64
- Parts of a Linux libOS
- 11K lines of code + 16K LoC library code
- Only 3.6% are unsafe
- NR is3.4K of Rust (5% unsafe)
- Userspace runtime support using vibrio
- Vibrio allows for linking against netBSD Rump Kernels
Evaluation
- Pinned threads to cores
- Disabled hyperthreads
- Turbo boost enabled
- How does NrOS design compare against monolithic and multikernel OS?
- NR-FS v tmpfs: allocate 256 MB files; each core accesses exactly one file;
file shared among 4-96 cores.
Cores either read or write a 1 KB block at random.
- 1 file across 96 cores: read-only scales linearly (Linux does not scale at all)
- 1 file across 96 cores: 10% writes scales to 10% of read-only (Linux scales to
8 cores and then falls.
- 1 file across 96 cores: 60% writes scales to 3x at 16 cores and holds stable
(Linux faster up to about 24 cores; and then drops).
- 1 file across 96 cores: write-only 50% improvement on 16 cores and holds
steady (Linux faster up to 24 cores).
- When we share 24 files on up to 96 cores, we see similar trends, but Linux sclaes
a bit better (and outperforms on the write-heavy load up until about 72 cores), but
the trends are the same.
- LevelDB: NrOS gets a factor of 3.5 in scaling from 4 to 28 cores (a factor of 7).
Linux scales about the same amount, but is slower to begin with.
- NR-vMem: against Linux, xv6, and Barrelfish -- repeated map operations
- NR replicates the entire address space, so no scaling; performance is stable.
- Linux degrades dreadfully as it's doing contending operations on its red/black tree.
- Barrelfish is about 10x faster because it's fully decoupled
- sv7 gets another factor of 10 beyond barrelfish due to fine grain locking.
- CNR appears to be about 2x barrelfish.
- Those are all throughput numbers. Latency numbers vary depending on architecture.
- What is the latency, memory and replication mechanism trade-offs in Nr-OS design?
- I didn't feel like they addressed this very well.
- Does NrOS matter for applications?: I guess Memcached and LevelDB were the two applications:
- As mentioned above: for LevelDB, NrOS is about 33% faster.
- For memcached, NrOS with 1 replica gets about 25% better throughput; with 4 replicas, it gets about 33% better throughput.