LegOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation

Shan, Huang, Chen, Zhang (2018)

It's somewhere between a 'big idea' and 'we built a thing.'
We talk of the 'hardware is changing, so here is how software has to change'
This is similar to that -- there are reasons for disaggregation; if you buy that, then you'll want this.

Cloud vendors maximize revenue when they can fully utilize their resources.
If you can manage resources in fine grain units (e.g., here is some CPU, here is some memory) rather than as an entire server, you can probably get better overall utilization and make more money.
BUT, commodity operating systems don't really work like this.
LegOS does!
Cloud vendors can make more money...
I like this tag line: "When hardware is disaggregated, the OS should be also."

Networking is much faster and more scalable (order of magnitude increase in past decade).
Networking interfaces moving closer to components: RDMA, NVMe over Fabric; this allows components to access the network without a CPU.
HW devices have more processing power.
What challenges does this present to software (the OS)?
1. How do you deliver good performance when resources are across a network instead of a local bus?
2. How do you manage HW components locally with limited compute power?
3. How do you manage distributed HW components?
4. How do you handle failure of a component?
5. What abstractions should an OS expose to applications?

A collection of loosely-coupled monitors (one per component)
Operating independently: added, removed, restarted without affecting other monitors.
Communicates only when it needs to access resources.
Only global tasks: resource allocation across components and handling component failure.
All communication is via network messaging.
Goal is resource packing over performance.

Instance of a split kernel
Applications see a collection of virtual servers (vNodes)
vNodes map to components N to M (i.e., 1 vNode can run on multiple components; 1 component can host multiple vNodes).
Three types of monitors: process, memory, storage (no network?)
RDMA-based networking stack with no (minimal) shared state across monitors
Supports Linux ABI

Applications/users see vNodes (looks like a virtual machine)
Only threads within the same process can share writable memory. (If you want unrelated processes to share memory, you have to use messages.)

Three types of memory: anonymous (heap/stack), mapped files, buffer caches.
The memory monitor manages virtual and physical address spaces.
No user process on mComponent.
VM managed in two levels: a process' home mComponent makes coarse grain allocations (vRegions) and local mComponents do the detailed management (virtual memory areas, vma, in a tree structure).
mComponents also manage the physical memory for data in its vRegions.

LegOS supports hierarchical interface, but sComponents essentially treat entire pathnames as names.
Hash table maps these names to the sComponent managing that name.
Buffer caches are at mComponents, not sComponents.

Two level management: three global resource managers: GPM (process), GMM (memory), and GSM (storage). Global decisinos and load balancing.
Monitors implement low level policies and manage resources.

Likelihood of mComponents increase due to RAID argument (more things to fail).
Since a process runs on a single pComponent, failure granularity is similar to monolithic system, so LegOS does not do anything special.
Similarly, since LegOS does not split files across sComponents, nothing special has to happen ther either.
For memory, use primary backup and a backup file. Secondary just stores log while Primary stores all the data.

Written in C; target x86-64.
113 Linux system calls and 10 vectored syscall opcodes
206K SLOC; 56K SLOC drivers
Emulate disaggregation on constrained servers.
Three network stacks: 1) RMDA-based RPC, 2) sockets over RDMA, 3) TCP/IP.
Process Monitor: contiguous physical memory set aside during boot. Fixed locations for ExCache, tags, meta-data and kernel physical memory.

Network Latency: Significantly faster than Linux
Memory Latency: While p-Local is competitive with Linux, even with 4 workers, LegOS has almost an order of magnitude lower throughput
Storage Throughput: About half that of Linux.
PARSEC: between 10% and 200% worse performance.
TensorFlow: Legos does better than Linux with tiny swap, but as ExCache and swap get larger, everyone converges to the same thing.
Phoenix: similar behavior.
Rest of the results just examine LegOS policy decisions.