LegOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation
Shan, Huang, Chen, Zhang (2018)
What kind of paper is this?
- It's somewhere between a 'big idea' and 'we built a thing.'
- We talk of the 'hardware is changing, so here is how software has to change'
- This is similar to that -- there are reasons for disaggregation; if you buy
that, then you'll want this.
The Story
- Cloud vendors maximize revenue when they can fully utilize their resources.
- If you can manage resources in fine grain units (e.g., here is some CPU,
here is some memory) rather than as an entire server, you can probably get
better overall utilization and make more money.
- BUT, commodity operating systems don't really work like this.
- LegOS does!
- Cloud vendors can make more money...
- I like this tag line: "When hardware is disaggregated, the OS should be also."
Why disaggregated Hardware?
- Networking is much faster and more scalable (order of magnitude increase in past decade).
- Networking interfaces moving closer to components: RDMA, NVMe over Fabric; this
allows components to access the network without a CPU.
- HW devices have more processing power.
- What challenges does this present to software (the OS)?
- How do you deliver good performance when resources are across a network
instead of a local bus?
- How do you manage HW components locally with limited compute power?
- How do you manage distributed HW components?
- How do you handle failure of a component?
- What abstractions should an OS expose to applications?
What is a split kernel?
- A collection of loosely-coupled monitors (one per component)
- Operating independently: added, removed, restarted without affecting other monitors.
- Communicates only when it needs to access resources.
- Only global tasks: resource allocation across components and handling
component failure.
- All communication is via network messaging.
- Goal is resource packing over performance.
What is LegOS
- Instance of a split kernel
- Applications see a collection of virtual servers (vNodes)
- vNodes map to components N to M (i.e., 1 vNode can run on multiple components;
1 component can host multiple vNodes).
- Three types of monitors: process, memory, storage (no network?)
- RDMA-based networking stack with no (minimal) shared state across monitors
- Supports Linux ABI
LegOS abstractions
- Applications/users see vNodes (looks like a virtual machine)
- Only threads within the same process can share writable memory. (If you want
unrelated processes to share memory, you have to use messages.)
Hardware
- pComponent (CPU)
- Has no memory, just caches.
- Need some memory for working set -- implemented as an extended cache (ExCache)
below the LLC.
- Also add small amount of memory on pComponent for LegOS data structures
(uses physical addresses for these).
- Sees only virtual addresses (so caches are VIVT)
- No synonyms (multiple VAs mapping to a single PA), because processes cannot
share writable data.
- Use ASID to solve homonym problem (two address spaces having the same virtual
address, but different physical addresses).
- mComponent (DRAM)
- Responsible for virtual to physical caching.
- Manages TLBs
Process Management
- Process manager runs in kernel space on pComponent.
- Runs user programs in user space.
- Local thread scheduling.
- Reserves some cores (2-4) for kernel background threads
- Normally, threads run to completion without scheduling or preemption.
- Also manages ExCache (has configurable associativity and replacement policy).
Memory Management
- Three types of memory: anonymous (heap/stack), mapped files, buffer caches.
- The memory monitor manages virtual and physical address spaces.
- No user process on mComponent.
- VM managed in two levels: a process' home mComponent makes coarse grain
allocations (vRegions) and local mComponents do the detailed management (virtual
memory areas, vma, in a tree structure).
- mComponents also manage the physical memory for data in its vRegions.
Storage
- LegOS supports hierarchical interface, but sComponents essentially
treat entire pathnames as names.
- Hash table maps these names to the sComponent managing that name.
- Buffer caches are at mComponents, not sComponents.
Global Resource Management
- Two level management: three global resource managers: GPM (process),
GMM (memory), and GSM (storage). Global decisinos and load balancing.
- Monitors implement low level policies and manage resources.
Relibaility
- Likelihood of mComponents increase due to RAID argument (more things to fail).
- Since a process runs on a single pComponent, failure granularity is similar
to monolithic system, so LegOS does not do anything special.
- Similarly, since LegOS does not split files across sComponents, nothing
special has to happen ther either.
- For memory, use primary backup and a backup file. Secondary just stores log
while Primary stores all the data.
Implementation
- Written in C; target x86-64.
- 113 Linux system calls and 10 vectored syscall opcodes
- 206K SLOC; 56K SLOC drivers
- Emulate disaggregation on constrained servers.
- Three network stacks: 1) RMDA-based RPC, 2) sockets over RDMA, 3) TCP/IP.
- Process Monitor: contiguous physical memory set aside during boot. Fixed
locations for ExCache, tags, meta-data and kernel physical memory.
Eval
- Network Latency: Significantly faster than Linux
- Memory Latency: While p-Local is competitive with Linux, even with 4
workers, LegOS has almost an order of magnitude lower throughput
- Storage Throughput: About half that of Linux.
- PARSEC: between 10% and 200% worse performance.
- TensorFlow: Legos does better than Linux with tiny swap, but as ExCache
and swap get larger, everyone converges to the same thing.
- Phoenix: similar behavior.
- Rest of the results just examine LegOS policy decisions.