Theseus: An Experiment in Operating System Structure and State Management
Boos, Liyange, Ijaz, Zhong (2022)
What kind of paper is this?
- Kind of like Singularity, "We report an experimentation of OS structural design, state management, and implementation techniques that leverage the power of modern safe systems programming languages, namely Rust."
The Story
- Although existing systems are modularized (to some extent), they often
exchange state and then hold on to that state in a way that modules (needlessly)
depend on one another.
- This fate sharing can lead to system failures.
- Rust has an ownership model for memory that seems well-matched to the
problem of avoiding this kind of state spilling. So let's build a system
that:
- Shifts OS safety responsibilities (as much as possible) to compiler (Rust).
- Minimize state that components hold for each other.
- Theseus maintains dependencies among small components that comprise the OS.
This enables live system evolution and fault recovery.
Fun Facts of System Scale
- 4 person years
- ~38000 lines of from-scratch Rust code
- 900 lines of bootstrap assembly code
- 246 crates of which 176 are first-party
- 72 unsafe code blocks or statements across 21 crates, most of which are for
port I/O or special register access.
Rust
- Strong type and memory safety
- No GC
- [Theseus uses the core and alloc libraries, but not the standard library]
- Rust has an owndership model in which every region of memory has a single
owner.
- Data is freed when its owner goes out of scope.
- Pointers can be borrowed, but the borrowed reference cannot out live the
owner.
- Rust traints are like Java interfaces or C++ concepts (I think). They
indicate what methods a type must have.
Overview
- Unit of modularity is called a cell.
- Single Address space OS with single privilege level.
- Global heap.
- Design Principles:
- Runtime persistent bounds for all cells.
- Maximize language power for all cells.
- Minimize state spill between cells.
Cells
- Implementation: a crate
- Compile time: single object file
- Runtime: set of memory-regions wit hper-section bounds and dependency
metadata.
- Are dynamically loaded.
- No Rust submodules; everything is a distinct crate.
Intralingual
- The key here is really that wherever possible, use the language (and
therefore the compiler) to do as much work as possible, rather than
manually implementing mechanisms to accomplish things (see examples
at the end of this section).
- Matching the Theseus execution environment to the language.
- For Rust: single address space, single privilege level, single global heap.
- Prioritize safety over performance.
- For each resources, identify and retain invariants and then maintain
those invariants across all invocations. (i.e., make everything strongly
typed)
- Uses Rust ownership and careful unwinding to reclaim resources.
- Invariants for memory
- VA to PA mapping is 1-to-1. (I.e., Sharing happens at the language level, via
shared references.)
- No memory is accessible outside the bounds of its map. Accomplish this by
essentially accessing memory by correctly sized types.
- Unmap only once when there are no references to an object.
- The access privileges (write/execute) are specified at map time and there
is no other way to grant anyone other memory privileges.
- Invariants for Tasks
- Spawning a new thread may not violate memory invariants. (Entry
function can run only once, arguments and return types must be allowed
to safely transfer among threads, and lifetimes of those types must
outlive the thread.)
- All task state must be released in all (terminating) paths.
- Memory accessed by the task (thread) must outline the task (thread).
Minimizing State Spills
- Only possible to spill state at 1) cell interfaces, if 2) callee
modifies state.
- Clients (of a server) maintain their own state, not the server itself.
- Interactions between clients and servers have to carry all the state
necessary in their interaction functions. (I.e., the servers are stateless
wrt to client activities.) (This can work because we're all in the same
address space so clients can pass references to objects directly.)
- Soft state, which does in fact spill state, is OK, because it
can be tossed at any point. (Doesn't this mean that you might be able
to launch side channel attacks?)
- Avoiding state spills leads to a design where task structs are
pretty lean, because state is left in the hands of cells that need
the particular state (e.g., schedulers).
Realizing live evolution: Cell Swapping
- Load new cell into a new empty CellSpace.
- Verify dependencies into and out of the new cell(s).
- Update references into the cell in all cells that depend on it.
- Remove the old cell(s).
- Triggered on commit to repo (cute)
- This works so well because cells contain all their state; so nothing
that depends on a call has state of the cell, so the cell can change
whatever it wants internally.
Availability via Fault Recovery
- System wide fault log.
- Use unwinding to clean up after task.
- Restart the task -- replace corrupted cells with new ones using
cell-swapping.
Eval
- Live Evolution
- Inter-Task Communication: Live swap the ITC mechanism while applications
are busily exchanging messages.
- Scheduler and runqueue: Replace the actual queue and the scheduling algorithm
while there are tasks actively running. (I do not completely understand how
the tasks moved from one queue to the other -- oh wait -- I get it! You remove
tasks from the old runqueue, but reschedule them on the new one and when the
old one is empty, you release it. This is super cool!)
- Ethernet driver and Network update Client: fixed a bug without losing
any NIC configuration settings or packets in progress!
- Fault Recovery
- Stress test on HW faults (that's brave!).
- Use QEMU and then do fault injection to induce HW faults.
- Random faults don't do enough damage, so they handcrafted faults as well.
- Faults in ITC (IPC): Theseus recovers in 11 of 13 cases (MINIX fails in all cases).
- Inject 800,000 random, general faults; .083% (664) manifest as errors.
- Theseus recovers from 69% (I love the fact that they admit they did not
get them all).
- Performance
- Compare to Linux using LMBench.
- High level takeaway: No glaring performance problems
- Mapped Pages: Better (~10%) performance and scalability, because ownership
means you don't have to traverse a tree to find the page and safety checks
are done at compile time.
- Avoiding state spill: Negligible overhead. It does, in fact, take more time
to (e.g.,) remove things from runqueues, because you have to search them all
(because the task itself does not keep track of where it is), BUT that time is
neglible in any real workload.
- Safety (in the heap allocation) incurs about 22% overhead.
- LMbench: Interestingly, the Theseus numbers look really good and yet they
avoid making strong claims about outperforming Linux; I really like that!