Light-Weight Contexts: An OS Abstraction for Safety and Performance
Litton, Vahldiek-Oberwagner, Elnikety, Garg, Bhatteracharjee, Druschel (2016)
What kind of paper is this?
- New idea
- With implementation and evaluation
The Fairy Tale:
Once upon a time processes were the main from of isolation
provided by operating systems.
Then we started allowing multiple threads of execution within
a single process, with all threads able to access all data in
a process address space.
Over time, we've seen that although threads share an address space,
threads also manage data that should be private to the thread.
The authors introduce a new isolation mechanism, light-weight
contexts (lwC), that provide the ability to support multiple
protection domains within a single process.
lwC's also enable interesting applications, such as
checkpoint/restore.
This powerful new abstraction lets programmers and software
designers live happily ever after.
Surprise
- There have been a bunch of similar kinds of designs
- sthreads (interwined with threads)
- shreds (interwined with threads)
- Dune uses VT-x to do something similar to lwCs
- Trellis: language support for something kind of like lwC
The Story
- Threads separate execution units from each other.
- lwC's separate memory, execution state, and privelege -- but are orthogonal
to threads.
- lwC's have their own:
- VM mappings
- File descriptors
- Credentials
- Threads can move among different lwC's.
Contributions
- New abstraction
- Implementation
- Sample Applications
- Evaluation
Key Differentiators
- Orthogonal to threads (unlike sthreads, shreads memory domains, Determinator)
- Allows snapshot/restore (unlike sthreads, shreads memory domains)
- Privelege separation within a process (unlike SpaceJMP, Mach, COMPOSITE, Mungi)
- No HW support (seems to provide similar functionality to Dune, but in
different manner; this would be interesting to dig into, Mondrian)
- No language/compiler support (unlike Trellis, Mondrian, CHERI)
- Focus on privilege, not resources (unlike resource containers)
- No dynamic, runtime checking (unlike SFI, CFI, CPI, NaCl)
What is an lwC?
- Has its own virtual address space, page-mappings, file descriptors
and credentials.
- Process starts with a root lwC -- can create more and give it whatever
it wants to give.
- Referenced by descriptor;may have multiple such descriptors
- Terminates when last reference goes away
- Creating an lwC does not start "running" it -- it's just a context. When
a thread switches to it, it copies the thread state and starts running.
- Switching lwCs is akin to a coroutine yield.
- Creating a child is kind of like clone -- the parent gets to decide
how resources are/are not shared with the child (using resource-spec).
- When a thread leaves an lwC, its state is remains in the
original lwC, so that when it returns, it picks them up via
arguments, as if it just made a switch into that lwC.
- Capability based system
- When you create an lwC, each descriptor can be COW, SHARED, or
UNMAP.
- If allowed, you can map resources from one lwC into another
using lwOverlay.
Usage patterns
- Snapshot/Rollback
- Create a context to save initial state (the child holds the
saved state)
- Handle a request
- Now use the new context, which in turn destroys the original one,
creates a new one.
- Lather, rinse, repeat.
- Server event-handling isolation (prevent information in different sessions
from leaking to one another).
- Server creates a socket descriptor for each client.
- Uses different contexts to respond to each descriptor
- Isolation of sensitive data (e.g., a signing key)
- Create a child who will have full rights to the key
- Parent relinquishes access to child's space
- Child enters infinite loop and everything gets a thread assigned
maps in an argument buffer, signs it and unmaps it
- Monitor child system calls
- Create a child indicating it should trap to parent
- Basically switch to the child context and only run on a system call trap
- If child is allowed to mkae the call do it and give the child the result,
else error.
Implementation
- Memory represented by a vmspace, which is a set of vm_map_entry
structures, each of which corresponds to a contiguous chunk of memory.
- Fork and lwCreate are proportional to the number of vm_map_entry
structures.
- Leverage PCID to avoid flushing TLB on switch
- lwCreate copies File Table and gets a pointer to Credentials (ucred)
- Default: child gets a copy of vmspace and file table and a ref to
ucred.
Evaluation
- What would you like to see in an eval?
- In many ways, Section 4 (applications) was the most important part of the
eval.
- So all I reaally care about is how long each micro-operation takes and
how lwC compares to alternatives.
- I certainly don't need 5.5 pages of eval!
- This is downright slimy: ``In an
experiment with Linux 3.11.10 on the same hardware,
user thread switches run in 6% of the time required by
semaphore-based kernel thread switches,'' You measured it; add it to the
table in real numbers (which should be about .06 * 4.12 = .25 or 10x faster
than lwC.
- Would have liked to see the lwC create, switch, destroy compared to
threads and processes.