Operating System Support for Safe and Efficient Auxiliary Execution

Jing, Huang (2022)

What kind of paper is this?

Best of both worlds (a way to have both strong isolation and visibility) for helper tasks (auxiliary execution).

The Story

Many applications have background or auxiliary tasks (e.g., tuning, debugging, reconfiguration, deadlock detection, garbage collection, checkpointing).
In most cases, these tasks run in the same address space of the application, but this poses a security vulnerability.
Running in a separate process provides better isolation, but at the cost of worse performance and less visibility into the application for which it is providing service.
Can we have the best of both worlds? Yes! Orbit is a new isolation abstraction that provides both strong isolation and visibility.

Goals for Orbits

Strong Isolation (why?)
Convenient Programming Model
Automatic State Synchronization
Controller Alteration
First-class Entity

Key Challenges

Allow auxiliary entity to inspect state from the main entity
Minimize performance overhead of strong isolation

API

Create like a thread create: orbit_create( ... entry function ...) -- once created, the orbit can only be invoked via specific orbit execution calls
Invoke the orbit via : orbit_call, orbit_call_async - for async calls, the main task state is snapshotted BEFORE the call returns
Retrieve answer from async orbit: orbit_future_get

State Synchronization

Data is synced from MAIN thread/process to an ORBIT.
Data is synced only in orbit areas, which are collections of contiguous virtual pages.
Orbit areas have the same VA in both the main and orbit.
So all state that needs to be synchronized must live in an orbit area.
When an orbit function is called, before the API returns, all pages the main orbit area are mapped into the orbit and are marked copy on write in main and no-write in the orbit. (This has to happen on EVERY orbit call.)

Orbit Execution

Challenges
1. A call requires crossing two address spaces.
2. Calls can be sync, async, and concurrent (but concurrent just means they get queued; there is no real concurrency).
Mechanisms
1. Task queue per orbit -- queue entry contains set of marked PTEs
2. Each call gets a unique ID.
3. Orbits function as single-threaded workers
4. orbit_task_return: returns result of last orbit call
5. Semaphore indicates whether an orbit has work to do
Policies
1. Calls to orbit processed in Fifo order.
2. Check for pending returns; if any exist signal last orbit thread to wait.
3. Privileged orbits can modify main program state.
  - Only in orbit areas
  - What about concurrent updates by main and orbit?
  - Control updates via pull_orbit and push_orbit -- orbit authors place them in code explicitly. The authors of main explicitly pull. (I am not convinced this actually works.) You can also push function pointers (i.e., to kill a thread).

Optimizations

Retain orbit mappings after termination; on next call, keep any that have not changed in main. (I'd really like data that indictes how often this happens.)
Keep region bitmaps to avoid traversing too many PTEs (but I thought orbit areas were small?).
Support choice of COW v COPY (but this assumes that you know a lot about what is going on and I bet can vary a lot between invocations).
Introduce delegate structs to deal with the case where we have large structs and only some fields need to go in orbit area. This is basically just another level of indirection, so you've complicated every structure that needs this.

Eval

Research Questions
1. Is orbit general enough to rewrite auxiliary tasks in real applications?
2. Can orbit-based tasks provide strong isolation?
3. How much overhead does orbit introduce?
Why did they have to do this under QEMU?
Microbenchmarks (overhead).
- orbit_create as a function of orbit areas: these results confuse me -- orbit-create is way faster than fork, but my understanding is that almost all versions of fork do copy-on-write, so shouldn't these be the same? "Most modern systems, including Linux, use a form of copy-on-write, where the pages in the process memory are not copied at the time of the fork call, but later when the parent or child first writes to the page. That is, each page starts out as shared, and remains shared until either process writes to that page; the process that writes gets a new physical page (with the same virtual address)."
- orbit_call: as expected, time increases almost linearly as a function of orbit area (it is comparable to the fork call). (How does it compare to a regular function call?)
Applications (fault isolation)
- Do the application unit tests actually text auxiliary tasks?
- Fault injection: Null pointer dereferences -- main task keeps running and orbit gracefully restarts.
- Fault injection: over allocation (ditto)
- Fault injection: CPU hog
- None of these fault injection results are surprising as we are in a different address space. However, are these the kind of auxiliary failures that traditionally cause app failures? (It would have been nice to see any kind of crude analysis of bug reports to see if this were the case.)
- Real world bug tests (4): Similarly -- they picked bugs that could be isolated into orbits (but would you actually move backup selection into an orbit?)
Applications (performance)
- End to end benchmarks show essentially no difference.
- Delegate objects are a huge win (unsurprisingly).
- Maintaining mapping is also a big win (unsurprisingly).
Code Changes: quite small (does this include use of delegates??)