Tornado: Maximizing Locality and Concurrency
in a Shared Memory Multiprocessor Operating System
Gamsa, Krieger, Appavoo, Stumm (1999)
What kind of paper is this?
- Describes a system
- Motivated by changes in hardware, "HW has changed, so we should be
changing the OS."
Motivation
- Design an operating system specifically for multiprocessors (SMMP)
- Specifically for SMMP's with NUMA, where locality plays a significant role
Approach
- Object oriented:
- Clustered objects (partition objects across resources)
- Protected procedure calls (preserves locality on IPC)
- Object-oriented locking
 
The Hardware Landscape
- New class of multiprocessors emerging
- Memory is higher latency (relative to CPU)
- Write-sharing is expensive
- Secondary caches are large
- Cache lines are larger, which leads to false sharing
- NUMA effects
- Larger systems (more processors, etc)
The Position
- The new hardware suggests that locality is of paramount importance.
- Ramifications of limited locality
- Must minimize read/write and write sharing to avoid
cache coherence overhead
- Must minimize false sharing
- Must minimize the distance between accessing processor and main memory
 
Object oriented structure
- Goal is to make as much of the OS state process local as possible.
- Avoids memory sharing and contention for resources.
- What do you suppose the negatives are?
- No global policies only local ones
- I wonder if you incur more overhead passing stuff around among lots
of different objects?
- Is it hard to get the synchronization deadlock free?
 
- Can replace implementations, so it's easy to experiment and you
can start with a simple implementation and then grow more complicated
only if you need to.
Clustered Objects
- How you do sharing.
- Provides a systematic way of partitioning resources across
multiple processing elements or nodes.
- Abstraction is a single object; implementation is multiple objects
- Objects are referenced as if there were a single object for the
whole system, but in fact, different processes/processors are actually
accessing different objects (that coordinate).
- The degree to which a clustered object can vary (one per system;
one per node; one per process; etc).
- Synchronization is handled by the objects however they like.
- Example: The Process Object
- One rep on each processor with a thread from that process.
- Heavy update fields are updated at a "master"
- Other fields are pulled (list of memory regions)
 
- Implemented with another level of indirection (yeay!)
Dynamic Memory Allocation
- Per-processor memory pools
- Enforces NUMA allocation
- Assembly coded fast synchronization (16-21 instructions)
Synchronization
- Locks-per-object keep cache consistency traffic low
- Optimized for uncontended locks
- spin-then-block locks (2 bits; 20 instructions)
- Garbage collection to avoid locking as a mechanism to protect existence.
IPC
- Client server communication done by protected procedure call (PPC)
- Essentialy the thread moves from client to server and back
- Cross process PPC looks a lot like RPC
Experimental results
- No detail on exactly what the various benchmarks are (gets better in
section 7.2).
- I really like 7.3 -- it's a nice high level summary of the results.