The Multikernel: A new OS architecture for scalable multicore systems

Baumann, Barham, Dagand, Harris, Isaacs, Peter, Roscoe, Scupbach, Singhania (2009)

What kind of paper is this?

Classic example of a: "Hardware changes so software must change" paper.

The Problem

Multicore is here.
Multicore is in general purpose processors and systems that cannot be tuned for specific platforms.
Applications on these platforms are not predictable.

The Approach

Make all inter-core communication explicit (messages not shared memory).
Make OS structure hardware neutral.
View state as replicated not shared.
Draw on distributed systems more than SMMP systems.

Main Motivations

Systems are increasingly diverse -- different implementations introduce different trade-offs.
Cores are increasingly diverse -- machines and even chips will have different cores for different purposes.
- Current OS design treats all the processors it manages as general purpose and the others as peripheral.
- General purpose cores share an OS.
- However, if you have different ISAs on your general purpose cores, you can't share OSs.
Interconnects matter -- both intercore and intracore.
Messages cost less than shared memory -- cache coherence kills shared memory performance; thus messages where single servers respond to read/write requests do better.
May have to give up on cache coherence! -- claim that cache coherence limits scalability to 80 cores.
Messages are getting easier -- since OSs already respond to messages (i.e., interrupts), message passing is a natural model in an OS.

Hardware is changing more quickly than software.

The Multikernel

Run a kernel on each core and use no shared state.
The idea that state is replicated rather than shared is, in my opinion, the really big idea here.
Potential problems
- Hard to build: but OS programmers are smart.
- May sacrifice certain hardware-specific optimizations (e.g., L2 cache sharing).
- Replica consistency is a pain in the neck -- some operations require agreement; others don't; have to make decisions on a case by case basis.
Barrelfish goals
- Comparable performance to existing commodity OS
- Evidence of scalability
- Efficient message-passing
- Demonstrate ability to map to different hardware

A Paper Gem

"While Barrelfish is a point in the multikernel design space, it is not the only way to build a multikernel. In this section we describe our implementation, and note which choices in the design are derived from the model and which are motivated for other reasons, such as local performance, ease of engineering, policy freedom, etc."

Implementation

Vaguely microkernel: kernel mode CPU driver does scheduling, communication, resource allocation. User mode monitor does device driers, network stacks, memory allocators.
CPU Driver
- Event driven
- Single threaded
- Non-preemtible
- Captures machine-specific part of design.
- Lightweight split-phase (asynch) IPC (uses shared memory for more complex channels).
User level monitor
- Coordinate system wide state.
- Encapsulate mechanism and policy
- Global structures use agreement porotocol.
- Monitors ideal for power management
Processes implemented as a collection of dispatchers.
- Dispatchers handle inter-core communication on behalf of processes.
- Multi-threaded processes implemented by user-level thread scheduler in the dispatcher.
Current transport is shared memory channel (URPC); messages are cache-line-sized.
Otherwise messing is pretty standard:
- Stub generators produce marshalling code.
- Lookup service finds services.
- Monitors setup channels.
Capability based system for managing memory.
- All memory management handled through system calls.
- System calls manipulate capabilities.
- Capabilities are user-level references to kernel objects or physical memory.
- Page tables manipulated by user level code.
- Nice discussion of pros and cons.
System Knowledge Base: Managing the HW
- Interrogates hardware to figure out configuration:Hardware discovery, online measurement, and pre-asserted facts.
- Representation is first order logic
- Express optimization queries over this collection of data.

Experience

CPU Driver/Monitor division is great from a software engineering perspective, but not so great from a performance perspective.
Current network stack is "suboptimal."

Evaluation

Goal of eval is to evalute against the goals outlined in the intro.
TLB Shootdown Case Study
- Standard approach: IPI
  - Originating core sends IPI
  - Receiving cores ack (write to shared variable)
  - Receiving cores then invalidate TLB
  - Originating core continues after receiving all the acks.
- Barrelfish naive: Local monitor broadcasts invalidate to all the other monitors and waits.
- Barrelfish better: Send unicast to each core to avoid expensive cache coherence traffic.
- Barrelfish HW-dependent: Take advantage of core configuration to construct a 2-level multicast tree -- multicast to each set of cores sharing a cache and let them share the info.
- Barrelfish NUMA: multicast to farthest places first.
- Overall effect: Relatively stable performance between 2 and 32 cores; almost 2X Linux at 2 cores, but crossover is around 11 cores (5 for Windows).
Frustratingly little explanation of compute bound jobs -- the openMP results are fascinating -- I would really have loved an explanation. Similarly if the difference is the kernel versus user-level thread scheduler, I would have expected Barrelfish to do better, so was also curious about integer sort results.
The network test could have saved a lot of dicsussion by simply saying that Barrelfish and Linux performed almost identically, both saturating the card (951.7 versus 951 - of course the fact that they didn't make it 951.0 bugged the heck out of me).
Web server: where is core 0 relative to cores 2 and 3? Very nice results (more than 2X linux).
Database: How many queries/second did Linus serve?
Hey -- they are actually honest, "An enormous investment has been made in optimizing Linux and Windows for current hardware, and conversely our system is inevitably more lightweight (it is new, and less complete). Instead, they should be read as indication that Barrelfish performs rea- sonably on contemporary hardware"
Other conclusions:
- Successfully demonstrate scaling
- Porting is relatively easy
- But, they admit no complex applications and haven't scaled too far