The Multikernel: A new OS architecture for scalable multicore systems
Baumann, Barham, Dagand, Harris, Isaacs, Peter, Roscoe, Scupbach, Singhania (2009)
What kind of paper is this?
- Classic example of a: "Hardware changes so software must change" paper.
The Problem
- Multicore is here.
- Multicore is in general purpose processors and systems that cannot be tuned
for specific platforms.
- Applications on these platforms are not predictable.
The Approach
- Make all inter-core communication explicit (messages not shared memory).
- Make OS structure hardware neutral.
- View state as replicated not shared.
- Draw on distributed systems more than SMMP systems.
Main Motivations
- Systems are increasingly diverse -- different implementations introduce
different trade-offs.
- Cores are increasingly diverse -- machines and even chips will have
different cores for different purposes.
- Current OS design treats all the processors it manages as general
purpose and the others as peripheral.
- General purpose cores share an OS.
- However, if you have different ISAs on your general purpose cores, you
can't share OSs.
- Interconnects matter -- both intercore and intracore.
- Messages cost less than shared memory -- cache coherence kills
shared memory performance; thus messages where single servers respond
to read/write requests do better.
- May have to give up on cache coherence! -- claim that cache coherence
limits scalability to 80 cores.
- Messages are getting easier -- since OSs already respond to messages
(i.e., interrupts), message passing is a natural model in an OS.
Hardware is changing more quickly than software.
The Multikernel
- Run a kernel on each core and use no shared state.
- The idea that state is replicated rather than shared is, in my
opinion, the really big idea here.
- Potential problems
- Hard to build: but OS programmers are smart.
- May sacrifice certain hardware-specific optimizations (e.g., L2 cache
sharing).
- Replica consistency is a pain in the neck -- some operations require
agreement; others don't; have to make decisions on a case by case basis.
- Barrelfish goals
- Comparable performance to existing commodity OS
- Evidence of scalability
- Efficient message-passing
- Demonstrate ability to map to different hardware
A Paper Gem
"While Barrelfish is a point in the multikernel design
space, it is not the only way to build a multikernel. In
this section we describe our implementation, and note
which choices in the design are derived from the model
and which are motivated for other reasons, such as local
performance, ease of engineering, policy freedom, etc."
Implementation
- Vaguely microkernel: kernel mode CPU driver does
scheduling, communication, resource allocation. User mode monitor does
device driers, network stacks, memory allocators.
- CPU Driver
- Event driven
- Single threaded
- Non-preemtible
- Captures machine-specific part of design.
- Lightweight split-phase (asynch) IPC (uses shared memory for
more complex channels).
- User level monitor
- Coordinate system wide state.
- Encapsulate mechanism and policy
- Global structures use agreement porotocol.
- Monitors ideal for power management
- Processes implemented as a collection of dispatchers.
- Dispatchers handle inter-core communication on behalf of processes.
- Multi-threaded processes implemented by user-level thread scheduler
in the dispatcher.
- Current transport is shared memory channel (URPC); messages
are cache-line-sized.
- Otherwise messing is pretty standard:
- Stub generators produce marshalling code.
- Lookup service finds services.
- Monitors setup channels.
- Capability based system for managing memory.
- All memory management handled through system calls.
- System calls manipulate capabilities.
- Capabilities are user-level references to kernel objects or physical
memory.
- Page tables manipulated by user level code.
- Nice discussion of pros and cons.
- System Knowledge Base: Managing the HW
- Interrogates hardware to figure out configuration:Hardware
discovery, online measurement, and pre-asserted facts.
- Representation is first order logic
- Express optimization queries over this collection of data.
Experience
- CPU Driver/Monitor division is great from a software engineering
perspective, but not so great from a performance perspective.
- Current network stack is "suboptimal."
Evaluation
- Goal of eval is to evalute against the goals outlined in the intro.
- TLB Shootdown Case Study
- Standard approach: IPI
- Originating core sends IPI
- Receiving cores ack (write to shared variable)
- Receiving cores then invalidate TLB
- Originating core continues after receiving all the acks.
- Barrelfish naive: Local monitor broadcasts invalidate to all the
other monitors and waits.
- Barrelfish better: Send unicast to each core to avoid expensive cache
coherence traffic.
- Barrelfish HW-dependent: Take advantage of core configuration to
construct a 2-level multicast tree -- multicast to each set of
cores sharing a cache and let them share the info.
- Barrelfish NUMA: multicast to farthest places first.
- Overall effect: Relatively stable performance between 2 and 32 cores;
almost 2X Linux at 2 cores, but crossover is around 11 cores (5 for Windows).
- Frustratingly little explanation of compute bound jobs -- the openMP
results are fascinating -- I would really have loved an explanation.
Similarly if the difference is the kernel versus user-level thread scheduler,
I would have expected Barrelfish to do better, so was also curious about
integer sort results.
- The network test could have saved a lot of dicsussion by simply saying
that Barrelfish and Linux performed almost identically, both saturating the
card (951.7 versus 951 - of course the fact that they didn't make it 951.0
bugged the heck out of me).
- Web server: where is core 0 relative to cores 2 and 3? Very nice
results (more than 2X linux).
- Database: How many queries/second did Linus serve?
- Hey -- they are actually honest, "An enormous investment has
been made in optimizing Linux and Windows for current hardware, and
conversely our system is inevitably more lightweight (it is new,
and less complete). Instead, they should be read as indication that
Barrelfish performs rea- sonably on contemporary hardware"
- Other conclusions:
- Successfully demonstrate scaling
- Porting is relatively easy
- But, they admit no complex applications and haven't scaled too far