Disco: Running Commodity Operating Systems On Scalable Multiprocessors

Bugnion, Devine, Rosenblum (1997)

What kind of paper?

New application of old technology
History repeats itself
A new twist on the old idea
Deeply, deeply technical paper -- it's a real slog to read and truly understand everything they are doing. But it's good technical content, so worth the effort.

What was the problem they were trying to solve?

A quick-and-dirty way to implement a multi-processor operating system
Developing multi-processor operating systems is hard
Disco "reduces the gap between hardware innovation and the adaptation of system software"
What do we think of the following, "Our experience with realistic workloads on a detailed simulator of the FLASH machine show that Disco achieves its goals."

What are the new ideas?

Run multiple single-processor operating systems over Disco VMM
Use distributed system facilities to provide a single-system image to the user
Eliminates inefficiencies: allow transparent buffer cache sharing among virtual machines
This was binary-compatible virtualization. OS's were compiled for the same hardware that was virtualized.
Use page placement and dynamic page migration to hide non-uniformity of memory access

Interfaces (between the VMM and the OS's)

Processors: MIPS R10000 (not fully virtualizable)
Physcial Memory: Flat address space starting at 0; Disco hides the NUMA-ness.
I/O Devices: disks with different access control.

System implementation

Disco itself is a multi-threaded shared memory program.
Multiple OS'es run over Disco. Don't have to be all the same OS'es.
Single-system image is accomplished by configuring the systems as a cluster.
Machine resources are managed by the VMM and are dynamically allocated among virtual machines.
A virtual machine is the unit of scalability and the unit of fault containment: contains both software and hardware faults.
Code base is very small - 13000 lines, 72KB - it's replicated.
Machine-wide data structures are partitioned, such that they are located on the processor where they are likely to be accessed more often.
Wait-free synchronization is used to improve scalability.
Inter-VM communication is done through shared memory.
Devices are virtualized: all operations on devices are intercepted and emulated
Non-privileged instructions are run directly on hardware, privileged instructions are emulated.
Memory pages are migrated and replicated to ensure better locality. They use FLASH hardware counters to find out whether a page should be migrated or replicated.
Memory and disk are transparently shared. Block cache is shared.
Copy-on-write disks

Implementation Details

Processors

Each virtual CPU is kind of like a Disco process. Disco maintains a data structure for each Virtual CPU with the processor state (privileged registers, TLB, etc.).
Disco runs in kernel mode
VMs run in supervisor mode when the OS is running, but in use mode when an application is running.
Supervisor mode can access protected parts of address space, but cannot use privileged instructions or access physical memory directly.
On a trap, Disco emulates the operations of the virtual processor.

Memory

Two layers of mapping: virtual->physical, physical->machine
TLB maps across both: virtual->machine
Kernel typically runs unmapped on MIPS; relink OS to run in mapped regions, so that this translate happens.
Implements a second level software TLB to mitigate the effects of using more TLB entries for the OS and for having to swap between virtual CPUs, which requires TLB flushes.
Pages are migrated and replicated to reduce cache misses.
Hardware counters provide data to help Disco decide which pages to migrate and which to replicate.

I/O

Disco intercepts all device accesses.
Disco devices have clean interface with everything packed into a single trap.
Shared devices handled directly by Disco device drivers.
Devices used by only a single VM aren't virtualized.

Copy-on-write disks

When possible, handle I/O via remapping (when the page is already in memory).
Multipe VMs can end up sharing memory (uses COW to deal with updates).
Huge win for things like executable directories and the code parts of the OS, etc.
Also works great for things like NFS when combined with the network interface optimizations described below.

Network Interface

Communication without replicating data.
Another application of COW -- don't copy data between VMs, just remap it.

IRIX changes

Move kernel out of KSEG0 into a mapped segment.
New device drivers that use DISCO's DMA based driver interface
Move trapping register accesses into a special page mapped only into the OS, so that you don't have to have the OS trap on every such update.
Change to mbufs to avoid COW traps.
Specialized bcopy to mmap if possible.

Performance results:

Ran on a simulator, simulating a slightly different processor (for performance reasons).
To get the overhead to its minimum, modifications to operating systems were required
Main sources of performance overhead are: TLB reloading for scientific mostly user-level workloads. High TLB fault rate for unpredictable database workloads. Emulation overhead for pmake.
They show better scalability with Disco as compared to IRIX, a commercial SMP operating system. Is this a fundamental property of VMs? Or can IRIX be fixed by using better synchronization primitives?

Questions:

Specific:
- What are the traditional problems with virtual machines? (Virtualization overhead, resource management - don't know when a resource is no longer in use and can be taken away from a VM, communication and sharing - old virtual machines could not communicate: a user could not start two virtual machines that accessed files on the same disk).
- In the section describing VM overheads, they say talk about emulating privileged instructions. What are they talking about?
- MIPS R10000 does not support the complete virtualization. What does this mean? How did they solve this problem? Could they have solved it differently?
- List performance optimizations that helped scalability of Disco? (software TLB cache, shared-memory communication for VMs, replicated code, partitioned data structures, wait-free synchronization, page migration and replication, transparently shared memory and buffer cache).
- Why did they have to re-map the operating system code?
- Where does performance overhead come from? (Emulation of privileged instructions and I/O. Increased # of TLB misses - they don't use ASIDs and flush the TLB on every context switch, plus they remap the kernel into the mapped memory region).
- How are devices implemented? (One virtual device driver for each device type.)
- They claim that the changes that they applied by hand to IRIX could be done automatically. Do you believe them?
- Performance: what are the sources of performance overhead for applications? How do they differ depending on the application? Are these performance problems fundamental to virtual machines or can they be fixed with a better implementaton?
- Were you convinced by their scalability experiments? Page migration and repliation experiments?

General:
- Isn't this just an exo-kernel? What are the similarities, what are the differences?
- Isn't this just like Mach running Unix servers? What are the similarities, what are the differences?
- Why didn't the world adopt this idea? Why are people building SMP operating systems? (One reason is that it is difficult to run parallel applications. Disco had to provide special support for parallel applications: memory regions that are shared across multiple machines: parallel application run on different virtual machines and share memory through these segments.)
- Why is this a good idea? Why is this a not-so-good idea?
- What did you like about this paper? What are the good ideas you would use in future system designs?
- Throughout history we saw VMs being used as a "quick-and-simple" solution to complicated problems: IBM OS/360 used for time-sharing, Disco for running over SMP hardware, VMWare Workstation for running Microsoft office apps. by Unix geeks. Do VMs have a place of its own, or do they simply serve as technology placeholders until the better technology comes around?

History of Virtual Machines

Started at IBM in the 1970's
Were used for time-sharing to multiplex expensive hardware
Following development of Multics, IBM hurried to announce plans to build TSS, its time-sharing system
Multics and TSS were late
But IBM released a system CP/CMS (CP stands for "Control Program", i.e. virtual machine). CMS was a single-user operating system. This was their quick-and-dirty way to implement time-sharing.
CP/CMS was a precursor of IBM OS/360 and OS/390 - OS/390 is still in use today on IBM mainframes.

What are VMs used for?

Time sharing (1960s)
Operating system debugging when hardware is expensive
Running Windows office software
Security: honeypots
Multi-platform OS development (Solaris): develop OS for virtual hardware platform. Run on top of hypervisor. This simplifies development and reduces the size of code tree.

How is virtualization done?

Types of virtualization:
- Full system simulation or translation (slow, useful for research)
- Binary-compatible: non-privileged instructions run directly on hardware, privileges code traps into the VMM

Definition of virtualizability (by Goldberg - the father of virtual machines, 1974): For efficiency, most instructions execute natively. "Privileged" instructions must be trapped and emulated: accessing processor state: status registers, TLB, I/O instructions.

Hardware architecture matters
- Some hardware is more difficult to virtualize
- x86 hardware does not generate traps when privileged code is executed
- Privileged instructions must be substituted with the code that traps into the VMM
- There is heavy use of self-modifying code in the x86 world, so you have to be careful

What is a NUMA?

Non-uniform memory access
Some memory is closer than other
ccNUMA is cache-coherent NUMA