Disco: Running Commodity Operating Systems On Scalable Multiprocessors
Bugnion, Devine, Rosenblum (1997)
What kind of paper?
- New application of old technology
- History repeats itself
- A new twist on the old idea
- Deeply, deeply technical paper -- it's a real slog to read and truly
understand everything they are doing. But it's good technical content, so
worth the effort.
What was the problem they were trying to solve?
- A quick-and-dirty way to implement a multi-processor operating system
- Developing multi-processor operating systems is hard
- Disco "reduces the gap between hardware innovation and the adaptation of system software"
- What do we think of the following, "Our experience with realistic
workloads on a detailed simulator of the FLASH machine show that Disco
achieves its goals."
What are the new ideas?
- Run multiple single-processor operating systems over Disco VMM
- Use distributed system facilities to provide a single-system image to the user
- Eliminates inefficiencies: allow transparent buffer cache sharing among virtual machines
- This was binary-compatible virtualization. OS's were compiled for the same hardware that was virtualized.
- Use page placement and dynamic page migration to hide non-uniformity of memory access
Interfaces (between the VMM and the OS's)
- Processors: MIPS R10000 (not fully virtualizable)
- Physcial Memory: Flat address space starting at 0; Disco hides the
NUMA-ness.
- I/O Devices: disks with different access control.
System implementation
- Disco itself is a multi-threaded shared memory program.
- Multiple OS'es run over Disco. Don't have to be all the same OS'es.
- Single-system image is accomplished by configuring the systems as a cluster.
- Machine resources are managed by the VMM and are dynamically allocated among virtual machines.
- A virtual machine is the unit of scalability and the unit of fault containment: contains both software and hardware faults.
- Code base is very small - 13000 lines, 72KB - it's replicated.
- Machine-wide data structures are partitioned, such that they are located on the processor where they are likely to be accessed more often.
- Wait-free synchronization is used to improve scalability.
- Inter-VM communication is done through shared memory.
- Devices are virtualized: all operations on devices are intercepted and emulated
- Non-privileged instructions are run directly on hardware, privileged instructions are emulated.
- Memory pages are migrated and replicated to ensure better locality. They use FLASH hardware counters to find out whether a page should be migrated or replicated.
- Memory and disk are transparently shared. Block cache is shared.
- Copy-on-write disks
Implementation Details
Processors
- Each virtual CPU is kind of like a Disco process.
Disco maintains a data structure for each Virtual CPU with the processor
state (privileged registers, TLB, etc.).
- Disco runs in kernel mode
- VMs run in supervisor mode when the OS is running, but in
use mode when an application is running.
- Supervisor mode can access protected parts of address space, but cannot
use privileged instructions or access physical memory directly.
- On a trap, Disco emulates the operations of the virtual processor.
Memory
- Two layers of mapping: virtual->physical, physical->machine
- TLB maps across both: virtual->machine
- Kernel typically runs unmapped on MIPS; relink OS to run in mapped
regions, so that this translate happens.
- Implements a second level software TLB to mitigate the effects of using
more TLB entries for the OS and for having to swap between virtual CPUs, which
requires TLB flushes.
- Pages are migrated and replicated to reduce cache misses.
- Hardware counters provide data to help Disco decide which pages to
migrate and which to replicate.
I/O
- Disco intercepts all device accesses.
- Disco devices have clean interface with everything packed into a
single trap.
- Shared devices handled directly by Disco device drivers.
- Devices used by only a single VM aren't virtualized.
Copy-on-write disks
- When possible, handle I/O via remapping (when the page is already in
memory).
- Multipe VMs can end up sharing memory (uses COW to deal with updates).
- Huge win for things like executable directories and the code parts of the
OS, etc.
- Also works great for things like NFS when combined with the network
interface optimizations described below.
Network Interface
- Communication without replicating data.
- Another application of COW -- don't copy data between VMs, just
remap it.
IRIX changes
- Move kernel out of KSEG0 into a mapped segment.
- New device drivers that use DISCO's DMA based driver interface
- Move trapping register accesses into a special page mapped only into the
OS, so that you don't have to have the OS trap on every such update.
- Change to mbufs to avoid COW traps.
- Specialized bcopy to mmap if possible.
Performance results:
- Ran on a simulator, simulating a slightly different processor (for
performance reasons).
- To get the overhead to its minimum, modifications to operating systems were required
- Main sources of performance overhead are: TLB reloading for scientific mostly user-level workloads. High TLB fault rate for unpredictable database workloads. Emulation overhead for pmake.
- They show better scalability with Disco as compared to IRIX, a commercial SMP operating system. Is this a fundamental property of VMs? Or can IRIX be fixed by using better synchronization primitives?
Questions:
- Specific:
- What are the traditional problems with virtual machines?
(Virtualization overhead, resource management - don't know when a
resource is no longer in use and can be taken away from a VM,
communication and sharing - old virtual machines could not communicate:
a user could not start two virtual machines that accessed files on
the same disk).
- In the section describing VM overheads, they say talk about
emulating privileged instructions. What are they talking about?
- MIPS R10000 does not support the complete virtualization. What
does this mean? How did they solve this problem? Could they have
solved it differently?
- List performance optimizations that helped scalability of
Disco? (software TLB cache, shared-memory communication for VMs,
replicated code, partitioned data structures, wait-free synchronization,
page migration and replication, transparently shared memory and
buffer cache).
- Why did they have to re-map the operating system code?
- Where does performance overhead come from? (Emulation of
privileged instructions and I/O. Increased # of TLB misses - they
don't use ASIDs and flush the TLB on every context switch, plus
they remap the kernel into the mapped memory region).
- How are devices implemented? (One virtual device driver for
each device type.)
- They claim that the changes that they applied by hand to IRIX
could be done automatically. Do you believe them?
- Performance: what are the sources of performance overhead for
applications? How do they differ depending on the application? Are
these performance problems fundamental to virtual machines or can
they be fixed with a better implementaton?
- Were you convinced by their scalability experiments? Page
migration and repliation experiments?
- General:
- Isn't this just an exo-kernel? What are the similarities, what are the differences?
- Isn't this just like Mach running Unix servers? What are the similarities, what are the differences?
- Why didn't the world adopt this idea? Why are people building SMP operating systems? (One reason is that it is difficult to run parallel applications. Disco had to provide special support for parallel applications: memory regions that are shared across multiple machines: parallel application run on different virtual machines and share memory through these segments.)
- Why is this a good idea? Why is this a not-so-good idea?
- What did you like about this paper? What are the good ideas you would use in future system designs?
- Throughout history we saw VMs being used as a "quick-and-simple" solution to complicated problems: IBM OS/360 used for time-sharing, Disco for running over SMP hardware, VMWare Workstation for running Microsoft office apps. by Unix geeks. Do VMs have a place of its own, or do they simply serve as technology placeholders until the better technology comes around?
History of Virtual Machines
- Started at IBM in the 1970's
- Were used for time-sharing to multiplex expensive hardware
- Following development of Multics, IBM hurried to announce plans to build TSS, its time-sharing system
- Multics and TSS were late
- But IBM released a system CP/CMS (CP stands for "Control Program", i.e. virtual machine). CMS was a single-user operating system. This was their quick-and-dirty way to implement time-sharing.
- CP/CMS was a precursor of IBM OS/360 and OS/390 - OS/390 is still in use today on IBM mainframes.
What are VMs used for?
- Time sharing (1960s)
- Operating system debugging when hardware is expensive
- Running Windows office software
- Security: honeypots
- Multi-platform OS development (Solaris): develop OS for virtual hardware platform. Run on top of hypervisor. This simplifies development and reduces the size of code tree.
How is virtualization done?
- Types of virtualization:
- Full system simulation or translation (slow, useful for research)
- Binary-compatible: non-privileged instructions run directly on hardware, privileges code traps into the VMM
- Definition of virtualizability (by Goldberg - the father of virtual machines, 1974): For efficiency, most instructions execute natively. "Privileged" instructions must be trapped and emulated: accessing processor state: status registers, TLB, I/O instructions.
- Hardware architecture matters
- Some hardware is more difficult to virtualize
- x86 hardware does not generate traps when privileged code is executed
- Privileged instructions must be substituted with the code that traps into the VMM
- There is heavy use of self-modifying code in the x86 world, so you have to be careful
What is a NUMA?
- Non-uniform memory access
- Some memory is closer than other
- ccNUMA is cache-coherent NUMA