Container-based Operating System Virtualization:
A Scalable, High-performance Alternative to Hypervisors
Soltesz, Potzl, Fiuczynski, Bavier, Peterson (2007)
What kind of paper is this?
- Best of both worlds: Isolation of VMs; resource efficiency approaching
that of processes.
The Story
- Isolation and sharing are at tension (e.g., processes have weak isolation
but lots of sharing; VMs have strong isolation, but no sharing).
- However, even in isolated environments, e.g., clouds, applications that do
not share any application information do share OS information. Thus, VM
solutions are resource inefficient.
- We present a container-based approach to provide the best of both worlds.
- Cloud vendors can use resource more efficiently, which should lead to
savings for cloud customers.
Context
- VM usage scenarios
- Secure work environment on laptop
- Realtime virus detection
- Intrusion root cause analysis
- Debug system
- Under consideration: HPC
- Under consideration: the Grid
- Under consideration: PlanetLab/Amazon EC2
- Container usage scenario: not as clearly defined, basically just "cases
in which you are willing to trade away isolation for efficiency"
- Tradeoffs
- Fault isolation: both container-based and VM-based systems rely on some
OS, principle difference is the size of the API; smaller APIs should leave
less opportunity for error. (In theory)
- Resource isolation: Both types of systems have to implement some kind of
sophisticated resource allocation polices.
- Security isolation: While carefully designed containers can try to
provide strong boundaries, it's way easier in a VM-based system.
Pieces of a Container-based system
- User Application View: root consisting of OS, libraries, etc (shared, but
read-only); assigned resources that can change dynamically; ability to boot,
shutdown, and reboot like an OS.
- Root is shared
- One VM is designated as a host VM (as in Xen) and is a full blown VM.
- Other VMs are guests -- the implementation of the guest is where things
differ
- Implementation
- COS: Security isolation on OS objects (PIDs, ptys, shared memory, etc)
- Contexts: Separate namespaces: IPC keys, IDs, etc are local to a VM and
there are no pointers between VMs.
- Filters: Access controls: runtime checks the kernel makes to grant access.
- VMs: Same idea of context and filter, but controlled at the HW level.
- Both systems use essentially same techniques for resource isolation (and,
in fact, when Xen runs linux in Dom0, it uses exactly the same code as does
a container based system on Linux.
- The hypervisor requires about an order of magnitude more code than is
required for the code changes in Linux-VServer.
Resource Isolation
- CPU: Token bucket + O(1) scheduler
- A token entitles a VM to run for 1 ms
- VMs with reservations accumulate tokens according to that reservation
- Reservations take priority over shares (these are allocated proportionally)
- IO: Network bandwidth allocated by Linux Hierarchical Token Bucket
- Disk IO: Standard Linux Completely Fair Queueing
- Storage: Hard limits on disk blocks and inodes
- Memory: limit resident set size, pinned pages, anon pages, shmem pages
Security Isolation
- PID: currently global; moving to local per VM
- Network: shares routing tables, IP tables, etc, but restricts sockets to
which a given VM can bind. (If routing tables are shared, doesn't this leak
info between VMs?)
- Chroot: Had to hack around a bug that lets you escape your chroot.
- Capabilities are limited to capabilities for a specific VM
System Efficiency (Eval)
- Claim: VServer is comparable to Linux; Xen incurs 50% overhead
- Tests are only on a system with a single VM; that makes me suspicious
- Microbenchmarks (lmbench)
- Is this an eval of VServer or of Xen?
- Present results only for those tests where the difference between Xen
and Vserver is more than 2x
- Xen overheads are all from maintaining page tables via upcalls
- System Benchmarks
- Iperf (network BW): Xen is CPU bound; VServer comparable to Linux
- dd (write 6 GB file): VServer equals Linux; Xen is about 65% to 85%
- CPU-only+kernel compile: Xen is comparable to Linux/Vserver
- OSDB: Xen ranges between 50% to 100% of Linux
Isolation
- The one-quarter test seems like a bogus comparison: Xen does not provide
the functionality being tested, so they criticize it for not doing what they
are asking.