Kivity: OSv -- Optimizing the Operating System for Virtual Machines
Kivity, Laor, Costa, Enberg, Har'El, Marti, Zolotarov (2014)
What kind of paper?
- Describes a new OS designed specifically for a virtualized environment.
- Specifically designed for environments where a VM is running a single
application (e.g., a database, a file server).
- Avoid duplication
- Library-OS idea
What was the problem they were trying to solve?
- Guest OS and hypervisor duplicate functionality.
- If you run a set of cooperating trusted applications in a VM, they
don't need isolation from one another; hypervisor isolation is sufficient.
- Both the OS and the hypervisor abstract the hardware.
- Goals of general purpose OS and VM OS differ:
- VM OS needs administration at scale (not single-machine admin)
- VM OS need not be portable to different hardware.
- VM OS needs to be fast and small, perhaps tailored to a
specific implementation.
Goals
- Run Linux binaries faster than Linux
- Quick boot
- Develop efficient new APIs that might be used natively or
from a JVM
- Develop a platform for further research
Design
- Library OS -- each VM is a single application (dynamically) linked with
its own copy of the OS. (More accurately the other way around -- OS dynamically
loads the application.)
- Appication talks directly to hardware.
- Single address space -- if you wanted more, you should have fired
up multiple VMs -- kernel and application run in the space address space.
- System calls to OSv become ordinary function calls.
- Multiple file systems: ZFS, devfs, ramfs (using VFS).
Memory Management
- Users VM (because x86_64 requires it for long mode operation)
- Supports mmap.
- No page eviction (because single application -- only need swap).
No Spinlocks
- Bad in virtual environments where the OS cannot prevent a kernel
from being descheduled while it holds a spinlock (because the hypervisor
might deschedule the entire VM).
- Instead, do kernel work in threads and use non spinlock based mutex
(i.e., lock free implementation).
- User per-thread run-queues and lock-free algorithms for the
scheduler (since it cannot run in a thread).
Network Channels
- The idea is to eliminate sharing that typically happens as packets
traverse up and down the network stack.
- Instead, create a channel per-flow.
- Classifier directs an incoming packet to the right flow.
- No synchronization on incoming packets.
- On the send side, packets handled by the application in the application
thread.
- Only synchronization is a single per-socket send/receive buffer lock.
Thread Scheduling
- Desirable properties:
- lock-free
- preemptive
- tickless
- fair
- scalable
- efficient
- Doesn't the N2 incoming wakeup queues potentially pose a problem?
- How are the calculations of when next thing is to run not ticks?
New APIs
- Main idea: because Linux was designed to run multiple processes,
it has overheads when you are not running multiple processes, but are
running a single-address-space operating system.
- Examples:
- netmap API: don't have to copy network packets from kernel to application.
- Expose MMU: JVM can use it to make GC more efficient.
- Basically, in this case, crashing the OS is equivalent to crashing
the application, so the fact that bugs in the application cause the OS
to crash just doesn't matter.
- Shrinker: low memory callback (sounds like an external pager) (lets
say a buffer cache grow as large as possible without interfering with the
OS). Thought: this sounds like an architecture that basically lets me
customize the OS for my application.
- Balloon: This seems completely orthogonal to OS-v -- any OS could do this
(although calling into the JVM through the JNI would probably be gross -- you'd
want a daemon to do this). This feels like a hack to me.
Evaluation
- Anyone know why one has to clear the iptables firewall rules?
- Interesting that they start with macrobenchmarks; typically one does
microbenchmarks first and uses them to explain the macrobenchmarks.
- The only stated goal of the evaluation is to demonstrate improvement
over LInux. Are there other things you might have wanted them to evaluation?
- Macrobenchmarks
- Memslap (memcached): They show good numbers, but a) do not explain why,
b) don't demonstrate the ability to integrate the caching hacks so that the
memcache cache can grow as large as possible, and c) don't tell us how long
their Linux took to boot or what its memory sizes were.
- SpecJVM2008: I like the fact that they were willing to show a benchmark
where their system does not really outscore Linux (0.005 doesn't seem
significant to me -- they claim it is because the SD was only 0.002, but
I would have liked a bit more detail about what/how specjvm runs).
- Microbenchmarks
- Netperf: again "we're better" but no explanation of why -- I find this sad.
- JVM Balloon: OK, now they test their cute little hack. And they explain
what is happening. Yeay.
- Context switch (really a thread switch): OSv way faster with no explanation.