Arrakis: The Operating System is the Control Plane
Peter, Li, Zhang, Ports, Woos, Krishnamurthy, Anderson, Roscoe (2014)
What kind of paper is this?
- Hardware/Computing landscape is changing (networks are really fast and some
devices provide virtualization), so let's redesign system sotware to this new
- Once upon a time system overheads were dwarfed by the relatively
poor performance of IO.
However, today's IO (networking in particular) are both significantly faster and
offer virtualization support.
Now, the software stack that sits on top of these devices is the bottleneck.
If we can give applications direct access to
the hardware for such data movement, they can bypass all that kernel overhead,
making an important class of servers much faster.
And now everyone who uses such services lives happily ever after.
Today's stacks are deeply layered and introduce overhead
- SW demultiplexing
- Security checks
- Context switching
- Cache and lock contension
- Queue management
- Storage (done with fsync) (Used to not be such a big deal when disks were
the bottleneck; with faster persistence, CPU time is a greater concern.)
- Data copies
- Parameter checking
- Metadata updates
- Arakis solution: use SR-IOV to remove kernel mediation from the data plane;
eliminates some overhead entirely and reduces other overheads. Eschewing POSIX
produces true zero-copy.
- SR-IOV is designed to support multiple VMs sharing an IO device
- Single device (physical function) can be dynamically turned into multiple
PCI devices (virtual functions).
- Hypervisor creates the virtual functions and install filters to demultiplex
operations to the right virtual function (guest OS).
- Arrakis uses SR-IOV and IOMMUs to give applications (instead of guest VMs)
direct access to IO devices.
- Note: This is not so much a new idea as the modern embodiment of an idea
that's been around for quite some time.
- Remove kernel from data plane operations
- Require no changes for application programming (but allow changes to produce
even better performance).
- Provide abstractions useful across many devices.
- Enabling technology:
While SR-IOV is designed for VMs, in Arrakis, each application can
have its own networking stack and virtual device.
- Assumes virtualized devices (i.e., SR-IOV) -- and their interface presents
the abstraction of idealized virtual devices, ideally provided entirely by the
- IO stacks provided as libraries (a partial libOS).
- Each device interacts with applications through a DMA send-receive queue.
Optionally might include device specific operations, e.g., TCP checksum and
- Transmit filters (predicates) prevent applications from sending bad packets or
avoiding security checks.
- Receive filters (predicates) direct incoming packets to the right
- Privileged software can install bandwidth limiters to implement policies
among the virtual devices.
Device Emulation: To run Arrakis without smart devices, use a core
on the processor to play the role of the smart device.
Control Plane Interface: Key abstractions: virtual interface cards,
doorbells, filters, Virtual Storage adapters, and rate specifiers.
Filters are implemented using barrelfish capabilities.
VSAs implement the file system on storage devices; an application can
choose to export the names it uses to the VFS.
Arrrakis provides standard POSIX APIs as well as Arrakis native APIs
(mostly for 0-copy).
Async notification (virtualized HW interrupts) are delivered via
doorbells, which are exposed through conventional APIs, e.g., select.
Based on Barrelfish, but extended with user-level network stack (Extaris),
POSIX threads, POSIX sockets, epoll, Caladan (persistent data structures), etc.
- Read-heavy memcached
- Write-heavy Redis
- Direct HTTP requests
- HTTP requests via a load balancer
- Nice focused questions for the eval:
- What are performance overheads and how to they compare to Linux?
- Is Arrakis latency/throughput better?
- Can Arrakis provide high performance directly to the application?
- What is the advantage of departing from POSIX?
- UDP echo server: Arrakis/N achieves 94% of the driver limit; roughly 4x
better than Linux. As the application spends more time processing, this
difference (obviously) decreased and evens out at about 64us. Arrakis/P
is nearly as good.
- Memcached: Arrakis is more scalable (maxes out the connection at 4 cores,
while Linux barely scales to 2). And it produces about 3x max throughput.
Changing implementation to Arrakis/N provides only about another 10% increase
- Redis: (This adds persistence) Read throughput is about 80% better
with Arrakis/P, but write throughput is about an order of magnitude better!
- HTTP haproxy: Arrakis/P improves throughput by
2 to 2.5x.
- HTTP middlebox load balancer: Arrakis/P improves throughput by 3-5x.
- Is 128 entries in memcached realistic? If not, is it likely to
affect performance? Is the equal load assumption valid?
- Why do we only see throughput numbers?
What did I like?
- Getting the OS out of the way for data operations has repeatedly been
shown to be a win.
- Designing for idealized HW also makes the OS way easier to deal with.