Do OS abstractions make sense on FPGAs?
Korolija, Roscoe, Alonso (2020)
What kind of paper is this?
- A rethinking of SW in light of the reality of HW trends.
The Story
- FPGAs are here; in data centers.
- Prior work on placing subset of OS functionality on FPGA.
- What if we went "all in" with FPGAs?
- Coyote is an open source, portable platform that provides a full compliment of
OS abstractions, letting us examine the all-in approach.
Context
- Hybrid boards with CPU and FPGA
- Any kind of bus connect (PCIe, CXL, CCIX, OpenCAPI, and native ones: Intel Harp or ETH Enzian).
- Avoid decisions that preclude use of FPGA any features.
- Partition FPGA into static (configured at boot) portion and a reconfigurable portion.
The Static Region
- Must have everything necessary to communicate with the host and reconfigure the dynamic region.
- Make the region modular so some components can be omitted.
- Required portions:
- Reconfiguration logic
- xDMA engine for host communication
- Ability to divide dynamic region into virtual FPGAs (vFPGA).
- Optional portions (shared across all vFPGA):
- memory controllers (for directly connected RAM)
- networking (TCP and RDMA at time of writing)
Dynamic Region
- Spatial (multiple vFPGAs) and temporal (reconfiguration) multiplexing
- Each vFPGA divided into user logic (application written by user) and wrapper (part of Coyote)
- Applications can be written in HLS, Verilog, VHDL, OpenCL (or combinations)
- Wrapper: sandboxes user logic and interfaces to rest of system (this provides portability
of user code between Coyote systems)
Trading off OS code for FPGA code
- Host OS must provide a user abstraction for access FPGA user logic.
- Faster to put performance critical stuff on FPGA, but space is a precious resource.
- Coyote principle: If functionality is not on the fast path -- it goes on the host
- OS code: driver (for Linux) and runtime manager
- driver: read config info from static region and create vFPGA data structures
- driver: control pane ops -- memory mapping, reconfiguration
OS Abstractions
- Processes, threads, tasks: it seems that this does map well to a vFPGA, but
they make it sound completely different. Admittedly, the granularity of scheduling
needs to be MUCH larger (I wish they had given us some basic info about FPGAs, such
as how long reconfiguration takes and longevity issues. I needed this before 4.3.)
- Execution environment: AXi4 interface provided in the dynamic wrappers; interfaces
with memory, network, host, and data/control busses. Applications use descriptors to
communicate with these interfaces.
- Using the FPGA:
- Application creates a job object, which encapsulates the user program, its
data, and its parameters.
- The job object is then passed to the runtime manager, who installs it on the FPGA.
- The CPU then interacts with the application on the FPGA by reading/writing to
memory mapped registers that the runtime manager creates for the prorgam.
- Now communication bypasses the OS and the runtime manager (i.e., it's just memory
writes from the application/CPU perspective).
- Scheduling: pre-emption is difficult, if not impossible. Try to build FPGA work so
that a task can run to completion (the only other option is really to kill a badly
behaving application). In coyote, scheduling is in the host runtime that install
the job objects. Uses a modified priority queueing approach.
- Virtual memory
- Useful VM functionality for FPGA: demand paging, relocation
- Challenge is addressing -- relative pointers or swizzling; and TLB changing on the fly!
- Newer FPGAs also have their own memory and could benefit from VM for that
- Claim SW loaded TLB is more appropriate
- Coyote provides TLBs in wrappers; mediate all accesses to FPGA-attached devices and RAM.
- Coyote provides separate 4KB-page and 2MB-page TLBs for each vFPGA.
- Host maintains, for each TLB, contents of TLB, SW-accessible wrapper registers, direct access to user-logic.
- Memory Management: FPGA memory has no caches and memory is more diverse.
In Coyote, kernel driver does all physical memory allocation (which also does the
VM mappings). Stripes memory to ensure high bandwidth.
- IPC, IO, etc: provides optional HW queues for IPC between vFPGAs; network stack is
a service (HW queues used for communication w/services).
Eval
- Main Question: Is Coyote's flexibility worth it? (I.e., it makes programming easier, but
probably costs performance, space, etc.)
- Macro-benchmark: Decision Trees (GBDT) -- compare Coyote to
Amazon F1 using SDAccel: Coyote looks good, and even if you scale
Harp-v2 with the clock rate, Coyote is still comparable.
- Space overhead: w/out networking 2-4% base overhead + 1% per additional vFPGA;
w/networking 6-9% base overhead + 1-2% per additional vFPGA.
- Micro-benchmark: context switch (reconfiguration) -- linear function of
how much area a vFPGA occupies. Partial reconfiguration, unsurprisingly, helps
a lot! (I.e., reusing the parts common to multiple jobs.)
- Resource sharing: Really great results that show pretty linear behavior.
- DRAM striping: Striping does almost as well as manually dividing (about a
10% penalty).
- Demand Paging: Works quite well (2-9%) for all put the worst case (which
is basically streaming).