Do OS abstractions make sense on FPGAs?

Korolija, Roscoe, Alonso (2020)

What kind of paper is this?

A rethinking of SW in light of the reality of HW trends.

The Story

FPGAs are here; in data centers.
Prior work on placing subset of OS functionality on FPGA.
What if we went "all in" with FPGAs?
Coyote is an open source, portable platform that provides a full compliment of OS abstractions, letting us examine the all-in approach.

Context

Hybrid boards with CPU and FPGA
Any kind of bus connect (PCIe, CXL, CCIX, OpenCAPI, and native ones: Intel Harp or ETH Enzian).
Avoid decisions that preclude use of FPGA any features.
Partition FPGA into static (configured at boot) portion and a reconfigurable portion.

The Static Region

Must have everything necessary to communicate with the host and reconfigure the dynamic region.
Make the region modular so some components can be omitted.
Required portions:
- Reconfiguration logic
- xDMA engine for host communication
- Ability to divide dynamic region into virtual FPGAs (vFPGA).
Optional portions (shared across all vFPGA):
- memory controllers (for directly connected RAM)
- networking (TCP and RDMA at time of writing)

Dynamic Region

Spatial (multiple vFPGAs) and temporal (reconfiguration) multiplexing
Each vFPGA divided into user logic (application written by user) and wrapper (part of Coyote)
Applications can be written in HLS, Verilog, VHDL, OpenCL (or combinations)
Wrapper: sandboxes user logic and interfaces to rest of system (this provides portability of user code between Coyote systems)

Trading off OS code for FPGA code

Host OS must provide a user abstraction for access FPGA user logic.
Faster to put performance critical stuff on FPGA, but space is a precious resource.
Coyote principle: If functionality is not on the fast path -- it goes on the host
OS code: driver (for Linux) and runtime manager
- driver: read config info from static region and create vFPGA data structures
- driver: control pane ops -- memory mapping, reconfiguration

OS Abstractions

Processes, threads, tasks: it seems that this does map well to a vFPGA, but they make it sound completely different. Admittedly, the granularity of scheduling needs to be MUCH larger (I wish they had given us some basic info about FPGAs, such as how long reconfiguration takes and longevity issues. I needed this before 4.3.)
Execution environment: AXi4 interface provided in the dynamic wrappers; interfaces with memory, network, host, and data/control busses. Applications use descriptors to communicate with these interfaces.
Using the FPGA:
- Application creates a job object, which encapsulates the user program, its data, and its parameters.
- The job object is then passed to the runtime manager, who installs it on the FPGA.
- The CPU then interacts with the application on the FPGA by reading/writing to memory mapped registers that the runtime manager creates for the prorgam.
- Now communication bypasses the OS and the runtime manager (i.e., it's just memory writes from the application/CPU perspective).
Scheduling: pre-emption is difficult, if not impossible. Try to build FPGA work so that a task can run to completion (the only other option is really to kill a badly behaving application). In coyote, scheduling is in the host runtime that install the job objects. Uses a modified priority queueing approach.
Virtual memory
- Useful VM functionality for FPGA: demand paging, relocation
- Challenge is addressing -- relative pointers or swizzling; and TLB changing on the fly!
- Newer FPGAs also have their own memory and could benefit from VM for that
- Claim SW loaded TLB is more appropriate
- Coyote provides TLBs in wrappers; mediate all accesses to FPGA-attached devices and RAM.
- Coyote provides separate 4KB-page and 2MB-page TLBs for each vFPGA.
- Host maintains, for each TLB, contents of TLB, SW-accessible wrapper registers, direct access to user-logic.
Memory Management: FPGA memory has no caches and memory is more diverse. In Coyote, kernel driver does all physical memory allocation (which also does the VM mappings). Stripes memory to ensure high bandwidth.
IPC, IO, etc: provides optional HW queues for IPC between vFPGAs; network stack is a service (HW queues used for communication w/services).

Eval

Main Question: Is Coyote's flexibility worth it? (I.e., it makes programming easier, but probably costs performance, space, etc.)
Macro-benchmark: Decision Trees (GBDT) -- compare Coyote to Amazon F1 using SDAccel: Coyote looks good, and even if you scale Harp-v2 with the clock rate, Coyote is still comparable.
Space overhead: w/out networking 2-4% base overhead + 1% per additional vFPGA; w/networking 6-9% base overhead + 1-2% per additional vFPGA.
Micro-benchmark: context switch (reconfiguration) -- linear function of how much area a vFPGA occupies. Partial reconfiguration, unsurprisingly, helps a lot! (I.e., reusing the parts common to multiple jobs.)
Resource sharing: Really great results that show pretty linear behavior.
DRAM striping: Striping does almost as well as manually dividing (about a 10% penalty).
Demand Paging: Works quite well (2-9%) for all put the worst case (which is basically streaming).