Enzian: An Open, General, CPU/FPGA Platform for Systems Software Research
Cock, Ramdas, Schwyn, Giardino, Turowski, He, Hossle, Korolija, Licciardello, Martsenko, Achermann, Alonso, Roscoe (2022)
What kind of paper is this?
- Big idea: We need a HW platform for research so we built one!
The Story
- Historically, there was sufficient commonality in HW that system software
researchers could use commodity HW.
- The explosion in GPUs and various hardware accelerators has made this
no longer be the case. This makes it difficult to propose truly innovative
SW, because the HW can be too special-purpose.
- Enzian is a platform designed specifically to give systems researchers
a good experimental platform.
- Success with Enzian should enable the golden age of architecture to be
matched by a golden age of system software.
The Platform
- Hybrid 2-socket board with one CPU and one FPGA
- Large FPGA (Xilinx XCVU9P) + 1 TB DRAM
- Server-class CPU (48-core ARMv8) + 128GB DRAM and match-action tables
- 2 by 40 GB/s ethernet on CPU
- 16 by 25 GB/s serial lines on FPGA (configurable)
- Asymmetric, cache-coherent NUMA
- Explicit access to cache coherence messages -- FPGA is connected
to the CPU through the native inter-socket cache coherence protocol.
- Detailed Instrumentation
- Programmable baseboard management processor
How is this research?
- How do you construct a single platform that covers and extends
the space of existing platforms? (At reasonable cost.)
- What kinds of experiments does Enzian enable that could otherwise
not be done?
- What kinds of hardware should we be building?
- How do we expose the research community to fundamental challenges
in building new HW platforms
Categorization of existing platforms
- PCIe-based accelerators: data copies in bulk to/from accelerator;
challenge for fine-grain computation.
- Fully cache-coherent protocols: good for fine-grain acceleration,
but typically have small caches and no significant memory on the FPGA.
(Kind of the opposite of the category above.)
- smartNIC FPGA: PCIe bus internal to the NIC; no direct connection
between FPGA and CPU, so they are only used for network processing.
- SmartNIC FPGA w/direct CPU channel: (This was not called out
as a class in the paper, but feels different enough that I wanted to.)
Enables more complicated functionality on the NIC (e.g., KVstore) and
allows for exploration of applications where FPGAs can provide
acceleration with much shorter turn around than an ASIC.
- MPSoc: CPU and FPGA are on the same die. FPGA has access to cache
protocol, so the FPGA can act like part of the CPU memory
system, which enables research on remote memory protocols (e.g., CXL).
These systems have wimpy cores.
Advantages of a research platform
- Provide more instrumentation
- Shorter learning curve for adoption
- Reference point for comparing disparate work
- Broader scope of possible research projects
- Sharing across the community
Eval
- Goals: performance comparable to existing systems and must support
existing research projects and potentially new ones.
- Cache Coherence Interconnect: Enzian (ECI) versus PCIe
- Two links of 12 lanes each; experiment uses one link
- Latency: Better up to transfers of 8KB
- Throughput: Better (and then comparable) up to transfer of 2KB
- Attribute improved performance to better protocol design: optimized
for 128-byte cache line transfers; PCIe is designed for throughput so has
high up front cost.
- Network (TCP/IP and RDMA)
- TCP
- Using open source FPGA stack ported to Coyote por
- Compare Enzian performance to that between to Intel Xeon Golds
- Enzian is about 1/3 the latency and about 2.5x the throughput.
- RDMA: open-source rDMA stack on Enzian and commercial Xilinx Alveo 280
- Enzian CPU and FPGA measurements are quite close
- For latency, Enzian is faster than Alveo (FPGA) but slower than Mellanox Host.
- For throughput, Enzian is mostly fater until transfer are >= 8KB
- PCIe Accelerator Style Application: Inference on GBDT (as in Coyote)
- Outperforms all other (these look like the exact same experiments and
results as those from the Coyote paper with one more entry).
- Even so, they are clear to state that they do not claim that Enzian
is universally faster than anything (I really liked this).
- FPGA as custom memory controller
- This is a new style benchmark -- it's an example of using Enzian
to prototype something you might consider building.
- Prototype a bulk transfer to the FPGA which performs luminance converstion and optional quantization.
- Cache integration alone yields a 33% speedup (39% w/out quantization)
- The burst/bulk optimization produces an almost 4x improvement.
- Instrumentation: fine-grain power monitoring
- Spectacularly detailed power profiles.
- Can tease out CPU/FPGA and DRAM bank power draws.
Other Use Cases
- FPGA as relational database accelerator: Higher Enzian bandwidth
makes this a win.
- Inference: the TB of memory on the Enzian FPGA makes this a win.
- FPGA as smart storage controller (i.e., support tiered memory).
- FPGA as CPU observer/monitor.
- Cluster of Enzians as a disaggregated memory system
- Port seL4 to BMC.