FAASM: Lightweight Isolation for Efficient Stateful Serverless Computing

Shillaker, Pietzuch (2020)

What kind of paper is this?

Yet another isolation mechanism -- containers with shared memory. (That is not a criticism, although seeing this convinces me more than ever that the work Sid is doing is critical!)
We could also call this: I built a thing; it's better than the other things.

The Story

Serverless is now a thing: Serverless isolates functions from each other.
Sometimes functions need to share; failure to do so is wasteful.
Two key problems: data access overheads and container resource footprint.
FaaSlets/FAASM allows memory sharing using SFI.
Jobs that need to share memory run way faster and consume less memory.

Margo Rant

Everytime I see someone say things like, "Data access is slow on serverless" I get very confused. It seems to me that by its very nature serverless is NOT designed for persistent data -- that is, it's all about running in a stateless fashion.
I'm thinking that we really have the wrong abstraction here (and we're going to stretch serverless until it's unrecognizable).
In my head, truly serverless should be about not needing persistent state.
Then we need something else that provides granular decomposition (FaaS) and persistent state.
Perhaps the first class is just too small to be interesting?
Or maybe serverless is just a terrible name?

Requirements for a Serverless Mechanism with better Data Properties

Strong memory and resource isolation
Efficient state sharing
Scaling state across multiple hosts
Low memory footprint
Fast instantiation
Multiple programming languages

Contributions

Lightweight isolation via SFI (memory), cgroups (CPU), network namespaces and traffic shaping.
Co-location of data in a shared address space. Locally use shared memory; globally use "distributed access".
Warmstart via pre-initialized snapshots. (OS independent)
Standard POSIX API (with minimal changes).
Evaluation demonstrating runtime, memory, and network traffic reductions.

Faaslet Function Overview

Compile functions to WebAssembly for memory safety and control flow integrity.
CPU isolatoin via Linux cgroups.
Fairshare via Linux CFS (each function runs in its own thread.
Fair and secure networking via network namespaces, virtual network interfaces, and traffic shaping (enforces ingress and egress traffic limits).
Does not offer POSIX -- instead a very limited set of function.
Shared memory via new shared region abstraction, added to WebAssembly. (Each faaslet gets a contiguous memory region; shared regions are appended to each faaslet's region.)

Local and Global State

Distributed Data Objects (DDO) are language-level classes that expose high-level state interfaces, implemented using the Faaslet KV-interfaces.
Share in-memory access locally; global access across hosts.
Faaslets can push to/pull from the global tier.
If you want global consistency, you have to request global locks; else you can (often) use local locks.

FAASM Runtime

Interacts with serverless infrastructure and provides scheduling, execution and state management for faaslets.
Scheduler tries to schedule faaslets where they have local state.
I think this means that all instances of the same function must share all possible state accessed by that function. E.g., If I want to execute a function on behalf of a customer, then a warm faaslet must have the ENTIRE customer database?
A protofaaslet is a pre-configured image containing all the code common to every instance of a faaslet. Improves cold start time.

Eval

How does FAASM statement management improve efficiency and performance on paralle machine learning training?
How effective are protofaaslets at reducing initialization time and throughput in inference servering?
How does faaslet isolation affect performance in a linear algebra benchmark using a dynamic language runtime?
How does faaslet overhead compare to docker?
Eval platform implements FAASM in Knative.
Parallel machine learning training
- Time as a function of number of parallel workers: Faasm shows graceful scaling out to 38 workers, but as # of works goes from 2->38, performance improves by about 5x. Knative runs out of memory at 30 parallel workers.
- Network traffic as a function of number of parallel workers: Faasm exhibits relatively stable BW from 2-38 workers, while Knative grows pretty linearly.
- Memory usage as a function of number of parallel workers: Faasm exhibits super slow growth in memory consumption relative to Knative.
- Is there a difference in consistency of the parameters in the two models?
Machine learning inference
- Faasm, as expected has super low cold start overhead.
- But inference time is somewhat higher than Knative due to the compilation from Tensor flow to Web Assembly.
Language runtime performance (i.e., how does web assembly do)
- Faasm and Knative are comparable on Cython matrix multiplication.
- In this app, Faasm reduces network bandwidth by about 13%.
- Polybench : faasm has comparable performance on all but 2 of the workloads (the two exceptions are due to missing loop optimizations in web assembly).
- Python Suite : faasm has fairly significant overhead on most of the benchmarks (e.g., big integer arithmetic is particularly slow in web assembly)..
The cold-start results from the last section (Faaslet way faster) are unsurprising given the design point and techniques.