FAASM: Lightweight Isolation for Efficient Stateful Serverless Computing
Shillaker, Pietzuch (2020)
What kind of paper is this?
- Yet another isolation mechanism -- containers with shared memory. (That is
not a criticism, although seeing this convinces me more than ever that the work
Sid is doing is critical!)
- We could also call this: I built a thing; it's better than the other things.
The Story
- Serverless is now a thing: Serverless isolates functions from each other.
- Sometimes functions need to share; failure to do so is wasteful.
- Two key problems: data access overheads and container resource footprint.
- FaaSlets/FAASM allows memory sharing using SFI.
- Jobs that need to share memory run way faster and consume less memory.
Margo Rant
- Everytime I see someone say things like, "Data access is slow on
serverless" I get very confused. It seems to me that by its very nature
serverless is NOT designed for persistent data -- that is, it's all about
running in a stateless fashion.
- I'm thinking that we really have the wrong abstraction here (and we're
going to stretch serverless until it's unrecognizable).
- In my head, truly serverless should be about not needing persistent state.
- Then we need something else that provides granular decomposition (FaaS)
and persistent state.
- Perhaps the first class is just too small to be interesting?
- Or maybe serverless is just a terrible name?
Requirements for a Serverless Mechanism with better Data Properties
- Strong memory and resource isolation
- Efficient state sharing
- Scaling state across multiple hosts
- Low memory footprint
- Fast instantiation
- Multiple programming languages
Contributions
- Lightweight isolation via SFI (memory), cgroups (CPU), network namespaces and traffic shaping.
- Co-location of data in a shared address space. Locally use shared memory;
globally use "distributed access".
- Warmstart via pre-initialized snapshots. (OS independent)
- Standard POSIX API (with minimal changes).
- Evaluation demonstrating runtime, memory, and network traffic reductions.
Faaslet Function Overview
- Compile functions to WebAssembly for memory safety and control flow integrity.
- CPU isolatoin via Linux cgroups.
- Fairshare via Linux CFS (each function runs in its own thread.
- Fair and secure networking via network namespaces, virtual network interfaces,
and traffic shaping (enforces ingress and egress traffic limits).
- Does not offer POSIX -- instead a very limited set of function.
- Shared memory via new shared region abstraction, added to WebAssembly. (Each faaslet gets a contiguous memory region; shared regions are appended to each
faaslet's region.)
Local and Global State
- Distributed Data Objects (DDO) are language-level classes that expose high-level state
interfaces, implemented using the Faaslet KV-interfaces.
- Share in-memory access locally; global access across hosts.
- Faaslets can push to/pull from the global tier.
- If you want global consistency, you have to request global locks; else you can (often)
use local locks.
FAASM Runtime
- Interacts with serverless infrastructure and provides scheduling, execution and
state management for faaslets.
- Scheduler tries to schedule faaslets where they have local state.
- I think this means that all instances of the same function must share all possible
state accessed by that function. E.g., If I want to execute a function on behalf of a
customer, then a warm faaslet must have the ENTIRE customer database?
- A protofaaslet is a pre-configured image containing all the code common to every
instance of a faaslet. Improves cold start time.
Eval
- How does FAASM statement management improve efficiency and performance on paralle
machine learning training?
- How effective are protofaaslets at reducing initialization time and throughput in
inference servering?
- How does faaslet isolation affect performance in a linear algebra benchmark using a
dynamic language runtime?
- How does faaslet overhead compare to docker?
- Eval platform implements FAASM in Knative.
- Parallel machine learning training
- Time as a function of number of parallel workers: Faasm shows graceful scaling out
to 38 workers, but as # of works goes from 2->38, performance improves by about 5x. Knative
runs out of memory at 30 parallel workers.
- Network traffic as a function of number of parallel workers: Faasm exhibits relatively
stable BW from 2-38 workers, while Knative grows pretty linearly.
- Memory usage as a function of number of parallel workers: Faasm exhibits super slow
growth in memory consumption relative to Knative.
- Is there a difference in consistency of the parameters in the two models?
- Machine learning inference
- Faasm, as expected has super low cold start overhead.
- But inference time is somewhat higher than Knative due to the compilation from
Tensor flow to Web Assembly.
- Language runtime performance (i.e., how does web assembly do)
- Faasm and Knative are comparable on Cython matrix multiplication.
- In this app, Faasm reduces network bandwidth by about 13%.
- Polybench : faasm has comparable performance on all but 2 of the workloads
(the two exceptions are due to missing loop optimizations in web assembly).
- Python Suite : faasm has fairly significant overhead on most of the benchmarks
(e.g., big integer arithmetic is particularly slow in web assembly)..
- The cold-start results from the last section (Faaslet way faster) are unsurprising
given the design point and techniques.