The Design and Implementation of a Log-Structured File System

What kind of paper is this?

Motivation

Widening I/O gap.
Large caches reduce read traffic.
Sequential writes improve write performance
Log representation is the only on-disk representation.
Technology is not improving disk access times.
Get rid of two problems in existing file systems:
- Small, unclustered accesses.
- Synchronous I/Os.

Log-Structured File Systems

Cache data in memory.
Coalesce multiple writes (data, inodes, directories, indirect blocks) into "single" I/O.

Free Space Management

Threading
- Leave live data in place.
- Write new data to available places.
- Problem: available spaces are no longer contiguous.
- WAFL uses this technique.
Cleaning
- Copy live data back into log.
- Reclaim space.
Hybrid
- Use threaded segments.
- Clean on a per segment basis.
- Thread segments together.
- Claim that segments are always written in entirety from front to back; not true for partial segments.

Recovery

Know where last mods were (end of log), so know where any inconsistencies might be).
Use checkpoints and roll-forward to recover.
Checkpoints
- Flush all modified data.
- Write inode map, segment usage table, and end- of-log pointer to checkpoint region.
- At reboot, read each of two checkpoint regions, initialize structures from the checkpoint. (assumption: last byte of region is last byte read)
- Checkpoint every 30 seconds (should checkpoint after some amount of data has been written).
Roll-Forward
- Read segments written after the last checkpoint.
- Can lose recently created files if data blocks were written, but inode was not.
- Use write-ahead logging to maintain consistency between directory entries and inodes.

Cleaning

Three-step algorithm
- Read N dirty segments.
- Identify which blocks are live.
- Write live data back to log.
Identify each block
- Must know which block of which file is being cleaned.
- Write segment summary at end of each segment.
- Identifies each file appearing in segment and each block in each file.
- Overhead is one segment summary per partial segment write.
Cleaning Policies
- When should the cleaner run?
- How many segments should be cleaned at once?
- Which segments should be cleaned?
- How should cleaned blocks be grouped?
This work addresses items C.3 and C.4.
- Cleans a few 10s of segments -- requires a few tens of megabytes of kernel main memory.
- Waits until clean segments are scarce; then begins cleaning.

Cleaning Analysis

Define Write Cost (WC)
- Disk busy time per byte of data read.
- Read N segments.
- Write out N*u (where u is utilization) cleaned bytes.
- This create n * (1 - u) clean segments.

Cleaning Simulation

Application of write cost.
- What assumptions are built in to the discussion on page 35 and Figure 3?

Performance

Experience
- Current systems are not diskbound
- Claims complexity less than Unix FFS.
Micro-benchmarks
- Best-case LFS performance.
- Different file system block sizes on the two systems .
- Read-performance explanation: files packed more densely.
- Write-performance: acknowledge new SunOS implementation .
Cleaning results
- Long-term usage patterns.
- Does not address disruption of cleaning.
- Cleaning better than simulated: claims large files .
Recovery
- Very fast!
- One second of recovery for 70 seconds of peak usage.

Conclusions