The Design and Implementation of a Log-Structured File System
Rosenblum, Ousterhout (1992)
What kind of paper is this?
- Motivation
- Design
- Analysis
Motivation
- Widening I/O gap.
- Large caches reduce read traffic.
- Sequential writes improve write performance
- Log representation is the only on-disk representation.
- Technology is not improving disk access times.
- Get rid of two problems in existing file systems:
- Small, unclustered accesses.
- Synchronous I/Os.
Log-Structured File Systems
- Cache data in memory.
- Coalesce multiple writes (data, inodes, directories, indirect blocks)
into "single" I/O.

Free Space Management
- Threading
- Leave live data in place.
- Write new data to available places.
- Problem: available spaces are no longer
contiguous.
- WAFL uses this technique.

- Cleaning
- Copy live data back into log.
- Reclaim space.
- Hybrid
- Use threaded segments.
- Clean on a per segment basis.
- Thread segments together.
- Claim that segments are always written in entirety
from front to back; not true for partial segments.

Recovery
- Know where last mods were (end of log), so know where any
inconsistencies might be).
- Use checkpoints and roll-forward to recover.
- Checkpoints
- Flush all modified data.
- Write inode map, segment usage table, and end-
of-log pointer to checkpoint region.
- At reboot, read each of two checkpoint regions,
initialize structures from the checkpoint.
(assumption: last byte of region is last byte read)
- Checkpoint every 30 seconds (should checkpoint
after some amount of data has been written).
- Roll-Forward
- Read segments written after the last checkpoint.
- Can lose recently created files if data blocks were
written, but inode was not.
- Use write-ahead logging to maintain consistency
between directory entries and inodes.
Cleaning
- Three-step algorithm
- Read N dirty segments.
- Identify which blocks are live.
- Write live data back to log.
- Identify each block
- Must know which block of which file is being
cleaned.
- Write segment summary at end of each segment.
- Identifies each file appearing in segment and each
block in each file.
- Overhead is one segment summary per partial
segment write.
- Cleaning Policies
- When should the cleaner run?
- How many segments should be cleaned at once?
- Which segments should be cleaned?
- How should cleaned blocks be grouped?
- This work addresses items C.3 and C.4.
- Cleans a few 10s of segments -- requires a few
tens of megabytes of kernel main memory.
- Waits until clean segments are scarce; then begins
cleaning.
Cleaning Analysis
- Define Write Cost (WC)
- Disk busy time per byte of data read.
- Read N segments.
- Write out N*u (where u is utilization) cleaned bytes.
- This create n * (1 - u) clean segments.



Cleaning Simulation
- Application of write cost.
- What assumptions are built in to the discussion on
page 35 and Figure 3?
- Files have only 1 block.
- Really ignores logging FFS.
- Assumes all I/Os are randomly distributed across the
disk.
- Assumes no savings in I/O for multiple directory writes.
- Assumes FFS does no worse at higher utilization.
- What assumptions are made in simulation?
- No meta-data.
- Overlooks the maximum I/O size of the underlying
device.
- No reads.
- What is stabilization?
- Free space utilization and write cost.
- Greedy algorithm works just fine for Uniform.
- Greedy provides worse writecost for Hot-Cold.
- Some segments have very cold data; do not drop
to threshold quickly.
- Hot segments drop to threshold quickly.
- When hot segments are rewritten, begin losing
space immediately.
- Solution: incorporate age (benefit) into selection
criteria.
- Use youngest modify time of any block in a
segment as the age of the segment.
- Derivation of cost-benefit


- Simulate cost-benefit for uniform (never looked at
what happens with greedy).
- How do you represent age?
Performance
- Experience
- Current systems are not diskbound
- Claims complexity less than Unix FFS.
- Micro-benchmarks
- Best-case LFS performance.
- Different file system block sizes on the two
systems .
- Read-performance explanation: files packed more
densely.
- Write-performance: acknowledge new SunOS
implementation .
- Cleaning results
- Long-term usage patterns.
- Does not address disruption of cleaning.
- Cleaning better than simulated: claims large files
.
- Recovery
- Very fast!
- One second of recovery for 70 seconds of peak
usage.
Conclusions
- Disk use an order of magnitude faster.