A Fast File System for UNIX
McKusick, Joy, Leffler, Fabry (1984)
What kind of paper is this?
The Story
- The original UNIX file system has some performance problems.
- It also has some functionality probems.
- We set out to build a new system to fix both.
- Our system can be up to 10x faster.
The Original File System
- A disk can be divided into partitions; each partition holds one file system.
- Blocks are 512-bytes. This ist he unit of transfer to/from disk.
- File IO is buffered in main memory.
- Applications need not worry about block alignment.
- There are no constraints on where data are placed.
- Cannot provide high throughput to/from the disk.
- Internals
- File system is described by a superblock.
- Superblock contains the number of data blocks, the number of inodes (i.e.,
the maximum number of files that can be created.
- Pointer to the free list of blocks.
- Files and directories stored as they are today with inodes, data blocks,
direct blocks, indirect blocks, etc.
Design implications
- Inodes and data blocks are allocated to separate parts of the partition:
accessing a file incurs a long seek between the two regions.
- Files in the same directory share no special allocation strategy:
accessing the inodes for all the objects in a directory incurs lots of
random seeks.
- No attempt to allocate contigous blocks of files near each other:
many seeks while reading a file sequentially.
Early Improvements
- 512-byte blocks => 1024 byte blocks.
- Ordering rights to improve recoverability.
The New File System
- Replicate the superblock to protect against single-block catastrophic loss.
- (Minimum) Block size increased to 4 KB (any power of two greater than or equal
to 4KB).
- Divide partition into cylinder groups.
- Each cylinder group contains: copy of superblock, group metadata
including allocation bitmaps, and a bunch of inodes and data blocks.
- Cylinder group metadata is offset in each group to avoid all metadata
falling on the first platter of a disk.
- As most files are small, allocating them in 4 KB wastes disk space; solution
is to add a fragment size that is the minimum unit of allocation.
- A block is broken up into 2, 4 or 8 fragments (decided at file system
create time).
- Minimum fragment size == sector size == 512 bytes
- Bitmaps record free space in fragments; finding a block requires finding
the right number of aligned, consecutive free fragments.
- Only the last block of a file can be a fragment and only if that last block
is a direct block (i.e., once you go to indirect blocks, you cease to
allocate fragments).
Tuning the File System to the Underlying Device
- Goal is to allow tuning the FS to match the characteristics of the
processor and device.
- Smart Allocation:
- Try to allocate a block on the same cylinder as the previous block.
- Ideally, these two blocks are placed at rotationally optimal positions
to achieve near full disk bandwidth transfer.
- The distance between 'rotationally optimal' depends on things such as
whether the processor must handle an interrupt between requests.
- Key disk characteristics such as blocks-per-track, number of tracks,
number of platters, etc also influence layout.
- Facilitate block allocation by storing an index from each of 8 rotational
position to available blocks.
Two-tier Layout Policy
- Global Allocation: Try to cluster related information
- Try to place the inodes for files in a directory in the same cyliner group
- Create directory inodes in groups with more than the average number of free
inodes.
- Place blocks of a given file in the same cylinder, ideally at optimal
rotational positions.
- However, once files get large (allocate an indirect block), move to a
different cylinder group to avoid a single file consuming all the blocks
in a group.
- Local Allocation
- Called by global asking for specific blocks.
- If the requested block is available, return it
- If the block isn't available, choose a block that is rotationally closest,
Performance
- Listing directories improves by factor of 2-8 (depending on directory
size and ratio of subdirectories to files).
- Almost an order of magnitude improvement in read throughput
- Depending on the bus, achieve 25-50% of bandwidth.
- Write bandwidth about a factor of 5 improvement.
Functional Improvements
- File names can be longer (up to 255 bytes/component) -- formerly limited to 12 bytes)
- Support for advisory locking (flock)
- Added symbolic links
- Added atomic rename
- Added quotas