CS 161: Operating Systems Spring 2016

Tuesday/Thursday 1:00-2:30

Pierce 301

James Mickens & Margo Seltzer

Home

CS161 Assignment 4: File Systems

In-class Design Review: Tuesday, April 12, 2016
Design Due: Thursday, April 14, 2016 at 5:00PM
Assignment Due: Wednesday, April 27, 2016 at 5:00PM (coursewide extension until Friday, April 29, 2016 at 5:00PM)

Objectives

Explain how recovery can be accomplished by using a journal to undo or redo modifications to the on-disk state.
Make tradeoffs between various possible journal record schemas.
Design a journal record schema (family of journal record types) that will allow your file system to be recovered after a failure.
Implement your journaling design and the corresponding recovery code so that your file system can be recovered after a failure.

Introduction

Up until this point, your OS/161 started anew every time it booted, blissfully unaware of any mayhem it may have unleashed on itself the last time it was running. Obviously, things in the real world aren't so clean. Operating systems must maintain state across boots using some kind of persistent storage. Often such storage is provided by disks, and the abstraction that operating systems usually provide for accessing a disk is a file system.

OS/161's native filesystem is called SFS (Simple File System). Though a relatively naive implementation, SFS boasts most of the basic features you would expect out of a filesystem:

Hierarchical directories
Large file support via multi-level indirect blocks
Fine-grained locking for increased concurrency
An in-memory buffer cache of filesystem data and metadata

All these and more are provided for you, free of charge, with the code you got at the beginning of the semester. So what's left for an ambitious CS161 student to do?

We're glad you asked! SFS has one glaring problem: it isn't resilient to crashes. All writes are buffered and are written to disk at an arbitrary time and in an arbitrary order. If OS/161 were to unexpectedly crash in the middle of operations (incredibly unlikely, we know, but bear with us), the data on disk could easily be left in an inconsistent state, rendering the filesystem unusable and with arbitrary amounts of data lost to eternity. That would be Bad™.

Your job is to modify SFS to be resilient in the face of crashes. This will entail two interrelated tasks:

Supplement normal SFS operation with journaling such that recovery after unexpected shutdown is possible.
Implement recovery code to process the journal (in order to restore the system to a consistent state) prior to mounting after a failure.

We provide an on-disk journal implementation; you define the journal records to be used, the code that issues them, and the code that interprets them at recovery time and makes whatever changes are necessary to recover the operations described.

Additionally, as with anything you implement, you should test your system as thoroughly as possible. We would like you to turn in, as part of your final submission, whatever automated tests you develop to verify your system is working correctly.

(If you would like a refresher on file system consistency, Eddie Kohler gave a fabulous colloquium on this topic that can be viewed online here. We recommend reviewing the first part of this talk for a lucid discussion of file system crash recovery. However, we do not expect anyone to implement Featherstitch for this assignment!)

Setup

Before you begin, you need to do some merging. We are providing you with implementations of several more system calls; you will need to pull this code and merge it. We've pushed the changes to the class distribution repository on code.seas.harvard.edu. To grab them, do:

    git pull git@code.seas.harvard.edu:cs161/os161-2016.git a4

(If you usually rebase, we recommend not doing so here.) The merge mostly adds new system call code in the file kern/syscall/more_syscalls.c, but it changes some other files as well. You will likely have merge conflicts. Hopefully, sorting them out should for the most part be straightforward, but it will depend on how extensively you've changed your code, and where.

The likely causes of merge conflicts:

Additional cases were added in the system call dispatcher kern/arch/mips/syscall/syscall.c to call the new system calls.
Declarations for the new calls were added to kern/include/syscall.h.
The new file was added to kern/conf/kern.conf.

Once you fix the conflicts, you can commit the merge. However, you're not done yet. If you look inside more_syscalls.c, you'll see the following functions:

sys_sync
sys_mkdir
sys_rmdir
sys_remove
sys_link
sys_rename
sys_getdirentry
sys_fstat
sys_fsync
sys_ftruncate

These functions are completely implemented for you. But if you look at the bottom of the file, you'll see that the implementations of sys_getdirentry, sys_fstat, sys_fsync, and sys_ftruncate depend on the file table, and in particular they use the solution set's file table interface and not your own from assignment 2. Therefore, they won't compile yet; you will need them to modify them to work with your file table.

This should be fairly straightforward to do. The file table interface they use is documented in a comment at the top of the file; your file table and open file code supports the same functionality and the same concepts, so you should be able to just replace the calls and structure fields used with your versions, or write fairly thin wrapper functions. If you don't want to deal with this up front it's fine to stub out these calls until you get far enough with the assignment that you need them working. (But do make sure this happens before you submit.)

You shouldn't have to change the other functions in the file at all, as they don't involve file handles and use only VFS-level functions (and copyin/out functions) that you most likely haven't touched. However, you should take the time to make sure you understand how all the new functions work.

Now, config an ASST4 kernel, run bmake depend, and build it. Make sure it still boots.

You are now ready to begin assignment 4. Commit everything and tag your repository asst4-start. Please tag only after committing the big merge, so that we don't see it as part of your diff.

Existing Infrastructure

Up until this point, you have not used SFS. Instead, you have been using emufs (the EMUlator File System). emufs amounts to little more than a pass-through to the underlying Linux filesystem on which sys161 runs. In fact, you probably noticed by now that all the user-level binaries get installed to the directory that you specified with the --ostree parameter when you ran configure. emufs simply accesses those files on the host system.

Beginning now, however, we will want to use a real filesystem. In our case, that filesystem is SFS. Fortunately for us, this switch will be transparent to the rest of the operating system because we are using the filesystem-independent VFS layer. The VFS layer takes high-level requests and dispatches them to the appropriate filesystem-specific code.

There are two main source directories that concern you:

kern/vfs -- VFS-layer code, including the buffer cache
kern/fs/sfs -- SFS filesystem-specific code

You will want to begin by acquainting yourself with both sets of code.

Code Overview

In kern/vfs:

buf.c

Contains the buffer cache code.

device.c

Implements support for vfs-level devices.

devnull.c

Provides the OS/161 equivalent of /dev/null, called "null:".

vfscwd.c

Contains the code that handles current working directories.

vfslist.c

Contains code that pertains to the entire list of mounted of file systems.

vfslookup.c

Contains file-system-independent name translation code. This handles device names; paths are handled in the file system. This separation is convenient in OS/161 but might not be such a good choice for a "real" OS.

vfspath.c

Supports operations on path names and implements the VFS-level operations.

vnode.c

Contains filesystem-independent utility and support code for vnodes.
In kern/fs/sfs:

sfs_balloc.c

Block allocation.

sfs_bmap.c

Block mapping and the guts of truncate.

sfs_dir.c

Low-level directory operations.

sfs_fsops.c

Implementations of the abstract VFS-level operations on a whole volume (FSOP_*).

sfs_inode.c

Inode-level operations.

sfs_io.c

Block I/O, both above and below the buffer cache.

sfs_jphys.c

Physical (on-disk) journal code.

sfs_vnops.c

Implementations of the abstract VFS-level operations on vnodes (VOP_*).

sfsprivate.h

Declarations of types and functions internal to SFS.
In userland/sbin:

dumpsfs

A utility that dumps out the state of an SFS filesystem for debugging.

mksfs

The program that creates new SFS volumes.

sfsck

The checker program for SFS.

Also take note of design/jphys.txt, which explains the physical journal code provided in sfs_jphys.c.

Code reading questions

What happens (in broad terms) if sys_remove is called on a file that is currently open by another running process? Will a read on the file by the second process succeed? A write? Why or why not? (1 point)
Describe the control flow, starting in the system call layer and proceeding through the VFS layer to reach SFS, that occurs for each of the following system calls. You need only trace the names of the functions that are called. Feel free to skip secondary or minor code paths that don't lead into SFS. (1 point)
1. SYS_open
2. SYS_write
3. SYS_mkdir
Now similarly describe the control flow within SFS for each of the same operations. Follow how we get to logic that makes specific updates to on-disk data structures (stopping where buffer-handling code takes over), and feel free to skip code paths that don't do so. (2 points)
And finally, for mkdir, describe what happens to the block containing the newly allocated directory inode from the time the logic you've described in #3 ends to the time the inode eventually reaches the disk device. Trace the code that handles it, at a similar level of detail: function names and a very brief description of what happens at each point is enough. (2 points)
Which of the following lock acquisitions are safe? (1 point)
1. lock a vnode, lock the vnode table
2. lock a vnode, lock the freemap
3. lock the physical journal, get (lock) a buffer
4. lock foo, lock foo/bar
5. lock foo, lock foo/..
Name two problems that the current implementation of sfsck can repair, and one that it cannot. (1 point)
Describe four different types of inconsistent states the filesystem could end up in after a crash, and give a sequence of events that could lead to each. You will probably want to explain in your design doc how you handle these problems. (2 points)

Using SFS

We have prepared a separate handout with a primer on using SFS in OS/161.

Before you start coding, it's probably a good idea to create an SFS volume, mount it, and run some tests on it. Everything should work correctly provided you don't crash the system. If it doesn't work, better to hammer out the problems now rather than after you have the hood open and all the guts of SFS strewn around. The fs tests from the OS/161 boot menu should all run; so should the file- and directory-related tests from testbin/. This includes the stress tests like dirconc and psort.

Also note that a volume that is unmounted before shutdown should be reported clean by sfsck. We strongly recommend that you make sure that when you don't crash your system, the file system tests run, a cleanly unmounted file system does check cleanly and that all is right with the world. If you do not do this, should you later encounter bugs, you won't know whether they were there before you started making modifications for this assignment.

The userland SFS utilities are compiled both for OS/161 and for the System/161 host OS (Linux in the appliance); the host versions can be run from the hostbin directory under $(OSTREE). This is often useful for scripting tests; it can also occasionally be useful for debugging them. If (read: most likely when) you need to adjust these tools, make sure both the host and native versions continue to work.

Making SFS Crash-Resilient

As we said before, SFS is not currently resilient to crashes. While it does employ fine-grained locking to enable safe concurrent operation, no care is taken regarding when, how, and where different structures are written to disk. Your job is to modify SFS so that it can reliably recover from a crash. Specifically, you will turn SFS into a journaling filesystem: changes to the filesystem are written to a journal using write-ahead logging (WAL) so that the filesystem can always be recovered to a consistent state following a crash.

As discussed in class, there are multiple ways to do journaling. You can journal whole blocks; that is, every time you change something you write the whole updated block involved to the journal. This is called "physical block journaling" or sometimes just "physical journaling" and it's how e.g. the Linux ext3 file system works. In some ways it is simple; but it is also not especially efficient and runs into problems if you ever want to journal anything that isn't just an updated copy of a disk block.

The alternative to journaling whole blocks is to write small journal records, which is known as "record journaling". Each record describes a single change; this avoids writing out a lot of redundant data and increases efficiency. Also, it's possible to issue records that describe multiple related changes at once, or things that aren't strictly speaking changes to on-disk structures like checksums of user data. For these reasons record journaling is generally preferred to block journaling.

The content actually appearing in journal records can span a wide continuum, from very high-level records describing high-level actions (e.g. "Remove file F, with inode number I and prior link count N, from directory D") to very low-level records describing low-level modifications (e.g. "change the word at offset 36 in block 1234 from 15 to 18"). These extremes are sometimes referred to as "logical record journaling" and "physical record journaling". Logical record journaling typically yields smaller journals and smaller I/O footprints; however, in practice it's substantially harder to get right and leads to substantially more complex recovery code. Because recovery code is difficult and hard to debug under the best of circumstances, pragmatic implementations tend to lean towards physical records.

In this assignment you will be implementing record journaling, and while you get to pick the exact set of log records you use, we encourage/expect/require you to stick to low-level physical records.

You will need to:

Modify SFS to make calls to the provided on-disk journal (log) to record the information you deem necessary.
Write code that processes the journal after a crash to recover the consistency of the system.

How you do this is entirely up to you, but there are a few ground rules:

You may modify the on-disk file system format, and you need not worry about compatibility with our or your classmates' versions.
You may make some I/O synchronous. It is unavoidable that you will need to ensure certain things are written to disk before proceeding. You should do your best, however, to minimize how often this is necessary, and all such activity should be explicitly justified.
You may not make all I/O synchronous. Though it would indeed make SFS crash-resilient, making every I/O synchronous would make the filesystem painfully slow.
You may lose recently written user data following a crash. Unless the user has specifically requested that their data be flushed to disk with sync or fsync, it is okay (actually inevitable) to lose some user data that was written shortly before a crash. Obviously you should endeavor to minimize such losses.
You may not expose garbage data to users. As was mentioned in class, it is relatively easy for one file to include another file's data after a crash. Do not let this happen.
You may not corrupt the file system. After running recovery the on-disk state of file system must be self-consistent and should be reasonably in accord with what was being done before crashing.
You must ensure that the file system utilities (mksfs, dumpsfs, and sfsck) work. They must be updated to conform to any changes you make to the file system.
You may not scan the entire volume at recovery time (using fsck or the like).

Note that even when doing low-level record journaling, the journal records can be designed at a wide variety of levels of abstraction, as low-level changes to specific bytes or integers, e.g., "change bit 546 of block 4 from 0 to 1", or logical updates to particular data structures, e.g., "allocate inode 546." What records you have, what information they contain, and how they're arranged for recovery is yours to design. (If you find yourself wanting to issue single records for whole file system operations, like "create a directory", you are heading into logical record territory and we advise a different course.)

As implied by the rules above, you do not need to journal user data. This is a fairly typical behavior for a journaling file system.

We provide you with code for an on-disk journal that manages records as arbitrary blobs. You may change this code in the course of doing the assignment; but ideally you will not need to. (If you find that you do want to change the on-disk journal code, please contact the course staff; it probably means either we didn't explain something adequately or it means something about the code should be improved for next time, and either way we'd like to know about it.) Your code will generate those blobs and handle them at recovery time, and take care of knowing what types of records exist and what they mean.

Interfacing to the physical journal

The physical journal code is documented in design/jphys.txt. We aren't going to repeat that material here, so please read the file, and if you have questions please ask.

That said, the interface to the physical journal is divided into roughly three parts:

Startup and shutdown;
reading the journal during recovery;
and writing the journal during operation.

Code to call the startup and shutdown functions has already been added to SFS for you; you will find this code (and the places you should insert calls to your own recovery routines) in the mount and unmount logic in sfs_fsops.c. Your recovery code will use the reading interface to scan the journal. You can scan forwards or backwards, and as many times as you wish, as needed by your recovery plan.

The writing interface can also be subdivided as follows:

Adding records to the journal;
flushing the journal when needed;
and throwing away ("trimming") old journal records.

You will be writing the code for adding records to the journal, and you also need to write code that requests flushing the journal to enforce write-ahead logging. You must also use flush and trim to provide checkpointing, so that the journal does not grow without bound.

You will find that two hooks have already been added to sfs_writeblock (which is the interface used by the buffer cache to write out dirty filesystem blocks) to make sure that the journal itself gets written out in order and to keep the journal code informed of which journal blocks have been written out. Unless we have blundered badly somewhere you should not need to change those hooks or add more calls to the functions they use.

You will find that sfs_jphys_write accepts a callback function. This allows you to update data structures using the LSN of a new record before the journal lock inside jphys is released. The reason for this is discussed in jphys.txt (and will be talked about in section) — it is possible you won't need this hook, but be aware that it's there and of why it's provided, and that if you use it this creates an ordering requirement between the journal lock and any locks you use in the callback.

Limits of the physical journal code

The journal code has some limits that you should be aware of:

Record type codes must be in the range 0-127. If you need more than this (or even this many, or half this many) you're probably doing something wrong.
Record sizes must be even (aligned to a 2-byte boundary). This is due to bit-packing in the record header.
The maximum record size is 508 bytes. Probably you shouldn't have records even close to this size.
Log sequence numbers use a 64-bit type but are actually only up to 48 bits wide. (This is still far more than we will ever need.)

Also, there is currently no logic in the journal code to detect if the on-disk journal head (the end new records are written to) runs into the on-disk journal tail (the oldest record still alive and not trimmed away). This is a bug, or at least a fairly significant missing feature. Without such logic, if this happens you will know at recovery time because the physical journal code won't load the journal and will complain about finding no trim record or not being able to find the tail. As long as you are making regular checkpoints this should not be a serious problem.

Some things to know when debugging

Some useful tidbits:

The journal is a circular buffer of disk blocks. Its start location and size are stored in the superblock. Normally it follows the freemap in the on-disk layout, but it can theoretically appear anywhere.
The record header is 8 bytes long.
The physical journal code has some records of its own, which the iterator interface hides from you so you don't need to cope with them. But you will see them if you dump the journal with dumpsfs.
There are "trim" records, which keep track of where the on-disk tail is. These are used when the journal loads to figure out what region of the journal is live and should be iterated.
There are also "pad" records, which are used to fill up journal blocks when there isn't space for the next record or if a journal block has to be written out before it's full.
Padding of less than 8 bytes at the end of a block is implicit.
Records don't span blocks.
Zeroed-out blocks are treated as if they contained pad records.
There should always be at least one trim record. When you run mksfs the journal is initialized with one.
Because of this the journal should never be completely empty.

Testing (or: The Doom Counter and You)

Testing recovery code is difficult, in part, because it entails being able to anticipate everything that could ever possibly go wrong, which is next to impossible. To help you, we have helpfully added a feature to System/161 we like to call the "Doom Counter." Every time a write is started on a disk device, the counter is decremented. If the value of the counter ever falls to 0, the machine trips over its power cord and pulls it out of the wall, crashing your system. Thus you have a convenient way to purposefully crash your system at arbitrary (but interesting) points in time. The -D option to sys161 allows you to set the Doom Counter; be sure to test with lots of different values.

Using frack

The frack test is a tool for playing back simple test workloads. It is specifically intended for testing recovery. The basic model is as follows:

First you run it in "do" mode, when it runs a workload.
You crash the system during (or immediately after) the workload.
Then you run recovery.
After running recovery you run it again in "check" mode, where it checks the file system state against the workload it ran.

You run it as follows, using the doom counter to trigger crashes:

   frack do workload
   (crash, reboot, run recovery)
   frack check workload

It has several dozen workloads coded into it; you can run frack list to get the list. Currently it is underdocumented and the best way to see what the workloads are is to read the source.

Note that frack is newish and may still have some problems. It is also still somewhat too aggressive about demanding that user data isn't lost.

Using poisondisk

The poisondisk program clears an entire disk to a recognizable (non-zero) byte pattern. The idea is that you can run this, then run mksfs to create a file system, then run a test workload. If the poison values show up in files (or worse, in metadata) after testing, you know you have a problem with exposing uninitialized disk blocks.

You can run poisondisk from inside OS/161 as testbin/poisondisk or outside as hostbin/host-poisondisk.

The poison value (as a byte) is 0xa9, or 169; full words are 0xa9a9a9a9, or 2846468521 (as unsigned decimal) or -1448498775 (as signed decimal). In iso-latin-1 this character prints as a copyright symbol; in UTF-8 a sequence of 0xa9 bytes is an invalid multibyte character.

Miscellaneous Helpful Hints

Yes, your system must be able to tolerate crashing while running recovery code. Yes, we will test this.
It is recommended that you test with multiple SFS volumes at once. We will test this, as well.
A nice benefit of running in a virtual environment is that sys161 "disks" are in fact simply files on the host system. This gives you a convenient way to swap in and out arbitrary disk images, compare disk images before and after crashes (and before and after running sfsck), run with arbitrary Doom Counter values, etc.
dumpsfs is a utility for printing out filesystem state in a human-readable format. It can dump the journal, but it of course doesn't know how to print your log records. Teaching it how to do that is worth your time.
To format a volume, use a command like:
mksfs lhd1raw: mydrive
or
hostbin/host-mksfs LHD1.img mydrive
To mount a drive, use the mount menu command, as in:
mount sfs lhd1:
You can give OS/161 menu commands to run as a command line argument to sys161. For example:
sys161 -D 12345 kernel 'mount sfs lhd1:; cd lhd1:; p /testbin/dirconc; q'
Give your swap disk the nodoom flag in sys161.conf to get more control with the doom counter.
As you change the filesystem, you may (usually will) need to modify the userlevel SFS utilities as well. And, when you change them, don't forget to recompile and reinstall.
When you step into a VOP call in gdb, it always takes you to vnode_check first. The way to get to the real function is to hit 's' (step) on the line containing vnode_check's close-brace.

What to hand in

Design Document

As with the previous assignments, we will be conducting peer design reviews in class, on Tuesday, April 12. You should have a complete first revision of your design document complete by then. Bring two hard copies of that design to class and commit that design to your team repository (in the directory os161/submit/asst4). After the peer review, make any changes to your design document and note what you have changed, then push again to your repository.

By 5:00 pm on Thursday, April 14, please submit your design document by putting it in the submit/asst4 subdirectory of your OS/161 repository as design_document.{pdf,md,txt}. Include solutions to the code reading questions in code_reading.txt in the same directory. Please commit and push to code.seas with a message that clearly identifies the submission, and verify that the submission is visible on the grading server.

Your design document submission must include your plan for converting SFS to being a crash-resilient journalled filesystem. Some things you should consider for your design document include:

What will your log records look like?
What will their corresponding recovery routines look like?
What locking will you need?
What is the overall control flow of your recovery process? What will it have to do?
How will you know that recovery is working?
What other problems do you anticipate?
What is your implementation timeline? Who will get what done by when?

As usual, your design document is worth 30% of your assignment grade.

Implementation

Your final submission, due at 5:00 pm on April 29, should include (along with your code) your revised design document and your sys161.conf file. Place all non-code and config files in the directory submit/asst4 under the top level of the OS/161 source.

You should submit by tagging your repo with asst4-submit, and then pushing your final working copy of your source along with your tags to your team git repository. As always, if you are taking late days please be in contact with your TF; and also, if you have to retag your submission or you do anything else with git such that git-level slipups might cause your TF to grade the wrong version, please send your TF the changeset hash of the right version as a precaution. Refer to the previous assignment specifications if you need a refresher on how to submit.

You must verify your intended submission is displayed on the grading server; we will not accept submissions not visible on the grading server.