Home | Syllabus | Assignments | Resources | Piazza |
In-class Design Review: Thursday, April 16, 2015
Design Due: Saturday, April 18, 2015 at 6:00PM
Assignment Due: Friday, May 1, 2015 at 5:00PM
Objectives |
Introduction |
Up until this point, your OS/161 started anew every time it booted, blissfully unaware of any mayhem it may have unleashed on itself the last time it was running. Obviously, things in the real world aren't so clean. Operating systems must maintain state across boots using some kind of persistent storage. Often such storage is provided by disks, and the abstraction that operating systems usually provide for accessing a disk is a file system.
OS/161's native filesystem is called SFS (Simple File System). Though a relatively naive implementation, SFS boasts most of the basic features you would expect out of a filesystem:
We're glad you asked! SFS has one glaring problem: it isn't resilient to crashes. All writes are buffered and are written to disk at an arbitrary time and in an arbitrary order. If OS/161 were to unexpectedly crash in the middle of operations (incredibly unlikely, we know, but bear with us), the data on disk could easily be left in an inconsistent state, rendering the filesystem unusable and with arbitrary amounts of data lost to eternity. That would be Bad™.
Your job is to modify SFS to be resilient in the face of crashes. This will entail two interrelated tasks:
Additionally, as with anything you implement, you should test your system as thoroughly as possible. We would like you to turn in, as part of your final submission, whatever automated tests you develop to verify your system is working correctly.
(If you would like a refresher on file system consistency, Eddie Kohler gave a fabulous colloquium on this topic that can be viewed online here. We recommend reviewing the first part of this talk for a lucid discussion of file system crash recovery. However, we do not expect anyone to implement Featherstitch for this assignment!)
Setup |
Before you begin, you need to do some merging. We are providing you with implementations of several more system calls; you will need to pull this code and merge it. We've pushed the changes to the class distribution repository on code.seas.harvard.edu. To grab them, do:
git pull git://code.seas.harvard.edu/cs161/os161.git a4(If you usually rebase, we recommend not doing so here.) The merge mostly adds new system call code in the file kern/syscall/more_syscalls.c, but it changes some other files as well. You will likely have merge conflicts. Hopefully, sorting them out should for the most part be straightforward, but it will depend on how extensively you've changed your code, and where.
The likely causes of merge conflicts:
Once you fix the conflicts, you can commit the merge. However, you're not done yet. If you look inside more_syscalls.c, you'll see the following functions:
These functions are completely implemented for you. But if you look at the bottom of the file, you'll see that the implementations of sys_getdirentry, sys_fstat, sys_fsync, and sys_ftruncate depend on the file table, and in particular they use the solution set's file table interface and not your own from assignment 2. Therefore, they won't compile yet; you will need them to adjust your file table.
This should be fairly straightforward to do. The file table interface they use is documented in a comment at the top of the file; your file table and open file code supports the same functionality and the same concepts, so you should be able to just replace the calls and structure fields used with your versions, or write fairly thin wrapper functions. If you don't want to deal with this up front it's fine to stub out these calls until you get far enough with the assignment that you need them working. (But do make sure this happens before you submit.)
You shouldn't have to change the other functions in the file at all, as they don't involve file handles and use only VFS-level functions (and copyin/out functions) that you most likely haven't touched. However, you should take the time to make sure you understand how all the new functions work.
Now, config an ASST4 kernel, run bmake depend, and build it. Make sure it still boots.
You are now ready to begin assignment 4. Commit everything and tag your repository asst4-start. Please tag only after committing the big merge, so that we don't see it as part of your diff.
Existing Infrastructure |
Up until this point, you have not used SFS. Instead, you have been using emufs (the EMUlator File System). emufs amounts to little more than a pass-through to the underlying Linux filesystem on which sys161 runs. In fact, you probably noticed by now that all the user-level binaries get installed to the directory that you specified with the --ostree parameter when you ran configure. emufs simply accesses those files on the host system.
Beginning now, however, we will want to use a real filesystem. In our case, that filesystem is SFS. Fortunately for us, this switch will be transparent to the rest of the operating system because we are using the filesystem-independent VFS layer. The VFS layer takes high-level requests and dispatches them to the appropriate filesystem-specific code.
There are two main source directories that concern you:
Code Overview |
Code reading questions
Using SFS |
We have prepared a separate handout with a primer on using SFS in OS/161.
Before you start coding, it's probably a good idea to create an SFS volume, mount it, and run some tests on it. Everything should work correctly provided you don't crash the system. If it doesn't work, better to hammer out the problems now rather than after you have the hood open and all the guts of SFS strewn around. The fs tests from the OS/161 boot menu should all run; so should the file- and directory-related tests from testbin/. This includes the stress tests like dirconc and psort.
Also note that a volume that is unmounted before shutdown should be reported clean by sfsck. We strongly recommend that you make sure that when you don't crash your system, the file system tests run, a cleanly unmounted file system does check cleanly and that all is right with the world. If you do not do this, should you later encounter bugs, you won't know whether they were there before you started making modifications for this assignment.
The userland SFS utilities are compiled both for OS/161 and for the System/161 host OS (Linux in the appliance); the host versions can be run from the hostbin directory under $(OSTREE). This is often useful for scripting tests; it can also occasionally be useful for debugging them. If (read: most likely when) you need to adjust these tools, make sure both the host and native versions continue to work.
Making SFS Crash-Resilient |
As we said before, SFS is not currently resilient to crashes. While it does employ fine-grained locking to enable safe concurrent operation, no care is taken regarding when, how, and where different structures are written to disk. Your job is to modify SFS so that it can reliably recover from a crash. Specifically, you will turn SFS into a journaling filesystem: changes to the filesystem are written to a journal using write-ahead logging (WAL) so that the filesystem can always be recovered to a consistent state following a crash.
As discussed in class, there are multiple ways to do journaling. You can journal whole blocks; that is, every time you change something you write the whole updated block involved to the journal. This is called "physical block journaling" or sometimes just "physical journaling" and it's how e.g. the Linux ext3 file system works. In some ways it is simple; but it is also not especially efficient and runs into problems if you ever want to journal anything that isn't just an updated copy of a disk block.
The alternative to journaling whole blocks is to write small journal records, which is known as "record journaling". Each record describes a single change; this avoids writing out a lot of redundant data and increases efficiency. Also, it's possible to issue records that describe multiple related changes at once, or things that aren't strictly speaking changes to on-disk structures like checksums of user data. For these reasons record journaling is generally preferred to block journaling.
The content actually appearing in journal records can span a wide continuum, from very high-level records describing high-level actions (e.g. "Remove file F, with inode number I and prior link count N, from directory D") to very low-level records describing low-level modifications (e.g. "change the word at offset 36 in block 1234 from 15 to 18"). These extremes are sometimes referred to as "logical record journaling" and "physical record journaling". Logical record journaling typically yields smaller journals and smaller I/O footprints; however, in practice it's substantially harder to get right and leads to substantially more complex recovery code. Because recovery code is difficult and hard to debug under the best of circumstances, pragmatic implementations tend to lean towards physical records.
In this assignment you will be implementing record journaling, and while you get to pick the exact set of log records you use, we encourage/expect/require you to stick to low-level physical records.
You will need to:
How you do this is entirely up to you, but there are a few ground rules:
Note that even when doing low-level record journaling, the journal records can be designed at a wide variety of levels of abstraction, as low-level changes to specific bytes or integers, e.g., "change bit 546 of block 4 from 0 to 1", or logical updates to particular data structures, e.g., "allocate inode 546." What records you have, what information they contain, and how they're arranged for recovery is yours to design. (If you find yourself wanting to issue single records for whole file system operations, like "create a directory", you are heading into logical record territory and we advise a different course.)
As implied by the rules above, you do not need to journal user data. This is a fairly typical behavior for a journaling file system.
We provide you with code for an on-disk journal that manages records as arbitrary blobs. You may change this code in the course of doing the assignment; but ideally you will not need to. (If you find that you do want to change the on-disk journal code, please contact the course staff; it probably means either we didn't explain something adequately or it means something about the code should be improved for next time, and either way we'd like to know about it.) Your code will generate those blobs and handle them at recovery time, and take care of knowing what types of records exist and what they mean.
Interfacing to the physical journal |
The physical journal code is documented in design/jphys.txt. We aren't going to repeat that material here, so please read the file, and if you have questions please ask.
That said, the interface to the physical journal is divided into roughly three parts:
The writing interface can also be subdivided as follows:
You will find that two hooks have already been added to sfs_writeblock (which is the interface used by the buffer cache to write out dirty filesystem blocks) to make sure that the journal itself gets written out in order and to keep the journal code informed of which journal blocks have been written out. Unless we have blundered badly somewhere you should not need to change those hooks or add more calls to the functions they use.
You will find that sfs_jphys_write accepts a callback function. This allows you to update data structures using the LSN of a new record before the journal lock inside jphys is released. The reason for this is discussed in jphys.txt (and will be talked about in section) — it is possible you won't need this hook, but be aware that it's there and of why it's provided, and that if you use it this creates an ordering requirement between the journal lock and any locks you use in the callback.
Limits of the physical journal code
The journal code has some limits that you should be aware of:
Some things to know when debugging
Some useful tidbits:
Testing (or: The Doom Counter and You) |
Testing recovery code is difficult, in part, because it entails being able to anticipate everything that could ever possibly go wrong, which is next to impossible. To help you, we have helpfully added a feature to System/161 we like to call the "Doom Counter." Every time a write is started on a disk device, the counter is decremented. If the value of the counter ever falls to 0, the machine trips over its power cord and pulls it out of the wall, crashing your system. Thus you have a convenient way to purposefully crash your system at arbitrary (but interesting) points in time. The -D option to sys161 allows you to set the Doom Counter; be sure to test with lots of different values.
Using frack |
The frack test is a tool for playing back simple test workloads. It is specifically intended for testing recovery. The basic model is as follows:
You run it as follows, using the doom counter to trigger crashes:
frack do workload (crash, reboot, run recovery) frack check workloadIt has several dozen workloads coded into it; you can run frack list to get the list. Currently it is underdocumented and the best way to see what the workloads are is to read the source.
Note that frack is newish and may still have some problems. It is also still somewhat too aggressive about demanding that user data isn't lost.
Using poisondisk |
The poisondisk program clears an entire disk to a recognizable (non-zero) byte pattern. The idea is that you can run this, then run mksfs to create a file system, then run a test workload. If the poison values show up in files (or worse, in metadata) after testing, you know you have a problem with exposing uninitialized disk blocks.
You can run poisondisk from inside OS/161 as testbin/poisondisk or outside as hostbin/host-poisondisk.
The poison value (as a byte) is 0xa9, or 169; full words are 0xa9a9a9a9, or 2846468521 (as unsigned decimal) or -1448498775 (as signed decimal). In iso-latin-1 this character prints as a copyright symbol; in UTF-8 a sequence of 0xa9 bytes is an invalid multibyte character.
Miscellaneous Helpful Hints |
mksfs lhd1raw: mydrive
orhostbin/host-mksfs LHD1.img mydrive
mount sfs lhd1:
sys161 -D 12345 kernel mount sfs lhd1:; cd lhd1:; p /testbin/dirconc; q'
What to hand in |
Design Document
Your design document submission must include your plan for converting SFS to being a crash-resilient journalled filesystem. Some things you should consider for your design document include:
Assignment code
Your final submission, due at 5:00 pm on May 1, should include (along with your code):
You should submit by tagging your repo with asst4-submit, and then pushing your final working copy of your source along with your tags to your team git repository. As always, if you are taking late days please be in contact with your TF; and also, if you have to retag your submission or you do anything else with git such that git-level slipups might cause your TF to grade the wrong version, please send your TF the changeset hash of the right version as a precaution. Refer to Assignment 2 if you need a refresher on how to submit.