Computer Science 161 (2013)

Final Exam

May 7, 2013 - May 8, 2013 5:00 PM

240 minutes/240 points

Instructions

The points allocated to each problem correspond to how much time we think the problem should take. Although you have 24 hours in which to take the exam, we do not recommend that you pull an all-nighter and spend 24 hours completing the exam. Here is Margo's advice on how to complete this exam:

Please send your answers to me in a PLAIN TEXT file as an attachment (in a perfect world, your text file would not have lines longer than 80 characters, but I'll reformat it if it does). Label the answers clearly so it is obvious what answers correspond to what questions. Leave space between each question. Please place your answers in order in the file. Send your completed exam to margo@eecs.harvard.edu. I will return the graded exam to you; I tend to engage in a dialog as much as possible, so there is some educational value in reading more than just the final number. This exam is open text book, open source code, and open notes. You may not use computers (cell phones, PDAs, or any other device that can talk to a network) during the exam (other than to record your answers, to view OS/161 or System/161 code, or to examine your notes or the course web site). Please do not access the Piazza group during the exam. You must complete the exam on your own.

Exam Policy Statement

At the end of your exam, assuming that you have followed the rules as stated above, please write the following statement and then place your name after it, asserting that you have followed the rules of the exam. "I have neither given nor received help on this exam and have followed the rules as stated." If this statement is absent, you will hear from me.

1. Buffer Cache Fun (60 points)

These questions are based on real world challenges in file systems.

A. O_DIRECT

As you learned in assignment 4 and our file system lectures, file systems often maintain an in-memory collection of pages to speed access to frequently used files (i.e., the buffer cache). Software that wishes more explicit control over when data gets to disk (e.g., transactional database systems) frequently find the buffer cache a problem. They often implement their own in-memory cache and when they issue a write system call, they really want the data to go to disk.

In recognition of this problem, some systems provide an open flag (i.e., a flag to the open system call), O_DIRECT (or the F_NOCACHE flag to the fcntl system call), that directs the file system not to cache data for the file in question, but instead to service read/write requests directly from/to the disk.

Consider the following scenario:

If P follows the semantics of O_DIRECT exactly and cp uses the buffer cache as it currently exists, you can get out-of-date backups. For example, let's say that at time 1, cp copies file F and leaves blocks in the buffer cache. Now P writes a block of F directly to disk (i.e., does not update the buffer cache). If cp runs again while its blocks are still in the cache, it could copy the old version of F without P's most recent update.

Design an implementation of O_DIRECT that solves this problem in the context of OS/161. You needn't write pseudocode, but identify the key data structures you want to change and, in English, describe what parts of the code you wish to modify and how you wish to modify them. After you present your design, explain how it avoids the problem described.

Note: One challenge not explicitly mentioned is that you might have dirty buffers in the cache when P opens a file O_DIRECT. You may ignore this situation in this question.

There were many ways to address this problem. In terms of code path,
I was hoping people would mention some aspect of passing the O_DIRECT
around and storing it.  Then, you could have done various things:
If you make things go through the buffer cache:
* store O_DIRECT in the open-file structure at open time
* pass O_DIRECT through to VOP_READ/VOP_WRITE
* pass it from sfs_read/sfs_write to and through sfs_io
* in sfs_blockio and sfs_partialio, if writing do buffer_writeout
  at the end, and then do buffer_release_and_invalidate instead of
  buffer_release.
If you make all accesses to O_DIRECT files bypass the cache
* tag the vnode with O_DIRECT when it's opened that way
  (this probably requires a counter, not just a flag, in case there
  are multiple such opens)
* in sfs_blockio, if the vnode is tagged O_DIRECT, go straight to
  the disk device with the passed-in uio structure.
* in sfs_partialio, even if the vnode is tagged O_DIRECT, you still
  need to use a buffer, so do the same thing as in the previous
  way.
If you try to invalidate cached blocks when you O_DIRECT write them:
* store O_DIRECT in the open-file structure at open time
* pass O_DIRECT through to VOP_READ/VOP_WRITE
* pass it from sfs_read/sfs_write to and through sfs_io
* in sfs_blockio, if doing O_DIRECT, go straight to the disk device
  with the passed-in uio structure.
* on write, also do buffer_drop on the block.
* on read... one way is to ignore the buffer cache (in which case,
  non-O_DIRECT writes not yet flushed out may not be seen); another
  way is to do buffer_drop, and if someone has done a non-O_DIRECT
  write they lose; a third way is to do buffer_read to read the
  block, and buffer_release_and_invalidate afterwards (like above),
  which guarantees write consistency at the expense of doing an
  extra copy for all reads.
* in sfs_partialio, if doing O_DIRECT, just return EINVAL (that is,
  demand block-aligned I/O) or do the same as in the other ways.
The first way solves the stale read problem by making sure all buffers are either dropped or up to date. The second way solves the stale read problem by not allowing stale data to be cached. The third way solves it by explicitly invalidating it when it goes out of date.

B. Mmap

One of the system calls we did not ask you to implement is mmap. Mmap takes a file (or parts thereof) and maps them into the address space of a process. Thus, it provides another way (in addition to open/close/read/write) to access file data from a program.

2. Virtual Memory (40 points)

The LEG processor MMU has the following characteristics: First level page table entries are stored as shown. The "A M" bits represent 2 bits for access modes, with the following meanings.
---------------------------------
---
-----------------
------
----
1 MB mapping
20-bits of physical address
8 unused bits
A M
1 0
Mapping for a second level page table
22-bits of physical address
6 unused bits
A M
0 1



Second level page table entries look like:
20-bit physical page #
8 unused bits
A M
0 0
  1. Why are 12 bits of the Virtual Address used to index into the top level page table? (4) That leaves 20 bits or 1 MB, so each entry in the top level table maps 1 MB, which lets you store the translation for large regions in the top level table.
  2. How many entries are there in the top-level page table? (2)4096
  3. On a TLB fault handled by the hardware page table path, how many memory accesses are required to read data stored in a 1 MB mapped segment? (2)2 -- 1 for the 1st level table; 1 for the data
  4. On a TLB fault handled by the hardware page table path, how many memory accesses are required to access data stored in a 4 KB page (not in a 1 MB segment). (2)3 -- 1 for the 1st level table; 1 for the second level table; 1 for the data
  5. Which bit(s) of a first level page table entry tell you whether the address you are translating is part of a 1 MB region or a 4 KB page? (2)The low 2 bits -- actually either of the bits
  6. How many entries are there in a 2nd level page table? (4)1 MB/4 KB = 256
  7. As described this processor can only access a 32-bit physical address space. How could you let it access a larger physical address space? (4)Use some of the unused bits as part of the physical address.
  8. Is it possible to run the LEG processor on a system with less than 4 GB (32 bits worth) of physical memory? (2)Yes
  9. Can two different pages in the same 1 MB region mapped in the top-level page table have different access protection? (2) No
  10. Can two consecutive 4 KB pages mapped through second level page tables have different access protection? (2) Yes
For the next 4 problems, assume that part of the top level page table contains the following entries (the first entry shown corresponds to the 1025th one in the table). Note that 22-bit entries are represented using the values 0x000000 through 0x3FFFFF.
---- Beginning of Top Level Table ----
---- Skipping Entries 0 - 1023 ----
0x50000 0xFF 0x3 10
0x00000 0xFF 0x0 10
0x123456 0x3F 0x2 01
0xDEAD0 0xFF 0x2 10
0x00BEEF 0x3F 0x3 01
0x2B012B 0x3F 0x3 01
0xABCDE 0xFF 0x1 10
---- Rest of Top Level Table ----
  1. At what address will you find the page table entry for virtual address 0x40277123? (6)0x48D159DC = 0x123456<<10 + 0x77<<2 = 0100 1000 1101 0001 0101 1001 1101 1100
  2. Given the information above, do you know that 0x40277123 is an address from which you are allowed to read? (2) No -- page permissions could deny it.
  3. What is the physical address for virtual address 0x40666ABC? (4) 0xABCDE66ABC (Note that this exceeds 32 bits, which was really quite unfair)
  4. Given the information above, do you know that you can read from that address?(2) Yes -- there are no additional permission bits to check

3. Multikernel/BarrelFish (80)

For this question, you will need to read the first eight and a half pages of this paper on the Multikernel architecture and its implementation as Barrelfish. (You are, of course, free to read the entire paper, but I wanted to keep the portion necessary for the exam within reason.) Please answer the following questions.
  1. The abstract states, "An evaluation of our prototype on multicore systems shows that, even on present-day machines, the performance of a multikernel is comparable with a conventional OS, and can scale better to support future hardware." Do you believe that it is sufficient to show that performance is comparable? Why/Why not? Limit your discussion to 1-2 paragraphs.
    (20 points) I do believe that comparable performance is sufficient - if by comparable, they mean single-threaded performance. If they include scalability or the ability to exploit multiple cores, then comparable is not sufficient; they need to do much better, since that's their focus. The greatest challenge we face on today's hardware is being able to exploit multiple cores, so if single-threaded performance is nearly as good as it is on regular operating systems, that's OK. They really need to show that they are better able to exploit multiple cores. This means that if they run many threads/processes they derive more benefit from additional processors in ways that other operating systems do not. Fundamentally, they are claiming that shared memory is a bottleneck, so they want to show that by designing a system that doesn't use shared memory as much, they avoid memory bottlenecks and obtain better aggregate throughput from the sum total of the processors.
  2. Figure 3 shows that there is a very small overhead in updating a data item that spans eight cache lines among many (> 8) cores using message passing, but a significant overhead in sharing eight cache lines among an equivalent number of cores using shared memory. What explanation do they provide for this difference?
    (15 points) In a shared memory arrangement, each write to a cache line causes stalls for the cache misses as the data moves among the various cores. As the number of cores increases, the number of stalls increases. In message passing, there is a relatively constant overhead in transmitting the message to a server process, but then there are no cache misses. So, in shared memory we are limited by the speed of the memory hierarchy while in the message passing case, we are limited by how long it takes the server to process the message.
  3. In Figure 4, let's assign a numerical scale to the X axis such that "Shared state, one big lock" = 1, "Finer-grained locking" = 2, "Clustered objects, paritioning" = 3, and "Distributed state, replica maintenance" = 4. What number would you give OS/161? Why?
    (10 points) 2: OS/161 is designed for multiple processors, so it tries to do something smarter than "one big lock," but it doesn't do anything fancy such as clustered objects.
  4. Although OS/161 is a shared memory and not a message-passing architecture, there are places where a core behaves in a split-phase manner (section 3.1). Give one example.
    (15 points) In the I/O system, we frequently issue an I/O, deschedule the currently running process and then go off and do something else.
  5. Had you completed assignments 2, 3, and 4 in barrelfish, for each major piece of functionality: system calls, scheduling, virtual memory, journaling, in which barrelfish component(s) would the majority of the functionality reside? If you specify more than one component, explain how the work would be partitioned.
    (20 points)
    • System calls: CPU driver to receive the trap, but would then dispatch to a user level server in many cases.
    • Scheduling: CPU driver
    • Virtual memory: CPU driver for actual hardware MMU manipulation; monitor for keeping track of process page tables.
    • Journaling: User-level service

4. Distributed File Systems (18 points)

You have been hired to design a new distributed file system, TNVFCSDFS (the new very fast, consistent, simple distributed file system). It is up to you to decide if you want your distributed file system protocol to be stateful or stateless. You've decided to poll your customers to figure out what they value most: performance, consistency, or simplicity (which is likely to get your product to market more quickly and more reliably). Your design will depend on what you learn from your customers.
  1. If your customers say that performance is the most important criteria, what will you decide? Why? (10 points) This is the interesting one ... I gave credit for either answer if you made valid points (e.g., with state you can know that your cached copy is valid and save messages to/from the server on open and you can cache more aggressively; without state, the master has less work to do and probably scales better).
  2. If your customers say that consistency is the most important criteria, what will you decide? Why? (4 points) If you want consistency, then stateful is the way to go; building a consistent distributed file system without state is very tricky; if you had a rock solid approach, I gave credit, but I don't think anyone did.
  3. If your customers say that simplicity is the most important criteria, what will you decide? Why? (4 points) Recovery is infinitely easier with a stateless system. If you could convince me that something was simpler with a stateful system, I gave credit, but doing that was tricky.

5. Pot Pourri (42 points)