Assignment 1: Paper Critique and Reproducibility (2022w1)

Due Septemer 30 -- electronically (via handin repository; see instructions) by 5:00 PM

You may complete this assignment either alone or with a partner. If, however, you have a partner, we will have higher expectations in terms of what you actually reproduce. If you plan on working with a partner, please let me know ASAP.

Select one paper that meets all of the following criteria:

It is an evaluation paper.
You are not an author.
It includes data (graphs, tables, numbers).
It has something to do with systems.

The paper selected may be one listed here or one from the reading list, but it does not have to be. As soon as you've selected your paper, please send us email and tell us what paper you are doing and send a link to it if it is not one on this list or in the course readings.

Your job is to:

Critique the paper (details below). Do this part before you attempt part 2.
Reproduce one or more of the experiments in the paper. See below for details on what this might look like depending on the state of the paper's artifact evaluation.
Write an addendum to your critique commenting on the reproducibility of the research presented.

1. Writing your Critique

Focus your attention on the research methodology more than the research idea. Your critique will probably be on the order of one to two pages although there will be exceptions. In your critique you must at least answer the following questions (depending on the paper, there will be other things to discuss as well).

What is the purpose of the paper?
What is the hypothesis that the authors are testing?
What is the experimental setup?
What is good/bad about the experimental setup?
How well was the research carried out? What results are presented?
Do you believe the results? Why/Why not?
What things might you have done differently?
What lessons did you learn from reading this paper critically?

2. Reproducing Results

The goal of this exercise is to understand systems research, writing, and reproducibility. In an age of artifact evaluations, this part can take many different forms. Please read this entire section before getting started.

1. Your paper has no published artifact

Pick one or two experiments from the paper and try to reproduce the tests and measurements. Undoubtedly you will have difficulty actually reproducing the results. This is OK. You will be graded on how you approach the reproduction, how carefully and fairly you compare your experience with that of the authors and how completely you can state the assumptions that you had to make.

2. Your paper has a published artifact that has NOT been through an artifact evaluation.

In this case, you may have difficulty using the published artifact. That would not be surprising. Report on the difficulty and do your best to get something similar to the artifact running! If you get it running easily, pretend that you are in case 3 below.

3. Your paper has a published artifact that has been through an artifact evaluation.

In this case, make sure you run the experiment on a platform that is quite different from that used in the paper. Ideally, the artifact runs easily and your task will require that you understand the benchmarks and platform differences sufficiently well that you can explain/justify your results.

In all cases

Be careful to articulate any hidden assumptions that you make. Think hard about how to interpret your results given different hardware and software configurations. You may take advantage of data and/or tools that have been made available by the authors, but you may not do so to the extent that there is no work left to the assignment.

3. Critique Addendum

Discuss your results and how and why they differ from those published. Then add a few paragraphs to your critique discussing the reproducibility of the results. Comment on whether or not your assessment of the paper changed after trying to reproduce the results.

What to turn in

A write-up of what experiment you are trying to reproduce (identify the corresponding tables/graphs from the original paper).
A description of your experimental setup.
A discussion of any assumptions you made and important information that the authors did not provide in their paper.
A list of any tools and/or traces that you used.

How to turn in

Place all the parts of your assignment into a single PDF document.
Name your document a1_[cwl].pdf where cwl is your CWL. If you worked with a partner, name your document a1_[cwl1]_[cwl2] where the two CWLs are listed alphabetically.
Add and commit this document into your handin_[CWL] repository (if you worked with a partner, both of you should do this).
Push your repository.

Suggested Papers

Here are some suggested papers. If you chose something not on this list, check with either Professor Seltzer or one of the TAs to make sure that the task you are undertaking is reasonable.

Amit 2017: Optimizing the TLB Shootdown Algorithm with Page Access Tracking
Amit 2019: JumpSwitches: Restoring the Performance of Indirect Branches In the Era of Spectre
Balmau 2017: TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores. Reproduce any figure numbered 9 or greater.
Blake 2003: High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two (appeared in the 2003 Hot Topics in Operating Systems). Reproduce the graph in Section 4.1.
Cadar 2008: Klee: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. Download their tool (it's not on a Stanford site, it's at llvm.org) and try it on some of the workloads they used.
Curtsinger 2015: COZ: Finding Code that Counts with Causal Profiling. Reproduce any of the examples of COZ profiling. If that works seamlessly, use COZ to evaluate something that they did not evaluate in the paper and report on what you learned about it.
Cutler 2018: The benefits and costs of writing a POSIX kernel in a high-level language. The code for this project is available here. See if you can reproduce some of their measurements about kernel functionality.
Harnik 2013: To Zip or not to Zip: Effective Resource Usage for Real-Time Compression. See if you can get the same kinds of compression timings that the authors got.
Jamet 2020: Characterizing the impact of last-level cache replacement policies on big-data workloads. In theory, it should be easy to use the simulator and tracer used in this study to reproduce their results exactly. See if theory meets practice. If so, analyze a different benchmark using their tools!
Kadekodi 2018: Geriatrix: Aging what you see and what you don’t see. A file system aging approach for modern storage systems. There are so many graphs frmo which to choose -- see if you can reproduce some runtime results on an aged file system.
Koller 2013: Write Policies for Host-side Flash Caches. Start with the analytical results from Figure 1. Then see if you can put together a system that looks something like what the authors did and see if you can run any of their benchmarks.
Kyrola 2012: GraphChi: Large-Scale Graph Computation on Just a PC. Most of the graphs from this paper are available from the SNAP repository and many of the systems against which to compare are open source.
Lawall 2022 : OS scheduling with nest: keeping tasks close together on warm cores. This has undergone artifact evaluation, so this is a type 3 project. You need to make sure you are running on a very different platform. Then you need to explain your results relative to those in the paper.
Lozi 2016: The Linux Scheduler: a Decade of Wasted Cores This paper has a collection of different graphs illustrating several interesting behaviors of the Linux scheduler, see if the behavior described still exists.
Mao 2012: Cache Craftiness for Fast Multicore Key-Value Storage. The software described here is available here. See if you can reproduce any of figures 9 - 11.
Min 2016: Understanding Manycore Scalability of File Systems. You can pretty much try to reproduce anything in these figures!
Ren 2019: An Analysis of Performance Evolution of Linux’s Core Operations
Roghanchi 2017: ffwd: delegation is (much) faster than you think. This paper explores different ways to consistently handle access to shared memory. See if you can reproduce any of the benchmarks in the first three or four figures. Code is available here
Roy 2013: X-Stream: Edge-centric Graph Processing using Streaming Partitions. This paper has a lot of different data - not just run time. Trying to reproduce it should be, um, fun.
Sumbaly 2012: Serving Large-scale Batch Computed Data with Project Voldemort. Using the publicly available Voldemort and MySQL releases, see if you can reproduce any of the graphs in the evaluation.
Vangoor 2017: To FUSE or Not to FUSE: Performance of User-Space File Systems. See if you can reproduce a few of the results from Table 3 on any system to which you have access.
Volos 2014: Aerie: flexible file-system interfaces to storage-class memory. See if you can reproduce Figure 1.
Wu 2018: Anna: A KVS For Any Scale. The code for this system is available here. Can you reproduce any of the comparisons with Redis or Cassandra or any other widely used KV store?
Zhao 2016: Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. Pick one workload used in the paper and see if you can reproduce it. Can you run workloads not in the paper?