CPSC 508 Final Project (2019)

Project Proposal & Research Plan Due: 5:00 PM October 4, 2019

1st Status Meeting: Early Week of October 14, 2019 or Late Week of October 21

2nd Status Meeting: Week of November 4, 2019

In-class Presentations: November 19/21, 2019

Depending on how many different projects we have, we may need only one of these

First Draft Due: 9:00 PM November 24, 2019

Final Project Due 5:00 PM December 13, 2019

The goal of the final project is to provide the opportunity for you to conduct systems research. The size of the project can vary, but thinking of it as a conference paper is probably a good model. This History of a Paper outlines a paper from initial submission to final paper. It includes the original (rejected) submission (extended abstract) and reviews, another (accepted) submission and reviews, and the the final paper. This collection should give you an idea of how to give and respond to constructive criticism. It will also give you a sense of what I mean by "conference paper."

Final projects may be undertaken in teams of two graduate students or up to four undergraduate students. If you feel that you have a project sufficiently large to warrant more people, come talk to me. Projects may also be undertaken in cooperation with other graduate courses, but any such project must be approved by the professors of both courses. Not surprisingly, we expect more depth and work for a project that is satisfying two class requirements. Similarly, if you wish to undertake a project related to your own research, I will permit it, but you must demonstrate how what we've learned in CPSC 508 influences your work and/or ways in which your research would have been different had you not been also conducting a project in CPSC 508. In other words, your project in CPSC 508 must extend work you would normally have done in some new and/or diferent way.

For this project, you need to pose a question, design a framework in which to answer the question, conduct the research, and write up your experience and results. There will be five deliverables for this project.

  1. Project Proposal and Research Plan (20%)
  2. Although your proposal and plan are not due until October 4, if you hand it in early, I will try to give you feedback within 24 hours. That will maximize the time you have to work on the project. I therefore strongly encourage you to try to hand it in early (unheard of, I know).

    The proposal part should be a single page that describes your project. You should clearly motivate and state the research question you are investigating. Provide a few sentences of explanation about why you think this is an interesting question, why it is important, and how it qualifies as research.

    The research plan is a more comprehensive document. It should include the following components (the numbers in parentheses are an indication of an estimate of the number pages that you might need for the section).

  3. Status Meeting
  4. I encourage you to come talk to me about your project or schedule other meetings with me as the need arises. At a minimum, I want to meet with you twice before the extended abstract is due. One of those meetings must happen either before Friday of the week of October 14 or the second half of the week of October 21, and the other must happen before the end of the week of November 4. These meetings are for your benefit. I expect to answer questions you may have, ask you questions about what you've done, brainstorm about what to do next, etc. If you haven't done anything, you will get little value out of these meetings. Come with questions I can help you answer.

  5. First Draft (30%)
  6. This is the version of the paper that we will review at the "Mock Program Committee." You should have some of your research completed by this point. The draft should contain all the parts of the paper, although it may have prepliminary results and may be missing some results. For any results you do not have, I want you to clearly state that you do not have results, but I want you to write the results section. That is, I want you to think through what results you hope to have and how you'll present them. This can be super fun, because you can MAKE UP WHATEVER RESULTS YOU WANT. The reason we do that is because when you get actual results, you then have something against which to compare them. When your actual results do not match your predictions, one of two things is true: Either, your initial intuition was wrong or your system/tests are wrong. If the former is true, that's really interesting -- your final paper will then explain what you originally thought and why it was not right. If the latter is true, you get to do some debugging. This first submission should contain a complete introduction, background, description of your research, related work, and (preliminary) conclusions. You should be able to write significant parts of this immediately after your project proposal is turned in, so please, please, please don't write this all the night before it is due. The better and more complete the first draft, the more valuable the input that I and your classmates can give you.

  7. In-class Presentation (10%)
  8. Each group will present a short talk on their research during class on November 19 or 21 (depending on how many group projects, we'll see if we need both days). You should plan on approximately a 10 minute presentation and 2-5 minutes for questions and answers. You can think of the in-class presentation as being a short conference presentation. You will not have time to present all the details and subtleties of your work, but you should be able to motivate the audience and explain the important results of your work. After your presentation, your classmates should want to read your final report. The presentation is a great way to make sure that they understand what you're trying to do, so that there is no confusion when they read/review your project.

  9. Final Report (40%)
  10. The final report is a research paper. I expect that most reports will be approximately 10 - 15 "Conference pages" including graphs, tables, diagrams and reference. You should complete the writing early enough that you have time to reread your work and critique it with the rigor that you applied to Assignment 1. Be honest. State shortcomings in your work. Discuss follow on projects. I expect that several of these reports will be suitable for submission to a conference, and I will be happy to work with you to turn them into submissions.

    Part of your final report grade will be based upon how well you address comments raised by the program committee. Do not ignore my and the reviewer comments!

Project Suggestions

I suggest some topics below (we may add to this list after it is up on the web site, so it makes sense to check there if you are stuck for project ideas). You need not pick your final project from this list, but if you decide on a project not on this list, please check with me before fully committing to the project. The key characteristics of a project should be:

  1. The work can reasonably be completed in two months.
  2. We have access to the required hardware and software.
  3. The research question has something to do with systems (I'm willing to give a fair bit of freedom here, but if there are any questions, please check with me).
  4. The project is structured in such a way that you can have tangible results. (No big idea papers probably.)
  5. You will learn something from undertaking this project.

  1. Making Research Reproducible
  2. I am assuming that after completing homework 1, you have concluded that it's quite difficult to make systems research reproducible. There are many different provenance capture systems in the world, and our hypothesis is that if you have complete provenance for an experiment, then you can automatically construct a virtual machine that can be used to reproduce the research in a paper. We think this would be cool!

    So, pick a provenance capture system (we have ideas -- we have a simple R-based provenance capture system that would be the obvious one to use, but you are free to use others if you prefer), and develop tools that can reconstruct an experiment based solely on the provenance.

    You will undoubtedly want to check out CamFlow and Philip Guo's CDE.

  3. Cloud Computing in the Wild
  4. Unikernels [1] have been as an alternative to containers (e.g., Docker) as a cloud application deployment solution. In parallel, the concept of "fog" computing is emerging, where services are being deployed directly into the Internet of Things (IoT) infrastructure to improve latency, privacy, and partition resilience (among other things). This suggests the questions, "Could we allow the dynamic migration of services from the cloud, to edge-devices and directly into end-devices?" and "Could this be done while maintaining the use of programming languages and skills backed by a relatively cheap and abundant workforce?"

    Transpilation techniques [2] combined with a unikernel designed for extremely low resource consumption [3] could be a step in this direction. A preliminary proof of concept demonstrated the possibility to transform a PHP application into a few MB self-contained virtual machine image. We want to go beyond that proof of concept and build a prototype to demonstrate effective service migration in such a manner.

    1 - Madhavapeddy, Anil, et al. "Unikernels: Library operating systems for the cloud." ACM SIGPLAN Notices 48.4 (2013): 461-472.

    2 - Zhao, Haiping, et al. "The HipHop compiler for PHP." ACM SIGPLAN Notices. Vol. 47. No. 10. ACM, 2012.

    3 - Bratterud, Alfred, et al. "IncludeOS: A minimal, resource efficient unikernel for cloud services." 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 2015.

  5. A General Purpose Isolation Mechanism
  6. One could view a hypervisor as a mechanism for providing units of isolation whose interface is a machine ISA. Similarly, a conventional operating system provides units of isolation whose interface is the system call API. This trend continues: a JVM is a user-level process that provides units of isolation whose API is Java bytecodes. Web servers adopted a pile of complexity to support virtual domains, thereby introducing yet another way to provide isolation." Some browsers also provide units of isolation between each browser tab. Given this stack of software, each providing an isolation mechanism, one might wonder why we have N different mechanisms instead of a single coherent mechanism. Putting it another way, could all these systems use a single isolation model/implementation? If so, what would it look like?

    The goal of this project would be to design an isolation mechanism that could be used in this way and evaluate it. (It's possible that you could build something like this on L4.) You could imagine evaluting this using a set of examples like the web server and running it in different architectural configurations: a single web server running virtual domains; a web server per virtual machine; a web server using the isolation mechanism you provide.

  7. Using Provenance to Solve OS Problems
  8. There are many systems papers of the form, "We wanted to solve some problem, so we modified the kernel to produce a bunch of data, and then we used that data to do something." I'd like to see how many of these projects could be done via a single provenance capture system. CamFlow is a selective whole-system provenance capture system. It also have a lovely front-end display engine. I would love to see how many special-purpose systems could be replaced by scripts running over CamFlow data. I could imagine doing this dynamically over streaming data (using CamQuery) or statically over collected data.

    1. For example, prefetching files requires that you know what files are likely to be accessed, before programs actually access them -- PASS captures much of that data. So, see if you can replicate the work in "An Analytical Approach to File Prefetching (1997 USENIX)" using PASS. Here are other papers on file prefetching to examine:
      • Marginal Cost-Benefit Analysis for Predictive File Prefetching (ACSME 2003)
      • Design and Implementation of Predictive File Prefetch ing (USENIX 2002)
    2. Another area where provenance might be useful is in cache replacement algorithms -- if you knew what you might need again soon, you would keep it in your cache. Look for papers on caching, such as:
        A study of integrated prefetching and caching strategies (Sigmetrics PER 1995).
      • Informed prefetching and caching (SOSP 1995)
      • Application controlled prefetching and caching (USENIX 2002)
    3. The Coda file system was designed to help users work in a disconnected mode. One component of that system was a hoarding mechanism where the system would try to figure out what files you were going to need to function while disconnected. It seems that one could exploit provenance to perform better hoarding. Do it!

    Warning: I have a strong vested interest in this project. The upside is that you are likely to get lots of attention; the downside is that you are likely to get lots of attention.

  9. Storing whole system provenance on blockchain
  10. We are frequently asked how we maintain the integrity of provenance. As Camflow provides a mechanism for shipping provenance out to a remote site, it’s possible that you could simply store provenance on the blockchain. However, it’s also possible that performance of blockchain storage will be too slow. Read the camflow paper and as much blockchain background as necessary. Can you make this work? (Consider alternatives to proof-of-work, else this is just a non-starter.) (There is a very recent paper that does something like this; I was not impressed with it, so I'm pretty confident there is something more interesting to be done.)

  11. Prove that LSM-based provenance capture is guaranteed to detect a security breach.
  12. This is a two-step process. First, using the methodology used in this paper, show that the current LSM interface captures all security-related flows in the kernel. Next, given provenance captured at these points, prove (or disprove) that a security violation must show up as an anomaly in the provenance graph.

  13. Real End-to-end Provenance
  14. Data provenance is metadata that describes how a digital artifact came to be in its present state. One problem with existing provenance capture systems is that they capture only local provenance: provenance from a particular language (e.g., R, Python) or from a particular workflow system (e.g., VisTrails). However, once you copy files or use multiple langauges, or connect different programs together in a script, you run the risk of breaking the provenance chain. We believe that whole system provenance (e.g., CamFlow) could provide the glue that connects different provenance systems. Your goal is to demonstrate some application that uses provenance from multiple different collection sources to do something interesting. For example, given a shell script that calls both R and Python programs, can you automatically build a container or VM that precisely and exactly reproduces the experiment? Alternately, could you use provenance to build a debugging tool? If you're interested in this project, come talk to me.

Deriving Boot sequences from Machine Independent specifications

I am really excited about the possibility of generating machine dependent OS code from machine independent specifications. I believe that such generation will require a lot of different techniques. One particularly tricky piece of OS code is the boot sequence. Your goal, should you choose to accept, is to study at least two different boot sequences (i.e., from two different processors) and come up with some techniques for A) Describing the sequence in a machine independent fashion, and B) Generating implementations from that specification. I expect that part B will not be particularly elegant first time around, but that's OK. I recommend not selecting the x86 as one of your target processors! In fact, you might select some simple and special purpose processors. One might get some inspiration from prior work on automatically generating device drivers.

Tiny OS Components for Tiny Processors

Pick some special purpose processor. Design a tiny operating system that either A) makes it easier to develop applications for the processor, or B) allows it to seamlessly communicate with a general purpose processor and OS. Once you've established the functionality you need for your tiny OS, design a set of even tinier components that can be assembled to provide that functionality.