Jump to:

Margo’s Current Harvard Students

  • Rob Bowden – Automatic program repair
  • Jingmei (Crystal) Hu – Evolvable operating systems
  • Xueyuan (Michael) Han – Applications of data provenance to system security

Current Projects

Collaborative Research: Bringing End-to-End Provenance to Scientists

An interdisciplinary team of computer scientists and ecologists have come together to develop tools to facilitate the capture, management, and query of data provenance – the history of how a digital artifact came to be in its present state. Such data provenance improves the transparency, reliability, and reproducibility of scientific results. Most existing provenance systems require users to learn specialized tools and jargon and are unable to integrate provenance from different sources; these are serious obstacles to adoption by domain scientists. This project includes the design, development, deployment, and evaluation of an end-to-end system (eeProv) that encompasses the range of activity from original data analysis by domain scientists to management and analysis of the resulting provenance in a common framework with common tools. This project leverages and integrates development efforts on (1) an emerging system for generating provenance from a computing environment that scientists actually use (the R statistical language) with (2) an emerging system that utilizes a library of language and database adapters to store and manage provenance from virtually any source.

Evolving Operating Systems

Large organizations invest enormous sums of money in software development, which result in large, difficult-to-maintain, complex systems. Over time, systems become increasingly brittle and more terrifying to modify, because no one understand the full system. When these systems are mission critical, they frequently continue to run for decades on outdated hardware, because no one knows how to migrate them to more modern system. The goal of this project is to develop tools and techniques to allow software to evolve, enabling migration to newer hardware platforms. In particular, we are investigating the use of hardware machine description languages as the foundation for developing tools that will let us synthesize systems for new hardware.

Back to top

Past Projects

Automatically Scalable Computation (ASC)

The ASC project is a research collaboration between Harvard University and Boston University to create a powerful, practical automatic parallelization runtime.

Traditional automatic parallelization techniques are not keeping pace with the widespread growth in parallelism in computing. ASC is a research project that studies an alternative approach to autoparallelism.

Rather than compiling sequential programs into parallel programs, our approach arises from the fact that executing a sequential von Neumann program on input data traces out a unique path through the state space of execution. If we could somehow partition that complete path into multiple smaller subpaths, then we could take advantage of all available processors to execute each subpath in parallel.

Citation++: Data citation, provenance, and documentation

This project creates a provenance-enabled data citation system that can both embed in an existing data platform (specifically, DataVerse) as well as function as a standalone service. The system will directly include executable transformations for a limited, but important set of tools: R and SQL. For other tools, it provides a standardized documentation capability to describe transformations. The system is sufficiently flexible to serve either as part of a publication workflow, where data is part of a more conventional publication, or in support of a standalone publication. It also provides data summaries.

Workload-Aware Storage Architectures for Optimal Performance and Energy Efficiency

The most significant performance and energy bottlenecks in a computer are often caused by the storage system, because the gap between storage device and CPU speeds is greater than in any other part of the machine. Big data and new storage media only make things worse, because today’s systems are still optimized for legacy workloads and hard disks. The team at Stony Brook University, Harvard University, and Harvey Mudd College has shown that large systems are poorly optimized, resulting in waste that increases computing costs, slows scientific progress, and jeopardizes the nation’s energy independence.

At Harvard, we are currently investigating two areas: building storage systems for new SMR disk drives and developing systems to store, manipulate, and query large graphs.

Provenance Aware Storage Systems (PASS)

Provenance (also known as pedigree or lineage) refers to the complete history of a document. In the scientific community, provenance refers to the information that describes data in sufficient detail to facilitate reproduction and enable validation of results. In the archival community, provenance refers to the chain of ownership and the transformations a document has undergone. However, in most computer systems today, provenance is an after-thought, implemented as an auxiliary indexing structure parallel to the actual data.

Provenance, however, is merely a particular type of meta-data. The operating system should be responsible for the collection of provenance and the storage system should be responsible for its management. We define a new class of storage system, called a provenance-aware storage system (PASS), that supports the automatic collection and maintenance of provenance. A PASS collects provenance as new objects are created in the system and maintains that provenance just as it maintains conventional file system meta-data. A PASS, in addition to collecting and maintaining provenance, also supports queries upon the provenance.

Currently, we have implemented a prototype that records relevant system activity and stores it persistently in an in-kernel database and responds to user queries about a file’s provenance.

Scalable Data Management

This project addresses the challenge of making petabyte scale storage systems easily usable. The techniques we developed forty years ago are simply not up to the task of managing billions of files. We leverage provenance (a record of the data and processes that contributed to its creation), content, and other attributes to provide a scalable and searchable file namespace that tracks data as it moves through the scientific workflow.


The Hourglass project is building a scalable, robust data collection system to support geographically diverse sensor network applications. Hourglass is an Internet-based infrastructure for connecting a wide range of sensors, services, and applications in a robust fashion. In Hourglass, streams of data elements are routed to one or more applications. These data elements are generated from sensors inside of sensor networks whose internals can be entirely hidden from participants in the Hourglass system. The Hourglass infrastructure consists of an overlay network of well-connected dedicated machines that provides service registration, discovery, and routing of data streams from sensors to client applications. In addition, Hourglass supports a set of in-network services such as filtering, aggregation, compression, and buffering stream data between source and destination. Hourglass also allows third party services to be deployed and used in the network.

Back to top

Examples of Undergraduate Projects and Theses

Optimal Bayesian Rule Lists — Nicholas Larus-Stone (2017)

We demonstrate a new algorithm that finds an optimal rule list as well as providing proof of that optimality. Rule lists, which are lists composed of If-Then statements, are similar to decision tree classifiers and are useful because each step in the model’s decision making process is understandable by humans. Our algorithm finds the optimal rule list through the use of three types of bounds: bounds inherent to the rules themselves, bounds based on the current best solution, and bounds based on symmetry between rule lists. We propose novel data structures to minimize the memory usage and runtime of our algorithm on this exponentially difficult problem. One advantage of our algorithm is that it generates not only the optimal rule list and a certificate of optimality, but also the entire solution space of nearly optimal solutions. Our algorithm therefore allows for the analysis and discovery of all optimal and near-optimal solutions on problems requiring human-interpretable algorithms.

Dependency Tracking in ASC — Peter Kraft (2017)

ASC is a new method of automatic parallelization that transforms parallelization into a machine learning problem. ASC works by predicting the future state of a program’s memory and registers and speculatively executing from predicted states. It is limited by its ability to accurately predict these future states. We introduce a new learning model for ASC as well as a notion of data dependency that reduces the size of the state that ASC needs to predict correctly to produce improved performance. Using these, we substantially increase the set of programs ASC can automatically parallelize, demonstrating speedup on several important classes of programs, such as maps.

The Climate for Women in Harvard CS — Michelle Danoff (2017)

In collaboration with Jim Waldo, we are working to better understand the intersection of gender and computer science at Harvard. The work includes surveys and undergraduate interviews with the goal to understand how students became interested in computer science and how their experience with computer science has been at Harvard. This research involves many components such as survey design, data analysis, and research about efforts at other universities. Our goal is to gain a better understanding of what the experience at Harvard is like now, and pass that on so SEAS and student groups so they can better direct efforts to narrow the gender gap.

Examining the feasibility of crowd-sourcing EpiPens — Gregory Hewitt (2017)

Epinephrine autoinjectors (e.g. EpiPens) are needed within minutes of anaphylactic shock to be effective, and emergency defibrillators are needed just as quickly to counteract a heart attack or heart arrhythmia. However, if someone is in need of one of these devices and does not have one on their person, it can be difficult and time consuming to find one. The goal of my research project is to find and implement the best way to make devices such as these more accessible in times of emergency. My analysis determines that, based on population densities of certain areas and the probability that any given person will have an autoinjector on their person, an app-based, crowdsourcing network can be implemented to solve this accessibility issue. I also discuss ways to safely and securely share, authenticate, and store a users location by examining what work and implementations have already been done in this area. I discuss why using Internet-connected containers that communicate with this network can replace current Automated Emergency Defibrillator (AED) containers drastically improves the accessibility of these devices and provides a significantly higher chance of survival in an emergency situation. I give detail on what this network requires for implementation and what proptocols are required. Finally, I discuss an evaluation of my own implementation of the hardware and software I used to build it.

Identifying and Visualizing Controversy in Wikipedia — Brigette Pare (2017)

Information Flow Audit on Streaming Provenance Graphs — Mohammed Hafeezul Rahman (2017)

Audit compliance on information/data flow graphs has been extensively studied, and there exist numerous applications that allow for provenance capture and checking for data flow violations on these provenance graphs. However, the subgraph containing the violation is usually small compared to the entire information flow graph generated by the system. Thus, we can save significant storage space by developing a framework capable of verifying compliance and enforcing application-specific information security guarantees from the data provenance graphs generated by the application, directly at runtime. The goal of the project is to achieve an in-memory solution to this by designing single-pass graph algorithms to detect violations from the massive streaming provenance graphs.

Predicting the Performance of Automatically Scalable Computation (ASC) — Serena Wang (2016)

The Automatically Scalable Computation (ASC) architecture is a new approach to auto- matic parallelization that transforms parallelization into a machine learning problem. The underlying principle is that if we observe the state of the machine repeatedly at a given place in a sequential program, we can build a model to predict the state of the machine at that place over time. The “place” in the program is defined by an address loaded into the instruction pointer (IP). We present a machine learning approach to automatic IP selection. Our approach relies on the observation that the error function of ASC’s internal machine learning model decreases over time for a good IP value. For a given program and IP value, we build a predictive model for ASC’s error function value over time. The inputs are a program and IP value and the target is ASC’s error function value. We present methods for representing the program and IP value as a feature set. We also present and evaluate three different machine learning models for predicting the error function values.

Community-Attribute Models for Bibliographic Reference Information via Dynamic Graph Evolution — Ross Rheingans-Yoo (2016)

We present Cambridge, the first technique for evolutionary analysis of dynamic graphs via a community-attribute graph model. Community-attribute models have been shown to be superior to models conventionally used for evolutionary analysis, particularly in modeling ecommunity structures in networks where communities exhibit dense overlaps. Thus, our use of a community-attribute model for analysis of a bibliographic network evolving in time allows us to observe not only the evolution of discrete clusters, but also the evolution of the ‘core’ of nodes which are strongly linked to multiple communities simultaneously. In particular, our approach allows us to observe and quantify how the sibling communities resulting from community-splitting events share and compete for external intercommunity influence inherited from parent communities. We present evidence that indicates that in such splitting events, highly-connected nodes which were part of the parent networks ‘strong intercommunity ties’ become concentrated in the siblings’ intersection, whereas highly-connected nodes that are part of ‘weak intercommunity ties’ are dispersed to the individual sibling communities. We discuss the implications of our findings for the field of evolutionary graph analysis, and address the evident promise of dynamic community-attribute models in providing fully generative models for dynamic networks.

The Consumer’s View on Security and Privacy of Health Wearbles — Margot Taylor (2016)

This study investigates the health wearable market, in particular, consumer attitudes towards security and privacy of their data and the factors that might cause them to modify their attitudes. It is clear that security of health wearables and implantables has room for improvement. However, security is not one of consumers’ top purchasing criteria at the moment. A case study on the advertisement of security in the personal computers industry revealed that a large scale hack affecting thousands of users has the potential to increase consumers concern about security sufficiently to make it a purchasing criteria. Furthermore, a survey of 301 American adults revealed factors that could modify consumer’s attitudes towards security. First, consumers are significantly more concerned about medical history and diagnostic data than they are about biometric data. Once a wearable contains medical history data, security becomes a top purchasing criteria. Second, Consumer’s concern levels are affected by the distribution channel, as they trust devices prescribed by a doctor to be more secure. The nuancing of consumer attitudes towards security and privacy of health wearables has the potential to impact health companies advertising techniques, value proposition models, and design process to focus on security more. It furthermore suggests that the speed of improvement of security may increase once devices that store medical history and diagnostic data come to market.

Back to top