A Critique of Seltzer's LFS Measurements

John Ousterhout / john.ousterhout@scriptics.com

The paper "File System Logging Versus Clustering: A Performance Comparison", by Margo Seltzer et al., appears in the Winter 1995 USENIX Conference. This paper is the second in a series of papers analyzing the performance of Seltzer's implementation of a log-structured file system in the BSD kernel (the first appeared in the Winter 1993 USENIX Conference).

The new paper corrects a number of flaws in the earlier paper having to do with the BSD LFS implementation and the choice of benchmarks, but I believe that the new paper is still misleading in several ways:

Section 5 claims to measure the effect of fragmentation on small file performance, but the benchmarking approach is such that it doesn't really measure small files; it measures only large files. Small file performance under FFS will be worse than claimed in the paper.
The measurements in Section 4 are counter-intuitive, and the back-of-the-envelope calculations used to support those measurements are incorrect. At present neither Seltzer nor I can explain the measurements.
BSD LFS does not include a simple optimization that should substantially improve the LFS results in Section 4.

Thus, of the three main results sections (Sections 3, 4, and 5) only the results in Section 3 are completely reliable; the others should be taken with a grain of salt.

The following sections explain the problems in more detail. Several other Web pages are also available on the topic of FFS vs. LFS:

Seltzer's response to these comments.
My response to Seltzer's response, which shows that Seltzer's comments validate my concerns.
A critique of Seltzer's 1993 USENIX paper.

Section 5: FFS Fragmentation

The most serious problem with the paper is Section 5, which attempts to measure the impact of free space fragmentation on FFS performance. This was done in Section 5.3 by taking snapshots of several production FFS disks. The snapshots were used to recreate the disks' block allocation structures on fresh disks, then benchmarks were run on the recreated disks. These results of the benchmarks were compared with the same benchmarks running on clean (unfragmented) disks to measure the performance degradation due to fragmentation.

Seltzer claims that real workloads will experience the same degradation seen for the benchmarks, but this is not the case. In fact, real workloads are likely to see much greater performance degradation than Seltzer measured, because the benchmarks really measure the performance of very large files; small files tend to suffer more from fragmentation than large files.

The reason for this is explained in Section 5.2 of the paper. In FFS the free space in a cylinder group is unevenly distributed. The first part of a cylinder group tends to be highly fragmented (only a few free blocks distributed far apart), and most of the free space is concentrated at the end of the cylinder group. As explained in Section 5.2, small files tend to be allocated entirely from the beginning of the cylinder group so they are likely to be fragmented. Large files, on the other hand, may consume all of the fragmented free space and have most of their blocks allocated from the more concentrated free space at the end of the cylinder group. Thus large files have better layout than small files. The authors of the paper measured this effect and the measurements appeared in earlier drafts of the paper (these measurements were dropped from the final draft, but I believe that they are available in reference 12 cited by the paper).

Consider the benchmark used in Section 5.3, which creates a large number of files of a given size and measures the throughput. The problem is that the benchmark writes 25 files in a row to a single directory without deleting any files. All 25 of these files will probably fall in the same cylinder group. The first files will consume all of the cylinder group's fragmented space, and the later files will all be allocated in the "good" space at the end of the cylinder group, just as if a single very large file had been written.

Now compare this to real workloads. It is unlikely in practice for 25 new files to be created without any intervening deletes. Real workloads tend to have interspersed creates and deletes, so that new space is constantly being freed throughout the cylinder group and the fragmented free space is never consumed as in the benchmark.

Since the benchmark in Section 5.3 writes 25 files in a row, it is really measuring the fragmentation for files about 25 times as large as those actually used. For example, Section 5.3 used three file sizes: 8KB, 64KB, and 1MB. The 8KB case is really equivalent to writing a single 200KB file, and the 64KB case is equivalent to a single 1.5MB file. Thus the benchmark doesn't really measure small file performance at all (in practice, less than 5% of all files are larger than 200 KB and only about half of all bytes transferred are to files larger than 200 KB). Small-file fragmentation is much worse than large-file fragmentation, so the real-world degradation for 8KB and 64KB files is likely to be worse than Seltzer reports.

There is also a second problem with Section 5, having to do with 8 KB files. Figure 8 shows a performance degradation of a few percent for 8 KB files, but in fact there should be none: 8 KB files occupy only a single block on disk so contiguous block allocation isn't even an issue. It is not obvious to me why there should be any degradation at all for 8KB files; in any case, it would be better to use 16KB files as an example of the smallest files where fragmentation is an issue.

Section 4: Transaction Processing

The other problematic section of the paper is Section 4, which uses a transaction processing workload to compare LFS to FFS. I believe that the results in this section don't make sense without further explanation; until they can be explained I think they should be viewed with skepticism.

The problem with the results in Figure 7 is that LFS shows almost no variation in performance as the disk utilization changes. This benchmark is a pathological case for LFS due to its random access patterns, and I believe that its performance should be very sensitive to disk utilization. See, for example, Figure 4 of the original LFS paper by Mendel Rosenblum (in ACM Transactions on Computer Systems, February 1992).

Seltzer attempts to explain the performance by doing two back-of-the-envelope calculations near the end of Section 4; ultimately she claims that there is only a 10% gap between expected and measured performance. Unfortunately, though, this calculation fails to include two important factors: variation in utilization and the short lifetimes of the indirect blocks. These are explained in the subsections below. In fact, the expected LFS performance is 25-30% higher than what was measured.

Variance in segment utilization

Seltzer's calculation assumes that when an LFS disk is 50% utilized, every segment cleaned will be 50% utilized. However, this is not the case. Under random access patterns there will be variations in the utilizations of segments, and the LFS cleaner can choose the least-utilized segments to clean. Thus, in practice, the utilization of the segments cleaned will be less than the utilization of the disk as a whole. This effect is described in Rosenblum's paper; from Figure 4 in Rosenblum's paper I calculate that if the disk as a whole is 50% utilized, the segments cleaned should be only about 33% utilized, which will yield much better cleaning performance.

Let's redo Seltzer's back-of-the-envelope calculation (the last one in the paper, in the next-to-last paragraph of Section 4) using Seltzer's disk parameters of 14.5 ms for a random seek and rotational delay, 2.3MB/sec read bandwidth, 1.7 MB/sec write bandwidth, and a transaction rate of 43 TPS (with cleaning costs ignored). At a 33% utilization of cleaned segments, LFS will read 3 dirty = segments, write 1 cleaned segment, and have room for 2 segments worth of new transactions. It will take 450ms to read each segment and 603ms to read the cleaned segment. Then there will be room for 2*168 =3D 336 new transactions at a rate of 43 TPS, or 7.81s. Thus the total time for 336 transactions (including both the transactions and the cleaning associated with them) will be 3*450ms + 603ms + 7810ms =3D 9763, for a = rate of about 34.4 transactions per second. The measured time in Figure 7 is 27 TPS, so there is a discrepancy of about 27% (not 10% as reported in the paper).

Let's also do a calculation for a disk utilization of 80%. From Rosenblum's Figure 4 I calculate the utilization of the cleaned segments to be about 66%, so LFS will read 3 segments for cleaning, write 2 cleaned segments, and have 1 segment for 168 transactions of new data. The total time for 168 transactions will then be 3*450ms + 2*603ms + 3900ms =3D 6456ms, or 26 TPS. This is closer to the measured rate of about 24 TPS, but it is still off by about 8%. Also note that this number is 30% lower than the estimate for 50% disk utilization.

Short-lived data

The second factor not included in Seltzer's calculation is the effect of indirect blocks. BSD LFS flushes all of the dirty indirect blocks to disk whenever it writes data blocks (I think this is a bug; more on this below). In the benchmark of Section 4 the file has 58 indirect blocks and the cache holds 173 blocks dirty blocks when disk writes occur. Virtually all the indirect blocks will be dirty whenever a write occurs, so 58 out of every 173 new blocks written to disk will be the indirect blocks. Furthermore, these indirect blocks die almost immediately: they will all be overwritten with fresh copies the next time the cache is flushed. The indirect blocks therefore have a much shorter lifetime than the data blocks of the file, and they constitute about 1/3 of all the disk writes.

Rosenblum's paper shows that LFS performs much better when some of the data is short-lived than when all of it is uniformly long-lived (see Figure 7 in the Rosenblum paper), so the measured performance should be better than what was estimated above. I haven't been able to find a way to estimate cleaning costs under this set of conditions that I'm completely comfortable with. My best guess is that at 50% disk utilization the utilization of cleaned segments will be around 25% and at 80% disk utilization the utilization of cleaned segments will be around 57%. This translates into overall performance of 36 TPS and 29 TPS respectively, which are 33% and 20% higher than the measured values. However, I don't completely trust these estimates, so use them with caution.

I don't claim that the measured numbers are wrong, but I do think there is something going on that we don't understand right now. For example, there may be some artifact of the BSD implementation that is affecting the measurements. Until there is a logical explanation for the measurements I think that they should be taken with a grain of salt.

BSD-LFS's writing policy

I believe that there is a bug in the writing policy of BSD-LFS that affects the performance in the measurements of Section 4. The bug is that BSD-LFS writes all dirty indirect blocks whenever it writes any dirty data blocks. In this benchmark, 58 out of every 173 blocks written to disk are indirect blocks. The indirect blocks get modified almost immediately after being written, so they result in a lot of free space on disk that must be reclaimed by cleaning.

I believe that the best policy is to write dirty indirect blocks for a file only if the file has been closed or if the LRU replacement policy determines that the block should be replaced in the cache. In the normal case where a file is opened, written sequentially, and immediately closed, this guarantees that the indirect blocks will be near the file's data on disk. In the case where a file is open a long time and is being accessed randomly, as in the transaction processing benchmark, there is nothing to be gained by writing the indirect blocks. They are not needed for reliability, since there is already enough information in the log to recover the data blocks after a crash without the indirect blocks. If a file is open there will probably be more I/O to the file soon, so the indirect blocks are likely to be modified soon, which invalidates any copies on disk. Writing the indirect blocks just wastes write bandwidth and makes more work for the cleaner.

In the last paragraph of Section 4 Seltzer tries to justify the BSD implementation by arguing that the cache space occupied by dirty indirect blocks is needed for dirty data blocks. However, virtually every indirect block will be touched again before the cache fills, so in fact all the indirect blocks will stay in the cache anyway and no space will be freed for dirty data blocks. If in fact a dirty indirect block ages out of the cache according to the cache's LRU policy, then I agree that it makes sense to write it to disk, but this isn't the case in this particular benchmark.

Thus the results for LFS in Section 4 are a bit pessimistic. For example, this should make a difference of about 1.2ms per transaction in the case where there is no cleaner, which should increase the transaction rate from about 43 TPS to over 45 TPS. I can't estimate how much this will affect LFS performance when the cleaner is turned on, since the absence of the indirect blocks will change the cleaning cost in ways I don't know how to estimate. However, the relative impact of the optimization should be much greater with the cleaner turned on at high disk utilizations than when the cleaner is turned off. ------=_NextPart_000_0061_01C1B48C.EF52B2C0 Content-Type: text/html; name="seltzer2.html" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="seltzer2.html" A Response to Seltzer's Response

A Response to Seltzer's Response

This article is a response to Margo = Seltzer's response to my critique of her 1995 USENIX paper Logging versus Clustering: A Performance Evaluation.

Here are the four summary paragraphs from Seltzer's response (bulleted and italicized), followed by my comments.

Ousterhout references "a number of flaws in the earlier paper" (the 1993 Usenix paper = A Log-Structured File System for UNIX), but does not detail these flaws. To date, we are aware of no technical inaccuracies in the paper.

This is a surprising comment, given the number and intensity of discussions we have had on this topic. The 1993 paper is riddled with flaws, including errors in the BSD-LFS implementation, errors in the choice of benchmarks, and errors and omissions in the = presentation and analysis of the results. In fact, Seltzer personally fixed two major performance problems in BSD-LFS during the preparation of the 1995 paper, and her response web page describes a flaw in the 1993 paper's analysis. Click here for a more complete critique of the 1993 paper.

Ousterhout cannot explain the measurements in Section 4, stating that they are inconsistent with the simulated results in the Rosenblum/Ousterhout paper that appeared in the February 1992 ACM Transactions on Computer Systems. The simulated results in the TOCS paper are for a different algorithm and different parameters than used by either Sprite-LFS or BSD-LFS. This is explained in more detail below.

In the detailed explanation, Seltzer offers new theories about what might account for the insensitivity of LFS cleaner performance to CPU utilization. However, none of these theories is backed up with measurements (the measurements given by Seltzer neither prove or = disprove her theories). In any case they make it even clearer that the paper = does not adequately explain what is going on. I stand by my suggestion that the paper's results (and Seltzer's latest theories) should be taken with a grain of salt until the performance can be tied quantitatively to specific architectural features of LFS.

The optimization, suggested by Ousterhout, to improve the = transaction processing performance has not yet been implemented. This optimization affects the ability of the file system buffer cache to cache dirty data, and the evaluation of its impact is beyond the scope of the 1995 Usenix = paper. We agree that this work will be interesting and encourage Ousterhout to pursue this research avenue.

As I explained in my critique, the optimization need not affect the ability of the file system buffer cache to cache dirty data; the current approach represents a performance bug that can be fixed in a way that improves TP performance without adversely affecting other applications.

Ousterhout claims that the results in Section 5 are invalid. We present even more data below that demonstrates that the results in Section 5 are indicative of real-world performance.

I'm delighted to see the additional measurements, because they validate my concerns and contradict Seltzer's conclusion above. The new data show that the paper erred by a factor of two in its estimates of performance degradation due to fragmentation. Seltzer's new estimates of performance degradation are 14% for reads and 24% for writes (this is the average of her reported numbers), = whereas the average degradations reported in the paper were 6% for reads and 14% for writes. In the worst case, fragmentation degrades read = performance by 33% and write performance by 47%.

What's most interesting about this is that the cost of fragmentation in FFS is comparable to the cost of cleaning in LFS. In Rosenblum's measurements of production LFS usage in Sprite, cleaning effectively added 20-60% to the cost of writes (but nothing to the cost of reads); in Seltzer's TP benchmark, which represents a pathological case for LFS, overall degradation due to cleaning was about 45%. However, the cleaner in LFS runs in the background so it may be possible to hide the = cleaning costs by cleaning during idle periods; in FFS the fragmentation affects the file layout so there is no way to escape the overhead.

The Real Conclusions

If the original paper by Rosenblum and myself is combined with all of Seltzer's data, including both the data in the two USENIX papers and the data in her web response, I think it is reasonable to draw the following conclusions:

It is possible to find particular benchmarks and system configurations where either file system dominates the other.
Considering typical usage patterns but ignoring the costs of cleaning in LFS and fragmentation in FFS, LFS performance is almost never worse than FFS and often substantially better. In the best case for LFS (writes of small files) LFS is 4-10x faster; in the worst case (writes of large files) LFS is about 5% slower.
Fragmentation can degrade actual FFS read performance by 15%=20 and write performance by 25%, while cleaning costs effectively add 20-60% to the cost of writes in LFS without affecting the cost of reads. In the worst case, fragmentation degrades overall FFS performance by about 35-45% while cleaning degrades overall LFS performance by about 45%.
LFS performance for transaction processing workloads ranges from about the same to about 10% worse than FFS, depending on the disk utilization.
The BSD-LFS implementation is still quite immature relative to FFS, so its performance is likely to improve relative to FFS as optimizations like the one described above are implemented.

It's also important to remember some of the other advantages of LFS that weren't addressed in Seltzer's measurements, such as faster crash recovery and the ability to handle striped disks and network servers much more efficiently than FFS. Overall, the available data suggests that LFS is a much better file system than FFS.