Notes on a Nine Year Study of File System and Storage Benchmarking

Benchmarks are most often used to provide an idea of how fast some piece of software or hardware runs. The results can significantly add to, or detract from, the value of a product (be it monetary or otherwise). For example, they may be used by potential consumers in purchasing decisions, or by researchers to help determine a system's worth.

July 16, 2009

7 Min Read
NetworkComputing logo in a gray background | NetworkComputing

Driven by a general sense that benchmarking practices in the areas of file and storage systems are lacking, we conducted an extensive survey of the benchmarks that were published in relevant conference papers in recent years.  We decided to evaluate the evaluators, if you will.  Our May 2008 ACM Transactions on Storage article, entitled "A Nine Year Study of File System and Storage Benchmarking'", surveyed 415 file system and storage benchmarks from 106 papers that were published in four highly-regarded conferences (SOSP, OSDI, USENIX, and FAST) between 1999 and 2007. 

Our suspicions were confirmed.  We found that most popular benchmarks are flawed, and many research papers used poor benchmarking practices and did not provide a clear indication of the system's true performance.  We evaluated benchmarks qualitatively as well as quantitatively: we conducted a set of experiments to show how some widely used benchmarks can conceal or overemphasize overheads.  Finally, we provided a set of guidelines that we hope will improve future performance evaluations.  An updated version of the guidelines is available

Benchmarks are most often used to provide an idea of how fast some piece of software or hardware runs.  The results can significantly add to, or detract from, the value of a product (be it monetary or otherwise).  For example, they may be used by potential consumers in purchasing decisions, or by researchers to help determine a system's worth.  

Systems benchmarking is a difficult task, and many of the lessons learned from this article are general enough that they can be applied to other system fields.  However, file and storage systems have special properties. Complex interactions between I/O devices, caches, kernel daemons, and other OS components result in behavior that is rather difficult to analyze. Moreover, systems have different features and optimizations, so no single benchmark is always suitable.  Lastly, the large variety of workloads that these systems experience in the real world also adds to this difficulty. 

When a performance evaluation of a system is presented, the results and implications must be clear to the reader.  This must include accurate depictions of behavior under realistic workloads and in worst-case scenarios, as well as explaining the reasoning behind benchmarking methodologies.  In addition, the reader should be able to verify the benchmark results, and compare the performance of one system with that of another.  To accomplish these goals, much thought must go into choosing suitable benchmarks and configurations, and accurate results must be properly conveyed. As part of our survey, we checked to see how well the conference papers (which included some of our own) performed these tasks.  For example, to depict the behavior of a benchmark accurately, it should generally run for at least a few minutes, be run multiple times to collect several data points, and some metric of statistical dispersion (e.g., standard deviation, confidence intervals, quartiles) should be provided.  In the surveyed papers, approximately 29% of benchmarks ran for less than one minute, which is generally too short to provide accurate results at steady state. Further, about half of the papers did not specify how many times they ran a benchmark, and less than 20% ran the benchmark more than five times.  We recommend at least ten data points to provide a clear picture of the results.  Finally, only about 45% of the surveyed papers included any mention of statistical dispersion.  In terms of accurately portraying the behavior of the system, about 38% of papers used only one or two benchmarks in their performance analysis.  This is generally not adequate for providing a complete picture.  We have posted the raw data from our survey in an online appendix so that others can review it and use it in future studies. 

ext2-ssh.jpgOur survey included descriptions and qualitative analyzes of every macro-benchmark used in the surveyed papers, as well as other benchmarks that we deemed worthy of discussion.  We also conducted experiments to perform a more quantitative analysis on two very popular benchmarks: a compile benchmark and Postmark (a mail server workload).  To do this, we modified the Linux ext2 file system to slow down certain operations; we called this new file system Slowfs.  A compile benchmark measures the time taken to compile a piece of software.  For OpenSSH compilations, we slowed down Slowfs's read operations (the most time-consuming operation for this benchmark) by up to 32 times, and yet the largest elapsed time overhead we observed was only 4.5%!  For the Postmark experiments, we used three different workload configurations that were derived from publications with both ext2 and Slowfs.  We learned two lessons.  pm-slowfs-fsl copy.jpgFirst, Postmark's configuration parameters can cause large variations even on ext2 alone: these varied from 2 to 214 seconds, with the 2-second configuration performing no I/O at all!  This problem is aggravated by the fact that few papers report all parameters: in our survey, only 5 out of 30 papers did so.  The second lesson was that some configurations showed the effects of Slowfs more than others. 

We recommend that with the current set of available benchmarks, an accurate method of conveying a file or storage system's performance is by using at least one macro-benchmark or trace, as well as several micro-benchmarks. Macro-Benchmarks and traces are intended to give an overall idea of how the system might perform under some workload.  If traces are used, then special care should be taken with regard to how they are captured and replayed, and how closely they resemble the intended real-world workload.  In addition, micro-benchmarks can be used to help understand the system's performance, to test multiple operations to provide a sense of overall performance, or to highlight interesting features about the system (such as cases where the system performs particularly well or poor). 

Performance evaluations should improve the descriptions of what the authors did, as well as why they did it, which is equally important.  Explaining the reasoning behind one's actions is an important principle in research, but is not being followed consistently in the fields of file system and storage performance evaluations.  Ideally, there should be some analysis of the system's expected behavior, and various benchmarks either proving or disproving these hypotheses.  Such analysis may offer more insight into a behavior than just a graph or table. 

We believe that the current state of performance evaluations has much room for improvement.  This belief is supported by the evidence presented in our survey.  Computer Science is still a relatively young field, and the experimental evaluations need to move further in the direction of precise science.  One part of the solution is that standards clearly need to be raised and defined.  This will have to be done both by reviewers putting more emphasis on a system's evaluation, and by researchers conducting experiments.  Another part of the solution is that this information needs to be better disseminated to all.  We hope that this article, as well as our continuing work, will help researchers and others to understand the problems that exist with file and storage system benchmarking.  The final aspect of the solution to this problem is creating standardized benchmarks, or benchmarking suites, based on open discussion among file system and storage researchers. Our article focused on benchmark results that are published in venues such as conferences and journals.  Another aspect is standardized industrial benchmarks.  Here, how the benchmark is run or chosen, or how the results are presented is of little interest, as these are all standardized.  An interesting question, though, is how effective these benchmarks are, and how the standards shape the products that are being sold today (for better or worse). 

The goal of this project is to raise awareness of issues relating to proper benchmarking practices of file and storage systems.  We hope that with greater awareness, standards will be raised, and more rigorous and scientific evaluations will be performed and published.  Since this article was published in May 2008, we held a workshop on storage benchmarking at UCSC, and we presented a BoF session at the 2009 7th USENIX Conference on File and Storage Technologies (FAST).  We have also set up a mailing list for information on future events, as well as discussions.  More information can be found on our Website, http://fsbench.filesystems.org/.  

Further reading:

[1] A nine year study of file system and storage benchmarking,

      Avishay Traeger, Erez Zadok , N. Joukov, and C. P. Wright, ACM Transactions on Storage, 4(2), May, 2008. 

A complete version of the article above can be accessed online.

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like


More Insights