Write Off-Loading: Practical Power Management for Enterprise Storage

Write off-loading modifies the per-volume access patterns, creating idle periods during which all the volume's disks can be spun down. For our traces this causes volumes to be idle for 79 of the time on average.

July 9, 2009

8 Min Read
NetworkComputing logo in a gray background | NetworkComputing

Power consumption is a major problem for enterprise data centers, impacting the density of servers and the total cost of ownership. This is causing changes in data center configuration and management. Some components already support power management features: for example, server CPUs can use low-power states and dynamic clock and voltage scaling to reduce power consumption significantly during idle periods. Enterprise storage subsystems do not have such advanced power management and consume a significant amount of power in the data center. An enterprise grade disk such as the Seagate Cheetah 15K.4 consumes 12W even when idle whereas a dual-core Intel Xeon processor consumes 24W when idle.  Thus, an idle machine with one dual-core processor and two disks already spends as much power on disks as processors.  For comparison, the core servers in our building's data center have more than 13 disks per machine on average.  

Simply buying fewer disks is usually not an option, since this would reduce peak performance and/or capacity. The alternative is to spin down disks when they are not in use.  The traditional view is that idle periods in server workloads are too short for this to be effective. However our analysis of real server workloads show that there is in fact substantial idle time at the storage volume level. We would also expect, and previous work has confirmed, that main-memory caches are effective at absorbing reads but not writes. Thus we would expect at the storage level to see periods where all the traffic is write traffic.  Our analysis shows that this is indeed true, and that the request stream is write-dominated for a substantial fraction of time.

This analysis motivated a technique that we call write off-loading, which allows blocks written to one volume to be redirected to other storage elsewhere in the data center. During periods which are write-dominated, the disks are spun down and the writes are redirected, causing some of the volume's blocks to be off-loaded. Blocks are off-loaded temporarily, for a few minutes up to a few hours, and are reclaimed lazily in the background after the home volume's disks are spun up.

Write off-loading modifies the per-volume access patterns, creating idle periods during which all the volume's disks can be spun down. For our traces this causes volumes to be idle for 79 of the time on average.  The cost of doing this is that when a read occurs for a non-off-loaded block, it incurs a significant latency while the disks spin up.  However, our results show that this occurs rarely.

Write off-loading is implemented at the block level and is transparent to file systems and applications running on the servers. Blocks can be off-loaded from any volume to any available persistent storage in the data center, either on the same machine or on a different one. The storage could be based on disks, NVRAM, or solid-state memory such as flash. Off-loading uses spare capacity and bandwidth on existing volumes and thus does not require provisioning of additional storage. Write off-loading is also applicable to a variety of storage architectures. Our trace analysis and evaluation are based on a Direct Attached Storage (DAS) model, where each server is attached directly to a set of disks, typically configured as one or more RAID arrays. DAS is typical for small data centers such as those serving a single office building.  Write off-loading can also be applied to network attached storage (NAS) and storage area networks (SANs).A major challenge when off-loading writes is to ensure consistency. Each write request to any volume can be off-loaded to one of several other locations depending on a number of criteria, including the power state and the current load on the destination. This per-operation load balancing improves performance, but it means that successive writes of the same logical block could be off-loaded to different destinations. It is imperative that the consistency of the original volume is maintained even in the presence of failures. We achieve this by persisting sufficient meta-data with each off-loaded write to reconstruct the latest version of each block after a failure.  This meta-data is also cached in memory as soft state for fast access.

Write off-loading also maintains the durability and reliability properties of the data. Off-loading can be restricted to remote locations which have as much or more fault tolerance as the off-loading volume, i.e. RAID volumes only off-load to other RAID volumes. Alternatively, the off-loading mechanism also supports replication, i.e., each off-loaded write can be sent to multiple remote locations.

Volume Access Patterns

The traditional view is that spinning disks down does not work well for server workloads because the periods of idleness are too short. However, many enterprise servers are less I/O intensive than, for example, TPC benchmarks, which are specifically designed to stress the system under test. Enterprise workloads also show significant variation in usage over time, for example due to diurnal patterns.

In order to understand better the I/O patterns generated by standard data center servers, we traced the core servers in our building's data center to generate per volume block-level traces for one week. In total, we traced 36 volumes containing 179 disks on 13 servers. The volumes are all RAID-1 if system boot volumes and RAID-5 otherwise.We believe that the servers, data volumes, and their access patterns are representative of a large number of small to medium size enterprise data centers. Although access patterns for system volumes may be dependent on, for example, the server's operating system, we believe that for data volumes these differences will be small.

The traces are gathered per-volume below the file system cache and capture all block-level reads and writes performed on the 36 volumes traced. The traced period was 168 hours (1 week) starting from 5PM GMT on the 22nd February 2007. The traces are collected using Event Tracing For Windows (ETW), and each event describes an I/O request seen by a Windows disk device (i.e., volume), including a timestamp, the disk number, the start logical block number, the number of blocks transferred, and the type (read or write). The total number of requests traced was 434 million, of which 70 were reads; the total size of the traces was 29GB. A total of 8.5TB was read and 2.3TB written by the traced volumes during the trace period.

Overall, the workload is read-dominated: the ratio of read to write requests is 2.37. However, 19 of the 36 volumes have read/write ratios below 1.0; for these volumes the overall read-write ratio is only 0.18. Further analysis shows that for most of the volumes, the read workload is bursty. Hence, intuitively, removing the writes from the workload could potentially yield significant idle periods.

Energy Savings vs. Performance

We measured the effects of write off-loading on energy consumption as well as performance by replaying these traces on a hardware testbed. The testbed uses typical high-end enterprise storage hardware: Seagate Cheetah 15,000rpm disks and HP SmartArray 6400 RAID controllers. From the one-week traces, we replayed all volume traces for two days: the day with the most idle time between I/O requests and the day with the least.We see that just spinning down the disks when idle saves substantial energy; enabling write off-loading enables even more savings, and the savings increases when the scope of write off-loading is increased. With rack-level off-loading, when the load is low and write-dominated a single spun-up volume can absorb the off-loads for the entire rack, which means all the other volumes can be spun down.
energy-total-fixed.jpg
However spinning down disks does have a performance penalty. Requests sent to a spun-down volume suffer a delay: this happens rarely but the response time penalty is high. Note that ``vanilla'' spin-down suffers high response times on both reads and writes, whereas with rack-level write off-loading enabled there is little or no penalty for writes. In fact write off-loading improves mean response times by load-balancing write bursts across multiple volumes and also by using a write-optimized log layout on the remote volumes. With machine level off-loading the worst-case response time increases slightly: this is because sometimes we get a burst of writes to a well-provisioned but spun-down volume, and the only option is to off-load to another less well provisioned volume on the same machine. With rack-level off-loading this is not a problem, since the write bursts can be load-balanced across many volumes on different servers.

Since spin-down and write off-loading can be enabled on a per-volume basis (a given volume can be configured to off-load, to receive off-loaded writes, both, or neither), administrators should not enable spin down for volumes hosting applications that cannot tolerate the performance penalty. In general, write off-loading should not also be enabled for system volumes, to avoid system data (such as OS patches) from being off-loaded.

Conclusion

Many server I/O workloads have substantial idle time at the volume level due to diurnal load patterns. These can be exploited for significant power reductions by spinning down or powering down the disks in idle volumes. This makes it important for enterprise storage hardware such as disks, RAID controllers, etc. to support spin-up and spin-down the power savings are even greater if write off-loading is then used to extend the idle periods.


Further reading:
[1] Write Off-Loading: Practical Power Management for Enterprise Storage,
Dushyanth Narayanan, Austin Donnelly, and Antony Rowstron, ACM Transactions on Storage, 4(3), November, 2008.A complete version of the article above can be accessed online.

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like


More Insights