Automated Administrator Hints For Optimizing Database Storage Utilization

To accommodate the rapid growth in the number of storage devices across data centers and their associated management overhead, data center administrators are inclining more and more toward isolating storage management at data centers using Storage Area Networks (SAN).

July 29, 2009

4 Min Read
NetworkComputing logo in a gray background | NetworkComputing

Data center services for medium to large enterprises typically host several petabytes of data on disk drives. Most of this storage houses data residing in tens to hundreds of databases. This data landscape is both growing as well as dynamic; new data-centric applications are constantly added at data centers, while regulatory requirements such as SOX prevent old and unused data from being deleted. Further, the data access characteristics of these applications change constantly. Ensuring peak application throughput at data centers is incumbent upon addressing this dynamic data management problem in a comprehensive fashion. To accommodate the rapid growth in the number of storage devices across data centers and their associated management overhead, data center administrators are inclining more and more toward isolating storage management at data centers using Storage Area Networks (SAN).

Although SANs allow significant isolation of storage management from server management, the storage management problem itself remains complex. Due to the dynamic nature of modern enterprises, the interaction and use of applications, and even data associated with a single application, changes over time. Dynamic changes in the set of "popular" data results in skewed utilization of network storage devices, both in terms of storage space and I/O bandwidth. In statically allocated storage systems, such skewed storage utilization eventually degrades the performance of applications, creating the necessity to buy more storage, thereby resulting in overall cost increment.

To avoid purchasing additional storage when existing storage is under-utilized, data center administrators spend copious amounts of time moving data between storage devices on a regular basis to avoid hotspots,. However, optimal data movement is a complex problem that entails obtaining accurate knowledge of data popularity at the right granularity and choosing from an exponential number of possible target solutions, while ensuring that the volume of data moved is minimal. As a result, manual decision making in large data centers containing several terabytes of data and hundreds or thousands of storage devices is time-consuming, inefficient, and typically results in sub-optimal decisions. Off-the-shelf relational databases contribute to a large portion of these terabytes of data and the manual data management tasks of system administrators mostly involve remapping of database entities (tables, indexes, logs, etc.) to storage devices.

fig-storage-arch.jpgSTORM is a database storage management system that combines low-overhead information gathering of database access and storage usage patterns with efficient analysis to generate accurate and timely hints for the administrator regarding data movement operations. Moving a large amount of data between storage devices requires considerable storage bandwidth and time, and although such movement is typically done in periods of low activity, such as night time, it nevertheless runs the risk of affecting the performance of applications. Moreover, such data movement operations are so critical that they are seldom done in unsupervised mode; a longer time implies greater administrator cost. A longer time requirement for the data movement also prompts data center managers to postpone such activities and live with skewed usage for as long as possible. It is therefore critical to minimize the overall data movement in any reconfiguration operation.

STORM addresses the problem of reconfiguration with the primary objective of minimizing total data movement, with the secondary objective of balancing the I/O bandwidth utilization of the storage devices in a SAN system, given storage device capacity constraints. STORM implements a solution to this exponential complexity problem using a two stage greedy heuristic algorithm that provides an acceptable approximate solution. The heuristic tries to move objects of smaller size before choosing to move larger objects (i.e. greedy on size) from storage nodes with higher bandwidth utilizations to storage nodes with lower bandwidth utilization (i.e. greedy on I/O bandwidth utilization).The STORM approach requires gathering usage data of various database objects such as tables and indices. A journal article about STORM [1] describes non-intrusive techniques for gathering such data in Oracle and SQL-Server. A similar approach is available in other databases such as MySQL, PostGreSQL and DB2. The article also describes how such data can be used to reach a near optimal decision of which database object should be stored in which storage device in a SAN system.

The STORM approach is database and storage system independent. A balancing of database storage utilization across storage devices is simple within the confines of single database system. However, it is very typical in enterprise systems to have multiple heterogeneous databases that share a common pool of SAN stores. In such a scenario, balancing the utilization of storage devices in the SAN system necessarily calls for some sort of a global solution than can work on top of any number of storage and database systems. STORM serves this purpose.

 An extensive simulation-based evaluation of STORM revealed that, its employed heuristic converges to an acceptable solution that is successful in balancing storage utilization with an accuracy that lies within generate 7% of the ideal solution. With the TPC-C database benchmark, STORM improved overall performance by as much as 22% by reconfiguring an initial random, but evenly distributed, placement of database objects.

Further reading:
[1] Workload-based Generation of Administrator Hints for Optimizing Database Storage Utilization,  Kaushik Dutta, Raju Rangaswami, and Sajib Kundu, ACM Transactions on Storage, 3(4), February, 2008.

A complete version of the article above can be accessed online.

Read more about:

2009
SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like


More Insights