Self-Describing Data
With storage costs under the microscope, efficient capacity utilization is critical. That's where data naming - or Self-Describing Data - comes in.
February 16, 2004
Oversubscribed and Underused
A key driver of storage cost is how efficiently we use the capacity available for storing application data. The phrase "oversubscribed and underutilized" crops up frequently in discussions with storage practitioners and industry observers as a kind of shorthand. However, like most industry buzz phrases, the expression has been interpreted in different ways by different vendors, introducing considerable confusion to both the analysis of the storage-cost problem and the development of strategies for addressing it.
In a nutshell, the phrase refers to the all-too-common practice of purchasing too much storage and using it inefficiently. Storage administrators overbuy because they don't always do their homework to understand how much space their applications and data actually require. To make matters worse, they underuse the space by storing infrequently accessed--and, in some cases, useless--data on the expensive gear.
Many "solutions" have appeared, from intelligent arrays and life-cycle management controllers to virtualization and hierarchical storage-management software. For the most part, however, these products blur the line between capacity-allocation efficiency and capacity-utilization efficiency. The former can yield some storage cost-of-ownership improvements, but these are dwarfed by the huge gains that could be made with an effective system of utilization management--real ILM.
Lingo Lowdown
Tell Me Moreclick to enlarge |
The definition of storage capacity is a no-brainer: It's the physical space available for recording bits of data on storage media. Physical capacity imposes a fixed storage limit, though the size and shape of the data being stored can vary. You can apply compression algorithms, for example, to reduce the capacity requirements of certain types of data, but the real capacity of the media doesn't change.
Capacity allocation refers to the provisioning of physical capacity for discrete data sets. A disk or disk array with a certain capacity may be allocated to store the elements of a database, for example, while another disk or disk array may be allocated to store the data from an engineering CAD/CAM program, a Web site or user files. An efficient capacity-allocation system adapts to changing data storage requirements it senses or even anticipates, reallocating capacity on the fly. An inefficient capacity-allocation system doesn't respond to such changes automatically, introducing downtime when application processes encounter "disk full" conditions and ceasing operation until sufficient capacity is allocated manually.
Capacity utilization, by contrast, refers to processes designed to ensure that the use of allocated capacity is optimized according to the characteristics of the data being stored and the cost and capabilities of the devices being used to store it.An efficient capacity-utilization system considers two general categories of data characteristics: frequency of access to the data, and requirements for its storage that are derived or "inherited" from the business process and application used to create the data. Access frequency is an important characteristic because it helps identify whether data must be stored "online"--on platforms that provide instantaneous accessibility to servers and end users--or can be moved to "near-line" or "offline" platforms that provide decreasing accessibility.
An efficient capacity-utilization system also takes into account the requirements that data inherits from the application used to generate it. Data from a critical application, for example, inherits the "criticality" of the application itself. In a disaster, this data must be recovered first--together with the application software that created it. Data may also inherit retention requirements from the originating application--how many days, weeks, months or years it must be kept, and even how it must be stored--especially if the application (and the business process it supports) is subject to regulatory requirements such as the Gramm-Leech-Bliley Act or the Sarbanes-Oxley Act.
In other cases, the data may inherit security or privacy requirements from applications to comply with regulatory or legal requirements like the Health Insurance Portability and Accountability Act. Other, more esoteric characteristics may be inherited from applications that vary from one company to another. For example, an application used to stream video or audio files across the Internet may impose a requirement that its data be stored to the outermost disk tracks, the longest contiguous area of storage available on the disk, to reduce jitter during playback.These data-related characteristics must be matched to a matrix of storage-platform costs and capabilities by a thorough analysis of the storage infrastructure. Different storage platforms may manifest different attributes in terms of topological accessibility, RAID levels, replication schemes, security features and speeds, for instance, and virtually all platforms manifest different costs as a function of their design, length of service and depreciation.
The objective of an efficient capacity-utilization system is to ensure that data is placed on the right platform when it's created, and is migrated to the most cost-effective platforms--those that meet its inherent requirements at the lowest possible cost--throughout its useful life. Once the data no longer needs to be retained, such a system deletes the data from all platforms automatically.
The Data-Naming Game
How Self-Describing Data Makes for Efficient Capacity Utilizationclick to enlarge |
A true ILM solution implements efficient capacity utilization. Such a solution must provide a mechanism for analyzing applications to discern the characteristics they impart to the data they generate. It must also provide a facility for creating an easy-to-use schema to store the categories of these characteristics, to be used subsequently to add a header or other data-naming "artifact" to data upon creation. And it must provide a mechanism for applying a self-describing header to data before the data is written to any storage platform.
An ILM system demands some other components/functions as well: a knowledge base about storage-platform costs and capabilities; a mechanism that collects and documents storage-infrastructure information and then arranges it in a set of class- or cost-of-service descriptions to reflect the various combinations of storage-infrastructure components available to meet the requirements of different types of named data; and an access-frequency counter function that runs in the infrastructure and checks stored data at regular intervals to determine how often the data has been accessed.ILM also requires a policy engine that lets users map classes of self-describing data and information from the access-frequency counter function to classes of infrastructure to create policies that automate the migration of data through the storage infrastructure. This component, with all its enabling components described above, is critical for ILM to achieve its capacity-utilization efficiency objective.No ILM vendor yet offers the full suite of functionality described above for a heterogeneous storage environment. The reason, aside from the desire of most vendors to lock consumers into proprietary hardware and software, is that data-naming schemes have not been a development priority within the industry or the standards groups.
But Microsoft's stated objective of replacing its file system with an object-oriented SQL database in its next-generation Windows server OS, code-named "Longhorn," may present new opportunities for data naming. If all files become objects in a database, an opportunity may be created for data description. Adding a self-describing header might be as simple as adding a row above the objects in the database.
If Longhorn is a hit and administrators of the 80-plus percent of open-system servers deployed today go along with Microsoft's file-system replacement, a huge opportunity may present itself for implementing a data-naming scheme at the source of data creation. Other vendors, such as Oracle and IBM, would likely follow suit with OODB-based file systems for Unix and Linux platforms, creating additional opportunities to implement self-describing data. (Both vendors have suggested a database-as-file-system strategy from time to time in white papers since the mid-1990s.)
Until that happens, current ILM "solutions" using proprietary approaches may deliver incremental improvements in storage-cost containment. However, as with all "stovepipe" approaches, short-term gains may yield longer-term losses by locking consumers into proprietary technologies. Backing data out of a proprietary ILM scheme that has become less than efficient could cost more than loading data into that scheme in the first place.
Jon William Toigo is a CEO of storage consultancy Toigo Partners International, founder and chairman of the Data Management Institute, and author of 13 books, including Disaster Recovery Planning: Preparing for the Unthinkable (Pearson Education, 2002) and The Holy Grail of Network Storage Management (Prentice Hall PTR, 2003). Write to him at [email protected].0
You May Also Like