Expanding Role Of Data Deduplication
Data volumes continue to explode: Of the 437 business technology professionals InformationWeek Analytics surveyed for our data deduplication report (available free for a limited time at dedupe.informationweek.com), more than half manage more than 10 TB of data, compared with just 10% who control less than 1 TB. Seven percent manage between 201 TB and 500 TB, and 8% are charged with wrangling more than 500 TB of data. These massive volumes may be a recent development--25% of the 328 business tech
May 25, 2010
Data volumes continue to explode: Of the 437 business technology professionals InformationWeek Analytics surveyed for our data deduplication report (available free for a limited time at dedupe.informationweek.com), more than half manage more than 10 TB of data, compared with just 10% who control less than 1 TB. Seven percent manage between 201 TB and 500 TB, and 8% are charged with wrangling more than 500 TB of data. These massive volumes may be a recent development--25% of the 328 business technology pros we surveyed for our 2009 InformationWeek Analytics State of Storage Survey managed less than 1 TB of data--but all indications point to this level of growth being the new normal.
The applications most responsible for the data deluge include the usual suspects: Enterprise databases and data warehouse apps (33%) and e-mail (23%) are cited most in our survey. Rich media, mainly voice and video, was cited by just 16%, but we think the recent surge in voice and video applications will put increasing demands on storage. And yes, we've been warned before about huge looming increases in video traffic, which never materialized. But there are good reasons to believe this time may be different given an increased focus on telecommuting and multimedia. In addition, the America Reinvestment and Recovery Act aims to have up to 90% of healthcare providers in the United States using electronic medical records by 2020. That's a potential tsunami of high-value, regulated--and huge--files.
As more companies jump on the fast track to petabyte land, a variety of vendors have emerged with technologies and management approaches aimed at helping us more efficiently administer large storage pools while lowering costs and increasing security. In our survey on data deduplication, we asked IT pros about their use of some of these technologies, including compression, disk-to-disk-to-tape backups, encryption, virtual tape libraries, thin provisioning, massive array of idle disks (MAID), and data deduplication specifically. Of those, compression is the most commonly used, with 64% of respondents employing the technology in their environments. Survey results show relatively low current adoption rates for data deduplication, with just 24% of respondents using the technology. However, the good news is that 32% are evaluating dedupe, and just 10% say definitively that they won't consider adoption. Only 17% of respondents have deployed thin provisioning, while 15% say they flat out won't; and only 12% say they have deployed MAID, while 17% say they won't.
We found the low adoption rates for these three promising technologies surprising because business as usual is no longer a realistic option. The price of storage in the data center isn't limited to hardware. Escalating power and cooling costs and scarce floor space pose a serious challenge to the "just buy more disks" approach. These three technologies could enhance a well-designed storage plan and--along with increasing disk/platter densities, larger disk drives, and faster performing drives such as solid-state disks--reduce storage hardware requirements.
Of course, compatibility with legacy systems is always an issue. McCarthy Building, a St. Louis-based construction firm with $3.5 billion in annual revenue, uses SATA disks in DualParity RAID configurations for its Tier 2 storage (more on tiers later). "We replicate production data to a remote site on the same storage," says Chris Reed, director of infrastructure IT. "We deduplicate everywhere we can, especially since the cost is still $0 from NetApp and we haven't seen a performance downside."However, Reed has run into problems with legacy applications, such as the company's Xerox DocuShare document management system, that must run on super-large volumes. That system has grown to more than 4 TB on a single Windows iSCSI volume on a NetApp 3040 cluster, which doesn't support deduplication on volumes larger than 4 TB. Data deduplication is particularly effective in reducing the storage capacity required for disk-to-disk and backup applications. It can also reduce the amount of bandwidth consumed by replication and disaster recovery. We discuss specifics of how data dedupe works in detail in our full report.
For those on the fence, recent market events--notably EMC's and NetApp's bidding war for Data Domain--illustrate the importance of dedupe. Companies that don't at least investigate the benefits could be hamstrung by spiraling storage costs.
Lay The Groundwork
Before we go more into dedupe, we want to give a shout out to a cornerstone of any solid data life-cycle management strategy: tiered storage. By matching different types of data with their appropriate storage platforms and media based on requirements such as performance, frequency of access, and data protection levels, tiered storage lets CIOs save money by applying expensive technologies, including data deduplication and thin provisioning, to only the appropriate data.
In a tiered strategy, Tier 1 storage is reserved for demanding applications, such as databases and e-mail, that require the highest performance and can justify the cost of serial-attached SCSI, Fibre Channel SANs, high-performance RAID levels, and the fastest available spindles--or even SSD drives.
Direct attached storage (DAS) is still the top choice of our survey participants for Tier 1 storage of applications such as databases and e-mail. Fibre Channel came in a close second, with 45% of respondents reporting use of Fibre Channel SANs, mainly for Tier 1 storage. Fibre Channel remains strong despite countless predictions of its rapid demise by most storage pundits--and the downright offensive "dead technology walking" label attached by a few.One survey finding that's not completely unexpected--we were tipped off by similar findings in previous InformationWeek Analytics State of Storage reports--but is nonetheless puzzling is the poor showing by iSCSI SANs, which are the main Tier 1 storage platform for just 16% of respondents. That's less than half the number who report using NAS, our third-place response. Seems most IT pros didn't get the memo that iSCSI would force Fibre Channel into early retirement.
In all seriousness, the continued dearth of interest in iSCSI is mystifying given the current economic backdrop, the widespread availability of iSCSI initiators in recent versions of Windows (desktop and server) and Linux, and the declining cost of 1-GB and 10-GB connectivity options. We think the slower-than-predicted rate of iSCSI adoption--and the continued success of Fibre Channel--is attributable to a few factors. First, the declining cost of Fibre Channel switches and host bus adapters improves the economic case for the technology. Second, we're seeing slower-than-expected enterprise adoption of 10-Gbps Ethernet, leaving iSCSI at a performance disadvantage against 4-GB Fibre Channel.
However, iSCSI's performance future does look bright thanks to emerging Ethernet standards such as 40 Gbps and 100 Gbps that will not only increase the speed limit, but also accelerate adoption of 10-Gbps Ethernet in the short term. In our practice, we also see a reluctance among CIOs to mess with a tried-and-true technology such as Fibre Channel, particularly for critical applications like ERP, e-mail, and enterprise databases. Sometimes, peace of mind is worth a price premium.
Tier 2 comprises the less expensive storage, such as SATA drives, NAS, and low-cost SANs, suitable for apps like archives and backups, where high capacity and low cost are more important than blazing speed. Our survey shows that NAS is the Tier 2 architecture of choice, used by 41% of respondents. DAS is the main Tier 2 storage of 34% of respondents. Once again, iSCSI SAN finished last, with a mere 17% of respondents using the technology primarily for Tier 2 storage. This is an even more surprising result than for Tier 1--we expected iSCSI's low cost relative to Fibre Channel SANs to result in a healthy showing here.
Tier 3 storage typically consists of the lowest-cost media, such as recordable optical or WORM (write once, read many) disks, and is well suited for historical archival and long-term backups.Applying a tiered strategy lets IT migrate older and less frequently accessed data to lower-cost storage--and in doing so significantly reduces both the growth rate of pricey Tier 1 capacity and overall data center costs. Sounds like a no-brainer, but data classification and planning are essential--including developing policies around retention, appropriate storage architecture, data backup and recovery, growth forecasting and management, and budgeting.
Policy is one area where many are falling behind. For example, one of the most eye-opening results of our survey was the response to our query about data retention periods. The percentage of participants reporting indefinite retention for application data ranged from 30% for Web (wikis and blogs) to a whopping 55% for enterprise database and data warehouse applications. And with the exception of wiki and blog applications and rich media, 50% or more of respondents report at least a five-year retention period--as high as 76% for enterprise databases and data warehouses.
We're clearly struggling to keep up with the complex records management needed to comply with requirements such as the Health Insurance Portability and Accountability Act, related privacy rules, and the Sarbanes-Oxley Act of 2002, just to name a few of the regs bedeviling enterprise IT.
We were also surprised to see that Tier 2 storage growth rates reported by our survey participants weren't dramatically different from Tier 1 growth rates. Twenty-nine percent of respondents reported growth in excess of 25% for Tier 2 storage, compared with 18% for Tier 1, and nearly twice the number of respondents are seeing growth rates of 51% to 75% in Tier 2 storage.
This represents a golden opportunity for IT to adopt more aggressive life-cycle management and shift more growth onto less costly Tier 2 storage. It may also indicate that more automation is needed. To that end, consider deploying information life-cycle management tools or archival systems with automated tiering features. More on those in our full report.
You May Also Like