Primary Deduplication Not Good For Backup

As an industry, we have fallen into the trap of thinking of data deduplication as a single technology. When NetApp and EMC were in their bidding war for Datadomain, some analysts were wondering why EMC, which had deduplication technology in Avamar, would want Datadomain's technology. Now that data deduplication is gaining traction in the primary storage market, I thought I would point out that a deduplication system designed for primary data may not be as effective with back-up data.

Howard Marks

April 23, 2010

2 Min Read
Network Computing logo

As an industry, we have fallen into the trap of thinking of data deduplication as a single technology.  When NetApp and EMC were in their bidding war for Datadomain, some analysts were wondering why EMC, which had deduplication technology in Avamar, would want Datadomain's technology. Now that data deduplication is gaining traction in the primary storage market, I thought I would point out that a deduplication system designed for primary data may not be as effective with back-up data.

All deduplication systems work by breaking down the files, or other objects like virtual tapes, into smaller blocks. They then identify those blocks that contain the same data, like the corporate logo on every PowerPoint slide, and use links in their internal file system so the single block of data they store can stand in for all the other copies of that data across the file system. Breaking the data down into blocks is easy, the hard part is figuring out what block alignment will result in the best data reduction. The simplest systems, like NetApp's or ZFS deduplication, simply break each file into fixed-size blocks. This works reasonably well for primary storage file systems that hold a large number of small files as each file starts on a block boundary. It works especially well for applications like VDI hosting where there are a lot of duplicate files.

Since the vast majority of today's backup applications create a small number of what are essentially tarballs or .ZIP files when they backup to disk, deduplicating backup targets have to work harder to determine where the block boundaries are. Content-aware systems like Sepaton's and Exagrid's reverse engineer the backup application's file and/or tape formats so they can identify each source-file in the stream and compare it to other copies of that file they've already stored. Other vendors have their own secret sauce, and while Datadomain's hash-based, variable block-size approach made sense when Hugo Patterson their CTO explained it to me last week, it's a bit too complicated to describe here.

Now imagine using a simple fixed-block deduping system with a backup stream. Your back-up app backs up the C: (system) and E: (data) drives of your server in a single back-up job to a single virtual tape file. The system logs are backed up early in the process, which causes all the data to be offset 513 bytes from where it was in yesterday's backup. While there may still be some duplicate blocks there won't be nearly as many as if the system could reset the alignment.

The moral of the story is all deduplication schemes are not alike.  Use primary storage deduplication with the wrong backup app, and you may not see the 20:1 data reduction you're looking for.  You'll see some data reduction, but we'll have to try them in the lab to see how much. Disclosure: I am currently working on projects for NetApp and EMC/Datadomain.

About the Author

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like


More Insights