Your Mileage Will Vary: Chunking

I've said many times that nowhere in the field of information technology are the words "your mileage my vary" truer than when discussing data deduplication. How much your data will shrink when run through a given vendor's data deduplication engine can vary significantly depending on the data you're trying to dedupe and how well that particular deduplication engine handles that kind of data. One critical factor is how the deduping engine breaks your data down into chunks.

Howard Marks

March 18, 2011

3 Min Read
Network Computing logo

I've said many times that nowhere in the field of information technology are the words "your mileage my vary" truer than when discussing data deduplication. How much your data will shrink when run through a given vendor's data deduplication engine can vary significantly depending on the data you're trying to dedupe and how well that particular deduplication engine handles that kind of data. One critical factor is how the deduping engine breaks your data down into chunks.

Most data deduplication engines work by breaking the data into chunks and using a hash function to help identify which chunks contain the same data. Once the system has identified duplicate data, it stores one copy and uses pointers in its internal file, or chunk management system, to keep track of where that chunk was in the original data set.

While most deduplication systems uses this basic technique, the details of how they decide what data goes into what chunk varies significantly. Some systems just take your data and break it into fixed-size chunks. The system may, for example, decide that a chunk is 8KBytes or 64KBytes and then break your data into 8KByte chunks, regardless of the content of the data.

Other systems analyze the data mathematically and choose spots that generate higher or lower values from their secret chunk-making function as the boundary between data chunks. On these systems, data chunks will vary in size based on the magic formula but within some limits, so chunks on these systems may be 8KBytes to 64KBytes, depending on the data.

If we implement these two techniques on backup appliances and back up a set of servers with a conventional backup application like NetBackup or Arcserve, the backup app will walk the file system and concatenate the data into a tape format file on the backup appliance.  Let us, for the sake of clarity, assume that the backup application always backs up the files in the root folder first. Then let us further assume that we create a 129-byte file in the root. (I know--bad server administration, but let me get away with it for the sake of a simple example.) When we run another full backup of the server, the data in that backup will be offset 129 bytes from where it was in the previous backup because our little file was backed up early in the process.

The fixed-chunk dedupe engine will break the data into its fixed-size chunks, and those chunks, because they contain data 129 bytes offset from what they contained the last time, will generate different hash values and the system will get a very low deduplication ratio. The variable block system may start off forming chunks that are similarly offset and therefore different, but will soon establish block boundaries in the same places it did the last time because it's looking at the same data it looked at the last time. Therefore it should deliver a higher data reduction ratio.

Of course, your mileage may vary, and fixed-chunk-size systems can work well when they're used with the right kind of data or tweaked by using some context from the available metadata.

Read more about:

2011

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights