More On Chunking
The last time we looked at the chunking process in data deduplication engines ("Your Mileage Will Vary: Chunking"), we were looking pretty favorably at variable chunking that used the contents of the data to assign chunk boundaries. However, as deduplication moves from backup appliances accepting tape, or other backup application-specific format data, into backup applications and primary storage, the advantages of fixed-chunk deduplication start to become apparent.
March 28, 2011
The last time we looked at the chunking process in data deduplication engines ("Your Mileage Will Vary: Chunking"), we were looking pretty favorably at variable chunking that used the contents of the data to assign chunk boundaries. However, as deduplication moves from backup appliances accepting tape, or other backup application-specific format data, into backup applications and primary storage, the advantages of fixed-chunk deduplication start to become apparent.
The primary advantage of fixed-chunk deduplication is lower CPU overhead. Fixed-chunk systems don't have to spend any CPU cycles examining data and determining where the chunk borders should be. They just break data up into chunks like any other file system. In fact, some primary storage deduplication, like NetApp's, uses just the underlying file system's chunks for its foundation.
Lower overhead also means lower latency; computing where to put the chunk boundaries takes some time. While vendors have done their best to reduce this additional latency, and will claim it's not noticeable, it exists and might be a problem for primary storage deduplication systems.
Backup applications are a simple lot. In their heart of hearts they just want to be sending a stream of data to a tape drive somewhere. Since they're making large sequential write requests to a small number of large files, a few milliseconds of latency per request won't have a big impact. For conventional backup applications like NetBackup or Networker, throughput is all important, and latency less so.
Primary storage applications, even simple ones like hosting users' home folders, are much more latency-sensitive. In addition, rather than writing to a small number of very large files like backup applications do, primary storage environments have millions of files of all sizes. Since each file begins on a fresh data chunk, an insertion or other change that could throw off the chunk alignment will only affect one file's worth of data. Every new file will realign the process.Software-based deduplication software--especially applications that deduplicate at the source server like Avamar, PureDisk or Asigra's Cloud Backup--will also use the file start and end to determine their chunk boundaries. These applications first identify files that have changed, like a conventional incremental backup, then start the chunking process on each file.
Using file boundaries can optimize fixed-chunk chunking on backup targets if the deduplication engine in the backup target knows the format of the tape or aggregate Tarball-like files your backup application writes its data in. The dedupe engine can determine the start and end of each file within the Tarball and can realign chunks to those boundaries. Content awareness also allows backup appliances to see the index marks and catalog data that backup applications insert into the Tarball and keep them from throwing off the chunking.
However, fixed-chunk systems can choke on some data. I know of one Data Domain user that used Exchange backups to test Symantec's PureDisk deduplication. They were retaining 40 backups of their Exchange servers in a given amount of storage on the Data Domains, but were unable to store four backups of the Exchange data in the same amount of storage deduped by PureDisk. Exchange data is a small number of large database files where the files change internally between backups, the worst case for PureDisk's dedupe engine. Now, if you used a fixed-chunk dedupe engine where the chunk was smaller than a database page ...
Disclosure:DeepStorage.net has done work for NetApp, Symantec and EMC, whose products were mentioned in this post.
About the Author
You May Also Like