Dedupe Dos and Don'ts

Data deduplication, at least for backup data, has made it to the mainstream. However, it's important to remember that the term "data deduplication" applies to a relatively wide range of technologies that all manage to store data once, even when they're told to store it many times. Since all of these technologies are sensitive to the data being stored, nowhere in IT is the term "your mileage may vary" more true than in dedupe. As 2010 winds down, I figured I'd share a few tips on how to get the m

Howard Marks

January 3, 2011

3 Min Read

Do check that the dedupe solution you're looking at supports your backup application. While most deduplication systems will find some duplicate data in an arbitrary data stream, most will get better results if they have some context about the data to work with. Deduplication systems based on hashes break the data into chunks and eliminate duplicate chunks. While they all will start a new chunk at the beginning of a new file, most backup applications store data in aggregate files that resemble Unix tarballs or ZIP files.

If your dedupe system knows about the aggregate file format your backup app uses, it can start a new chunk for each file from your source system in the backup. This will allow the system to identify more duplicate data. In addition to your data, aggregate files include index information that the backup application uses to speed restores. If you store backup data on a fixed chunk dedupe system, like most of the primary storage systems that dedupe data, this index info may shift data so the system won't recognize that today's backup includes the same data as yesterday's.

Do keep similar data sources in the same dedupe pool. If your dedupe system isn't capable of storing all your data in a single dedupe pool, split your data so that servers that hold similar data are in the same pool. Putting file servers in one pool and Oracle servers in another, for example, will give you a better dedupe ratio than storing all the data from the New York office in one pool and all the data from Chicago in another.

Don't encrypt your data before sending it to the deduping appliance. Encryption algorithms are designed so that similar sets of cleartext data will generate very different sets of cryptotext. That prevents your deduping appliance from recognizing duplicate data. Compression has just about the same effect, so leave the compression to the back-end deduplication device and not the backup software.Do encrypt your data when you copy it from the deduping appliance to tape to send it off-site or as you replicate it over the Internet.

Don't use multiplexing to your virtual tape library. I consider multiplexing an evil technology whose day has passed. Combining the backups from multiple slow servers to a single fast tape drive made some sense when we were backing up to tape and had to keep the tape drive feed to avoid the dreaded shoe-shining. Since disk systems, with or without dedupe, can accept data slower than their maximum throughput without complaining or slowing things down even further, there's no good reason to multiplex--ever.

Follow these simple tips and your data will dedupe smaller, letting you keep data on disk longer and management happy not to be buying more disk drives for the dedupe system every few months.

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.He has been a frequent contributor to Network Computing and InformationWeek since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of Networking Windows and co-author of Windows NT Unleashed (Sams).He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.  You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

Related Topics

Recent in Infrastructure

Related Topics

Recent in Network Mgmt

Related Topics

Recent in Security

Related Topics

Recent in Enterprise Connectivity

Related Topics

Recent in Wireless

Related Topics

Recent in Careers

Related Topics

Dedupe Dos and Don'ts

About the Author(s)