Source vs. Target Deduplication: Scale Matters

I had a nice conversation with the CEO of a backup software vendor, who shall remain nameless, at last week's Exec Event storage industry schmooze-fest. At the event, the CEO asked why I thought target deduplication appliances like those from Data Domain, Quantum and Sepaton were still around. Why, he asked, doesn't everyone shift to source deduplication since it's so much more elegant?

Howard Marks

February 3, 2011

2 Min Read
Network Computing logo

I had a nice conversation with the CEO of a backup software vendor, who shall remain nameless, at last week's Exec Event storage industry schmooze-fest. At the event, the CEO asked why I thought target deduplication appliances like those from Data Domain, Quantum and Sepaton were still around. Why, he asked, doesn't everyone shift to source deduplication since it's so much more elegant?

By running in agents on the hosts, source deduplication leverages the CPU horsepower of all the hosts being backed up to do some of the heavy lifting inherent in data deduplication. This should reduce the CPU horsepower needed in the target system and thus hold down its cost. While all deduplication schemes minimize the disk space your backup data consumes, deduplicating at the source minimizes the network bandwidth required to send the backups from source to target.

Since most branch offices run a single shift--leaving servers idle for a 12-hour backup window--and WAN bandwidth from the branch office to the data center comes dear, source deduplication is a great solution to the ROBO (remote office, branch office) backup problem. 

As a result, and because of the generally abysmal state of ROBO backup at the time, early vendor marketing for source deduplication products such as EMC's Avamar and Symantec's PureDisk pitched them as ROBO solutions.

Source dedupe fits well wherever CPU cycles are available during the backup window. If bandwidth is constrained, such as in a virtual server host backing up 10 guests at a time, even better. Since it's just software, the price is usually right. And since vendors have started building source deduplication into the agents for their core enterprise backup solutions, users don't even need to junk Networker, Tivoli Storage Manager or NetBackup to dedupe at the source.One major sticking point remaining is that most source deduplication systems can't hold as much data as a DD890 or other big honking backup appliance. By all accounts I can find, Avamar is the best-selling source deduplicating software package today. However, an Avamar data store can only grow to a 16-node RAIN (Redundant Array of Independent Nodes) with a total capacity of about 53TBytes (net after RAID but before deduplication), while a DD890, also from EMC, can hold about 300TBytes.

53TBytes before deduplication, which probably means 300TBytes to 800TBytes of actual backup data, is a lot of storage for most of us. Some large enterprises with more data than that would have to create multiple repositories. These repositories, in addition to being more work to manage, will reduce the deduplication rate because each repository will be an independent deduplication realm. 

Disclaimer: EMC, Quantum and Symantec are, or have been, clients of DeepStorage.net, of which I am founder and chief scientist. Of course, they're on both sides of this question.

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights