De-Duplication Rumors Highlight Controversy

Scuttlebutt on various de-dupe partnerships points up arguments over approaches

August 28, 2007

4 Min Read
NetworkComputing logo in a gray background | NetworkComputing

A spate of industry chatter surrounding data de-duplication is reflecting a growing controversy over the best methods for deploying the technology.

The importance of data de-duplication is growing, exemplified by the success of Data Domain's recent IPO and fresh funding for Diligent Technologies. At the same time, storage professionals are being exposed to a cacophony of sometimes conflicting information about suppliers and their methods.

The rumor mill's on full tilt, for instance, regarding the potential for big vendors like EMC or Sun to add to or supercede their de-duplication partners. The latest whispers say EMC may announce a close partnership with Data Domain, to add to or replace de-duplication from its Avamar acquisition. (For the record, EMC and Data Domain refuse to comment on rumor or speculation.)

One thing to emerge from the swirling gossip is that suppliers are engaged in a fierce battle for mindshare that's liable to confuse would-be customers. And at the heart of the debate is the argument that some methods for de-duplicating data are proving to be better and more scaleable than others.

How closely should users listen to the rumblings for and against? Should the latest religious wars over data de-duplication be factored into the list of things prospective de-dupe buyers need to consider up front?To consider these questions, we've compiled a rundown of the latest arguments:

  • Appliances are best. Some argue that products deploying agents on local servers or working within the backup utility itself are less efficient than products based on appliances.

    At least one customer shares a different experience. Jason Paige, information systems manager at financial firm Integral Capital Partners, says he chose Avamar prior to the EMC acquisition because the use of agent software on multiple servers reduced the bandwidth required to back up remote sites. What's more, far from slowing up the backup process, the de-duplication reduced a full backup from 8 hours to 3 and enabled him to incorporate a much more detailed Exchange backup, while supporting nearly six times as many computers.

    Paige concedes, though, that customers with larger installations than his might opt for a back-end "post processing" de-dupe solution.

  • In-line is better than post-processing. Voices are crying that wares from Data Domain and Diligent, which use appliances to de-duplicate data "in line" before it is sent to backup, are more efficient than products that use "post processing" methods, such as those deployed by ExaGrid, FalconStor, and Sepaton, to de-duplicate data once it is backed up. At this stage, only one vendor, Quantum, claims to support both approaches on one appliance.

    "As the market matures, we are beginning to see battle lines established between where... data de-duplication is performed," maintains George Crump of the Storage Switzerland consultancy in a recent blog. In his view, only the use of in-line appliances guarantees that a server won't be overwhelmed with the processing required to de-duplicate data before it goes to backup. Also, if de-duplication is done after the backup, you need more storage than you already have, and there's a chance your de-duplication processing will interfere with the speed of RTO in case there's a real outage.There are others, though, who claim in-line devices interfere with performance on the network, even though they may use less disk than post processing solutions.

    At least one analyst, who asked not to be named, isn't taking sides. "Both in-line and post processing approaches have advantages and disadvantages. A well done post process de-dupe is just as effective as an ingress de-dupe. A well done ingress de-dupe can be just as fast as a post process de-dupe. Remember: One of the basic pillars in math and programming is that there are multiple ways to achieve the solution. The creators of each solution think their baby is the best and only way to get it done."

  • Hash-based de-duplication algorithms don't scale. Some de-duplication vendors boast that their solutions don't consume as much processing power because they're not based on hash algorithms. This usually goes hand-in-glove with the argument that one's de-duplication index lives in RAM, instead of consuming CPU cycles. You get the drift.

    There may indeed be an argument for performance of any specific solution versus another, but without live third-party testing, judgments are strictly speculative. Further, it's not clear which de-duplication vendors are using hashing and which aren't. In fact, several claim to use multiple techniques to get the de-dupe job done.

    In the end, what customers are actually seeking from de-duplication products may not be the same things vendors are pushing. According to analyst Simon Robinson of The 451 Group, users his firm surveyed reported their de-duplication criteria to include items like data integrity, replication support, ease of use, pricing, and packaging. "Performance is key. Data integrity is a huge concern," he says.

    Going forward, the matchup between one form of de-duplication with a specific size and type of application may become clearer. For now, it seems suppliers are eager to muddy the waters as much as possible.

  • Data Domain Inc. (Nasdaq: DDUP)

  • Diligent Technologies Corp.

  • EMC Corp. (NYSE: EMC)

  • ExaGrid Systems Inc.

  • Quantum Corp. (NYSE: QTM)

  • Storage Switzerland

  • Sun Microsystems Inc. (Nasdaq: SUNW)

  • The 451 Group

Read more about:

2007
SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like


More Insights