De-Dupe Dos & Don'ts
Some of the methods you use to get the most out of a real tape library can confuse the de-duping algorithm in your shiny new VTL
January 21, 2009
11:30 AM -- I know the salesman from your favorite three-letter storage server told your boss you could drop a new de-duping VTL into your backup scheme and be up and running in 20 minutes without changing your jobs and schedules. While that's technically true, adjusting your backup methods just a little could have a big payoff in higher data reduction rates, as some of the methods you use to get the most out of a real tape library can confuse the de-duping algorithm in your shiny new VTL.
Here are a few tips to help you pack the backup data onto the least space on your new backup target.
Do store similar data in the same repository
If for capacity, performance, or other reasons you end up with more than one de-duping backup device, you'll get much better de-duplication factors if you keep the backups of servers that host similar data types together. After all, your users receive files as email attachments, edit them, and send them out as attachments so there will be lots of duplicate data across your email and file servers.
Backing up the system drives of your Windows servers to one appliance and your databases to another will get you better data reduction than backing up the automotive division servers to one and the aerospace division to the other.
Dont encrypt or compress before de-duping
All de-duping algorithms work by identifying common blocks of data and storing them only once. Encrypting or compressing data in your backup application or a SAN appliance before the de-duping algorithm sees it hides the commonality.Compression and encryption algorithms take fixed-size data blocks and transform them to do their work. If 1 byte of the data changes, the compressed and/or encrypted data block changes dramatically so a variable block data de-duping solution like those from Quantum, Dell, EMC, or Data Domain can't see the changes.
Since most de-duping solutions, with the obvious exception of NetApp's A-SIS, which is targeted more at primary storage than backups, compress data after de-duping, there's no good reason to compress before.
I've never thought encrypting disk data in the data center, as opposed to tapes you're going to put in a truck, added significantly to real security, so turning off encryption doesn't really bother me. It does mean that you'll have to spool data off to tape through your backup media servers until the vendors of VTLs and backup software work out APIs that allow direct export to LTO-4 drives using the drive's encryption engine and key management in the backup app.
Don't interleave/multiplex
While interleaving made a lot of sense when you were trying to keep a high-speed tape drive fed, disk backup targets can accept data at any rate without shoe shining. Like compression, interleaving disturbs the data stream so the de-duping algorithm will see part of file A that it previously backed up mixed with file B that's new and think the whole stream is new data.
While a content-aware system like Sepaton's could demultiplex and possibly decompress the data before de-duping, there's no good reason to make the VTL CPU work harder than it has to.Given what you spent for the new VTLs, the extra money you'll have to send IBM, EMC, or Symantec for tape drive licenses is chickenfeed.
— Howard Marks is chief scientist at Networks Are Our Lives Inc., a Hoboken, N.J.-based consultancy where he's been beating storage network systems into submission and writing about it in computer magazines since 1987. He currently writes for InformationWeek, which is published by the same company as Byte and Switch.
About the Author
You May Also Like