A Data De-Duplication Survival Guide: Part 1

In the first installment of this series, we discuss deciding where to de-duplicate data

May 28, 2008

8 Min Read
NetworkComputing logo in a gray background | NetworkComputing

Editor's note: This is the first installment of a four-part series that will examine the technology and implementation strategies deployed in data de-duplication solutions:

  • Part 1 will look at the basic location of de-duplication -- standalone device, VTL solution, or host software.

  • Part 2 will discuss the timing of de-duplication. This refers to the in-line versus post-processing debate.

  • Part 3 will cover unified versus siloed de-duplication, exploring the advantages of using a single supplier with the same solution covering all secondary data, versus deploying unique de-duplication for various types of data.

  • Part 4 will discuss performance issues. Many de-duplication suppliers claim incredibly high-speed ingestion rates for their systems, and we'll explore how to decipher the claims.

The original products for the data de-duplication market were based on specific systems that focused on improving the value of disk-to-disk backup solutions while providing organizations the ability to minimize their reliance on tape.

As data de-duplication solutions have become more prevalent, a few primary storage suppliers have attempted to implement the technology as an add-on feature, most notably in their VTLs. Backup software vendors are also adding the capability to their solutions. With so many data de-duplication options available to the IT manager today, the new question is, Where is the best place to host the data de-duplication process?

As you are reading, keep in mind that the primary focus of data de-duplication is secondary storage -- archive and backup, as opposed to primary storage. Also note that what constitutes duplicate data may not be immediately obvious. An Oracle database, for example, can be backed up in several ways -- using the built-in RMAN utility; using an organizations enterprise backup software application; or using an Oracle-specific backup utility. Each of these methods creates its own data set. Since those data sets are backups of the same Oracle database, the data within each set is essentially identical.

To Page 2General purpose de-duplication systems

Several vendors, including Data Domain and Quantum, offer de-duplication systems that are not associated with particular VTLs or backup appliances. These devices can be termed general-purpose de-duplicators.

The advantage of working with a general-purpose data de-duplication storage system is that it is designed solely to de-duplicate data. As a result, these systems are source neutral, meaning that the source backup data can come from multiple applications (backup software, application utilities, archiving applications, or directly from the user).

General-purpose systems provide multiple data access protocols (NFS, CIFS, or tape emulation) and offer multiple types of physical connectivity (Ethernet or Fibre Channel). In the real-world data center, there are many sources of backup data, and there is distinct advantage in being source neutral.

Although input can be taken from multiple sources, in a general-purpose system, the data de-duplication process is leveraged across all of them. For example, the Microsoft SQL environment may be backed up by an administrator through the backup application to the general-purpose data de-duplication system. Later, the same data may be dumped to the data de-duplication system by the SQL DBA. After that, it may also be captured as part of a VMware image using a VMware backup utility to move the data to the data de-duplication system.

In the above example, all data is similar and the redundant segments from each of the sources are eliminated before the data is stored. Be aware that this example is for one file that changed slightly on one day. This type of multi-protection is not uncommon in today’s data center, so the space savings across a week or month could be staggering.General-purpose data de-duplication systems will typically have (or should have) the ability to do inline data de-duplication, since that is generally the most efficient process. Also, ideally, the data de-duplication system should have variable-length segment identification in order to provide the most aggressive data de-duplication effect. For example, it should be able to pick up and store only the changed segments within a database, as opposed to storing the entire file new on each backup.

Lastly, general-purpose data de-duplication systems that include replication provide the optimal way to replicate backup data to remote sites. By leveraging data de-duplication, the data de-duplication system only needs to replicate the net new segments of data across the network.

The most efficient systems will perform de-duplicated replication, in-line, across multiple sites. So far, Data Domain fits the bill. In addition, in-line de-duplication enables the replication process to begin the moment the system starts receiving data. This is unlike VTL systems that typically use post-process data de-duplication and therefore incur a time delay before the replication process can begin -- thus putting the disaster recovery data at risk.

To Page 3

VTL solutions

Suppliers of VTL solutions, such as FalconStor (which supplies EMC and Sun), NetApp, and Sepaton, typically qualify a range of backup applications, but they are not neutral in terms of data source or target.Specifically, VTL solutions emulate a tape library. Therefore, only applications that have specific support for tape libraries can utilize the VTL, making it a limited application in itself.

Many of the utilities that are prevalent in the data center typically dump data to disk and do not support tape protocol. Many data protection utilities cannot support copying data to a VTL.

Most of the limitations for consideration of VTL solutions with data de-duplication center on the added management complexity and on the in-line versus post processing debate. In general, the added virtual tape management needed to emulate a tape on disk adds more complexity to an already complex environment.

Post processing further complicates ongoing daily management and has a negative impact on time to de-duplicate and time to replicate (or create the DR copy). Post processing also requires additional disk capacity to act as a de-duplication landing pad.

Ultimately, more capacity means more disk to manage, more power and cooling, more floor space, and of course, more spindles to purchase. Adding the ability to do data de-duplication as a feature to an existing VTL product has thus far been implemented by using the less efficient post-process method of de-duplication.Software-based de-duplication and single instancing

As expected, backup software vendors are now adding data de-duplication abilities to their feature sets. In addition, backup software vendors like CommVault are using a data reduction technique known as single instancing, which occurs as the backup host receives the data and performs file-level comparisons.

While this method will certainly reduce some of the storage requirements caused by the backup process, it does nothing to address the network bandwidth requirements, nor does it address multiple copies of similar data (only the data that runs through the specific application will be compared for redundancy).

Single instance storage does not solve the other big problem in backup storage -- files that change slightly on a regular basis.

With single instancing, discrete files that do not change each day are typically “instanced out” of the backup. However, in any backup vaulting strategy, files that don't change are not the issue; the big files that change a little bit every day are the problem.

Databases, VMware images, and Exchange stores often change slightly throughout the day. A file-level single instance comparison would see changes as different files, not as the same file with a few changes. This means the entire file must be stored again, resulting in an anemic data reduction effect when compared to true de-duplication techniques. Clearly, without block-level reduction, there is no space savings, particularly for database files, which can be very large.Another challenge that single instance storage doesn't address is that there are often multiple backup sources for similar data sets. For example, the backup administrator may back up Exchange with the backup software’s Exchange module; the Exchange administrator may also back up the Exchange stores with a separate utility. No data reduction happens here, since the backup software never sees the backups created by the standalone utility.

In both cases (apps with frequent small changes and multiple backup sources), a data de-duplication system operating at a block level would identify the redundant blocks and reduce the storage impact even though the backup source (backup application or Exchange utility) is different.

Software suppliers that use this single instancing technique will claim that this storage method is better suited for recovery, implying that there are some restore performance issues with de-duplication systems. While there may have been some recovery performance issues with data de-duplication systems from some suppliers, when the system is designed with the correct architecture there should be no measurable impact on performance as a result of the data de-duplication process.

In a real world data center, there are simply too many other bottlenecks between backed-up data and the source server for recovery from a general purpose data de-duplication system to be the problem. If the recovery performance requirement exceeds the ability to recover from disk, then other high-availability solutions, like clustering or active targets, should be considered. (An active target is a backup target application that can be browsed and read like a normal file system.)

Finally, the method of using single instancing assumes the use of a single software application for all backups, archives, and other data management functions and across all data types. This is not practical. While many backup software suppliers do offer some sort of additional components beyond just backup, there are varying degrees of functionality with these additional modules, and most customers in reality are going to have separate solutions for backup and archive, as well as applications for particular platforms (such as VMware). There will also be a limit to how much development a software manufacturer will invest in a module for a unique database or operating system.Summary

The data source-neutral, protocol/connectivity-neutral, data type-neutral capabilities of a general-purpose data de-duplication system are the best tools possible for storing of backup and archive data. Be careful: Do not be limited to the particular capabilities of the data de-duplication module built into your backup software or limited by the tape-only protocol of a VTL.

Have a comment on this story? Please click "Discuss" below. If you'd like to contact Byte and Switch's editors directly, send us a message.

  • CommVault Systems Inc.

  • EMC Corp. (NYSE: EMC)

  • FalconStor Software Inc. (Nasdaq: FALC)

  • Microsoft Corp. (Nasdaq: MSFT)

  • NetApp Inc. (Nasdaq: NTAP)

  • Sepaton Inc.

  • Sun Microsystems Inc. (Nasdaq: JAVA)

  • VMware Inc.

Read more about:

2008
SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like


More Insights