Disk-Based Archiving Lowers the Boom on Tape

If disk-to-disk backup put a nail in tape's coffin, then disk-based archiving may just finish it off

October 10, 2007

13 Min Read
NetworkComputing logo in a gray background | NetworkComputing

If disk-to-disk backup put a nail in the coffin for tape archiving, then another technology -- disk-based archiving -- may help tape's demise even more.

Disk-based archiving leverages disk, as opposed to tape, to provide long-term access to data that has become static but may be needed at some point in the future. Sometimes called disk nearlining, the technique involves using an intelligent disk target device along with "move" commands or off-the-shelf archiving software to move data out of the backup process and onto disk.

The trend is evident in various new products, as well as in homegrown systems in use by various IT organizations. Data Domains recent announcement updating their OS to better support smaller files indicates the reality of data archiving. Suppliers like Copan, EMC, Hitachi, and Permabit also are angling to lead in this space.

At the heart of the trend is the weakness of tape to fulfill the archiving role. That weakness is more significant than the weakness of tape to complete the backup/recover function. Besides making it particularly slow to find and recover single files, tape makes it hard to ensure integrity of data over time, and it is difficult to upgrade between technology generations using tape.

Difficulties are compounded by the fact that most customers used to look at an archive with the thought: “If I can recover it, that’s fine but not a requirement." Now, with regulations and electronic discovery in legal cases becoming commonplace, the need to pull data from that archive is critical, if not required. The ramifications of not being able to recover that data can be costly, both from a public-image perspective and from a financial bottom line perspective. Users can no longer just “hope” that data can be recovered from the archive.Despite all the discussion of tiered storage, ILM, and HSM, in reality most data centers have three pools of storage: primary storage, backup storage, and archive storage. All organizational data usually gets backed up to the backup pool and in many cases the backup pool is the archive pool.

How does the concept of disk-based archiving improves this picture? The primary reason is the same one that drove the disk-to-disk backup market – data de-duplication.

Data de-duplication is the ability of an appliance or software application running on a server with disk attached to compare segments of data being written to it with data segments that currently reside on it. If duplicate data is found, a pointer is established to the original set of data as opposed to actually storing the duplicate blocks segments - removing or "de-duplicating" the redundant blocks from the volume.

De-duplication is changing the perception of disk-based archiving and fixes a fundamental flaw with the ILM craze of a few years ago. Back then, you were going to take 80 percent of your data that was not being used and store it on less expensive disk. The problem was, it was still the same amount of storage, just a little cheaper.

We would go to customer sites and identify 10 Tbytes of static data that could be moved off of primary storage, but to move that data you need to buy over 10 Tbytes of ATA disk – not a great value proposition. With data de-duplication, you may only need to buy 3 or 4 Tbytes of capacity to store that same 10 Tbytes. Given the advent of SAS/SATA, de-duplication helps ensure that not only will you use less disk, you'll pay less for the fundamental medium as well.Less disk of course means fewer storage elements to manage and the reduction in power and cooling costs that all IT managers are looking for today. The combination of SATA technology and data de-duplication allows you to create a disk-based archive at a price point that begins to match the cost of tape.

Next Page: Comparing disk-based archiving to tape

With the cost advantage of tape lessened or even eliminated, comparing a disk-based archive to a tape-based archive is almost unfair. But the comparison turns up two specific weaknesses of tape nonetheless: the difficulty of performing single-file restores from data sets that are months if not years old, and the inability to ensure data integrity of tapes.

The difference between backup and archiving can be most simply identified by looking at how the copy of the data is going to be used. If you make a copy of primary data for data protection purposes, that is a backup. If you copy the primary data to another tier of storage for permanent or near-permanent retention, that is an archive.

Most customers that use tape for archiving are using their backup software for creating and managing that archive, essentially backup jobs that have not been over-written.Backup applications create a database that tracks what file is on what tape. Growth of this database is caused by the number of files that are backed up and the length of time those files need to be tracked. Most backup applications will ask you to prune these databases after a certain period of time or after the database reaches a certain size. Typically, this is well before you want to or are able to get rid of the tape cartridge if you are actually using the tape as a permanent archive. In what amounts to bad news disguised as good, you can still access data on tape, but you have to know which tape contains what data (Excel spreadsheet anyone?) and then you have to rescan that tape. This is a very time consuming process and typically requires trial and error and a lot of luck. Even if that works well you still have the issue of long-term access.

Long-term access to tape is problematic at best. I have seen and been involved with countless situations where tape recoveries have failed. The tape itself is intact, but the attempt to read and recover data is unsuccessful. There are many culprits, but the point is you needed data and you could not get it back. How it happened is nice to know, but it's much better to have a job.

It may not even be the fact that the tape has gone bad or has an error. It could be progress. You may have moved on to a different media format and no longer have a drive that can read that tape, or you could have upgraded your backup application and don’t have a copy of the original application around that works on the newer operating system that you just installed to be able to read the tape.

In addition to data de-duplication, most disk-based archive systems provide abilities to address these limitations. First is access to the data. Data placed on a disk archive can be in its native or original format. There is no need to have it stored in a tape-efficient, proprietary format. In the past ten years, your means of accessing a CIFS or NFS mount points has changed very little. In that same ten years we have seen a major shift in tape formats (DLT to AIT to LTO) and multiple generations of those formats (DLT-Super DLT, AIT 1-5, and LTO 1-4). Your chances of reading data from a network volume that is 10 years old are much better than reading it off of a tape that has been sitting on the shelf for that same period of time.

Finding data on disk can be done by merely navigating a directory structure, or as the index grows data can be indexed at a content level to allow for “Google-like” searches, making recovery easy. Also, recovery does not always mean data movement with disk-based archiving. Most requests from archive are for reference purposes, meaning you need to look at, but not modify, the file. If this is the case, the reference can be done with the data in place on the disk archive, a feat that is almost impossible with tape.Retention of electronic content continues to become more important for most enterprises. With various rules and regulations or even corporate governance, the length of time that data is required to be retained is increasing rapidly. How do you protect the archive – and proof of integrity of the archive – as it ages?

If a tape is lost or broken, most people do not have a redundant copy of archive tapes. If they do, it is not readily accessible and will require some effort to access. A disk-based archive is typically either mirrored or running some sort of RAID configuration that not only delivers redundant access to the data but also does so instantly. Also, with disk-based archiving, you know prior to data loss that you are having drive or media failures, but with tape you only know at the point when you go to recover the data.

How can you make sure that the data itself does not degrade? With tape, the only way to do this would be to rescan the tapes on a periodic basis to check for quality. Going further, the data should be migrated to new media every three to five years. I don’t know many IT professionals with the time or tools to do this. With disk-based archiving, it is automatic. Most disk-based archiving has a built-in algorithm that continually checks the integrity of the data for the lifetime of that data.

The final piece of protection is replication. With the correct data de-duplication strategy built in, replication of the archive to a remote location can be done with modest bandwidth requirements. This near-real-time replication, as data is added to the archive, eliminates the need for the archive to be backed up to disk or tape at all, greatly reducing actual data in the backup path.

Next Page: Comparison to dumb diskYou may be tempted to try and create a disk-based archive yourself by purchasing regular ATA-based arrays, either as an add-on shelf from your primary storage supplier or as a separate unit. Alternatively, you may be a supplier thinking of creating a disk-based archive by adding functionality to your existing array.

The problem with this approach is that you get none of what the disk-based archiving vendors are offering. You get none of the efficiencies of data de-duplication, none of the protection (data integrity checking), and the cost to replicate data to the archive is more expensive - to the point that it is impractical because there is no data de-duplication.

We are seeing some suppliers of cheap disk try to "bolt on" a data de-duplication capability to existing wares. And as an end user, you could also buy the cheap disk wholesale and just add de-dupe functionality as a service. Typically, this form of de-duplication falls into the category known as "post process" de-duplication; the data is stored on the unit first and is then crawled for redundant data, a time-consuming process.

With post-process data de-duplication you need extra storage space. You must have enough storage to receive the un-de-duplicated data, then you need to have enough capacity to store the actual de-duplicated data, and typically you need to have some sort of an intermediary storage as a temporary working area. This not only requires more storage as backup and archive sets continue to grow, but it adds greater complexity to the overall architecture, thanks to difficult capacity planning, more points to manage, and more points to fail.

With post process de-duplication, you also have to wait until the data de-duplication process is completed before replicating the data across the WAN. With the time delays involved in doing data de-duplication after post process, this can be very problematic and will delay greatly the time for the DR site to come into sync.In-line de-duplication that occurs as the storage system receives the data provides a better solution. There is no requirement for extra storage and no delay in the time to replicate to a remote site. There is also no management complexity related to orchestrating separate data storage, data de-duplication, and data replication processes/procedures.

Next Page: Moving data into a disk-based archive

As a general rule, there are five types of data in the enterprise that can be put on an archive: files, images, email, databases, and VMware images.

For most of us files and images are obvious candidates but what makes disk-based archiving unique is that the placement of this data may not need a data mover (a software application that identifies and moves data to an archive, setting up a transparent link). Since this is a very smart but possibly slow network share, files can be manually moved to the archive by standard copy-and-move commands. One has to question investing in an application that will set up a transparent link to a file that has not been accessed in years. With a disk-based archive, that is not a requirement and as a result could greatly reduce costs and simplify installation and management of the archive.

Using a disk-based archive as an email archive target may also seem obvious, and even though many email applications offer file redundancy checking, they are not as granular as a disk-based archive at redundancy checking, so great gains will still be made in storage efficiencies. Add to that built-in integrity checking and replication capabilities, and disk-based archive is a natural target storage device for email archiving software.Databases may be a surprise. I am not advocating the actual execution of the database from the archive but more the archiving of the database to the archive. I see greater requirements to produce a database environment as it looked on such and such day. A disk-based archive makes this very easy. Because of the NAS-like functionality of the disk-based archive, you can use the databases internal dump or archive capabilities to drive data to the archive. With the ability to do data de-duplication, especially with databases, storage efficiencies will be very high.

VMware as a data generator may then be a total shock. How do you backup and archive all those OS images that VMware creates? A disk-based archive is ideal. VMware data is extremely redundant. Once it is run through the de-dupe engine, the storage efficiencies are extremely high.

To sum up, disk-based archiving eliminates issues of data movers, and it offers what is essentially a safe, storage-efficient NAS you can start using right away. Simply identifying old files in the environment and manually moving or dumping database and VMware data to it can show immediate payback. You may still need a data mover, but you have bought yourself time to decide where and when to use it.

Being able to leverage a disk-based archive target across so many functions is important. It can be driven by legal or regulatory requirements but then leveraged for practical use in the data center like reducing backup windows, reducing primary storage investments, lowering power and cooling costs, and reducing administration time.

With a disk-based archive in place we can achieve a highly optimized storage architecture. The primary pool will only be used for your working set of data, databases, email, and files that have been accessed in the last few months. The disk-to-disk backup pool will be used to back up the now smaller primary pool, and the archive pool will be used to safely and securely retain the information that is no longer active or changing. The result is a primary storage investment that experiences relatively slow growth (remember primary storage keeps getting cheaper, so the longer you can delay a primary storage purchase the better); a reduction in backup windows and recovery windows; and an archive environment that will grow, but at an order of magnitude slower pace than regular disk. The deliverable of disk-based archiving is costs down, operational efficiencies up.— George Crump, Founder and President, Storage Switzerland

Read more about:

2007
SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like


More Insights