Disaster Recovery Planning
Explore DR strategies that can mean the difference between bankroll or bust.
January 16, 2004
Over the years, DRP (disaster-recovery planning) and its cousin BCP (business-continuity planning) have generated a variety of methods for protecting infrastructure components like networks, servers and software, even in complex, n-tier client-server configurations. However, data remains at significant risk. This is because physical infrastructure protection and recovery strategies are based on redundancy or replacement. But data cannot be replaced, cutting the odds by 50 percent. The only way to protect it is redundancy--make a good copy and store it safely out of harm's way.
Some claim that disaster recovery focuses on IT infrastructure replacement, while BCP focuses on business-process continuance. Others argue that DRP is an oxymoron, saying, "If you can recover from it, how can it be a disaster?" This school claims that BCP is more reflective of the goals of the activity and has a more positive psychological impact (read: more politically correct).
At the end of the day, we don't care a whit what you call it--DRP, BCP or EIEIO. It all means the same thing: Avoid preventable interruptions and develop strategies to cope with interruptions you can't prevent.The first step is to copy your data. That's easy enough, right? Au contraire. In data replication, many factors add complexity and cost. Time, for example. Copying takes time--less if the copy is made to disk, more if to tape. Top tape-backup speeds achieved in laboratories today hover at about 2 TB per hour, assuming a sturdy interconnect, a state-of-the-art drive, perfect media and a well-behaved software stack. Disk-to-disk copying takes a fraction of the time required by tape, though this, again, is a function of interconnect (usually WAN) robustness, array-controller efficiency and many other factors.
Then there's geography. You want the copy of your data to reside far enough away so it won't be consumed by the same disaster that interrupted normal access to the original. With tape, this is no problem. The portability of removable media means you can make a local copy, then ship it off site for safekeeping. In disk-to-disk, the copy must be directed to an off-site target platform across a network interconnect on an ongoing basis.
Opinions vary on what constitutes an acceptable distance between source and target in disk-to-disk copying, but be aware that the greater the distance between the original disk platform and the remote platform, the more the data on the two devices is out of sync. This is called the delta in disaster-recovery parlance, and data deltas can be the difference between boom or bust. More to the point, deltas can determine whether the remote copy of your data can be used to restore application processing in the event of an interruption. Crash consistency is the shorthand used to express this concept.Regardless of interconnect and platform efficiencies, deltas begin to accrue after data travels about 16 kilometers, thanks to speed-of-light constraints on signal velocity. In a communications system, propagation delay refers to the time lag between the departure of a signal from the source and the arrival of the signal at the destination. When two arrays configured in a synchronous mirror are placed 16 kilometers or more apart, the propagation delay that accrues to storage I/O signals introduces noticeable latency into application performance. Basically, the application must wait until a response is received confirming that data has been written to both the primary (local) and secondary (remote) array before it continues with the next I/O. If the arrays are configured in an asynchronous mirror, the application doesn't wait for the confirmation of a write at the remote location. However, deltas reflect the transit time between local and remote arrays, and worsen as the two arrays are placed farther apart. Attempts to work around distance-induced latency have led to a proliferation of journaling, spoofing and caching strategies that do not so much surmount the issue of propagation delay as provide ways to live with it (see "Signal Velocity," below).
Signal Velocityclick to enlarge |
Other factors affecting the copy process may include the need to quiesce (turn off or idle) servers or applications while data copies are being made, or the requirement to do the copying within narrow windows to avoid overburdening production servers or networks with background copy tasks. In the face of many companies' 24x7x365 operating schedules, opportunities for off-line copying or backup are in increasingly short supply. Moreover, as the quantity of data that needs to be copied grows (estimates range between 40 percent and 100 percent per year, depending on the analyst), the idea of copying massive amounts of data in short windows of opportunity is becoming increasingly laughable.
Finally, pragmatic constraints, typically related to option costs, limit the data-protection alternatives in most companies. Until recently, the industry provided only two choices for copying data: disk to tape (streaming backup) and disk to disk (mirroring). Tape has been the workhorse for data protection in most companies for the past 20 years, while array-to-array mirroring over distance has provided a pricey option for those with deep pockets and zero tolerance for downtime.
Proprietary mirroring schemes have always been a cost accelerator in disk-to-disk copying. With conventional mirroring, a company must buy two arrays (three for multihop mirroring) from the same vendor to use the vendor's remote mirroring software. Today, many systems tout "vendor-neutral" software-based mirroring but require that storage arrays first be aggregated into a Fibre Channel fabric--aka a SAN--which must then be overlaid with virtualization technology to facilitate copying. Such configurations still carry hefty price tags, and that's before factoring in the ongoing cost of a WAN interconnect or the additional expense of security and encryption for remote data copies.
What you get for all this extra money, of course, is fast recovery. Disk-to-disk replication saves a lot of time and eliminates the hassle of reloading data from tape into a usable form. High-end mirroring crowd is important for certain industry segments, such as finance, whose operations are extremely sensitive to interruptions. These firms also tend to set up redundant data centers with identical gear to ensure that operations absolutely, positively will not stop.For the rest of us, however, the cash to build a customized data center in waiting is not in the budget. The next best option is to buy a contract with a disaster-recovery facilities service provider, test your recovery strategy often and cross your fingers. For providers of this nature, the flexibility of data restore is often just as important as speed of restore, because commercial disaster-recovery facilities are often oversubscribed and can seldom guarantee that storage platforms identical to those used in your production setting will be available for recovery. Thus, being able to restore data to an alternative platform may be key to getting your enterprise back in business quickly after an interruption.
Score one for tape, which provides the means to restore data on the fly to whichever set of LUNs (logical unit numbers) is available.Disk-to-disk advocates suggest that times are changing. Enabled by Fibre Channel fabrics, new software and inexpensive Serial ATA disk arrays, disk to disk may improve the economics and efficiencies of data protection, they say. Numerous vendors, including Breece Hill, Nexsan and Quantum, have introduced disk-based products that place a second tier of disk between primary or production arrays and tape-backup libraries, shortening backup time and expediting restores. Breece Hill combines disk and a tape autoloader in the same box, aimed at small and midsize businesses.
Some vendors use disk to emulate tape, taking advantage of the speed of disk to expedite backup processes, then dumping the data from aggregated backup streams to tape as an off-line process. Others use a Tier 2 disk platform as disk, providing a means to restore specific files rapidly in the event of accidental corruption, as well as a location for performing data-hygiene functions, such as virus scanning, junk-data purging and duplicate-data elimination, before the data is written to tape.
Meanwhile, products from Arkivio, Avamar Technologies and other vendors seek to advance data protection beyond simple data copying and into data life-cycle management. They provide a less expensive home for infrequently accessed data, and management tools for migrating the data from Tier 2 to disk or optical--or into the waste bin--as access requirements dictate.
All these "enhanced backup" products are expanding the options for data copying and data protection. Last year, these burgeoning approaches were in the process of being vetted by a nascent industry association, EBSI (Enhanced Backup Solutions Initiative), when the popular organization was first blockaded, then absorbed, by SNIA (Storage Networking Industry Association).EBSI, when launched in late 2002, proved immensely popular with consumers, 450 of whom volunteered their support to the organization within a few days of its Web site unveiling. The encouraging grassroots response, which underscored the need for a sanity check on the proliferating set of alternatives for data protection, caught the attention of many vendors--and SNIA.
Apparently concerned about the popularity of a rival group and the threat of member vendor funds being spent on a non-SNIA endeavor, SNIA put the word out to its members not to join EBSI, stating that SNIA was about to launch SNIF (Storage Network Industry Forum) covering the same turf. Behind closed doors, EBSI founders were told they should let themselves be absorbed by SNIA or risk losing the investment they had made in EBSI. The founders chose to fold their activities into SNIA, where they were recast as the Data Protection Forum.
Little has been done since to vet and compare data-protection solutions appearing in the market. At the Fall Storage Networking World event in 2003, the forum chairman jokingly remarked that the group had met twice since being formed, and had spent most of their time debating the meaning of the word continuous in the phrase continuous data protection. Let's hope 2004 brings more progress.
Also obfuscating planning efforts is the lack of clarity in government regulations. After 9/11, a panel of regulatory agencies was convened to assess the adequacy of data protection within the financial community. The group stopped short, however, of mandating or recommending distance requirements for data mirroring. In the wake accounting scandals, corporate governance regulations have emerged that place data protection on the front burner but offer little guidance on how to provide protection in a compliant way. Health-care data privacy and portability laws, such as HIPAA (Health Insurance Portability and Accountability Act), as well as a growing number of Homeland Security laws and regulations, also have focused on data protection, but compliance systems and auditing standards remain moving targets.In the final analysis, selecting the most appropriate strategy for data protection comes down to application requirements and budget, not much different from any other large IT investment. First and foremost, planners must understand applications and the business they support. They must define the characteristics that data inherits from the applications that produce it--in terms of criticality, priority of restoration, and requirements for access, retention and security.
This analysis helps determine which data must be copied, how frequently copies must be updated and which platforms should host the data in a recovery setting. Unfortunately, this is a laborious and time-consuming process; there are no automation shortcuts. However, a thorough analysis can narrow the range of options. If you aren't sure how to assess your application requirements, find a consultant to do it for you or teach you the basics.Keep in mind that there is no "one size fits all" solution--different applications and data may have different recovery requirements that may be best served by several types of systems. Good policy-based storage-management software may prove useful in keeping multiple processes under control in complex settings (find news and reviews of storage-oriented products on our Storage Pipeline site, at www.nwc.storagepipeline.com). Test the candidate products thoroughly for compatibility with your requirements and infrastructure. And keep testing them as requirements change.
Finally, share your results and experiences--in user groups, online forums and letters to the editor. There's a dearth of reliable "in-the-trenches" information for fellow planners, and the vendor marketecture mill is operating, as usual, in overdrive. Your experience can help bridge the information gap.
Bottom line: Data protection is an important undertaking that has been the odd man out at many companies for too long. Unfortunately, this places planners, many of them with no formal training, in the awkward situation of having to "bolt on" solutions to infrastructures that were not designed for recoverability. (To see how a group of vendors helped our fictional grocery titan beef up its disaster-recovery capabilities, see "Natural Selection.")
Jon William Toigo is CEO of storage consultancy Toigo Partners International and author of 13 books, including Disaster Recovery Planning: Preparing for the Unthinkable (Pearson Education, 2002). Write to him at [email protected].
Post a comment or question on this story.
If protecting your organization's data and maintaining an effective disaster-recovery plan are part of your job description, you may feel like the cards are stacked against you. Between the wildly increasing amount of data that needs to be backed up and stored, and the growing threats from script kiddies, disgruntled employees and Mother Nature, it's a tough way to make a living.But there are some strategies that can tilt the odds in your favor. In "Data at Risk," we explain how to tap available resources, from professional organizations to local user groups, and cover the finer points of data replication, disk versus tape, physical siting of backup locations, data deltas, costs and more.
For our review, "Natural Selection", we issued an RFP (request for proposal) to Computer Associates, Fujitsu Softek, Hewlett-Packard, Quantum Corp., Tacit Networks and Veritas Software on behalf of our hypothetical retail company, Darwin's Groceries. The company, whose corporate motto is, "Driving mom-and-pop grocers to extinction, one community at a time," has awakened to the need for serious data protection after protests at some of its SuperGigantic stores turned ugly.
Although we found Tacit's response interesting, it did not fulfill the criteria for inclusion in the race for Editor's Choice. That honor went to Fujitsu Softek, because its solution could be broken down into three manageable chunks and neither required a huge rip-and-replace nor locked Darwin's into expensive hardware or proprietary technology.
You can find our complete RFP and all six vendors' responses here.
Web Links
• NWC Project: Data Recovery Plans
• "The Survivor's Guide to 2004: Storage and Servers"
• "Special Report: Ultimate Enterprise Storage"
Network Computing invited 15,000 readers to participate in an e-mail poll on data protection and received more than 640 responses. Although unscientific, the poll sheds light on the state of data protection.
Approximately 95 percent of respondents reported having an on-site data-protection capability, and 90 percent said this capability is used routinely. However, only 75 percent said they test or audit the capability on an ongoing basis. Of the 70 percent who said they'e experienced an interruption requiring recovery services from their data-protection capability, the majority--89 percent--said they've succeeded in recovering critical data using their technique.
Storage quantities included in respondents' data-protection schemes varied widely, from tens of gigabytes to tens of terabytes. So did the time frame required for data restoration following an interruption, ranging from minutes to days.
Interestingly, 41 percent of respondents admitted to including all data in their recovery schemes rather than culling out duplicates or junk files. Sixty-four percent said their data copies are physically located within 30 miles of original-data repositories, whereas 24 percent reported that backup data is located beyond 30 miles.Most (47 percent) reported a data delta--the difference between the time it takes to retrieve original data and the time required to access copied data--of more than six hours. Twenty-five percent estimated a one- to six-hour difference, 10 percent boasted a delta measured in minutes, and a lucky few--5 percent--claimed a delta of less than 60 seconds.
Forty-seven percent claimed they would cope with their data delta by rekeying data or re-entering transactions as part of their recovery, while 15 percent claimed that their solution was bulletproof and would require no extraneous data reconstruction. Some 12 percent claimed they would move forward without replacing lost data, and nearly the same number said they had no idea how they would replace lost data. We like honesty.
Confidence in the recovery system is a litmus test of plan solvency. Only 33 percent reported being very confident about their data-protection strategy. Fifty-four percent said they were somewhat confident, and nearly 4 percent indicated that their résumés were up-to-date and stored safely off-site.We created a scenario where a fictitious supermarket company was looking to improve their data management and protection, as well as improve their situation during any data loss disasters.
View the RFI responses of our particpating vendors
Read more about:
2004You May Also Like