Avoiding DR Disasters
Proper focus and testing can alleviate risks to disaster recovery plans
June 1, 2007
If you had to act on your disaster recovery plan today, would it work?
If you're not sure, you're not alone. Sources say it's not only hard work to establish DR plans, it's work to make sure they are more than theoretical.
"There's no magic answer. You have to plan it and try it, then try it again," says Rick Erickson, assistant network administrator at Fidelity Bank of Edina, Minn. His group has been working on a DR plan for over two years and testing a secondary physical DR site every few months for nearly a year.
Each test reveals more. "You can find a lot of problems, everything from making sure you're using the right cabling to configuring the firewall properly and making sure you have the right addresses," he says.
"I think it's probably impossible to predict every scenario, but if you have a plan and practice it, you can capture a lot of potential problems," says Luis Salazar, a shareholder and attorney with the firm of Greenberg Traurig, a worldwide law practice headquartered in Miami.Salazar isn't just talking as an advisor to corporate clients who are looking to set up data management and business continuity plans. His own office was destroyed during Hurricane Wilma in 2005. Thanks to an alert IT staff, he was able to work from the nearest WiFi hotspot, even though he had no office or desk.
So what do astute users like these do to ensure their DR plans actually work? Here is a checklist we gathered from talking to Erickson, Salazar, and others:
Set recovery priorities. This is task number one for most experts we talked to. "The first hurdle is to decide what gets stored/replicated/protected and to what extent," states Richard Taylor, senior system programmer, IT, for Clark County, Nev. It's not easy, he says: "If you thought ILM/records management/HSM was hard to get consensus on, wait until you try to get a roomful of battlin' business units to agree that only two of them get 'the GOOD backup!' "
Still, input from all levels of management, especially the highest level, will help establish the proper lists. Banks might determine that financial reporting is more important than supplier or employee information, for example. A retailer may need to preserve customer lists and transactions, as well as inventory data.
Scope out your facilities and match them to your requirements. Salazar says once a list of key IT resources is made and the most critical ones are identified, it's important to determine which systems keep those applications in place. "You'll need to promote internal resources and timing for the systems that will get those applications up and running."
Sometimes testing will reveal a need for new technology. Erickson's group at Fidelity Bank, for instance, discovered that one or two of their applications would be "clumsy" to recover unless they maintained offsite copies of virtual servers. (The bank uses Microsoft's virtualization software.) As long as the virtual servers are replicated at the remote DR site, it's relatively quick to update specific data, he says.
Nail down timeframes. Establish your recovery time objective (RTO) and recovery point objective (RPO) up front. Erickson's group, for instance, assumes that most banking transactions must be completed within a 24-hour period. Their DR system calls for loan, operations, and teller areas to be up and running no later than a day after an outage. There are two- to three-day recovery times for less critical applications, such as some internal reporting functions.
Call for help if you need it. Fidelity Bank doesn't rely solely on internal advice. Erickson and his group talk to vendors about disaster recovery in order to make sure they've got everything covered. Then they have an external consultant come by for an assessment.
Keep track of changes. "Most DR failures are caused by configuration and environmental changes to the storage infrastructure," says George Crump of consultancy Storage Switzerland.
"Make DR plan updates part of your normal change control management, and if you are not doing change control management, add that to part of your DR plan, as they are interrelated," says analyst Greg Schulz of the StorageIO consultancy. "Periodically audit backup and replication to ensure that data can be read and restored to alternate sources."
Test, test, and retest. Then test your tests. "Do periodic audits of test plans, procedures, and documentation, using people not familiar with the processes to help determine what is known and what is assumed and what is documented," says Schulz. "A bit of common sense in DR can go a long way."Schulz says it's important to use testing correctly. "The emphasis should not be as much on a successful test, rather, a focus on finding issues and fixing them. Granted, no ones wants to have a failed test. However, if you cannot find the faults or issues, it then becomes tougher to fix them and avoid problems with an actual recovery."
Count out IT. How will your systems recover if IT folk aren't available? It's important to determine how non-technical staff could cope in the absence of IT support.
Pick your spots well. Mirrored DR sites should be far enough away from the main site that the odds encourage their survival. Fidelity Bank, for instance, picked a site 14 miles away, then used data compression from Silver Peak to ensure speedy transport of data over a T1 line. Salazar says his firm uses a Citrix VPN to link Miami headquarters to a DR site in Ga. This is because if a hurricane hits in Fla., the laws of meteorology indicate it probably won't get as far as Ga.
The above is hardly a comprehensive list. Sources indicate DR is really a work in progress. But by taking the right steps and performing detailed tests, it's possible to cut a wide margin of success.
Mary Jander, Site Editor, Byte and Switch
Silver Peak Systems Inc.
The StorageIO Group
Read more about:
2007You May Also Like