Refreshing Your Network DR Plan
Hurricane Helene was a reminder that network DR plans should be up to date. Here is a checklist to be prepared for the next disaster.
October 30, 2024
Micro networks, segmented networks, and enterprise networks all are now a part of the corporate IT ecosystem, which makes it important to revisit network disaster recovery and failover plans. Are all of these new network assets prepared for resilience and rapid recovery if disaster strikes?
Most sites find that they need to revisit and update their DR plans. These plans have historically been directed toward the restoration of enterprise systems and, best, restoration of an entire enterprise network. They don’t necessarily address new network deployments such as subnetworks, micro networks and edge computing.
The bottom line is that it's probably smart to revisit your network disaster recovery and restoration plan to see how inclusive it is.
Step one: Develop a network DR failover plan?
When I ask CIOs about their disaster recovery plans, a clear emphasis is on mission critical systems and time to recovery. There is such a thing as risk assessment, which includes researching and being prepared for vulnerabilities such as network security breaches—but all too often, there are no well-developed DR plans for networks themselves. It is network staff that factors in failovers as a part of daily network operations, but these operations are seldom formalized or tested.
Just how do you formalize a network disaster recovery plan?
The network DR plan is roughly 50% administration documentation and 50% technical detail, so a logical first step is seeing what types of documentation you already have in place. The documentation will likely consist of network configuration documentation and update information, network recovery procedures, and also list of network vendors and their emergency contact information. In most cases, the network staff will find that this documentation is out of date.
Fortunately, there is a plethora of network documentation tools available in the commercial marketplace. These tools provide ease of use for documenting and some degree of automation, but the actual work of producing documentation falls on the network staff. The ultimate benefit is that you can bring your documentation current in the software, which then makes it easier to keep the documentation updated as you go forward.
Step two: Look for risk points
In pre-micro network and pre-edge days, looking for network technical risk points was largely confined to vulnerability and security testing at network end points that interfaced with the outside world. Now, edge and micro-networks distributed throughout the company present new risk points.
Is a new micro-network in manufacturing properly secured at its edge? How well are your multi-cloud deployment boundaries secured? What about mobile devices and other automated equipment?
In short, the number of potential security and DR risk points has multiplied exponentially. Any one of these risk points could admit a harmful threat that throws the network staff into DR mode.
The only way to identify all network risk points is for network staff to find them one by one. Staff is then in a position to assess where documented security and failover procedures exist for these risk points and where they don't.
Step three: Look for critical points of failure
Networks aggregate hardware, software, cabling, equipment, vendors, and communications technologies into a single routing system that must have multiple traffic routes for doing its work if it is to be truly resilient and failover-capable. For this reason, the goal of any network DR plan should be to eliminate as many single critical points of failure as possible.
If an ISP service goes down, the network, micro-networks, cloud, and other edge services should be able to instantaneously failover to the services of an alternate ISP. At most sites today, this is standard practice. So, a spare parts inventory for routers, IoT equipment, and other common network components is being built to ensure that a part can quickly be swapped out if it fails.
What isn’t always standard practice is assuring that failover schemes exist for micro networks and edge IT. What happens if manufacturing’s micro network goes down? Does it failover to a cloud resource or to the central network? Network failovers should be defined and documented for all company networks so single point of failure risks can be avoided.
A similar precaution should be made for network equipment that presents a single point of failure. What happens if the power fails and your generator doesn’t work? Is there an alternative? If your VoIP phone system goes out, does the VoIP provider have failover, or do you need to plan for an alternative?
Step four: Formalize the network DR plan in writing
A network DR plan should address all networks throughout the enterprise, from the central corporate network out to the edge micro-networks. The documented failover operations should address all probable failure points and give specific operational direction for network personnel.
Internal network policies and SLAs for network users should also be considered.
What will be your mean time to recovery? Do you make a commitment to users that you will make the best effort to restore service within a specific timeframe? Do you have a way of informing users of ongoing network status while you are working on a recovery? Can users be failed over to a cloud while you are working on restoring service? Finally, is your network DR plan integrated with the overall IT DR plan?
Step five: Test your DR plan
I’ve been through network DR plan testing with a number of companies, and I can’t think of one instance where all tests executed flawlessly.
We often discovered that recovery procedures were missing or out of date. Sometimes, we discovered that a certain documented procedure didn’t actually work in practice. User communication methods during an outage weren’t always well articulated—and there were single critical points of network failure that had been missed.
We unearthed these DR plan shortcomings by testing a variety of different DR scenarios. This enabled us to correct procedures and (where necessary) retrain network personnel on revisions to the plan.
About the Author
You May Also Like