Recent Outages Highlight the Need for Digital Resilience and Experience Assurance
This year’s network outages serve as a wake-up call for organizations of all sizes. They need to be proactive about managing their networks and battle-testing them for today’s unpredictable landscape.
August 30, 2024
A string of network outages in the first half of 2024 from major companies underscores the critical role of the IT network in our always-on digital world. These disruptions, impacting millions of users, highlight the importance of ensuring network infrastructure performance and resilience. Network complexity combined with increasing demand and persistent cyber threats call for a new approach that delivers the scalability, flexibility, security, and ease of use needed to ensure consistent performance and protect against disruptions.
Specific details regarding the causes of these outages are being investigated, but one thing is clear: network disruptions are on the rise and can be caused by a multitude of factors, so you need to be prepared for anything. Disruption causes can range from hardware failures and software glitches to cyberattacks and even human network configuration errors.
The recent outages raise a critical question for all organizations, especially large enterprises with global footprints and millions of customers: Is our network equipped to handle the ever-evolving demands of today's digital landscape? If the answer to that question is “no” or “I’m not sure,” you may have a serious problem on your hands.
Modernization is essential
The good news is that advancements in network technology offer solutions. Modernized networks, leveraging the power of intelligent automation, offer the agility and resilience needed for today's world. Here are some best practices for IT operations management (ITOM) that can help reduce the risk of network outages. By implementing these practices, organizations could’ve potentially prevented or lessened the impact of the significant outages we saw in the first half of this year.
Testing and Validation: On February 22, a major cellular network provider in the United States experienced a widespread network outage, impacting millions of customers across the country for roughly 12 hours. The company attributed the issue to an error during a network expansion project. Without complete knowledge of the company's specific network environment, we do know that automated testing and validation are key to minimizing the risk of these types of errors. This includes pre-change lab-based testing, pre-checks, and post checks to ensure an optimal network state before and after changes are made. While it's impossible to say definitively if these techniques would have prevented this specific outage entirely, they certainly could lessen the impact and help organizations in a similar situation restore service more quickly.
Configuration Management: In early March, a technical issue caused widespread outages for a major social media platform, impacting roughly half a million users globally for over two hours. According to ThousandEyes, the outage was likely caused by a backend service such as authentication. Outages like this often stem from software bugs or configuration errors introduced during updates or maintenance. Traditional troubleshooting methods can be time-consuming, leading to delays in resolving the issue and extended downtime for users. By automating configuration management, users can thoroughly vet new updates and configurations before they become public. This can help catch and fix bugs much sooner, potentially preventing outages entirely. Rollback and preview capabilities provide additional measures to avoid major outages. Additionally, continuous integration and continuous delivery (CI/CD) practices can streamline the deployment of bug fixes once they've passed testing. This helps resolve outages quickly and minimizes user disruption.
Network Monitoring: On March 6, a brief disruption left many users unable to access another social media platform. Based on early analyses, an issue with the platform’s backend system - likely the servers storing user data and posts - prevented it from responding to requests from the edge network, resulting in a temporary outage for users. Modernized networks often have sophisticated monitoring tools that can detect issues like these early on. Combining monitoring with auto-remediation capabilities allows for quicker resolution of problems before they significantly impact users. Additionally, networks with intelligent features can reroute traffic or switch to backup systems instantaneously, minimizing overall downtime.
Network Visibility: On March 15, a major outage crippled a fast-food restaurant chain’s operations worldwide for several hours, impacting millions of customers across numerous countries. The outage was caused by a minor configuration change by a third-party vendor, highlighting the complexity and increased vulnerability of interconnected technology systems. With better visibility into the entire technology stack, including what third-party vendors are doing, enterprises can better identify potential problems before they cause outages. To further strengthen their defenses, enterprises can implement redundancy and diversification, making their networks less susceptible to outages caused by single points of failure.
Steps to network modernization
To take this further, below are key steps all enterprises can take to modernize their networks and reduce the risk of costly outages:
Establish proactive responses: Set up systems to respond to monitoring and alerting conditions. Include periodic and triggered configuration audits, configuration and state drift detection, and proactive troubleshooting procedures to pinpoint network issues.
Enable self-healing mechanisms: Utilize technology, such as network automation with auto-remediation, to implement functionalities that fix common network problems like configuration errors, restarting failed devices, and rerouting traffic.
Enforce standardization with configuration management: Implement a system to enforce standard configurations, track changes, and enable rollbacks to known-good states.
Integrate continuous testing: Incorporate automated testing and validation, including pre-change lab-based testing, pre-checks, and post-checks to ensure optimal network state throughout changes.
Maintain clear documentation and visualization: Update network documentation, device inventories, and topology maps regularly. This minimizes manual errors and speeds up troubleshooting.
Streamline security posture with enforcement: Enforce security policy configuration automatically to minimize threats and the likelihood of security-related outages. Make sure patching and OS upgrades are current to reduce exposures.
The network outages from the first half of the year serve as a wake-up call for organizations of all sizes. No longer can businesses cross their fingers and hope for their networks to get the job done. They need to be proactive about managing their networks and battle-testing them for today’s unpredictable landscape. By embracing network modernization best practices, businesses can build more agility and resilience into their existing infrastructure. This minimizes downtime and mitigates the impact of outages, and it also ensures a smoother and more reliable experience for users and employees alike. Investing in network modernization is no longer a luxury; it's a business imperative for thriving in today's ever-connected digital landscape.
About the Author
You May Also Like