3 Real Lessons in Network Resilience
By Ahmed Algam
Over the past few months, I’ve seen many real-world examples of this. These incidents drove home a hard truth about architecting for network resilience:
Out-of-Band (OOB) access isn’t optional. It’s essential.
Here are three short but very real stories that made this point crystal clear.
1. The Power Outage That Didn’t Stop Us
Our Fremont office went dark. Completely dark. There was a power outage and our provider failed to give us a heads-up, so it took us by surprise.
No power meant routers, ESXi hosts, Proxmox servers, backup systems, and even Wi-Fi were knocked offline. It was a total blackout.
But we weren’t scrambling. We had architected a true out-of-band path using LTE. Even with the production network down, we still had a way in.
From miles away, we diagnosed the problem, rebooted critical infrastructure, and got things running again before most people even noticed.
Lesson: Your recovery plan is only as good as your last mile. If your failover path isn’t truly independent, it’s not a plan – it’s wishful thinking.
2. The Engineer Who Locked Himself Out
A partner’s network went down during a routine change. Not uncommon. What was uncommon? The fact that they had no access to fix it.
All their management traffic – SSH, APIs, everything – was routed through the same production network that had just failed. When that network died, so did their ability to reach any routers or switches. The team was flying blind.
We got the call, helped them recover, and discussed IMI best practices afterward.
Lesson: Never mix management and user traffic. You need a control plane that exists outside your data plane, especially when uptime is mission-critical.
3. “That’s So Obvious Now…” – The Failover Fail
A customer had the right idea: install a 4G modem as a failover path. This is common, and it’s a great way to gain access in case the main path goes down.
But the modem was physically wired into their primary Cisco router.
When that router failed (power surge), so did the modem. To make things worse, their monitoring agent was running in-band. So when the network collapsed, their monitoring did, too. No visibility, no access, no control.
We pointed out this problem. Then we suggested running the agent on dedicated OOB gear instead. Their response?
“That’s so obvious now…but I didn’t even think about it.”
Lesson: Monitoring doesn’t help if it goes down with everything else. Build it into your OOB infrastructure. Make it resilient, not just present.
What I Want You To Take Away From These Stories
Resilience isn’t just about having backup tools or extra hardware.
It’s about designing for failure. It’s about building your architecture so that even if the core goes dark, you still have eyes and hands on the network.
OOB isn’t a luxury. It’s your lifeline. Make sure to architect it like one.
Here Are Resources to Help Build Your OOB Lifeline
- Rollback Gone Wrong: How Out-of-Band Saved Our Engineering Backbone
- After The Firewall Fails: How Gen 3 Out-of-Band Cuts the Ransomware Killchain
- Out-of-Band Deployment Guide
- The CrowdStrike Outage: How to Recover Fast and Avoid the Next Outage
Get Hands-On Help From Our Engineers
My colleagues have years of experience architecting these resilience practices. Please use the form to send us a message and get help with your specific use case.