Home » Blog » “That’s So Obvious Now…” – 3 Real Lessons in Network Resilience
Ahmed Algam – Real lessons in network resilience

3 Real Lessons in Network Resilience

By Ahmed Algam

Failure is a necessary part of life. It shines a light on things that you didn’t give enough attention to, so you can learn and grow. The same goes for life in IT. We do a lot of planning to prevent failure, but it inevitably shows up and reveals the flaws in our plans. We don’t like failure, but we kind of need it.

Over the past few months, I’ve seen many real-world examples of this. These incidents drove home a hard truth about architecting for network resilience:

Out-of-Band (OOB) access isn’t optional. It’s essential.

Here are three short but very real stories that made this point crystal clear.

1. The Power Outage That Didn’t Stop Us

Our Fremont office went dark. Completely dark. There was a power outage and our provider failed to give us a heads-up, so it took us by surprise.

No power meant routers, ESXi hosts, Proxmox servers, backup systems, and even Wi-Fi were knocked offline. It was a total blackout.

But we weren’t scrambling. We had architected a true out-of-band path using LTE. Even with the production network down, we still had a way in.

From miles away, we diagnosed the problem, rebooted critical infrastructure, and got things running again before most people even noticed.

Lesson: Your recovery plan is only as good as your last mile. If your failover path isn’t truly independent, it’s not a plan – it’s wishful thinking.

2. The Engineer Who Locked Himself Out

A partner’s network went down during a routine change. Not uncommon. What was uncommon? The fact that they had no access to fix it.

All their management traffic – SSH, APIs, everything – was routed through the same production network that had just failed. When that network died, so did their ability to reach any routers or switches. The team was flying blind.

We got the call, helped them recover, and discussed IMI best practices afterward.

Lesson: Never mix management and user traffic. You need a control plane that exists outside your data plane, especially when uptime is mission-critical.

3. “That’s So Obvious Now…” – The Failover Fail

A customer had the right idea: install a 4G modem as a failover path. This is common, and it’s a great way to gain access in case the main path goes down.

But the modem was physically wired into their primary Cisco router.

When that router failed (power surge), so did the modem. To make things worse, their monitoring agent was running in-band. So when the network collapsed, their monitoring did, too. No visibility, no access, no control.

We pointed out this problem. Then we suggested running the agent on dedicated OOB gear instead. Their response?

That’s so obvious now…but I didn’t even think about it.

Lesson: Monitoring doesn’t help if it goes down with everything else. Build it into your OOB infrastructure. Make it resilient, not just present.

What I Want You To Take Away From These Stories

Resilience isn’t just about having backup tools or extra hardware.

It’s about designing for failure. It’s about building your architecture so that even if the core goes dark, you still have eyes and hands on the network.

 

OOB isn’t a luxury. It’s your lifeline. Make sure to architect it like one.

Here Are Resources to Help Build Your OOB Lifeline

 

Get Hands-On Help From Our Engineers
My colleagues have years of experience architecting these resilience practices. Please use the form to send us a message and get help with your specific use case.

 

ZPE Systems delivers innovative solutions to simplify infrastructure managment at the datacenter, branch, and edge. Learn how our Zero Pain Ecosystem can solve your biggest network orchestration pain points.  
Watch a Demo Contact Us