Why VPNs and Jump Hosts Fail MSPs at Scale, And How To Fix It
MSPs and Managed Network Service providers depend on remote access every day. Engineers connect to firewalls, routers, switches, hypervisors, and servers across dozens or even hundreds of customer environments. It’s a core function of operations, and without it, MSPs just wouldn’t exist.
The foundation of the remote access model is familiar for many providers: VPN tunnels combined with jump hosts or bastion servers. These tools allow engineers to log into a centralized environment and reach infrastructure across customer networks. This model works reasonably well when there are few customers. But as MSPs add sites, scale their customer base, and deploy more infrastructure, this traditional model becomes unmanageable.
Let’s find out why by looking at how VPN and jump host architectures actually work during real-world failure scenarios.
The MSP Remote Access Model
Most MSP/MNS environments rely on a layered remote access architecture. Engineers connect through a VPN gateway hosted either by the MSP or the customer environment. Once authenticated, they reach an internal jump host or bastion server that acts as a controlled entry point to the network infrastructure.
From the jump host/bastion server, they access infrastructure including:
- Edge routers and firewalls
- Core switches
- Hypervisors and storage systems
- Monitoring servers
- Identity services
- Virtual infrastructure platforms (like VMware, Microsoft Hyper-V, etc.)
Image: MSP remote access relies on the very infrastructure it manages.
This architecture has some benefits. It centralizes access control for the specific customer environment, somewhat simplifies credential management, and allows security teams to enforce authentication policies before engineers reach sensitive systems.
But remote access relies on the assumption that all of this production infrastructure remains operational.
What happens when it fails?
When In-Band Management Breaks: Common Failure Scenarios
VPNs and jump hosts operate entirely in-band, meaning they rely on the same network infrastructure they are meant to manage.
We covered this dependency at length in our last MSP article. Essentially, in-band management is cut off during failures, turning small issues into big outages that eat into MSP margins. And there’s a whole range of failures that can occur. Here are just a few of the common scenarios that lead to long outages and truck rolls:
Routing failures can entirely remove the path between engineers and the environment. A BGP misconfiguration, OSPF failure, or even a bad firmware update can drop VPN sessions instantly. The device causing the issue may still be running, but without access, engineers can’t fix it.
Firewall policy errors often block management traffic. A single misapplied rule or automated update can cut off access to internal systems. The firewall is online but unreachable, making a simple rule change impossible without on-site help.
WAN or ISP outages eliminate remote connectivity altogether. Even if the internal network is still functioning, engineers outside the environment have no way in. What should be a quick fix becomes a truck roll.
Authentication failures can lock engineers out of jump hosts, even when systems are otherwise healthy. If identity services like Active Directory or LDAP are unavailable, login attempts fail and troubleshooting stops.
Core service failures, such as DNS or certificate validation issues, can also break access indirectly. Devices may still be reachable, but the tools used to connect to them stop working.
We’ll break these scenarios down further in a separate article, but the pattern is clear: Even when infrastructure is still running, engineers lose the ability to reach it when it matters most.
Why the Problem Gets Worse as MSPs Scale
Let’s set aside the fragility of this in-band remote access model and talk strictly about scale. When you’re managing dozens of customer environments, each introduces more VPN gateways, firewalls/policies, routing domains, identity integrations, etc.
That simple remote access model turns into a highly distributed patchwork of VPN tunnels, jump hosts, bastion servers, and authentication systems spanning multiple networks. It doesn’t take a large leap of the imagination to see why this doesn’t scale.
Access is Fragmented
Engineers rarely connect to a single management environment (unless of course they’re using ZPE Cloud). Instead, they maintain separate access paths for each customer, which looks like this:
- Different VPN clients or portals
- Separate credential sets
- Unique bastion hosts
- Different network segmentation models
Image: MSPs need to juggle multiple access paths, credentials, and infrastructure for different customers.
Troubleshooting a single outage may require navigating several access layers before even reaching the affected device. This slows response time and increases the likelihood of access failures during incidents.
Ops Overhead Grows
As environments get bigger, so does the job of maintaining access infrastructure. MSP teams need to set up and maintain VPN gateways, manage identity federation between organizations, monitor jump host infrastructure, rotate/secure access credentials, and fix connectivity issues.
It’s easy for engineers to spend as much time maintaining the access system as they do managing the infrastructure itself.
Recovery Delays Multiply Across Sites
One incident is manageable. But imagine there’s a regional ISP outage or widespread software bug that takes down a dozen customer sites. Engineers are forced to:
- Queue troubleshooting tasks across environments
- Dispatch all their technicians to remote locations
- Coordinate access with third-party facilities
- Work around broken VPN connectivity
Image: Software bugs, like the one that caused 2024’s CrowdStrike outage, can render mission-critical PCs useless until remedied by on-site intervention.
As the number of managed sites grows, these recovery delays compound and the limitations of traditional remote access become clear.
Operational Costs Rise Quietly
When managing so many sites and incidents per year, the financial impact adds up. That practical remote access solution becomes a hefty cost of doing business, especially when incidents require additional troubleshooting hours, escalations to senior engineers, on-site recovery/travel expenses, and SLA penalties/credits.
Engineering Turns Into Firefighting
One of the biggest impacts on business is when engineers can no longer focus on optimizing the network, automating jobs, or rolling out security enhancements, and instead have to focus on putting out ops fires. When strategic improvements take a back seat to remote access failures and reactive outage recovery, teams become less productive.
How To Fix It: Separate Management From Production
Solving the challenge doesn’t involve deploying more remote access or monitoring tools. Many MSPs are taking a step back and addressing the underlying architecture. They’re finding that out-of-band management using the proper Isolated Management Infrastructure (IMI) is the only path forward (pun intended).
Maintain Access When the Network Fails
Out-of-band architectures introduce a separate management path that operates independently of the production network. Instead of relying solely on VPN connectivity through the customer infrastructure, engineers can reach devices through a dedicated management plane designed specifically for recovery and operational control. This includes:
- Direct console access to network and other devices
- Independent connectivity using secondary and tertiary WAN links
- Centralized management gateways that remain reachable during major outages
This management plane is reachable via 5G/cellular, satellite (like Starlink), secondary ISP, and other links. Modern serial console servers, like the Nodegrid Serial Console Plus, also include enterprise-grade security features like multi-factor authentication and zero trust controls, and isolation to keep the management plane completely hidden from threats. MSPs remain in control whether they’re battling a widespread outage or active cyberattack.
Image: Out-of-band management allows MSPs to securely connect to infrastructure, even when the production network fails.
If routing breaks, engineers can still reach the router console.
If firewall policies block access, engineers can log in through the out-of-band path and correct the rule.
If the WAN circuit fails entirely, cellular/satellite connectivity still provides a path into the environment.
The key difference is that management access no longer depends on the health of the production network. Management access becomes completely independent and always reachable.
Simplify Operations Across Many Environments
Out-of-band helps address the operational complexity that scales with traditional in-band management. Engineers no longer need to juggle separate VPNs, credentials, jump hosts, etc. for each customer. They get one management infrastructure that centralizes access and standardizes connectivity across sites. MSP teams get to:
- Maintain consistent access workflows across customers
- Enforce centralized authentication and authorization policies
- Audit administrative activity across all managed environments
- Reduce the number of tools required to access infrastructure
Image: Out-of-band helps MSPs streamline day-to-day operations by eliminating the need to juggle multiple VPNs, credentials, jump hosts, and other access layers for each customer.
For MSPs that use the secure management portal ZPE Cloud, they can log in once and simply click to switch between customer environments (here’s a cool video showing how easy it is). This simplifies day-to-day operations and outage recovery, and helps teams become more productive.
Combine Resilient Access and Centralized Control
Modern platforms combine out-of-band connectivity with centralized orchestration to provide both operational resilience and secure access management. Solutions like ZPE’s Nodegrid are designed to act as a dedicated management gateway for distributed infrastructure. Within this single platform, MSPs can:
- Maintain always-available console access to networking, computing, and their full stack of devices
- Connect to remote sites through independent cellular or secondary links
- Enforce role-based access controls and identity integration
- Record and audit administrative sessions with detailed logging
- Manage thousands of devices across geographically distributed environments
Image: ZPE’s Nodegrid devices combine 9+ functions into one and create an isolated management infrastructure ideal for secure, reliable access to production assets.
This architecture effectively creates an isolated management plane that remains available even when the production network is experiencing failures.
Make Recovery Predictable Instead of Reactive
For MSPs, the real advantage of this model is operational. When engineers know they will always be able to reach infrastructure during an outage, recovery becomes faster and more consistent. Troubleshooting can begin immediately, configuration errors can be corrected remotely, and incidents that used to require on-site intervention can be resolved from the operations center.
At scale, these improvements translate directly into measurable outcomes:
- Faster mean time to resolution
- Fewer truck rolls
- Lower operational overhead
- Improved SLA performance
In other words, the architecture changes how teams handle operations and how efficiently MSPs grow their business.
Understanding the Financial Impact
For many providers, the operational costs of traditional remote access models remain hidden until they analyze how often incidents require on-site intervention or extended troubleshooting.
To help MSP teams quantify this impact, we created a simple worksheet that estimates the true cost of downtime across managed environments.
It walks through common inputs such as incident volume, technician time, truck roll costs, and SLA penalties to calculate the annual financial impact of outage recovery.
From there, it shows how resilient management infrastructure can significantly reduce those costs. Download it now to analyze your costs and see your potential ROI by adopting out-of-band.

















