Network Management Best Practices
Network management involves administering, controlling, and monitoring an organization’s network. For most companies, the top priority for network teams is ensuring the continuous availability of critical business services, even during disruptive events like natural disasters, ransomware attacks, and infrastructure failures. Network resilience is the ability to continue operating (if in a degraded state) and delivering digital services in the face of adversity. This guide discusses the network management best practices for improving and supporting network resilience.
Network management best practices
Isolated Management Infrastructure (IMI)
Major ransomware attacks and breaches happen so frequently that cybersecurity professionals must now operate as if the network has already been compromised. This high-threat atmosphere led to the rise of the Zero Trust Security methodology discussed below. It’s also why a recent CISA Binding Directive outlines the best practice of isolating your management interfaces to a designated management network.
Moving all control functions for network infrastructure off the production LAN reduces the risk of cybercriminals accessing your management interfaces and “crown jewel” assets. This practice is known as isolated management infrastructure (IMI), and it separates the management plane from the data plane using designated network infrastructure. Doing so prevents attackers on the production network from finding and accessing the interfaces used to control servers, firewalls, routers, and other critical infrastructure devices. Thanks to management network segmentation and zero-trust security controls, hacking an IMI is almost impossible.
The best practice is to use out-of-band (OOB) serial consoles (a.k.a. console servers or terminal servers) to help construct the IMI. An OOB management solution uses dedicated network interfaces (such as 4G/5G cellular LTE or fiber) to provide an Internet connection for remote management access that doesn’t rely upon the primary production network at all. The benefit of using an OOB console server for the IMI is that teams have continuous access to monitor, manage, troubleshoot, and recover remote infrastructure when the production network is unavailable. Additionally, routing management ports to terminate on OOB terminal servers deployed top-of-rack creates multiple layers of management isolation to protect critical assets from criminals on the network.
Another network management best practice aided by IMI and OOB serial consoles is an isolated recovery environment (IRE). An IRE is built with designated infrastructure that is easily and quickly deployable, including an OOB control plane (such as a serial console), redundant storage & compute, and security and recovery tools. This gives teams a safe environment to recover from ransomware attacks without worrying about reinfection. Ideally, the IMI will use devices that consolidate network functions to enable easy deployments and scaling of IRE and OOB, but those devices should have robust features that can host the apps, tools, and services required to rebuild systems and restore data.
Network automation
Modern networks are large, complex, and ever-expanding, with user expectations growing more demanding every day. Even the best network administrator sometimes makes mistakes, either through negligence or because they have an overwhelming amount of work to do. Maybe they copy and paste the wrong security setting for a particular firewall appliance in a rush to deploy a new site on time; perhaps they miss a critical device health alert because they’re responding to a separate incident. These human errors, while understandable, can have devastating consequences on network resilience by causing security breaches, equipment failure, and service outages.
Network automation removes human error from the equation, ensuring network management tasks are carried out perfectly every time. Automation streamlines the most tedious network and fleet management tasks so teams can improve efficiency without allowing anything to fall through the cracks. Automation tools also respond to changing network conditions faster than human administrators to optimize the performance and availability of critical systems and services.
Network security
As discussed above, network breaches occur so frequently that it’s now a security best practice to assume attackers are already on the network. This is part of the zero trust security methodology, which follows the principle of “never trust, always verify” regarding all the users, devices, and applications that access the network. Zero trust security uses strong authentication methods (e.g., 2FA or one-time passwords), hardware roots of trust, and network micro-segmentation. These methods prevent attackers from moving around the network and accessing valuable resources (such as management interfaces).
Another security-related network management best practice is to extend zero-trust controls and policies to the network’s edges, such as to work-from-home devices, branch offices, and other remote business sites. This is achieved using edge-centric security solutions such as Security Service Edge (SSE) and Secure Access Service Edge (SASE). These technologies route remote, web-destined network traffic through a whole stack of cloud-based security solutions. This allows organizations to apply consistent security to edge traffic without creating bottlenecks at a centralized firewall or deploying additional security appliances at each site.
Another emerging network management best practice, especially for complex, automated infrastructures, is using artificial intelligence (AI) and machine learning to aid security and recovery. For example, AIOps solutions analyze data pulled from various sources on the network, including monitoring platforms, security appliances, and system event logs. AIOps is excellent at detecting anomalies, extrapolating potential consequences, and positing solutions. It can find novel and zero-day threats on the network, spot the signs of an imminent device failure, and perform root-cause analysis (RCA) to discover the source of problems. AIOps enhances the management, automation, and security practices on this list to improve the overall efficiency and resilience of enterprise networks.
Network management FAQs
1. How do I ensure interoperability amongst network management solutions?
Managing a modern network requires many different solutions, often from many different vendors. All these solutions must work together to prevent the management plane from getting too complex and ensure there are no coverage gaps. One option is to stick within one vendor’s ecosystem, but you may miss out on beneficial features or pay for functionality you don’t need. The best approach is to use a vendor-neutral (a.k.a. vendor-agnostic) network management platform to unify all your tools. To learn more, read The Benefits of Vendor Agnostic Platforms in Network Management.
2. What’s the difference between network automation and orchestration?
Network automation and network orchestration are two concepts that are often referenced together, leading to some confusion about the difference between them. Network automation focuses on individual tasks and processes, such as deploying a single software update. Network orchestration involves coordinating and managing multiple tasks and processes, or even entire workflows, such as configuring and deploying all the software on a server. To learn more, read IT Automation vs Orchestration: What’s the Difference?
3. Is network resilience the same as redundancy and backups?
Redundancy and backups are both critical to business continuity, but they do not equate to network resilience. Backups are copies of data, configurations, and code that are used to restore failed (or compromised) production systems. Redundancy duplicates services, applications, and systems so the primary versions can be “failed over” in case of failure or attack. Resilience is an organization’s overall ability to recover or adapt when major disruptions occur. To learn more, read Network Resilience: What is a Resilience System?
Resilient network management with Nodegrid
These network management best practices represent the industry-leading solutions for addressing the most common resilience challenges facing organizations. The network resilience experts at ZPE Systems can help you implement these practices with Gen 3 out-of-band management solutions and a vendor-neutral network management platform that supports automation. ZPE’s Nodegrid platform is the perfect ransomware recovery multi-tool, providing an isolated control plane as well as access to all the tools and software needed to restore critical operations.
Network management best practices for ransomware recovery and resilience
Learn more about using Nodegrid to improve ransomware resilience by downloading our white paper, 3 Steps to Ransomware Recovery.
ZPE Systems offers various solutions to help you implement your enterprise network management strategy.
Including data center infrastructure management, critical remote infrastructure management, and a secure uCPE gateway for distributed branch & edge networks. To learn more, contact us online.
Network Resilience: What is a Resilience System?
Network resilience means being able to withstand or recover from adversity, service degradation, and complete outages with minimal business disruption. The longer business-critical services are down, or systems are breached, the greater the risk of significant financial, reputational, and legal consequences. A resilience system is a set of technologies that enable an organization to continue operating while teams work to repair failures and recover from cyberattacks. But what exactly is a resilience system, and what does it look like? This guide to network resilience defines resilience systems, provides example use cases, compares them to related technologies like backups and redundant systems, and describes the key components required to build them.
What is a resilience system?
A resilience system provides all the infrastructure, tools, and services necessary to continue operating, if in a degraded state, during major incidents. It also includes everything needed to recover data, rebuild systems, perform security testing, and continue delivering core business functionality. A resilience system is typically isolated from the production network, preventing cybercriminals from finding and compromising it and ensuring teams have continuous access even if the primary network goes down.
Resilience system use cases
Some examples of the challenges that resilience systems help overcome include:
1. Ransomware recovery
In a ransomware attack, cybercriminals infect systems with malware that spreads throughout the network and encrypts any data it encounters. Modern ransomware now uses packaged attacks that move at machine speed, instantly incapacitating entire networks. Organizations completely lose access to critical systems and data until they pay a ransom, often in untraceable cryptocurrency. Ransomware is an exceptionally tenacious form of malware and tends to reinfect backup data and rebuilt systems, significantly hampering recovery efforts and increasing the duration and cost of the attack. The best practice for resilience systems is to isolate them on an out-of-band (OOB) network, inaccessible to hackers who have breached the production in-band network. Doing so creates a safe, isolated recovery environment (IRE) where teams can restore critical data and systems without the risk of reinfection. The resilience system includes all the tools and hardware needed to restore critical business services and infrastructure. An IRE significantly accelerates ransomware recovery and minimizes downtime, so businesses can avoid paying ransoms and reduce the overall cost of attacks.
2. Network outages
Enterprise network architectures and supply chains are highly complex, with lots of moving parts that rely on external vendors to maintain availability. Just one of those vendors dropping the ball could take the entire organization offline, severely impacting network resilience. For example, in 2023, an expired cryptographic certificate caused Cisco’s Viptela SD-WAN appliances to fail on reboot, completely taking down affected networks until the issue was resolved. With a resilience system, Viptela customers could have potentially avoided this downtime by failing over to alternative network resources. For example, a resilience system with integrated cellular failover allows branches to continue connecting to and delivering critical business services while also providing a lifeline for remote teams to access and recover failed systems. A resilience system also provides observability and automatic notifications so teams are instantly alerted to issues like certificate expirations and can respond quickly to recover critical services.
3. Shift to remote work
Incidents like ransomware attacks and equipment failures happen frequently enough that companies can create detailed plans and proactively implement solutions to minimize their impact, but not all adverse events are so predictable. When the COVID-19 pandemic struck, the massive shift to remote work strained the network resources of most organizations. Instead of maintaining a limited number of branch offices, teams suddenly had to treat every employee as a new branch, leading to performance degradation and outages as they scrambled to reinforce the business’s remote capabilities. A resilience system gives teams the tools and resources they need to provision additional infrastructure, manage networking logic, deploy new security solutions, and more, even while the primary network is offline or under a heavy load. A resilience system is the key to quickly adjusting network performance and security to adapt to sudden changes like a transition to fully remote operations.
Do backups and redundancy equate to network resilience?
The short answer is no; backups and redundancy do not equate to network resilience, though they do contribute to making systems more resilient.
- Backups are copies of data, configurations, and application code used to do a hot or cold restore when a production system fails. The underlying infrastructure must remain operational for teams to access and use backups, and unless additional resilience measures are taken, it’s easy for backups to become infected or compromised, severely hampering recovery efforts.
- Redundancy involves duplicating critical systems, services, and applications as a failsafe in case the primaries go down. Organizations can “fail over” to the redundancies to continue critical business operations during outages. However, redundant systems are just as susceptible to failures and infections without additional resilience measures like out-of-band management and isolated management infrastructure.
Backups and redundancy are part of network resilience but alone are not enough to ensure business continuity. Resilience systems focus on maintaining the architecture of the production network while adding the ability to recover or adapt to adversity. The next section discusses all the tools and technologies that make up network resilience systems.
What does a resilience system look like?
There are four key components that go into a resilience system.
Alternative networking and compute resources ensure the organization can failover in the event of a network failure or continue delivering services when production servers are unavailable. Teams also need enough storage to restore backup data, build new systems, and support the content delivery network (CDN). Automation solutions like zero-touch provisioning (ZTP), configuration management, and security validation tools accelerate the recovery process while mitigating the risk of human error. Combined, these components enable teams to reduce the frequency, severity, and duration of outages, improving overall network resilience.
Network resilience with ZPE Systems
A resilient network will continue delivering critical business services in the face of any challenge, whether from cybercriminals, supply chain issues, global events, or even plain human error. A resilience system is isolated from the production network to ensure security and availability, and it consists of all the tools and technologies needed to troubleshoot, recover, and deliver your most crucial data, applications, and infrastructure. The Nodegrid platform from ZPE Systems is the perfect foundation for a resilience system. Nodegrid is a vendor-neutral, out-of-band management solution capable of running your choice of third-party software. Nodegrid allows you to build a highly customizable IRE containing all the tools needed to safely recover from ransomware. You can even use Nodegrid to deliver services while the primary network or systems are down, making it your all-in-one network resilience multi-tool.
Want to ensure network resilience by accelerating ransomware recovery?
Minimize the business impact of ransomware with the help of our whitepaper, 3 Steps to Ransomware Recovery. Learn how to follow Gartner’s best practices to build an Isolated Recovery Environment



