Providing Out-of-Band Connectivity to Mission-Critical IT Resources

What to do if You’re Ransomware’d: A Healthcare Example

What to do if youre ransomwared

This article was written by James Cabe, CISSP, a 30-year cybersecurity expert who’s helped major companies including Microsoft and Fortinet.

Ransomware gangs target the innocent and vulnerable. They hit a Chicago hospital in December 2023, a London hospital in October the same year, and schools and hospitals in New Jersey as recently as January 2024. This is one of the biggest reasons I’m committed to stopping these criminals by educating organizations on how to re-think and re-architect their approach to cybersecurity.

In previous articles, I discussed IMI (Isolated Management Infrastructure) and IRE (Isolated Recovery Environments), and how they could have quickly altered outcomes for MGM, Ragnar Locker victims, and organizations affected by the MOVEit vulnerability. Using IMI and IRE, organizations find that the key to not only speedy recovery, but also to limiting the blast radius and attack persistence, is isolation.

Why is isolation (not segmentation) key to ransomware recovery?

The NIST framework for incident response has five steps: Identify, Protect, Detect, Respond, and Recover. It’s missing a crucial step, however: Isolate. Stay tuned for a full breakdown of this in my next article. But the reason this is so critical is because attacks move at machine speed, and are very pervasive and persistent. If your management network is not fully isolated from production assets, the infection spreads to everything. Suddenly, you’re locked out completely and looking at months of tedious recovery. For healthcare providers, this jeopardizes everything from patient care to regulatory compliance.

Isolation is integral to building a resilience system, or in other words, a system that gives you more than basic serial console/out-of-band access and instead provides an entire infrastructure dedicated to keeping you in control of your systems — be it during a ransomware attack, ISP outage, natural disaster, etc. Because this infrastructure is physically and virtually isolated from production (no dependencies on production switches/routers, no open management ports, etc.), it’s nearly impossible for attackers to lock you out.

So, what really should you do if you’re ransomware’d? Let’s walk through an example attack on a healthcare system, and compare the traditional DR (Disaster Recovery) response to the IMI/IRE approach.

Ransomware in Healthcare: Disaster Recovery vs Isolated Recovery

Suppose you’re in charge of a hospital’s network. MDIoT, patient databases, and DICOM storage are the crown jewels of your infrastructure. Suddenly, you discover ransomware has encrypted patient records and is likely spreading quickly to other crown jewel assets. The risks and potential fallout can’t be understated. Millions of people are depending on you to protect their sensitive info, while the hospital is depending on you to help them avoid regulatory/legal penalties and ensure they can continue operating.

The problem with Disaster Recovery

Though the word ‘recovery’ is in the name, the DR approach is limited in its capacity to recover systems during an attack. Disaster Recovery typically employs a couple things:

  • Backups, which are copies of data, configurations, and code that are used to restore a production system when it fails.
  • Redundancy, which involves duplicating critical systems, services, and applications as a failsafe in the event that primaries go down (think cellular failover devices, secondary firewalls, etc.).

What happens when you activate your DR processes? It’s highly likely that you won’t be able to, and that’s because the typical DR setup relies on the production network. There’s no isolation.

Think about it this way: your backup servers need direct access to the data they’re backing up. If your file servers get pwned, your backup servers will, too. If your primary firewall gets hacked, your secondary will, too. The problem with backup and redundancy systems — and any system, for that matter — is that when they depend on the underlying infrastructure to remain operational, they’re just as susceptible to outages and attacks. It’s like having a reserve parachute that depends on the main parachute.

And what about the rest of your systems? You just discovered the attack has encrypted your servers and is quickly bringing operations to a crawl. How are you going to get in and fight back? What if you try to log into your management network, only to find that you’re locked out? All of your tools, configurations, and capabilities have been compromised.

This is why CISA, the FBI, US Navy, and other agencies recommend implementing Isolated Management Infrastructure.

IMI and IRE guarantee you can fight back against ransomware

You discover that the ransomware has spread. Not only has it encrypted data and stopped operations, but it has also locked you out of your own management network and is affecting the software configurations throughout the hospital. This is where IMI (Isolated Management Infrastructure) and IRE (Isolated Recovery Environment) come in.

Because IMI is physically separate from affected systems, it guarantees management access so teams can set up communication and a temporary ‘war room’ for incident response. The IRE can then be created using a combination of cellular, compute, connectivity, and power control (see diagram for design and steps). Docker containers should be used to bring up each step.

Diagram showing a chart containing the systems and open-source tools that can be deployed for an Isolated Recovery Environment

Image: The infrastructure and incident response protocol involved in the Isolated Recovery Environment. These products were chosen from free or open source projects that have proven to be very useful in each of these stages of recovery. These can be automated in pieces for each phase, and then be brought down via Docker container to eliminate the risk of leakage or risk during each phase.

Without diving too far into the technicalities, the IRE enables you to recover survivable data, restore software configurations, and prevent reinfection. Here are some things you can do (and should do) in this scenario, courtesy of the IRE:

Establish your war room

You can’t fight ransomware if you can’t securely communicate with your team. Use the IRE to create offline, break-the-glass accounts that are not attached to email. This allows you to communicate and set up ticketing for forensics purposes.

Isolate affected systems

There’s no use running antivirus if reinfection can occur. Use the IRE to take offline the switch that connects the backup and file servers. Isolate these servers from each other and shut down direct backup ports. Then, you can remote-in (KVM, iKVM, iDRAC) to run antivirus and EDR (Endpoint Detection and Response).

Restore data and device images

The key is to have backup data at its most current, both for patient data and device/software configurations. Because the IRE provides an isolated environment, and you’ve already pulled your backups offline, you can gradually restore data, re-image devices, and restore configurations without risking reinfection. The IRE ensures devices “keep away” from each other until they can be cleansed and recovered.

Things You’ll Need To Build The IMI and IRE

Network Automation Blueprint

We’ve created a comprehensive blueprint that shows how to implement the architecture for IMI and IRE. Don’t let the name fool you. The Network Automation Blueprint covers everything from establishing a dedicated management network, to automating deployment of services for ransomware recovery. Get your PDF copy now at the link below.

Gen 3 Console Servers To Replace End-of-Life Gear

It’s nearly impossible to build the IMI or deploy the IRE using older console servers. That’s because these only give you basic remote access and a hint of automation capabilities. You’ll still need the ability to run VMs and containers. Gen 3 console servers let you do all of the things for IMI and IRE, like full control plane/data plane separation, hosting apps, and deploying VMs/containers on-demand. They’ve also been validated by Synopsys and have built-in security features I’ve been talking about for years. Check out the link below for resources about Gen 3 and how we’ll help you upgrade.

Get in touch with me!

I’d love to talk with you about IMI, IRE, and resilience systems. These are becoming more crucial to operational resilience and ransomware recovery, and countries are passing new regulations that will require these approaches. Get in touch with me via social media to talk about this!

IT Automation vs Orchestration: What’s the Difference?

it-automation-vs-orchestration

IT automation and orchestration are two important concepts in the field of information technology that are often used interchangeably but are actually quite different. IT automation focuses on individual tasks, whereas orchestration encompasses multiple tasks or even entire workflows. Each approach produces different results and helps teams meet different goals. They also have their own benefits and challenges that must be considered. This guide compares IT automation vs orchestration to clear up misconceptions and help organizations choose the right approach to streamlining their IT operations.

IT Automation vs Orchestration: What’s the Difference?

IT Automation vs Orchestration

IT automation refers to the use of technology to automate repetitive tasks and processes, including things like automated backups, software updates, and monitoring systems. The goal of IT automation is to free up time and resources for IT professionals by automating routine tasks, allowing them to focus on more strategic initiatives.

Orchestration, on the other hand, is the coordination and management of multiple processes or entire workflows. This can include things like configuring and deploying new servers, managing network connections, and monitoring the performance of many different systems. The goal of orchestration is to improve the overall efficiency of IT operations, reducing costs and enabling greater scalability.

The benefits of IT automation vs orchestration

Benefits of IT Automation vs Orchestration

IT Automation

  • Saves time
  • Reduces human error
  • Improves compliance

Orchestration

  • Increases operational efficiency
  • Improves network scalability
  • Ensures IT system reliability

One of the main benefits of IT automation is that it can save time and resources for IT professionals. By automating routine tasks, IT teams can focus on more strategic initiatives and projects. Additionally, automation helps reduce human error and increases the accuracy, speed, and efficiency of tasks. Automation also improves compliance, as automated processes are less prone to human negligence and are easier to audit.

Orchestration, on the other hand, helps improve the overall efficiency and effectiveness of IT operations. By automating the coordination and management of multiple tasks, orchestration helps ensure that different systems and processes work together seamlessly. Additionally, orchestration helps improve the scalability and reliability of IT systems by ensuring different components are configured and deployed correctly.

The challenges of IT automation and orchestration

IT Automation and Orchestration Challenges

IT Complexity

Teams can’t effectively automate IT operations unless they thoroughly understand all the tasks, systems, and workflows comprising a highly complex network.

Automation Skills Gap

A high demand for automation engineers makes it difficult and expensive to recruit, train, and retain qualified IT automation and orchestration professionals.

Supporting Infrastructure

Effective automation and orchestration deployments require a robust underlying infrastructure of specialized hardware and software solutions.

One of the main challenges of automation and orchestration is the complexity of IT systems. As organizations rely more heavily on specialized technology and grow both in size and in number of business sites, IT systems become increasingly complex and difficult to manage. Automation and orchestration help reduce complexity by automating routine tasks and coordinating the management of different systems. However, teams must understand those tasks and systems well enough to know how to automate them effectively; otherwise, mistakes will proliferate or there will be gaps in automated workflows.

Another IT automation and orchestration challenge is the need for skilled professionals to deploy and manage these solutions. As automation and orchestration become more prevalent, the demand for skilled professionals has increased, making it harder (and more expensive) to recruit and retain qualified automation engineers. The alternative is for organizations to spend time and resources training existing IT staff to work with automation and orchestration.

Additionally, organizations need to invest in the technology and infrastructure necessary to support automation and orchestration. Some examples of these automation infrastructure components include:

  • Gen 3 out-of-band (OOB) serial consoles, which allow teams to deploy third-party automation on an OOB network that doesn’t rely on production infrastructure, improving security and resilience. Gen 3 OOB also moves bandwidth-hogging orchestration workflows off the production network, which reduces latency for better performance.
  • Software-defined networking, which virtualizes the control and management processes and abstracts them from underlying LAN and WAN hardware. SDN, SD-WAN, and SD-Branch technologies enable a high degree of automation for networking workflows such as load balancing, application-aware routing, and failover.
  • Infrastructure as Code (IaC), which turns infrastructure configurations into software code. IaC enables the use of version control, zero-touch deployments, automatic configuration management, automated security testing, and other tools and processes that support automation and improve network resilience.
  • Orchestrator software, which controls all of the automated workflows on a network. The orchestrator is the central hub for teams to create, deploy, monitor, and troubleshoot automated workflows and infrastructure.
  • AIOps, or artificial intelligence for IT operations, which analyzes all the logs and data pulled from automated infrastructure devices and security appliances. AIOps provides predictive maintenance insights, automatic root-cause analysis (RCA), enhanced threat detection, and other functionality to help support a complex, automated network infrastructure.

Tips for overcoming IT automation and orchestration challenges

While every organization will face unique IT automation and orchestration hurdles, there are two basic tips to help simplify any deployment. Using consolidated network hardware and vendor-neutral platforms can help reduce the complexity of network infrastructure, the need to hire additional staff, and the cost to deploy automation infrastructure.

  • Consolidated network hardware, such as all-in-one branch/edge gateway routers, significantly reduces the number of devices deployed at each business site. Fewer devices to automate means less complexity, and organizations save money on deployment costs like hardware overhead and automation license seats.
  • Vendor-neutral platforms, such as the Nodegrid infrastructure management platform from ZPE Systems, allow teams to use the automation and orchestration tools they’re most comfortable with regardless of provider, reducing the skills gap. Open platforms ensure seamless interoperability between all the various automated components to decrease management complexity. Vendor-neutral hardware also allows organizations to run software from multiple vendors on a single device, enabling even greater network consolidation to reduce the complexity and cost of automated infrastructure deployments.

Choosing IT automation vs orchestration

IT automation and orchestration are interconnected concepts that are frequently, but incorrectly, used interchangeably. Automation focuses on individual tasks, while orchestration manages multiple tasks and entire workflows. Both automation and orchestration can help improve the efficiency and effectiveness of IT operations, but they have their unique benefits and challenges. Organizations must carefully consider their IT systems and needs when deciding which approach to use.

IT automation vs orchestration simplified

The network automation experts at ZPE Systems have helped Big Tech brands like Amazon and Uber improve operational efficiency and resilience with IT automation and orchestration. Learn how to use these best practices to streamline your IT operations by downloading our Network Automation Blueprint.

Download the Blueprint

Network Management Best Practices

A collage of concepts related to network management best practices for resilience and security.

Network management involves administering, controlling, and monitoring an organization’s network. For most companies, the top priority for network teams is ensuring the continuous availability of critical business services, even during disruptive events like natural disasters, ransomware attacks, and infrastructure failures. Network resilience is the ability to continue operating (if in a degraded state) and delivering digital services in the face of adversity. This guide discusses the network management best practices for improving and supporting network resilience.  

Network management best practices

Network Management Best Practices for Resilience

Isolated Management Infrastructure (IMI)

  • Moves management interfaces off the production network to protect them from cybercriminals

  • Out-of-band (OOB) management ensures continuous remote access to IMI even when production infrastructure is offline

  • Isolated Recovery Environments (IREs) allow teams to restore infrastructure and services without risking reinfection

Network Automation

  • Reduces the risk of failures or security breaches by eliminating human error in configuration changes

  • Simplifies fleet management tasks like connectivity checks, device location monitoring, and software patching

  • Application-aware routing, intelligent load balancing, and automatic failover ensure optimal performance and availability

Network Security

  • Zero trust security protects valuable data and resources from attackers already on the network

  • SASE and SSE extend enterprise security policies and tools to remote users, applications, and devices

  • AIOps provides enhanced security monitoring, threat detection, and remediation capabilities

Isolated Management Infrastructure (IMI)

Major ransomware attacks and breaches happen so frequently that cybersecurity professionals must now operate as if the network has already been compromised. This high-threat atmosphere led to the rise of the Zero Trust Security methodology discussed below. It’s also why a recent CISA Binding Directive outlines the best practice of isolating your management interfaces to a designated management network.

Moving all control functions for network infrastructure off the production LAN reduces the risk of cybercriminals accessing your management interfaces and “crown jewel” assets. This practice is known as isolated management infrastructure (IMI), and it separates the management plane from the data plane using designated network infrastructure. Doing so prevents attackers on the production network from finding and accessing the interfaces used to control servers, firewalls, routers, and other critical infrastructure devices. Thanks to management network segmentation and zero-trust security controls, hacking an IMI is almost impossible.

A diagram showing a multi-layered isolated management infrastructure.

The best practice is to use out-of-band (OOB) serial consoles (a.k.a. console servers or terminal servers) to help construct the IMI. An OOB management solution uses dedicated network interfaces (such as 4G/5G cellular LTE or fiber) to provide an Internet connection for remote management access that doesn’t rely upon the primary production network at all. The benefit of using an OOB console server for the IMI is that teams have continuous access to monitor, manage, troubleshoot, and recover remote infrastructure when the production network is unavailable. Additionally, routing management ports to terminate on OOB terminal servers deployed top-of-rack creates multiple layers of management isolation to protect critical assets from criminals on the network.

A diagram showing the components of an isolated recovery environment.

Another network management best practice aided by IMI and OOB serial consoles is an isolated recovery environment (IRE). An IRE is built with designated infrastructure that is easily and quickly deployable, including an OOB control plane (such as a serial console), redundant storage & compute, and security and recovery tools. This gives teams a safe environment to recover from ransomware attacks without worrying about reinfection. Ideally, the IMI will use devices that consolidate network functions to enable easy deployments and scaling of IRE and OOB, but those devices should have robust features that can host the apps, tools, and services required to rebuild systems and restore data.

Network automation

Modern networks are large, complex, and ever-expanding, with user expectations growing more demanding every day. Even the best network administrator sometimes makes mistakes, either through negligence or because they have an overwhelming amount of work to do. Maybe they copy and paste the wrong security setting for a particular firewall appliance in a rush to deploy a new site on time; perhaps they miss a critical device health alert because they’re responding to a separate incident. These human errors, while understandable, can have devastating consequences on network resilience by causing security breaches, equipment failure, and service outages.

Network automation removes human error from the equation, ensuring network management tasks are carried out perfectly every time. Automation streamlines the most tedious network and fleet management tasks so teams can improve efficiency without allowing anything to fall through the cracks. Automation tools also respond to changing network conditions faster than human administrators to optimize the performance and availability of critical systems and services.

Network Automation Examples

Infrastructure as Code (IaC) abstracts infrastructure configurations from the underlying hardware so they can be written and deployed as repeatable, automatable scripts.

Zero Touch Provisioning automatically downloads and installs new network device configurations with little to no human interaction to streamline remote deployments.

Software-Defined Wide Area Networking (SD-WAN) decouples WAN control functions from the underlying hardware to enable features like application-aware routing, intelligent load balancing, and automatic failover to improve performance and availability.

Automatic Patch Management ensures software vulnerabilities are closed before being exploited by cybercriminals while providing automatic recovery and rollback in case of issues.

Network security

As discussed above, network breaches occur so frequently that it’s now a security best practice to assume attackers are already on the network. This is part of the zero trust security methodology, which follows the principle of “never trust, always verify” regarding all the users, devices, and applications that access the network. Zero trust security uses strong authentication methods (e.g., 2FA or one-time passwords), hardware roots of trust, and network micro-segmentation. These methods prevent attackers from moving around the network and accessing valuable resources (such as management interfaces).

Another security-related network management best practice is to extend zero-trust controls and policies to the network’s edges, such as to work-from-home devices, branch offices, and other remote business sites. This is achieved using edge-centric security solutions such as Security Service Edge (SSE) and Secure Access Service Edge (SASE). These technologies route remote, web-destined network traffic through a whole stack of cloud-based security solutions. This allows organizations to apply consistent security to edge traffic without creating bottlenecks at a centralized firewall or deploying additional security appliances at each site.

A diagram illustrating a basic SASE network security architecture.

Another emerging network management best practice, especially for complex, automated infrastructures, is using artificial intelligence (AI) and machine learning to aid security and recovery. For example, AIOps solutions analyze data pulled from various sources on the network, including monitoring platforms, security appliances, and system event logs. AIOps is excellent at detecting anomalies, extrapolating potential consequences, and positing solutions. It can find novel and zero-day threats on the network, spot the signs of an imminent device failure, and perform root-cause analysis (RCA) to discover the source of problems. AIOps enhances the management, automation, and security practices on this list to improve the overall efficiency and resilience of enterprise networks.

Network management FAQs

1. How do I ensure interoperability amongst network management solutions?

Managing a modern network requires many different solutions, often from many different vendors. All these solutions must work together to prevent the management plane from getting too complex and ensure there are no coverage gaps. One option is to stick within one vendor’s ecosystem, but you may miss out on beneficial features or pay for functionality you don’t need. The best approach is to use a vendor-neutral (a.k.a. vendor-agnostic) network management platform to unify all your tools. To learn more, read The Benefits of Vendor Agnostic Platforms in Network Management.

2. What’s the difference between network automation and orchestration?

Network automation and network orchestration are two concepts that are often referenced together, leading to some confusion about the difference between them. Network automation focuses on individual tasks and processes, such as deploying a single software update. Network orchestration involves coordinating and managing multiple tasks and processes, or even entire workflows, such as configuring and deploying all the software on a server. To learn more, read IT Automation vs Orchestration: What’s the Difference?

3. Is network resilience the same as redundancy and backups?

Redundancy and backups are both critical to business continuity, but they do not equate to network resilience. Backups are copies of data, configurations, and code that are used to restore failed (or compromised) production systems. Redundancy duplicates services, applications, and systems so the primary versions can be “failed over” in case of failure or attack. Resilience is an organization’s overall ability to recover or adapt when major disruptions occur. To learn more, read Network Resilience: What is a Resilience System?

Resilient network management with Nodegrid

These network management best practices represent the industry-leading solutions for addressing the most common resilience challenges facing organizations. The network resilience experts at ZPE Systems can help you implement these practices with Gen 3 out-of-band management solutions and a vendor-neutral network management platform that supports automation. ZPE’s Nodegrid platform is the perfect ransomware recovery multi-tool, providing an isolated control plane as well as access to all the tools and software needed to restore critical operations.

Network management best practices for ransomware recovery and resilience

Learn more about using Nodegrid to improve ransomware resilience by downloading our white paper, 3 Steps to Ransomware Recovery.

Download Whitepaper

ZPE Systems offers various solutions to help you implement your enterprise network management strategy.

Including data center infrastructure management, critical remote infrastructure management, and a secure uCPE gateway for distributed branch & edge networks. To learn more, contact us online. 

Contact Us

Network Resilience: What is a Resilience System?

A digital web of interconnected network resilience concepts being selected by a business person in a suit.

Network resilience means being able to withstand or recover from adversity, service degradation, and complete outages with minimal business disruption. The longer business-critical services are down, or systems are breached, the greater the risk of significant financial, reputational, and legal consequences. A resilience system is a set of technologies that enable an organization to continue operating while teams work to repair failures and recover from cyberattacks. But what exactly is a resilience system, and what does it look like? This guide to network resilience defines resilience systems, provides example use cases, compares them to related technologies like backups and redundant systems, and describes the key components required to build them.

What is a resilience system?

A resilience system provides all the infrastructure, tools, and services necessary to continue operating, if in a degraded state, during major incidents. It also includes everything needed to recover data, rebuild systems, perform security testing, and continue delivering core business functionality. A resilience system is typically isolated from the production network, preventing cybercriminals from finding and compromising it and ensuring teams have continuous access even if the primary network goes down.

Resilience system use cases

Some examples of the challenges that resilience systems help overcome include:

1. Ransomware recovery

In a ransomware attack, cybercriminals infect systems with malware that spreads throughout the network and encrypts any data it encounters. Modern ransomware now uses packaged attacks that move at machine speed, instantly incapacitating entire networks. Organizations completely lose access to critical systems and data until they pay a ransom, often in untraceable cryptocurrency. Ransomware is an exceptionally tenacious form of malware and tends to reinfect backup data and rebuilt systems, significantly hampering recovery efforts and increasing the duration and cost of the attack. The best practice for resilience systems is to isolate them on an out-of-band (OOB) network, inaccessible to hackers who have breached the production in-band network. Doing so creates a safe, isolated recovery environment (IRE) where teams can restore critical data and systems without the risk of reinfection. The resilience system includes all the tools and hardware needed to restore critical business services and infrastructure. An IRE significantly accelerates ransomware recovery and minimizes downtime, so businesses can avoid paying ransoms and reduce the overall cost of attacks.

2. Network outages

Enterprise network architectures and supply chains are highly complex, with lots of moving parts that rely on external vendors to maintain availability. Just one of those vendors dropping the ball could take the entire organization offline, severely impacting network resilience. For example, in 2023, an expired cryptographic certificate caused Cisco’s Viptela SD-WAN appliances to fail on reboot, completely taking down affected networks until the issue was resolved. With a resilience system, Viptela customers could have potentially avoided this downtime by failing over to alternative network resources. For example, a resilience system with integrated cellular failover allows branches to continue connecting to and delivering critical business services while also providing a lifeline for remote teams to access and recover failed systems. A resilience system also provides observability and automatic notifications so teams are instantly alerted to issues like certificate expirations and can respond quickly to recover critical services.

3. Shift to remote work

Incidents like ransomware attacks and equipment failures happen frequently enough that companies can create detailed plans and proactively implement solutions to minimize their impact, but not all adverse events are so predictable. When the COVID-19 pandemic struck, the massive shift to remote work strained the network resources of most organizations. Instead of maintaining a limited number of branch offices, teams suddenly had to treat every employee as a new branch, leading to performance degradation and outages as they scrambled to reinforce the business’s remote capabilities. A resilience system gives teams the tools and resources they need to provision additional infrastructure, manage networking logic, deploy new security solutions, and more, even while the primary network is offline or under a heavy load. A resilience system is the key to quickly adjusting network performance and security to adapt to sudden changes like a transition to fully remote operations.

Do backups and redundancy equate to network resilience?

The short answer is no; backups and redundancy do not equate to network resilience, though they do contribute to making systems more resilient.

  • Backups are copies of data, configurations, and application code used to do a hot or cold restore when a production system fails. The underlying infrastructure must remain operational for teams to access and use backups, and unless additional resilience measures are taken, it’s easy for backups to become infected or compromised, severely hampering recovery efforts.
  • Redundancy involves duplicating critical systems, services, and applications as a failsafe in case the primaries go down. Organizations can “fail over” to the redundancies to continue critical business operations during outages. However, redundant systems are just as susceptible to failures and infections without additional resilience measures like out-of-band management and isolated management infrastructure.

Backups and redundancy are part of network resilience but alone are not enough to ensure business continuity. Resilience systems focus on maintaining the architecture of the production network while adding the ability to recover or adapt to adversity. The next section discusses all the tools and technologies that make up network resilience systems.

What does a resilience system look like?

There are four key components that go into a resilience system.

Key Components of a Resilience System

Alternative Networking

Full-stack routing and switching, Wi-Fi, VoIP, virtualization, software-defined network overlays for SDN & SD-WAN

Alternative Compute

Full-stack compute, containers, virtual machines, and any other resources needed to run applications and deliver services

Storage & Storage Recovery

Enough storage to recover systems and applications as well as support content delivery

Automation

Tools like zero-touch provisioning (ZTP) to facilitate speedy recovery while minimizing human error

Alternative networking and compute resources ensure the organization can failover in the event of a network failure or continue delivering services when production servers are unavailable. Teams also need enough storage to restore backup data, build new systems, and support the content delivery network (CDN). Automation solutions like zero-touch provisioning (ZTP), configuration management, and security validation tools accelerate the recovery process while mitigating the risk of human error. Combined, these components enable teams to reduce the frequency, severity, and duration of outages, improving overall network resilience.

Network resilience with ZPE Systems

A resilient network will continue delivering critical business services in the face of any challenge, whether from cybercriminals, supply chain issues, global events, or even plain human error. A resilience system is isolated from the production network to ensure security and availability, and it consists of all the tools and technologies needed to troubleshoot, recover, and deliver your most crucial data, applications, and infrastructure. The Nodegrid platform from ZPE Systems is the perfect foundation for a resilience system. Nodegrid is a vendor-neutral, out-of-band management solution capable of running your choice of third-party software. Nodegrid allows you to build a highly customizable IRE containing all the tools needed to safely recover from ransomware. You can even use Nodegrid to deliver services while the primary network or systems are down, making it your all-in-one network resilience multi-tool.

Want to ensure network resilience by accelerating ransomware recovery?

Minimize the business impact of ransomware with the help of our whitepaper, 3 Steps to Ransomware Recovery. Learn how to follow Gartner’s best practices to build an Isolated Recovery Environment

Download Whitepaper

Network Resilience Doesn’t Mean What it Did 20 Years Ago

Network resilience requirements have changed

Enterprise networks are like air. When they’re running smoothly, it’s easy to take them for granted, as business users and customers are able to go about their normal activities. But when customer service reps are suddenly cut off from their ticketing system, or family movie night turns into a game of “Is it my router, or the network?”, everyone notices. This is why network resilience is critical.

But, what exactly does resilience mean today? Let’s find out by looking at some recent real-world examples, the history of network architectures, and why network resilience doesn’t mean what it did 20 years ago.

Why does network resilience matter?

There’s no shortage of real-world examples showing why network resilience matters. The takeaway is that network resilience is directly tied to business, which means that it impacts revenue, costs, and risks. Here is a brief list of resilience-related incidents that occurred in 2023 alone:

  • FAA (Federal Aviation Administration) – An overworked contractor unintentionally deleted files, which delayed flights nationwide for an entire day.
  • Southwest Airlines – A firewall configuration change caused 16,000 flight cancellations and cost the company about $1 billion.
  • MOVEit FTP exploit – Thousands of global organizations fell victim to a MOVEit vulnerability, which allowed attackers to steal personal data for millions.
  • MGM Resorts – A human exploit and lack of recovery systems let an attack persist for weeks, causing millions in losses per day.
  • Ragnar Locker attacks – Several large organizations were locked out of IT systems for days, which slowed or halted customer operations worldwide.

What does network resilience mean?

Based on the examples above, it might seem that network resilience could mean different things. It might mean having backups of golden configs that you could easily restore in case of a mistake. It might mean beefing up your security and/or replacing outdated systems. It might mean having recovery processes in place.

So, which is it?

The answer is, it’s all of these and more.

Donald Firesmith (Carnegie Mellon) defines resilience this way: “A system is resilient if it continues to carry out its mission in the face of adversity (i.e., if it provides required capabilities despite excessive stresses that can cause disruptions).”

Network resilience means having a network that continues to serve its essential functions despite adversity. Adversity can stem from human error, system outages, cyberattacks, and even natural disasters that threaten to degrade or completely halt normal network operations. Achieving network resilience requires the ability to quickly address issues ranging from device failures and misconfigurations, to full-blown ISP outages and ransomware attacks.

The problem is, this is now much more difficult than it used to be.

How did network resilience become so complicated?

Twenty years ago, IT teams managed a centralized architecture. The data center was able to serve end-users and customers with the minimal services they needed. Being “constantly connected” wasn’t a concern for most people. For the business, achieving resilience was as simple as going on-site or remoting-in via serial console to fix issues at the data center.

Network architecture showing simplicity of data center connected via MPLS to branch office

Then in the mid-2000s, the advent of the cloud changed everything. Infrastructure, data, and computing became decentralized into a distributed mix of on-prem and cloud solutions. Users could connect from anywhere, and on-demand services allowed people to be plugged in around-the-clock. Services for work, school, and entertainment could be delivered anytime, no matter where users were.

Network architecture showing complexity of data center, CDN, remote user, branch office, all connected via many paths

Behind the scenes, this explosion of architecture created three problems for achieving network resilience, which a simple serial could no longer fix:

Too Much Work

Infrastructure, data, and computing are widely distributed. Systems inevitably break and require work, but teams don’t have the staff to keep up.

Too Much Complexity

Pairing cloud and box-based stacks creates complex networks. Teams leave systems outdated, because they don’t want to break this delicate architecture.

Too Much Risk

Unpatched, outdated systems are prime targets for packaged attacks that move at machine speed. Defense requires recovery tools that teams don’t have.

Enabling businesses to be resilient in the modern age requires an approach that’s different than simply deploying a serial console for remote troubleshooting. Gen 1 and 2 serial consoles, which have dominated the market for 20 years, were designed to solve basic issues by offering limited remote access and some automation. The problem is, these still leave teams lacking the confidence to answer questions like:

  • “How can we guarantee access to fix stuff that breaks, without rolling trucks?”
  • “Can we automate change management, without fear of breaking the network?”
  • “Attacks are inevitable — How do we stop hackers from cutting off our access?”

Hyperscalers, Internet Service Providers, Big Tech, and even the military have a resilience model that they’ve proven over the last decade. Their approach involves fully isolating command and control from data and user environments. This allows them to not only gain low-level remote access to maintain and fix systems, but also to “defend the hill” and maintain control if systems are compromised or destroyed.

This approach uses something called Isolated Management Infrastructure (IMI).

Isolated Management Infrastructure is the best practice for network resilience

Isolated Management Infrastructure is the practice of creating a management network that is completely separate from the production network. Most IT teams are familiar with out-of-band management as this network; IMI, however, provides many capabilities that can’t be hosted on a traditional serial console or OOB network. And with increasing vulnerabilities, CISA issued a binding directive specifically calling for organizations to implement IMI.

Isolated Management Infrastructure using Gen 3 serial consoles, like ZPE Systems’ Nodegrid devices, provides more than simple remote access and automation. Similar to a proper out-of-band network, IMI is completely isolated from production assets. This means there are no dependencies on production devices or connections, and management interfaces are not exposed to the internet or production gear. In the event of an outage or attack, teams retain management access, and this is just the beginning of the benefits of having IMI.

A network architecture diagram showing Isolated Management Infrastructure next to production infrastructure

IMI includes more than nine functions that are required for teams to fully service their production assets. These include:

  • Low-level access to all management interfaces, including serial, Ethernet, USB, IPMI, and others, to guarantee remote access to the entire environment
  • Open, edge-native automation to ensure services can continue operating in the event of outages or change errors
  • Computing, storage, and jumpbox capabilities that can natively host the apps and tools to deploy an IRE, to ensure fast, effective recovery from attacks

Get the guide to build IMI

ZPE Systems has worked alongside Big Tech to fulfill their requirements for IMI. In doing so, we created the Network Automation blueprint as a technical guide to help any organization build their own Isolated Management Infrastructure. Download the blueprint now to get started.

Edge Computing Requirements

Edge computing requirements displayed in a digital interface wheel.

The Internet of Things (IoT) and remote work capabilities have allowed many organizations to conduct critical business operations at the enterprise network’s edges. Wearable medical sensors, automated industrial machinery, self-service kiosks, and other edge devices must transmit data to and from software applications, machine learning training systems, and data warehouses in centralized data centers or the cloud. Those transmissions eat up valuable MPLS bandwidth and are attractive targets for cybercriminals.

Edge computing involves moving data processing systems and applications closer to the devices that generate the data at the network’s edges. Edge computing can reduce WAN traffic to save on bandwidth costs and improve latency. It can also reduce the attack surface by keeping edge data on the local network or, in some cases, on the same device.

Running powerful data analytics and artificial intelligence applications outside the data center creates specific challenges. For example, space is usually limited at the edge, and devices might be outdoors where power and climate control are more complex. This guide discusses the edge computing requirements for hardware, networking, availability, security, and visibility to address these concerns.

Edge computing requirements

The primary requirements for edge computing are:

1. Compute

As the name implies, edge computing requires enough computing power to run the applications that process edge data. The three primary concerns are:

  • Processing power: CPUs (central processing units), GPUs (graphics processing units), or SoCs (systems on chips)
  • Memory: RAM (random access memory)
  • Storage: SSDs (solid state drives), SCM (storage class memory), or Flash memory
  • Coprocessors: Supplemental processing power needed for specific tasks, such as DPUs (data processing units) for AI

The specific edge computing requirements for each will vary, as it’s essential to match the available compute resources with the needs of the edge applications.

2. Small, ruggedized chassis

Space is often quite limited in edge sites, and devices may not be treated as delicately as they would be in a data center. Edge computing devices must be small enough to squeeze into tight spaces and rugged enough to handle the conditions they’ll be deployed in. For example, smart cities connect public infrastructure and services using IoT and networking devices installed in roadside cabinets, on top of streetlights, and in other challenging deployment sites. Edge computing devices in other applications might be subject to constant vibrations from industrial machinery, the humidity of an offshore oil rig, or even the vacuum of outer space.

3. Power

In some cases, edge deployments can use the same PDUs (power distribution units) and UPSes (uninterruptible power supplies) as a data center deployment. Non-traditional implementations, which might be outdoors, underground, or underwater, may require energy-efficient edge computing devices using alternative power sources like batteries or solar.

4. Wired & wireless connectivity

Edge computing systems must have both wired and wireless network connectivity options because organizations might deploy them somewhere without access to an Ethernet wall jack. Cellular connectivity via 4G/5G adds more flexibility and ideally provides network failover/out-of-band capabilities.

5. Out-of-band (OOB) management

Many edge deployment sites don’t have any IT staff on hand, so teams manage the devices and infrastructure remotely. If something happens to take down the network, such as an equipment failure or ransomware attack, IT is completely cut off and must dispatch a costly and time-consuming truck roll to recover. Out-of-band (OOB) management creates an alternative path to remote systems that doesn’t rely on any production infrastructure, ensuring teams have continuous access to edge computing sites even during outages.

6. Security

Edge computing reduces some security risks but can create new ones. Security teams carefully monitor and control data center solutions, but systems at the edge are often left out. Edge-centric security platforms such as SSE (Security Service Edge) help by applying enterprise Zero Trust policies and controls to edge applications, devices, and users. Edge security solutions often need hardware to host agent-based software, which should be factored into edge computing requirements and budgets. Additionally, edge devices should have secure Roots of Trust (RoTs) that provide cryptographic functions, key management, and other features that harden device security.

7. Visibility

Because of a lack of IT presence at the edge, it’s often difficult to catch problems like high humidity, overheating fans, or physical tampering until they affect the performance or availability of edge computing systems. This leads to a break/fix approach to edge management, where teams spend all their time fixing issues after they occur rather than focusing on improvements and innovations. Teams need visibility into environmental conditions, device health, and security at the edge to fix issues before they cause outages or breaches.

Streamlining edge computing requirements

An edge computing deployment designed around these seven requirements will be more cost-effective while avoiding some of the biggest edge hurdles. Another way to streamline edge deployments is with consolidated, vendor-neutral devices that combine core networking and computing capabilities with the ability to integrate and unify third-party edge solutions. For example, the Nodegrid platform from ZPE Systems delivers computing power, wired & wireless connectivity, OOB management, environmental monitoring, and more in a single, small device. ZPE’s integrated edge routers use the open, Linux-based Nodegrid OS capable of running Guest OSes and Docker containers for your choice of third-party AI/ML, data analytics, SSE, and more. Nodegrid also allows you to extend automated control to the edge with Gen 3 out-of-band management for greater efficiency and resilience.

Want to learn more about how Nodegrid makes edge computing easier and more cost-effective?

To learn more about consolidating your edge computing requirements with the vendor-neutral Nodegrid platform, schedule a free demo!

Request a Demo