Data Center Resilience Archives

AI Data Center Infrastructure

by Jordan Baker | Aug 9, 2024 | Actionable Data, Application Hosting, Data Center Management, Data Center Resilience, Improve Network Security, Increase Productivity, Micro-segmentation, Minimize Impact of Disruptions, Monitoring & Reporting, Network Automation, Out of Band Management, Remote Network Management, Serial Consoles, Streamline Deployments, Zero Touch Provisioning (ZTP), Zero Trust Security

Artificial intelligence is transforming business operations across nearly every industry, with the recent McKinsey global survey finding that 72% of organizations had adopted AI, and 65% regularly use generative AI (GenAI) tools specifically. GenAI and other artificial intelligence technologies are extremely resource-intensive, requiring more computational power, data storage, and energy than traditional workloads. AI data center infrastructure also requires high-speed, low-latency networking connections and unified, scalable management hardware to ensure maximum performance and availability. This post describes the key components of AI data center infrastructure before providing advice for overcoming common pitfalls to improve the efficiency of AI deployments.

AI data center infrastructure components

Computing

Generative AI and other artificial intelligence technologies require significant processing power. AI workloads typically run on graphics processing units (GPUs), which are made up of many smaller cores that perform simple, repetitive computing tasks in parallel. GPUs can be clustered together to process data for AI much faster than CPUs.

Storage

AI requires vast amounts of data for training and inference. On-premises AI data centers typically use object storage systems with solid-state disks (SSDs) composed of multiple sections of flash memory (a.k.a., flash storage). Storage solutions for AI workloads must be modular so additional capacity can be added as data needs grow, through either physical or logical (networking) connections between devices.

Networking

AI workloads are often distributed across multiple computing and storage nodes within the same data center. To prevent packet loss or delays from affecting the accuracy or performance of AI models, nodes must be connected with high-speed, low-latency networking. Additionally, high-throughput WAN connections are needed to accommodate all the data flowing in from end-users, business sites, cloud apps, IoT devices, and other sources across the enterprise.

Power

AI infrastructure uses significantly more power than traditional data center infrastructure, with a rack of three or four AI servers consuming as much energy as 30 to 40 standard servers. To prevent issues, these power demands must be accounted for in the layout design for new AI data center deployments and, if necessary, discussed with the colocation provider to ensure enough power is available.

Management

Data center infrastructure, especially at the scale required for AI, is typically managed with a jump box, terminal server, or serial console that allows admins to control multiple devices at once. The best practice is to use an out-of-band (OOB) management device that separates the control plane from the data plane using alternative network interfaces. An OOB console server provides several important functions:

It provides an alternative path to data center infrastructure that isn’t reliant on the production ISP, WAN, or LAN, ensuring remote administrators have continuous access to troubleshoot and recover systems faster, without an on-site visit.
It isolates management interfaces from the production network, preventing malware or compromised accounts from jumping over from an infected system and hijacking critical data center infrastructure.
It helps create an isolated recovery environment where teams can clean and rebuild systems during a ransomware attack or other breach without risking reinfection.

An OOB serial console helps minimize disruptions to AI infrastructure. For example, teams can use OOB to remotely control PDU outlets to power cycle a hung server. Or, if a networking device failure brings down the LAN, teams can use a 5G cellular OOB connection to troubleshoot and fix the problem. Out-of-band management reduces the need for costly, time-consuming site visits, which significantly improves the resilience of AI infrastructure.

AI data center challenges

Artificial intelligence workloads, and the data center infrastructure needed to support them, are highly complex. Many IT teams struggle to efficiently provision, maintain, and repair AI data center infrastructure at the scale and speed required, especially when workflows are fragmented across legacy and multi-vendor solutions that may not integrate. The best way to ensure data center teams can keep up with the demands of artificial intelligence is with a unified AI orchestration platform. Such a platform should include:

Automation for repetitive provisioning and troubleshooting tasks
Unification of all AI-related workflows with a single, vendor-neutral platform
Resilience with cellular failover and Gen 3 out-of-band management.

To learn more, read AI Orchestration: Solving Challenges to Improve AI Value

Improving operational efficiency with a vendor-neutral platform

Nodegrid is a Gen 3 out-of-band management solution that provides the perfect unification platform for AI data center orchestration. The vendor-neutral Nodegrid platform can integrate with or directly run third-party software, unifying all your networking, management, automation, security, and recovery workflows. A single, 1RU Nodegrid Serial Console Plus (NSCP) can manage up to 96 data center devices, and even extend automation to legacy and mixed-vendor solutions that wouldn’t otherwise support it. Nodegrid Serial Consoles enable the fast and cost-efficient infrastructure scaling required to support GenAI and other artificial intelligence technologies.

Make Nodegrid your AI data center orchestration platform

Request a demo to learn how Nodegrid can improve the efficiency and resilience of your AI data center infrastructure.
Contact Us

Why Securing IT Means Replacing End-of-Life Console Servers

by Jordan Baker | Jul 25, 2024 | Data Center Management, Data Center Resilience, Improve Network Security, Increase Productivity, Micro-segmentation, Minimize Impact of Disruptions, Network Automation, Out of Band Management, Remote Network Management, Serial Consoles, User Management, Vendor Neutral Platform, Zero Touch Provisioning (ZTP), Zero Trust Security

The world as we know it is connected to IT, and IT relies on its underlying infrastructure. Organizations must prioritize maintaining this infrastructure; otherwise, any disruption or breach has a ripple effect that takes services offline for millions of users (take 2024’s CrowdStrike outage, for example). A big part of this maintenance is ensuring that all hardware components, including console servers, are up-to-date and secure. Most console servers reach end-of-life (EOL) and need to be replaced, but for many reasons, whether budgetary concerns or the “if it isn’t broken” mentality, IT teams often keep their EOL devices. Let’s look at the risks of using EOL console servers, and why replacing them goes hand-in-hand with securing IT.

The Risks of Using End-of-Life Console Servers

End-of-life console servers can undermine the security and functionality of IT systems. These risks include:

1. Lack of Security Features and Updates

Aging console servers lack adequate hardware and management security features, meaning they can’t support a zero trust approach. On top of this, once a console server reaches EOL, the manufacturer stops providing security patches and updates. The device then becomes vulnerable to newly discovered CVEs and complex cyberattacks (like the MOVEit and Ragnar Locker breaches). Cybercriminals often target outdated hardware because they know that these devices are no longer receiving updates, making them easy entry points for launching attacks.

2. Compliance Issues

Many industries have stringent regulatory requirements regarding data security and IT infrastructure. DORA, NIS2 (EU), NIST2 (US), PCI 4.0 (finance), and CER Directive are just a few of the updated regulations that are cracking down on how organizations architect IT, including the management layer. Using EOL hardware can lead to non-compliance, resulting in fines and legal repercussions. Regulatory bodies expect organizations to use up-to-date and secure equipment to protect sensitive information.

3. Prolonged Recovery

EOL console servers are prone to failures and inefficiencies. As these devices age, their performance deteriorates, leading to increased downtime and disruptions. Most console servers are Gen 2, meaning they offer basic remote troubleshooting (to address break/fix scenarios) and limited automation capabilities. When there is a severe disruption, such as a ransomware attack, hackers can easily access and encrypt these devices to lock out admin access. Organizations then must endure prolonged recovery (like the CrowdStrike outage, or 2023’s MGM attack) because they need to physically decommission and restore their infrastructure.

The Importance of Replacing EOL Console Servers

Here’s why replacing EOL console servers is essential to securing IT:

1. Modern Security Approach

Zero trust is an approach that uses segmentation across IT assets. This ensures that only authorized users can access resources necessary for their job function. This approach requires SAML, SSO, MFA/2FA, and role-based access controls, which are only supported by modern console servers. Modern devices additionally feature advanced security through encryption, signed OS, and tampering detection. This ensures a complete cyber and physical approach to security.

2. Protection Against New Threats

New CVEs and evolving threats can easily take advantage of EOL devices that no longer receive updates. Modern console servers benefit from ongoing support in the form of firmware upgrades and security patches. Upgrading with a security-focused device vendor can drastically shrink the attack surface, by addressing supply chain security risks, codebase integrity, and CVE patching.

3. Ease of Compliance

EOL devices lack modern security features, but this isn’t the only reason why they make it difficult or impossible to comply with regulations. They also lack the ability to isolate the control plane from the production network (see Diagram 1 below), meaning attackers can easily move between the two in order to launch ransomware and steal sensitive information. Watchdog agencies and new legislation are stipulating that organizations follow the latest best practice of separating the control plane from production, called Isolated Management Infrastructure (IMI). Modern console servers make this best practice simple to achieve by offering drop-in out-of-band that is completely isolated from production assets (see Diagram 2 below). This means that the organization is always in control of its IT assets and sensitive data.

Diagram 1: Though an acceptable approach, Gen 2 out-of-band lacks isolation and leaves management interfaces vulnerable to the internet.

Diagram 2: Gen 3 out-of-band fully isolates the control plane to guarantee organizations retain control of their IT assets and sensitive info.

4. Faster Recovery

New console servers are designed to handle more workloads and functions, which eliminates single-purpose devices and shrinks the attack surface. They can also run VMs and Docker containers to host applications. This enables what Gartner calls the Isolated Recovery Environment (IRE) (see Diagram 3 below), which is becoming essential for faster recovery from ransomware. Since the IMI component prohibits attackers from accessing the control plane, admins retain control during an attack. They can use the IMI to deploy their IRE and the necessary applications — remotely — to decommission, cleanse, and restore their infected infrastructure. This means that they don’t have to roll trucks week after week when there’s an attack; they just need to log into their management infrastructure to begin assessing and responding immediately, which significantly reduces recovery times.

Diagram 3: The Isolated Recovery Environment allows for a comprehensive and rapid response to ransomware attacks.

Get a Walkthrough of IMI and IRE

Let’s cover what IMI and IRE would look like in your environment and walk through some outage recovery scenarios. Use the link below to set up a technical discussion.

Set Up a Demo

Meet Me at Cisco Live Amsterdam 2026

Visit booth C10 at Cisco Live Amsterdam to chat about IMI, IRE, and replacing end-of-life console servers. You can also catch my 10-minute presentation on Wednesday, February 11 at 1:50pm in the Speakers Corner. I’ll cover From Pilot Projects to Global Rollouts: Why Out-of-Band Management is Crucial for Scaling AI Infrastructure, with more concepts and network diagrams showing how to achieve true resilience. Visit our Cisco Live page below to let me know you’re coming. See you at the show!

Rene Neumann presents at Cisco Live Amsterdam 2026

Meet at Cisco Live

Critical Entities Resilience Directive

by Jordan Baker | Jun 5, 2024 | Application Hosting, Data Center Management, Data Center Resilience, Improve Network Security, Micro-segmentation, Minimize Impact of Disruptions, Monitoring & Reporting, Out of Band Management, Remote Network Management, Streamline Deployments, Vendor Neutral Platform, Zero Trust Security

The Critical Entities Resilience (CER) Directive is an EU regulation designed to prevent disruption to the services considered essential to society or the economy. The CER Directive outlines the obligations of critical entities to prepare for any potential hazard, including natural disasters, human errors, terrorist attacks, and cybersecurity breaches. EU Member States have until 17 October 2024 to adopt and publish resilience measures required for their critical entities, and those measures officially take effect from 18 October 2024. Member States must identify and notify critical entities by July 2026; these entities then only have ten months to comply with CER requirements. With such a tight timeframe to demonstrate compliance with the Critical Entities Resilience Directive, organizations that might be deemed critical should begin preparing their resilience strategies now.

Citation: Directive (EU) 2022/2557 of the European Parliament and of the Council of 14 December 2022 on the resilience of critical entities and repealing Council Directive 2008/114/EC

Who does the Critical Entities Resilience Directive apply to, and why does it matter?

The CER Directive covers eleven sectors and subsectors that provide services essential to society, the economy, public health & safety, or preserving the environment. These include:

In-Scope Sectors Covered by the CER Directive
Sector	Subsectors
Energy	Electricity Heating and cooling Oil & gas Hydrogen
Transport	Air Rail Water Road Public transportation
Banking	Deposit, lending, and credit institutions
Financial Market Infrastructure	Trading venues Clearing systems
Health	Healthcare providers Reference laboratories Medicinal research and development Pharmaceutical manufacturers Critical medical device manufacturers Medicinal distributors
Drinking Water	Drinking water suppliers Drinking water distributors
Waste Water	Collection Treatment Disposal
Digital Infrastructure	Internet Exchange Point providers DNS providers Top-level domain (TLD) name registries Cloud service providers Data center service providers Content Delivery Networks (CDNs) Trust service providers Electronic communications providers Emergency communication networks
Public Administration	Public administration entities of central governments
Space	Operators of ground-based infrastructure for space-based services
Food Production, Processing, and Distribution	Large-scale industrial food production and processing Food supply chain services Food wholesale distributors

The Critical Entities Resilience Directive is one of several new EU regulations (such as DORA and NIS2) created to establish consistent guidelines for resilience in sectors where any service disruption has a significant negative impact on society or the economy. Whereas DORA applies primarily to financial institutions and supporting services, and NIS2 focuses on cybersecurity threats, the CER Directive is broader in scope and addresses other, non-digital threats to resilience such as natural disasters and global health crises (e.g., COVID-19).

The penalties for noncompliance will vary by Member State but are likely to include fines, public notification, remediation, and withdrawal of authorization.

CER Directive requirements for critical entities

Most of the CER Directive requirements apply to Member States, outlining how the designated authorities will adopt and enforce resilience measures and support critical entities in achieving compliance. However, there are five key provisions that relevant organizations should be aware of as they prepare for their identification as critical entities.

1. Article 4: Strategy on the resilience of critical entities

EU Member States have until 17 January 2026 to adopt a strategy outlining the guidelines and procedures for critical entities to achieve and maintain a high level of resilience. Essentially, this strategy will describe the requirements for CER Directive compliance in each Member State and provide guidance on how to meet those requirements. Potentially critical entities can prepare by examining existing resilience frameworks and regulations to anticipate the policies, tools, and procedures that will likely be required.

2. Article 5: Risk assessment by Member States

Member States have until 17 January 2026 to perform a risk assessment of all essential services. These assessments must account for natural and human-made risks, including accidents, natural disasters, public health emergencies, terrorist attacks, and antagonistic threats. Member States will then use the risk assessments to identify critical entities within each sector.

3. Article 12: Risk assessment by critical entities

Critical entities must perform risk assessments using similar criteria to Article 5 within nine months of being notified of their designation as critical and at least every four years afterward. If an organization already conducts risk assessments according to other similar resilience guidelines or frameworks, Member States have the authority to decide whether or not those assessments meet CER Directive compliance requirements.

4. Article 13: Resilience measures of critical entities

Critical entities must take the appropriate technical, security, and policy measures to ensure resilience, including a comprehensive strategy for service continuity and disaster recovery. Examples of resilience measures outlined by the CER Directive include:

CER Directive Resilience Measures
Requirements	Examples
Adopt disaster risk reduction and climate adaptation measures	Using an environmental monitoring system to detect and respond to rising temperatures, humidity, and other relevant conditions
Ensure adequate physical protection of the premises and critical infrastructure, including fencing, barriers, perimeter monitoring tools, detection equipment, and access controls	Installing proximity sensors in data center racks to automatically notify security teams if an unauthorized user physically tampers with remote infrastructure
Respond to, resist, and mitigate service disruptions	Deploying out-of-band (OOB) serial consoles with cellular capabilities to ensure continuous remote management access to critical infrastructure
Recover from incidents using business continuity measures to resume provisioning essential services	Building a resilience system containing all the infrastructure and tools needed to rebuild and recover while still delivering core services
Manage employee security by classifying personnel who exercise critical functions, establishing access rights and controls, and performing background checks as needed	Adopting zero-trust security policies and controls that assign access privileges according to role (role-based access control, or RBAC)

5. Article 15: Incident notification

Critical entities must notify the competent authority of any incidents that have or could significantly disrupt essential services within 24 hours of detection. The significance of a disruption is determined according to the following parameters:

How many users the disruption affects;
How long the disruption lasts;
The geographical area the disruption affects.

The incident notification must explain the nature, cause, and potential consequences of the disruption, including any cross-border implications.

How Nodegrid simplifies CER Directive compliance

Nodegrid is a Gen 3 out-of-band management platform that makes the perfect foundation for a resilience system. Nodegrid OOB separates the control plane from the data plane to ensure continuous remote management access to critical infrastructure even during production network outages. Vendor-neutral serial consoles and integrated branch service routers directly host third-party software for security, automation, recovery, and more, reducing hardware overhead at each site while ensuring teams have access to all the tools they need to restore essential services.

Looking to Upgrade to a Nodegrid serial console?

Prepare for the Critical Entities Resilience Directive by replacing your discontinued, EOL serial console with a Gen 3 out-of-band solution from Nodegrid.

Click here to learn more!

DORA Act: 5 Takeaways For The Financial Sector

by Jordan Baker | Mar 7, 2024 | Data Center Management, Data Center Resilience, Data Logging, Improve Network Security, Increase Productivity, Micro-segmentation, Minimize Impact of Disruptions, Modernize Legacy Environments, Monitoring & Reporting, Network Automation, Remote Network Management, User Management, Zero Touch Provisioning (ZTP), Zero Trust Security

The Digital Operational Resilience Act (DORA) is a regulatory initiative within the European Union that aims to enhance the operational resilience of the financial sector. Its main goal is to prevent and mitigate cyber threats and operational disruptions. The DORA Act outlines regulatory requirements for the security of network and information systems “whereby all firms need to make sure they can withstand, respond to and recover from all types of ICT-related disruptions and threats” (DORA Act website).

Who and What Are Covered Under the DORA Act?

The DORA Act is a regulation that covers all financial entities within the European Union (EU). It recognizes the critical role of information and communication technology (ICT) systems in financial services. DORA applies to financial services including payments, securities, credit rating, algorithmic trading, lending, insurance, and back-office operations. It establishes a framework for ICT risk management through technical standards, which are being released in two phases, the first of which was published on January 17, 2024. The DORA Act will go into effect in its entirety on January 17, 2025.

With cyberattacks constantly in the news cycle, it’s no surprise that governing bodies are putting forth standards for operational resilience. But without combing through this lengthy piece of legislation, what should IT teams start thinking about from a practical standpoint? Here are 5 takeaways on what the DORA Act means for the financial sector.

DORA Act: 5 Takeaways for the Financial Sector

1. Shore-up your cybersecurity measures

The DORA Act emphasizes strengthening cybersecurity measures within the financial sector. It requires financial institutions, such as banks, stock exchanges, and financial infrastructure providers, to implement robust cybersecurity controls and protocols. These include adopting advanced authentication mechanisms, encryption standards, and network segmentation to protect sensitive financial data and critical infrastructure from cyber threats. Part of this will also require organizations to apply system patches and updates in a timely manner, which means automated patching will become necessary to every organization’s security posture.

2. Implement resilience systems

Operational resilience is a key focus area of the DORA Act, aiming to ensure the continuity of essential financial services in the face of cyber threats, natural disasters, and other operational disruptions. Financial institutions are required to develop comprehensive business continuity plans, establish redundant systems and backup facilities, and conduct regular stress tests to assess their ability to withstand and recover from various scenarios. Implementing a resilience system helps with this, as it provides all the infrastructure, tools, and services necessary to continue operating during major incidents.

3. Conduct regular scans for vulnerabilities

The DORA Act mandates financial institutions to implement robust risk management practices to identify, assess, and mitigate cyber risks and operational vulnerabilities. This includes conducting regular assessments, vulnerability scans, and penetration tests, and developing incident response procedures to quickly address threats. This is all part of taking a proactive approach to identify and mitigate cyber incidents, and reduce the impact that adverse events have on financial stability and consumer confidence.

4. Collaborate and share information with industry peers

The DORA Act encourages financial institutions to share cybersecurity threat intelligence, incident data, and best practices with industry peers, regulators, and law enforcement agencies. The ability to monitor systems and collect data will be crucial to this approach, and will require systems that can rapidly (and securely) deploy apps/services during ongoing incidents. This will help financial institutions to better understand emerging threats, coordinate responses to cyber incidents, and strengthen collective defenses against threats and operational disruptions.

5. Segment physical and logical systems to pass regular audits

Through the DORA Act, regulators are empowered to conduct regular assessments, audits, and inspections of systems. This will ensure that financial institutions are implementing adequate controls and safeguards to protect against cyber threats and operational disruptions. A crucial part to this will involve physical and logical separation of systems, such as through Isolated Management Infrastructure, as well as implementing zero trust architecture across the organization. These will help bolster resilience by eliminating control dependencies between management and production networks, which will also help to streamline audits.

Get the blueprint to help you comply with the DORA Act

DORA’s requirements are meant to help IT teams better protect sensitive data and the integrity of financial systems as a whole. But without a proper network management infrastructure, their production networks are too sensitive to errors and vulnerable to attacks. ZPE has created the blueprint that covers these 5 crucial takeaways outlined in the DORA Act. The architecture outlined in this blueprint has been trusted by Big Tech for more than a decade, as it allows them to deploy modern cybersecurity measures, physically and logically separated systems, and rapid recovery processes. Download the blueprint now.

Download blueprint

Network Resilience: What is a Resilience System?

by Jordan Baker | Feb 2, 2024 | Application Hosting, Data Center Resilience, Improve Network Security, Increase Productivity, Micro-segmentation, Minimize Impact of Disruptions, Monitoring & Reporting, Network Automation, Out of Band Management, Power Management, Remote Network Management, Serial Consoles, Virtualization, Zero Touch Provisioning (ZTP), Zero Trust Security

A digital web of interconnected network resilience concepts being selected by a business person in a suit.

Network resilience means being able to withstand or recover from adversity, service degradation, and complete outages with minimal business disruption. The longer business-critical services are down, or systems are breached, the greater the risk of significant financial, reputational, and legal consequences. A resilience system is a set of technologies that enable an organization to continue operating while teams work to repair failures and recover from cyberattacks. But what exactly is a resilience system, and what does it look like? This guide to network resilience defines resilience systems, provides example use cases, compares them to related technologies like backups and redundant systems, and describes the key components required to build them.

What is a resilience system?

A resilience system provides all the infrastructure, tools, and services necessary to continue operating, if in a degraded state, during major incidents. It also includes everything needed to recover data, rebuild systems, perform security testing, and continue delivering core business functionality. A resilience system is typically isolated from the production network, preventing cybercriminals from finding and compromising it and ensuring teams have continuous access even if the primary network goes down.

Resilience system use cases

Some examples of the challenges that resilience systems help overcome include:

1. Ransomware recovery

In a ransomware attack, cybercriminals infect systems with malware that spreads throughout the network and encrypts any data it encounters. Modern ransomware now uses packaged attacks that move at machine speed, instantly incapacitating entire networks. Organizations completely lose access to critical systems and data until they pay a ransom, often in untraceable cryptocurrency. Ransomware is an exceptionally tenacious form of malware and tends to reinfect backup data and rebuilt systems, significantly hampering recovery efforts and increasing the duration and cost of the attack. The best practice for resilience systems is to isolate them on an out-of-band (OOB) network, inaccessible to hackers who have breached the production in-band network. Doing so creates a safe, isolated recovery environment (IRE) where teams can restore critical data and systems without the risk of reinfection. The resilience system includes all the tools and hardware needed to restore critical business services and infrastructure. An IRE significantly accelerates ransomware recovery and minimizes downtime, so businesses can avoid paying ransoms and reduce the overall cost of attacks.

Learn more about ransomware recovery with resilience systems:

2. Network outages

Enterprise network architectures and supply chains are highly complex, with lots of moving parts that rely on external vendors to maintain availability. Just one of those vendors dropping the ball could take the entire organization offline, severely impacting network resilience. For example, in 2023, an expired cryptographic certificate caused Cisco’s Viptela SD-WAN appliances to fail on reboot, completely taking down affected networks until the issue was resolved. With a resilience system, Viptela customers could have potentially avoided this downtime by failing over to alternative network resources. For example, a resilience system with integrated cellular failover allows branches to continue connecting to and delivering critical business services while also providing a lifeline for remote teams to access and recover failed systems. A resilience system also provides observability and automatic notifications so teams are instantly alerted to issues like certificate expirations and can respond quickly to recover critical services.

3. Shift to remote work

Incidents like ransomware attacks and equipment failures happen frequently enough that companies can create detailed plans and proactively implement solutions to minimize their impact, but not all adverse events are so predictable. When the COVID-19 pandemic struck, the massive shift to remote work strained the network resources of most organizations. Instead of maintaining a limited number of branch offices, teams suddenly had to treat every employee as a new branch, leading to performance degradation and outages as they scrambled to reinforce the business’s remote capabilities. A resilience system gives teams the tools and resources they need to provision additional infrastructure, manage networking logic, deploy new security solutions, and more, even while the primary network is offline or under a heavy load. A resilience system is the key to quickly adjusting network performance and security to adapt to sudden changes like a transition to fully remote operations.

Do backups and redundancy equate to network resilience?

The short answer is no; backups and redundancy do not equate to network resilience, though they do contribute to making systems more resilient.

Backups are copies of data, configurations, and application code used to do a hot or cold restore when a production system fails. The underlying infrastructure must remain operational for teams to access and use backups, and unless additional resilience measures are taken, it’s easy for backups to become infected or compromised, severely hampering recovery efforts.
Redundancy involves duplicating critical systems, services, and applications as a failsafe in case the primaries go down. Organizations can “fail over” to the redundancies to continue critical business operations during outages. However, redundant systems are just as susceptible to failures and infections without additional resilience measures like out-of-band management and isolated management infrastructure.

Backups and redundancy are part of network resilience but alone are not enough to ensure business continuity. Resilience systems focus on maintaining the architecture of the production network while adding the ability to recover or adapt to adversity. The next section discusses all the tools and technologies that make up network resilience systems.

What does a resilience system look like?

There are four key components that go into a resilience system.

Key Components of a Resilience System
Alternative Networking	Full-stack routing and switching, Wi-Fi, VoIP, virtualization, software-defined network overlays for SDN & SD-WAN
Alternative Compute	Full-stack compute, containers, virtual machines, and any other resources needed to run applications and deliver services
Storage & Storage Recovery	Enough storage to recover systems and applications as well as support content delivery
Automation	Tools like zero-touch provisioning (ZTP) to facilitate speedy recovery while minimizing human error

Alternative networking and compute resources ensure the organization can failover in the event of a network failure or continue delivering services when production servers are unavailable. Teams also need enough storage to restore backup data, build new systems, and support the content delivery network (CDN). Automation solutions like zero-touch provisioning (ZTP), configuration management, and security validation tools accelerate the recovery process while mitigating the risk of human error. Combined, these components enable teams to reduce the frequency, severity, and duration of outages, improving overall network resilience.

Network resilience with ZPE Systems

A resilient network will continue delivering critical business services in the face of any challenge, whether from cybercriminals, supply chain issues, global events, or even plain human error. A resilience system is isolated from the production network to ensure security and availability, and it consists of all the tools and technologies needed to troubleshoot, recover, and deliver your most crucial data, applications, and infrastructure. The Nodegrid platform from ZPE Systems is the perfect foundation for a resilience system. Nodegrid is a vendor-neutral, out-of-band management solution capable of running your choice of third-party software. Nodegrid allows you to build a highly customizable IRE containing all the tools needed to safely recover from ransomware. You can even use Nodegrid to deliver services while the primary network or systems are down, making it your all-in-one network resilience multi-tool.

Want to ensure network resilience by accelerating ransomware recovery?

Minimize the business impact of ransomware with the help of our whitepaper, 3 Steps to Ransomware Recovery. Learn how to follow Gartner’s best practices to build an Isolated Recovery Environment

Download Whitepaper

Out-of-Band Management: What It Is and Why You Need It

by Jordan Baker | Jan 26, 2024 | Data Center Management, Data Center Resilience, Improve Network Security, Minimize Impact of Disruptions, Monitoring & Reporting, Out of Band Management, Power Management, Remote Network Management, Serial Consoles

Thumbnail – What is out-of-band management

This scenario is every IT professional’s worst nightmare: it’s the middle of the night, a remote site on the other side of the country has gone offline, and nobody knows why. A single minute of downtime can cost anywhere from several hundred dollars to tens of thousands of dollars, and the nearest tech is a six-hour plane ride away. Consider 2024’s CrowdStrike outage and the devastation caused for banks, airports, and many other organizations.

A bar chart showing the average hourly cost of downtime by industry.

Data Source: SolarWinds

Out-of-band management offers the solution: a way for teams to access critical remote infrastructure during outages and breaches without “out-of-chair” expenses. Out-of-band management allows organizations to recover remote infrastructure faster, reducing the duration and expense of downtime.

This guide to out-of-band management answers critical questions about what this technology is, why you need it, and how to choose the right solution.

Table of contents

What is Out-of-Band Management?

Why Do You Need Out-of-Band Management?

What is an Out-of-Band Management solution?

How Does Out-of-Band Management Improve Security and Resilience?

Why Choose Nodegrid for Out-of-Band Management?

What is out-of-band management?

Out-of-band management (OOBM) involves controlling network infrastructure and workflows on an out-of-band network. An out-of-band network is an entirely separate network that runs parallel with your production (or in-band) network but doesn’t rely on any of the same infrastructure or services. OOBM allows teams to administer network infrastructure remotely on a dedicated connection, such as secondary Fiber or cellular LTE, that will remain available even if the in-band network goes down from an equipment failure, ISP outage, or ransomware attack.

The biggest reason to use out-of-band management is to ensure continuous, uninterrupted access to critical remote infrastructure even when the primary network is down. OOBM allows teams to recover from outages and cyberattacks faster and more cost-efficiently because they can access, troubleshoot, and restore systems without rolling trucks or hiring on-site services.

Out-of-band management provides a lifeline for teams to access critical remote infrastructure when the production network is offline. It allows them to immediately begin troubleshooting and repairing the issue to restore services ASAP. With OOBM, companies save money on recovery expenses, and minimize the duration and business impact of downtime.

What is an OOBM serial console?

Some organizations use OOBM jump boxes (or jump servers) that are connected to both the in-band and out-of-band networks, allowing administrators to “jump” from one network to the other for management. Examples of low-cost jump boxes include the Intel NUC and the Raspberry Pi. However, OOBM jump boxes are security risks because they do not effectively isolate the management infrastructure, plus they require an entire duplicate infrastructure of devices and services to create the out-of-band network. The best practice for security, resilience, and efficiency is to deploy an all-in-one, out-of-band management solution.

An out-of-band management solution uses hardware devices known as serial consoles, which connect to infrastructure devices via their management port (usually RS232 Serial, Ethernet, or USB). Serial consoles are known by lots of other names, including terminal servers, console servers, console server switches, serial routers, and serial switches.

The serial console has dedicated network interfaces to provide an Internet connection for remote management access, often fiber or 4G/5G cellular LTE, so they don’t connect to or rely upon the primary production network at all. This gives teams the ability to continuously monitor and administer critical remote infrastructure even during an ISP or WAN outage that would make a jump box inaccessible.

Learn more about security & resilience with out-of-band management

Understanding Serial Console Interfaces

Serial Console Redirection Guide

Serial Console PDU Management Guide

Administrators remotely access an OOBM serial console via this dedicated link and, from there, can view and manage all connected infrastructure from a single, convenient software platform. This software is typically deployed on-premises and runs as a VM (virtual machine) either on the serial console itself or on a separate machine, but there are some cloud-based OOBM network management software tools.

Out-of-band management software varies from provider to provider, with most offering second-generation (or Gen 2) solutions that provide some built-in automation capabilities but do not support vendor-neutral integrations with third-party tools. Newer, third-generation (or Gen 3) solutions use an open, x86 Linux-based operating system to allow easy integrations with other vendors’ software for automation, orchestration, security, monitoring, and more.

The benefits of out-of-band management

Out-of-band management can help you:

Improve network performance: Performing resource-intensive management, automation, and orchestration workflows on the out-of-band network reduces the strain on the production network for better speed and reliability.

Reduce your attack surface: OOBM is a core component of an isolated management infrastructure (IMI), which takes management interfaces off the production network to prevent cybercriminals from accessing them during a breach. IMI is recommended by CISA’s binding directive 23-02.

Accelerate ransomware recovery: The OOBM network can be used to create an isolated recovery environment (IRE) where teams can safely rebuild and recover from ransomware attacks without the risk of reinfection, reducing the duration and expense of ransomware-related outages.

Streamline repairs and rebuilds: OOBM provides the ability to deploy the tools and applications needed to isolate, cleanse, rebuild, and restore services that have been affected by failures and ransomware.

The security and resilience benefits of out-of-band management are discussed further below.

How does out-of-band management improve security and resilience?

Network breaches and ransomware attacks occur so frequently that most businesses know it’s no longer a question of “if,” but “when” they’ll be hit. Once cybercriminals compromise a device or account and can move around the network, it’s only a matter of time before they find the management interfaces and take complete control over critical infrastructure.

OOBM and management infrastructure isolation

Serial consoles create an out-of-band network by directly connecting to the management port of infrastructure devices and moving all control functions off of the production LAN. This isolates the management plane from the data plane, which is part of a cybersecurity best practice known as isolated management infrastructure (IMI). An IMI further segments the management network and routes management ports to terminate on top-of-rack, OOBM serial switches, creating multiple layers of isolated management. The isolated management plane is always remotely accessible to engineers via the OOBM connection, but it remains hidden from any cybercriminals who may breach the production network.

OOBM and ransomware recovery

Out-of-band management also improves security and resilience by aiding in ransomware recovery. According to a Sophos survey, 70% of companies hit by ransomware take longer than two weeks to recover, due in no small part to the pervasive nature of the malware used and how frequently rebuilt systems and recovered data get reinfected. Today’s ransomware attacks are now pre-packaged and move at machine speed – meaning instantly – across infrastructure, bringing entire businesses down before they’ve even realized they’re under attack. The longer the business is offline, the more revenue (and customer trust) is lost, causing recovery costs to skyrocket.

An IMI using out-of-band management gives teams an isolated recovery environment (IRE) where they can recover data and rebuild systems without the risk of reinfection. The IRE allows organizations to get services back online faster to reduce the financial and reputational consequences of ransomware attacks.

Resilience is defined as the ability to continuously operate and deliver services, if in a degraded fashion, even while undergoing major failures and breaches. Out-of-band management improves resilience by ensuring that teams have continuous access to critical remote infrastructure no matter what’s going wrong with the production environment. OOBM serial consoles also isolate the management infrastructure to protect it from attackers on the primary network and provide a safe environment for teams to recover from ransomware.

Learn more about security & resilience with out-of-band management

Dissecting the MGM Cyberattack: Lions, Tigers, & Bears, Oh My!

The Biggest Ransomware Attack You Haven’t Heard of…Yet

Breaking Down The 2023 Ragnar Locker Cyberattacks

What to do if You’re Ransomware’d: A Healthcare Example

Why choose Nodegrid for out-of-band management?

Many network teams think of out-of-band as being a huge expense and time sink. Setting up proper infrastructure for OOBM and IMI typically requires 6 or more boxes at each business site for routing, switching, firewall, storage, cellular access, and a jump box. The Nodegrid platform from ZPE Systems reduces the cost and headache of out-of-band management by combining all these functions and more into a single box. Teams can easily drop a Nodegrid box in each site at a fraction of the cost of deploying a traditional OOBM network.

The first Gen 3 OOBM solution

Nodegrid is the first and only Gen 3 out-of-band management solution. Nodegrid OOBM devices use the x86 Linux-based NodegridOS, which is capable of running VMs and Docker containers to host your choice of third-party applications for automation, orchestration, security, SD-WAN, and more. Nodegrid’s ability to host other vendors’ software ensures that teams have access to all the tools they need to troubleshoot and recover infrastructure from within the IMI environment, making it the perfect network resilience multi-tool.

Nodegrid OOBM software is available as an on-premises solution or a highly scalable cloud-based app, and both support easy integrations with tools for monitoring, automated configuration management, and more. This enables teams to consolidate and streamline their workflows, maximizing efficiency while reducing the risk of human error.

Nodegrid’s other key features include:

Built-in 5G/4G LTE and Wi-Fi options for OOB and network failover
OOB support over IPMI, ILO, DRAC, CIMC, vSerial, and KVM
Robust hardware security like BIOS protection, UEFI Secure Boot, and an encrypted solid-state disk
SAML 2.0 and two-factor authentication (2FA)
Support for legacy and mixed-vendor infrastructure without expensive adapters

ZPE Systems offers a wide range of out-of-band management devices to fit any deployment size and use case, including the 96-port Nodegrid Serial Console Plus (NSCP) for large and hyperscale data centers, and the Nodegrid Gate SR, which combines branch gateway routing and OOB serial console functionality for remote business sites like retail stores and manufacturing plants.

Nodegrid OOB serial console comparison

Guest OS

Docker Apps

Wi-Fi

Cellular (Dual-SIM)

Serial Ports

Data Sheet

Nodegrid Serial Console S Series

1-2

16, 32 or 48

Download

Nodegrid Serial Console Plus (NSCP)

1-2

Yes

16, 32, 48 or 96

Download

Nodegrid OOB network edge router comparison

Guest OS

Docker Apps

Wi-Fi

Cellular (Dual-SIM)

Serial Ports

Data Sheet

Nodegrid Link SR

1-2

Yes

Download

Nodegrid Bold SR

1-2

Yes

1-2

Download

Nodegrid Hive SR

1-2

1-3

Yes

1-2

Download

Nodegrid Gate SR

1-3

1-4

Yes

1-2

Download

Nodegrid Net SR

1-6

1-4

Yes

1-4

16-80

Download

Nodegrid Mini SR

1-2

Yes

Via USB

Download

Get scalable network resilience with the only Gen 3 out-of-band management solution

Only Nodegrid OOBM delivers network control, security, automation, and resilience with a completely vendor-neutral platform. To see Nodegrid out-of-band management in action, request a free demo.

Request a Demo

« Older Entries

Next Entries »

ZPE Solution Pathways

Discover Nodegrid

AI Data Center Infrastructure

AI data center infrastructure components

Computing

Storage

Networking

Power

Management

AI data center challenges

Improving operational efficiency with a vendor-neutral platform

Make Nodegrid your AI data center orchestration platform

Why Securing IT Means Replacing End-of-Life Console Servers

The Risks of Using End-of-Life Console Servers

1. Lack of Security Features and Updates

2. Compliance Issues

3. Prolonged Recovery

The Importance of Replacing EOL Console Servers

1. Modern Security Approach

2. Protection Against New Threats

3. Ease of Compliance

4. Faster Recovery

Get a Walkthrough of IMI and IRE

Meet Me at Cisco Live Amsterdam 2026

Critical Entities Resilience Directive

Who does the Critical Entities Resilience Directive apply to, and why does it matter?

In-Scope Sectors Covered by the CER Directive

CER Directive requirements for critical entities

1. Article 4: Strategy on the resilience of critical entities

2. Article 5: Risk assessment by Member States

3. Article 12: Risk assessment by critical entities

4. Article 13: Resilience measures of critical entities

CER Directive Resilience Measures

5. Article 15: Incident notification

How Nodegrid simplifies CER Directive compliance

Looking to Upgrade to a Nodegrid serial console?

Read more about network resilience:

DORA Act: 5 Takeaways For The Financial Sector

Who and What Are Covered Under the DORA Act?

DORA Act: 5 Takeaways for the Financial Sector

1. Shore-up your cybersecurity measures

2. Implement resilience systems

3. Conduct regular scans for vulnerabilities

4. Collaborate and share information with industry peers

5. Segment physical and logical systems to pass regular audits

Get the blueprint to help you comply with the DORA Act

Network Resilience: What is a Resilience System?

Table of contents

What is a resilience system?

Resilience system use cases

1. Ransomware recovery

2. Network outages

3. Shift to remote work

Do backups and redundancy equate to network resilience?

What does a resilience system look like?

Key Components of a Resilience System

Network resilience with ZPE Systems

Want to ensure network resilience by accelerating ransomware recovery?

Out-of-Band Management: What It Is and Why You Need It

Data Source: SolarWinds

What is out-of-band management?

What is an OOBM serial console?

The benefits of out-of-band management

How does out-of-band management improve security and resilience?

OOBM and management infrastructure isolation

OOBM and ransomware recovery

Why choose Nodegrid for out-of-band management?

The first Gen 3 OOBM solution

Nodegrid OOB serial console comparison

Nodegrid OOB network edge router comparison

Get scalable network resilience with the only Gen 3 out-of-band management solution