Providing Out-of-Band Connectivity to Mission-Critical IT Resources

What to do if You’re Ransomware’d: A Healthcare Example

What to do if youre ransomwared

This article was written by James Cabe, CISSP, a 30-year cybersecurity expert who’s helped major companies including Microsoft and Fortinet.

Ransomware gangs target the innocent and vulnerable. They hit a Chicago hospital in December 2023, a London hospital in October the same year, and schools and hospitals in New Jersey as recently as January 2024. This is one of the biggest reasons I’m committed to stopping these criminals by educating organizations on how to re-think and re-architect their approach to cybersecurity.

In previous articles, I discussed IMI (Isolated Management Infrastructure) and IRE (Isolated Recovery Environments), and how they could have quickly altered outcomes for MGM, Ragnar Locker victims, and organizations affected by the MOVEit vulnerability. Using IMI and IRE, organizations find that the key to not only speedy recovery, but also to limiting the blast radius and attack persistence, is isolation.

Why is isolation (not segmentation) key to ransomware recovery?

The NIST framework for incident response has five steps: Identify, Protect, Detect, Respond, and Recover. It’s missing a crucial step, however: Isolate. Stay tuned for a full breakdown of this in my next article. But the reason this is so critical is because attacks move at machine speed, and are very pervasive and persistent. If your management network is not fully isolated from production assets, the infection spreads to everything. Suddenly, you’re locked out completely and looking at months of tedious recovery. For healthcare providers, this jeopardizes everything from patient care to regulatory compliance.

Isolation is integral to building a resilience system, or in other words, a system that gives you more than basic serial console/out-of-band access and instead provides an entire infrastructure dedicated to keeping you in control of your systems — be it during a ransomware attack, ISP outage, natural disaster, etc. Because this infrastructure is physically and virtually isolated from production (no dependencies on production switches/routers, no open management ports, etc.), it’s nearly impossible for attackers to lock you out.

So, what really should you do if you’re ransomware’d? Let’s walk through an example attack on a healthcare system, and compare the traditional DR (Disaster Recovery) response to the IMI/IRE approach.

Ransomware in Healthcare: Disaster Recovery vs Isolated Recovery

Suppose you’re in charge of a hospital’s network. MDIoT, patient databases, and DICOM storage are the crown jewels of your infrastructure. Suddenly, you discover ransomware has encrypted patient records and is likely spreading quickly to other crown jewel assets. The risks and potential fallout can’t be understated. Millions of people are depending on you to protect their sensitive info, while the hospital is depending on you to help them avoid regulatory/legal penalties and ensure they can continue operating.

The problem with Disaster Recovery

Though the word ‘recovery’ is in the name, the DR approach is limited in its capacity to recover systems during an attack. Disaster Recovery typically employs a couple things:

  • Backups, which are copies of data, configurations, and code that are used to restore a production system when it fails.
  • Redundancy, which involves duplicating critical systems, services, and applications as a failsafe in the event that primaries go down (think cellular failover devices, secondary firewalls, etc.).

What happens when you activate your DR processes? It’s highly likely that you won’t be able to, and that’s because the typical DR setup relies on the production network. There’s no isolation.

Think about it this way: your backup servers need direct access to the data they’re backing up. If your file servers get pwned, your backup servers will, too. If your primary firewall gets hacked, your secondary will, too. The problem with backup and redundancy systems — and any system, for that matter — is that when they depend on the underlying infrastructure to remain operational, they’re just as susceptible to outages and attacks. It’s like having a reserve parachute that depends on the main parachute.

And what about the rest of your systems? You just discovered the attack has encrypted your servers and is quickly bringing operations to a crawl. How are you going to get in and fight back? What if you try to log into your management network, only to find that you’re locked out? All of your tools, configurations, and capabilities have been compromised.

This is why CISA, the FBI, US Navy, and other agencies recommend implementing Isolated Management Infrastructure.

IMI and IRE guarantee you can fight back against ransomware

You discover that the ransomware has spread. Not only has it encrypted data and stopped operations, but it has also locked you out of your own management network and is affecting the software configurations throughout the hospital. This is where IMI (Isolated Management Infrastructure) and IRE (Isolated Recovery Environment) come in.

Because IMI is physically separate from affected systems, it guarantees management access so teams can set up communication and a temporary ‘war room’ for incident response. The IRE can then be created using a combination of cellular, compute, connectivity, and power control (see diagram for design and steps). Docker containers should be used to bring up each step.

Diagram showing a chart containing the systems and open-source tools that can be deployed for an Isolated Recovery Environment

Image: The infrastructure and incident response protocol involved in the Isolated Recovery Environment. These products were chosen from free or open source projects that have proven to be very useful in each of these stages of recovery. These can be automated in pieces for each phase, and then be brought down via Docker container to eliminate the risk of leakage or risk during each phase.

Without diving too far into the technicalities, the IRE enables you to recover survivable data, restore software configurations, and prevent reinfection. Here are some things you can do (and should do) in this scenario, courtesy of the IRE:

Establish your war room

You can’t fight ransomware if you can’t securely communicate with your team. Use the IRE to create offline, break-the-glass accounts that are not attached to email. This allows you to communicate and set up ticketing for forensics purposes.

Isolate affected systems

There’s no use running antivirus if reinfection can occur. Use the IRE to take offline the switch that connects the backup and file servers. Isolate these servers from each other and shut down direct backup ports. Then, you can remote-in (KVM, iKVM, iDRAC) to run antivirus and EDR (Endpoint Detection and Response).

Restore data and device images

The key is to have backup data at its most current, both for patient data and device/software configurations. Because the IRE provides an isolated environment, and you’ve already pulled your backups offline, you can gradually restore data, re-image devices, and restore configurations without risking reinfection. The IRE ensures devices “keep away” from each other until they can be cleansed and recovered.

Things You’ll Need To Build The IMI and IRE

Network Automation Blueprint

We’ve created a comprehensive blueprint that shows how to implement the architecture for IMI and IRE. Don’t let the name fool you. The Network Automation Blueprint covers everything from establishing a dedicated management network, to automating deployment of services for ransomware recovery. Get your PDF copy now at the link below.

Gen 3 Console Servers To Replace End-of-Life Gear

It’s nearly impossible to build the IMI or deploy the IRE using older console servers. That’s because these only give you basic remote access and a hint of automation capabilities. You’ll still need the ability to run VMs and containers. Gen 3 console servers let you do all of the things for IMI and IRE, like full control plane/data plane separation, hosting apps, and deploying VMs/containers on-demand. They’ve also been validated by Synopsys and have built-in security features I’ve been talking about for years. Check out the link below for resources about Gen 3 and how we’ll help you upgrade.

Get in touch with me!

I’d love to talk with you about IMI, IRE, and resilience systems. These are becoming more crucial to operational resilience and ransomware recovery, and countries are passing new regulations that will require these approaches. Get in touch with me via social media to talk about this!

IT Automation vs Orchestration: What’s the Difference?

it-automation-vs-orchestration

IT automation and orchestration are two important concepts in the field of information technology that are often used interchangeably but are actually quite different. IT automation focuses on individual tasks, whereas orchestration encompasses multiple tasks or even entire workflows. Each approach produces different results and helps teams meet different goals. They also have their own benefits and challenges that must be considered. This guide compares IT automation vs orchestration to clear up misconceptions and help organizations choose the right approach to streamlining their IT operations.

IT Automation vs Orchestration: What’s the Difference?

IT Automation vs Orchestration

IT automation refers to the use of technology to automate repetitive tasks and processes, including things like automated backups, software updates, and monitoring systems. The goal of IT automation is to free up time and resources for IT professionals by automating routine tasks, allowing them to focus on more strategic initiatives.

Orchestration, on the other hand, is the coordination and management of multiple processes or entire workflows. This can include things like configuring and deploying new servers, managing network connections, and monitoring the performance of many different systems. The goal of orchestration is to improve the overall efficiency of IT operations, reducing costs and enabling greater scalability.

The benefits of IT automation vs orchestration

Benefits of IT Automation vs Orchestration

IT Automation

  • Saves time
  • Reduces human error
  • Improves compliance

Orchestration

  • Increases operational efficiency
  • Improves network scalability
  • Ensures IT system reliability

One of the main benefits of IT automation is that it can save time and resources for IT professionals. By automating routine tasks, IT teams can focus on more strategic initiatives and projects. Additionally, automation helps reduce human error and increases the accuracy, speed, and efficiency of tasks. Automation also improves compliance, as automated processes are less prone to human negligence and are easier to audit.

Orchestration, on the other hand, helps improve the overall efficiency and effectiveness of IT operations. By automating the coordination and management of multiple tasks, orchestration helps ensure that different systems and processes work together seamlessly. Additionally, orchestration helps improve the scalability and reliability of IT systems by ensuring different components are configured and deployed correctly.

The challenges of IT automation and orchestration

IT Automation and Orchestration Challenges

IT Complexity

Teams can’t effectively automate IT operations unless they thoroughly understand all the tasks, systems, and workflows comprising a highly complex network.

Automation Skills Gap

A high demand for automation engineers makes it difficult and expensive to recruit, train, and retain qualified IT automation and orchestration professionals.

Supporting Infrastructure

Effective automation and orchestration deployments require a robust underlying infrastructure of specialized hardware and software solutions.

One of the main challenges of automation and orchestration is the complexity of IT systems. As organizations rely more heavily on specialized technology and grow both in size and in number of business sites, IT systems become increasingly complex and difficult to manage. Automation and orchestration help reduce complexity by automating routine tasks and coordinating the management of different systems. However, teams must understand those tasks and systems well enough to know how to automate them effectively; otherwise, mistakes will proliferate or there will be gaps in automated workflows.

Another IT automation and orchestration challenge is the need for skilled professionals to deploy and manage these solutions. As automation and orchestration become more prevalent, the demand for skilled professionals has increased, making it harder (and more expensive) to recruit and retain qualified automation engineers. The alternative is for organizations to spend time and resources training existing IT staff to work with automation and orchestration.

Additionally, organizations need to invest in the technology and infrastructure necessary to support automation and orchestration. Some examples of these automation infrastructure components include:

  • Gen 3 out-of-band (OOB) serial consoles, which allow teams to deploy third-party automation on an OOB network that doesn’t rely on production infrastructure, improving security and resilience. Gen 3 OOB also moves bandwidth-hogging orchestration workflows off the production network, which reduces latency for better performance.
  • Software-defined networking, which virtualizes the control and management processes and abstracts them from underlying LAN and WAN hardware. SDN, SD-WAN, and SD-Branch technologies enable a high degree of automation for networking workflows such as load balancing, application-aware routing, and failover.
  • Infrastructure as Code (IaC), which turns infrastructure configurations into software code. IaC enables the use of version control, zero-touch deployments, automatic configuration management, automated security testing, and other tools and processes that support automation and improve network resilience.
  • Orchestrator software, which controls all of the automated workflows on a network. The orchestrator is the central hub for teams to create, deploy, monitor, and troubleshoot automated workflows and infrastructure.
  • AIOps, or artificial intelligence for IT operations, which analyzes all the logs and data pulled from automated infrastructure devices and security appliances. AIOps provides predictive maintenance insights, automatic root-cause analysis (RCA), enhanced threat detection, and other functionality to help support a complex, automated network infrastructure.

Tips for overcoming IT automation and orchestration challenges

While every organization will face unique IT automation and orchestration hurdles, there are two basic tips to help simplify any deployment. Using consolidated network hardware and vendor-neutral platforms can help reduce the complexity of network infrastructure, the need to hire additional staff, and the cost to deploy automation infrastructure.

  • Consolidated network hardware, such as all-in-one branch/edge gateway routers, significantly reduces the number of devices deployed at each business site. Fewer devices to automate means less complexity, and organizations save money on deployment costs like hardware overhead and automation license seats.
  • Vendor-neutral platforms, such as the Nodegrid infrastructure management platform from ZPE Systems, allow teams to use the automation and orchestration tools they’re most comfortable with regardless of provider, reducing the skills gap. Open platforms ensure seamless interoperability between all the various automated components to decrease management complexity. Vendor-neutral hardware also allows organizations to run software from multiple vendors on a single device, enabling even greater network consolidation to reduce the complexity and cost of automated infrastructure deployments.

Choosing IT automation vs orchestration

IT automation and orchestration are interconnected concepts that are frequently, but incorrectly, used interchangeably. Automation focuses on individual tasks, while orchestration manages multiple tasks and entire workflows. Both automation and orchestration can help improve the efficiency and effectiveness of IT operations, but they have their unique benefits and challenges. Organizations must carefully consider their IT systems and needs when deciding which approach to use.

IT automation vs orchestration simplified

The network automation experts at ZPE Systems have helped Big Tech brands like Amazon and Uber improve operational efficiency and resilience with IT automation and orchestration. Learn how to use these best practices to streamline your IT operations by downloading our Network Automation Blueprint.

Download the Blueprint

Network Resilience: What is a Resilience System?

A digital web of interconnected network resilience concepts being selected by a business person in a suit.

Network resilience means being able to withstand or recover from adversity, service degradation, and complete outages with minimal business disruption. The longer business-critical services are down, or systems are breached, the greater the risk of significant financial, reputational, and legal consequences. A resilience system is a set of technologies that enable an organization to continue operating while teams work to repair failures and recover from cyberattacks. But what exactly is a resilience system, and what does it look like? This guide to network resilience defines resilience systems, provides example use cases, compares them to related technologies like backups and redundant systems, and describes the key components required to build them.

What is a resilience system?

A resilience system provides all the infrastructure, tools, and services necessary to continue operating, if in a degraded state, during major incidents. It also includes everything needed to recover data, rebuild systems, perform security testing, and continue delivering core business functionality. A resilience system is typically isolated from the production network, preventing cybercriminals from finding and compromising it and ensuring teams have continuous access even if the primary network goes down.

Resilience system use cases

Some examples of the challenges that resilience systems help overcome include:

1. Ransomware recovery

In a ransomware attack, cybercriminals infect systems with malware that spreads throughout the network and encrypts any data it encounters. Modern ransomware now uses packaged attacks that move at machine speed, instantly incapacitating entire networks. Organizations completely lose access to critical systems and data until they pay a ransom, often in untraceable cryptocurrency. Ransomware is an exceptionally tenacious form of malware and tends to reinfect backup data and rebuilt systems, significantly hampering recovery efforts and increasing the duration and cost of the attack. The best practice for resilience systems is to isolate them on an out-of-band (OOB) network, inaccessible to hackers who have breached the production in-band network. Doing so creates a safe, isolated recovery environment (IRE) where teams can restore critical data and systems without the risk of reinfection. The resilience system includes all the tools and hardware needed to restore critical business services and infrastructure. An IRE significantly accelerates ransomware recovery and minimizes downtime, so businesses can avoid paying ransoms and reduce the overall cost of attacks.

2. Network outages

Enterprise network architectures and supply chains are highly complex, with lots of moving parts that rely on external vendors to maintain availability. Just one of those vendors dropping the ball could take the entire organization offline, severely impacting network resilience. For example, in 2023, an expired cryptographic certificate caused Cisco’s Viptela SD-WAN appliances to fail on reboot, completely taking down affected networks until the issue was resolved. With a resilience system, Viptela customers could have potentially avoided this downtime by failing over to alternative network resources. For example, a resilience system with integrated cellular failover allows branches to continue connecting to and delivering critical business services while also providing a lifeline for remote teams to access and recover failed systems. A resilience system also provides observability and automatic notifications so teams are instantly alerted to issues like certificate expirations and can respond quickly to recover critical services.

3. Shift to remote work

Incidents like ransomware attacks and equipment failures happen frequently enough that companies can create detailed plans and proactively implement solutions to minimize their impact, but not all adverse events are so predictable. When the COVID-19 pandemic struck, the massive shift to remote work strained the network resources of most organizations. Instead of maintaining a limited number of branch offices, teams suddenly had to treat every employee as a new branch, leading to performance degradation and outages as they scrambled to reinforce the business’s remote capabilities. A resilience system gives teams the tools and resources they need to provision additional infrastructure, manage networking logic, deploy new security solutions, and more, even while the primary network is offline or under a heavy load. A resilience system is the key to quickly adjusting network performance and security to adapt to sudden changes like a transition to fully remote operations.

Do backups and redundancy equate to network resilience?

The short answer is no; backups and redundancy do not equate to network resilience, though they do contribute to making systems more resilient.

  • Backups are copies of data, configurations, and application code used to do a hot or cold restore when a production system fails. The underlying infrastructure must remain operational for teams to access and use backups, and unless additional resilience measures are taken, it’s easy for backups to become infected or compromised, severely hampering recovery efforts.
  • Redundancy involves duplicating critical systems, services, and applications as a failsafe in case the primaries go down. Organizations can “fail over” to the redundancies to continue critical business operations during outages. However, redundant systems are just as susceptible to failures and infections without additional resilience measures like out-of-band management and isolated management infrastructure.

Backups and redundancy are part of network resilience but alone are not enough to ensure business continuity. Resilience systems focus on maintaining the architecture of the production network while adding the ability to recover or adapt to adversity. The next section discusses all the tools and technologies that make up network resilience systems.

What does a resilience system look like?

There are four key components that go into a resilience system.

Key Components of a Resilience System

Alternative Networking

Full-stack routing and switching, Wi-Fi, VoIP, virtualization, software-defined network overlays for SDN & SD-WAN

Alternative Compute

Full-stack compute, containers, virtual machines, and any other resources needed to run applications and deliver services

Storage & Storage Recovery

Enough storage to recover systems and applications as well as support content delivery

Automation

Tools like zero-touch provisioning (ZTP) to facilitate speedy recovery while minimizing human error

Alternative networking and compute resources ensure the organization can failover in the event of a network failure or continue delivering services when production servers are unavailable. Teams also need enough storage to restore backup data, build new systems, and support the content delivery network (CDN). Automation solutions like zero-touch provisioning (ZTP), configuration management, and security validation tools accelerate the recovery process while mitigating the risk of human error. Combined, these components enable teams to reduce the frequency, severity, and duration of outages, improving overall network resilience.

Network resilience with ZPE Systems

A resilient network will continue delivering critical business services in the face of any challenge, whether from cybercriminals, supply chain issues, global events, or even plain human error. A resilience system is isolated from the production network to ensure security and availability, and it consists of all the tools and technologies needed to troubleshoot, recover, and deliver your most crucial data, applications, and infrastructure. The Nodegrid platform from ZPE Systems is the perfect foundation for a resilience system. Nodegrid is a vendor-neutral, out-of-band management solution capable of running your choice of third-party software. Nodegrid allows you to build a highly customizable IRE containing all the tools needed to safely recover from ransomware. You can even use Nodegrid to deliver services while the primary network or systems are down, making it your all-in-one network resilience multi-tool.

Want to ensure network resilience by accelerating ransomware recovery?

Minimize the business impact of ransomware with the help of our whitepaper, 3 Steps to Ransomware Recovery. Learn how to follow Gartner’s best practices to build an Isolated Recovery Environment

Download Whitepaper

Out-of-Band Management: What It Is and Why You Need It

Thumbnail – What is out-of-band management

This scenario is every IT professional’s worst nightmare: it’s the middle of the night, a remote site on the other side of the country has gone offline, and nobody knows why. A single minute of downtime can cost anywhere from several hundred dollars to tens of thousands of dollars, and the nearest tech is a six-hour plane ride away. Consider 2024’s CrowdStrike outage and the devastation caused for banks, airports, and many other organizations.

A bar chart showing the average hourly cost of downtime by industry.
Data Source: SolarWinds

Out-of-band management offers the solution: a way for teams to access critical remote infrastructure during outages and breaches without “out-of-chair” expenses. Out-of-band management allows organizations to recover remote infrastructure faster, reducing the duration and expense of downtime.

This guide to out-of-band management answers critical questions about what this technology is, why you need it, and how to choose the right solution.

What is out-of-band management?

Out-of-band management (OOBM) involves controlling network infrastructure and workflows on an out-of-band network. An out-of-band network is an entirely separate network that runs parallel with your production (or in-band) network but doesn’t rely on any of the same infrastructure or services. OOBM allows teams to administer network infrastructure remotely on a dedicated connection, such as secondary Fiber or cellular LTE, that will remain available even if the in-band network goes down from an equipment failure, ISP outage, or ransomware attack.

A diagram showing how out-of-band management works.

The biggest reason to use out-of-band management is to ensure continuous, uninterrupted access to critical remote infrastructure even when the primary network is down. OOBM allows teams to recover from outages and cyberattacks faster and more cost-efficiently because they can access, troubleshoot, and restore systems without rolling trucks or hiring on-site services.

Out-of-band management provides a lifeline for teams to access critical remote infrastructure when the production network is offline. It allows them to immediately begin troubleshooting and repairing the issue to restore services ASAP. With OOBM, companies save money on recovery expenses, and minimize the duration and business impact of downtime.

What is an OOBM serial console?

Front and back views of the Nodegrid out-of-band management serial console.

Some organizations use OOBM jump boxes (or jump servers) that are connected to both the in-band and out-of-band networks, allowing administrators to “jump” from one network to the other for management. Examples of low-cost jump boxes include the Intel NUC and the Raspberry Pi. However, OOBM jump boxes are security risks because they do not effectively isolate the management infrastructure, plus they require an entire duplicate infrastructure of devices and services to create the out-of-band network. The best practice for security, resilience, and efficiency is to deploy an all-in-one, out-of-band management solution.

An out-of-band management solution uses hardware devices known as serial consoles, which connect to infrastructure devices via their management port (usually RS232 Serial, Ethernet, or USB). Serial consoles are known by lots of other names, including terminal servers, console servers, console server switches, serial routers, and serial switches.

The serial console has dedicated network interfaces to provide an Internet connection for remote management access, often fiber or 4G/5G cellular LTE, so they don’t connect to or rely upon the primary production network at all. This gives teams the ability to continuously monitor and administer critical remote infrastructure even during an ISP or WAN outage that would make a jump box inaccessible.

 Administrators remotely access an OOBM serial console via this dedicated link and, from there, can view and manage all connected infrastructure from a single, convenient software platform. This software is typically deployed on-premises and runs as a VM (virtual machine)  either on the serial console itself or on a separate machine, but there are some cloud-based OOBM network management software tools.

Out-of-band management software varies from provider to provider, with most offering second-generation (or Gen 2) solutions that provide some built-in automation capabilities but do not support vendor-neutral integrations with third-party tools. Newer, third-generation (or Gen 3) solutions use an open, x86 Linux-based operating system to allow easy integrations with other vendors’ software for automation, orchestration, security, monitoring, and more.

The benefits of out-of-band management

Out-of-band management can help you:

  • Improve network performance: Performing resource-intensive management, automation, and orchestration workflows on the out-of-band network reduces the strain on the production network for better speed and reliability.
  • Accelerate ransomware recovery: The OOBM network can be used to create an isolated recovery environment (IRE) where teams can safely rebuild and recover from ransomware attacks without the risk of reinfection, reducing the duration and expense of ransomware-related outages.
  • Streamline repairs and rebuilds: OOBM provides the ability to deploy the tools and applications needed to isolate, cleanse, rebuild, and restore services that have been affected by failures and ransomware.

The security and resilience benefits of out-of-band management are discussed further below.

How does out-of-band management improve security and resilience?

Network breaches and ransomware attacks occur so frequently that most businesses know it’s no longer a question of “if,” but “when” they’ll be hit. Once cybercriminals compromise a device or account and can move around the network, it’s only a matter of time before they find the management interfaces and take complete control over critical infrastructure.

OOBM and management infrastructure isolation

Serial consoles create an out-of-band network by directly connecting to the management port of infrastructure devices and moving all control functions off of the production LAN. This isolates the management plane from the data plane, which is part of a cybersecurity best practice known as isolated management infrastructure (IMI). An IMI further segments the management network and routes management ports to terminate on top-of-rack, OOBM serial switches, creating multiple layers of isolated management. The isolated management plane is always remotely accessible to engineers via the OOBM connection, but it remains hidden from any cybercriminals who may breach the production network.

Multi Layered OOB IMI – ZPE Systems

 

OOBM and ransomware recovery

Out-of-band management also improves security and resilience by aiding in ransomware recovery. According to a Sophos survey, 70% of companies hit by ransomware take longer than two weeks to recover, due in no small part to the pervasive nature of the malware used and how frequently rebuilt systems and recovered data get reinfected. Today’s ransomware attacks are now pre-packaged and move at machine speed – meaning instantly – across infrastructure, bringing entire businesses down before they’ve even realized they’re under attack. The longer the business is offline, the more revenue (and customer trust) is lost, causing recovery costs to skyrocket.

An IMI using out-of-band management gives teams an isolated recovery environment (IRE) where they can recover data and rebuild systems without the risk of reinfection. The IRE allows organizations to get services back online faster to reduce the financial and reputational consequences of ransomware attacks.

A diagram showing the components of an isolated recovery environment.

Resilience is defined as the ability to continuously operate and deliver services, if in a degraded fashion, even while undergoing major failures and breaches. Out-of-band management improves resilience by ensuring that teams have continuous access to critical remote infrastructure no matter what’s going wrong with the production environment. OOBM serial consoles also isolate the management infrastructure to protect it from attackers on the primary network and provide a safe environment for teams to recover from ransomware.

Why choose Nodegrid for out-of-band management?

Many network teams think of out-of-band as being a huge expense and time sink. Setting up  proper infrastructure for OOBM and IMI typically requires 6 or more boxes at each business site for routing, switching, firewall, storage, cellular access, and a jump box. The Nodegrid platform from ZPE Systems reduces the cost and headache of out-of-band management by combining all these functions and more into a single box. Teams can easily drop a Nodegrid box in each site at a fraction of the cost of deploying a traditional OOBM network.

A diagram showing ZPE’s multi-function capabilities for IMI in branch and edge sites.

The first Gen 3 OOBM solution

Nodegrid is the first and only Gen 3 out-of-band management solution. Nodegrid OOBM devices use the x86 Linux-based NodegridOS, which is capable of running VMs and Docker containers to host your choice of third-party applications for automation, orchestration, security, SD-WAN, and more. Nodegrid’s ability to host other vendors’ software ensures that teams have access to all the tools they need to troubleshoot and recover infrastructure from within the IMI environment, making it the perfect network resilience multi-tool.

Nodegrid OOBM software is available as an on-premises solution or a highly scalable cloud-based app, and both support easy integrations with tools for monitoring, automated configuration management, and more. This enables teams to consolidate and streamline their workflows, maximizing efficiency while reducing the risk of human error.

Nodegrid’s other key features include:

  • Built-in 5G/4G LTE and Wi-Fi options for OOB and network failover
  • OOB support over IPMI, ILO, DRAC, CIMC, vSerial, and KVM
  • Robust hardware security like BIOS protection, UEFI Secure Boot, and an encrypted solid-state disk
  • SAML 2.0 and two-factor authentication (2FA)
  • Support for legacy and mixed-vendor infrastructure without expensive adapters

ZPE Systems offers a wide range of out-of-band management devices to fit any deployment size and use case, including the 96-port Nodegrid Serial Console Plus (NSCP) for large and hyperscale data centers, and the Nodegrid Gate SR, which combines branch gateway routing and OOB serial console functionality for remote business sites like retail stores and manufacturing plants.

Nodegrid OOB serial console comparison


Guest OS
Docker Apps
Wi-Fi
Cellular (Dual-SIM)
Serial Ports
Data Sheet
Nodegrid Serial Console S Series
1
1-2
No
1
16, 32 or 48
Nodegrid Serial Console Plus (NSCP)
1
1-2
Yes
1
16, 32, 48 or 96

Nodegrid OOB network edge router comparison


Guest OS
Docker Apps
Wi-Fi
Cellular (Dual-SIM)
Serial Ports
Data Sheet
Nodegrid Link SR
1
1-2
Yes
1
1
Nodegrid Bold SR
1
1-2
Yes
1-2
8
Nodegrid Hive SR
1-2
1-3
Yes
1-2
8
Nodegrid Gate SR
1-3
1-4
Yes
1-2
8
Nodegrid Net SR
1-6
1-4
Yes
1-4
16-80
Nodegrid Mini SR
1
1-2
Yes
1
Via USB

Get scalable network resilience with the only Gen 3 out-of-band management solution

Only Nodegrid OOBM delivers network control, security, automation, and resilience with a completely vendor-neutral platform. To see Nodegrid out-of-band management in action, request a free demo.

Request a Demo

IT Infrastructure Management Best Practices

A small team uses IT infrastructure management best practices to manage an enterprise network

A single hour of downtime costs organizations more than $300,000 in lost business, making network and service reliability critical to revenue. The biggest challenge facing IT infrastructure teams is ensuring network resilience, which is the ability to continue operating and delivering services during equipment failures, ransomware attacks, and other emergencies. This guide discusses IT infrastructure management best practices for creating and maintaining more resilient enterprise networks.
.

What is IT infrastructure management? It’s a collection of all the workflows involved in deploying and maintaining an organization’s network infrastructure. 

IT infrastructure management best practices

The following IT infrastructure management best practices help improve network resilience while streamlining operations. Click the links on the left for a more detailed look at the technologies and processes involved with each.

Isolated Management Infrastructure (IMI)

• Protects management interfaces in case attackers hack the production network

• Ensures continuous access using OOB (out-of-band) management

• Provides a safe environment to fight through and recover from ransomware

Network and Infrastructure Automation

• Reduces the risk of human error in network configurations and workflows

• Enables faster deployments so new business sites generate revenue sooner

• Accelerates recovery by automating device provisioning and deployment

• Allows small IT infrastructure teams to effectively manage enterprise networks

Vendor-Neutral Platforms

• Reduces technical debt by allowing the use of familiar tools

• Extends OOB, automation, AIOps, etc. to legacy/mixed-vendor infrastructure

• Consolidates network infrastructure to reduce complexity and human error

• Eliminates device sprawl and the need to sacrifice features

AIOps

• Improves security detection to defend against novel attacks

• Provides insights and recommendations to improve network health for a better end-user experience

• Accelerates incident resolution with automatic triaging and root-cause analysis (RCA)

Isolated management infrastructure (IMI)

Management interfaces provide the crucial path to monitoring and controlling critical infrastructure, like servers and switches, as well as crown-jewel digital assets like intellectual property (IP). If management interfaces are exposed to the internet or rely on the production network, attackers can easily hijack your critical infrastructure, access valuable resources, and take down the entire network. This is why CISA released a binding directive that instructs organizations to move management interfaces to a separate network, a practice known as isolated management infrastructure (IMI).

The best practice for building an IMI is to use Gen 3 out-of-band (OOB) serial consoles, which unify the management of all connected devices and ensure continuous remote access via alternative network interfaces (such as 4G/5G cellular). OOB management gives IT teams a lifeline to troubleshoot and recover remote infrastructure during equipment failures and outages on the production network. The key is to ensure that OOB serial consoles are fully isolated from production and can run the applications, tools, and services needed to fight through a ransomware attack or outage without taking critical infrastructure offline for extended periods. This essentially allows you to instantly create a virtual War Room for coordinated recovery efforts to get you back online in a matter of hours instead of days or weeks. A diagram showing a multi-layered isolated management infrastructure. An IMI using out-of-band serial consoles also provides a safe environment to recover from ransomware attacks. The pervasive nature of ransomware and its tendency to re-infect cleaned systems mean it can take companies between 1 and 6 months to fully recover from an attack, with costs and revenue losses mounting with every day of downtime. The best practice is to use OOB serial consoles to create an isolated recovery environment (IRE) where teams can restore and rebuild without risking reinfection.
.

Network and infrastructure automation

As enterprise network architectures grow more complex to support technologies like microservices applications, edge computing, and artificial intelligence, teams find it increasingly difficult to manually monitor and manage all the moving parts. Complexity increases the risk of configuration mistakes, which cause up to 35% of cybersecurity incidents. Network and infrastructure automation handles many tedious, repetitive tasks prone to human error, improving resilience and giving admins more time to focus on revenue-generating projects.

Additionally, automated device provisioning tools like zero-touch provisioning (ZTP) and configuration management tools like RedHat Ansible make it easier for teams to recover critical infrastructure after a failure or attack. Network and infrastructure automation help organizations reduce the duration of outages and allow small IT infrastructure teams to manage large enterprise networks effectively, improving resilience and reducing costs.

For an in-depth look at network and infrastructure automation, read the Best Network Automation Tools and What to Use Them For

Vendor-neutral platforms

Most enterprise networks bring together devices and solutions from many providers, and they often don’t interoperate easily. This box-based approach creates vendor lock-in and technical debt by preventing admins from using the tools or scripting languages they’re familiar with, and it makes a fragmented, complex architecture of management solutions that are difficult to operate efficiently. Organizations also end up compromising on features, ending up with a lot of stuff they don’t need and too little of what they do need.

A vendor-neutral IT infrastructure management platform allows teams to unify all their workflows and solutions. It integrates your administrators’ favorite tools to reduce technical debt and provides a centralized place to deploy, orchestrate, and monitor the entire network. It also extends technologies like OOB, automation, and AIOps to otherwise unsupported legacy and mixed-vendor solutions. Such a platform is revolutionary in the same way smartphones were – instead of needing a separate calculator, watch, pager, phone, etc., everything was combined in a single device. A vendor-neutral management platform allows you to run all the apps, services, and tools you need without buying a bunch of extra hardware. It’s a crucial IT infrastructure management best practice for resilience because it consolidates and unifies network architectures to reduce complexity and prevent human error.

Learn more about the benefits of a vendor-neutral IT infrastructure management platform by reading How To Ensure Network Scalability, Reliability, and Security With a Single Platform

AIOps

AIOps applies artificial intelligence technologies to IT operations to maximize resilience and efficiency. Some AIOps use cases include:

  • Security detection: AIOps security monitoring solutions are better at catching novel attacks (those using methods never encountered or documented before) than traditional, signature-based detection methods that rely on a database of known attack vectors.
  • Data analysis: AIOps can analyze all the gigabytes of logs generated by network infrastructure and provide health visualizations and recommendations for preventing potential issues or optimizing performance.
  • Root-cause analysis (RCA): Ingesting infrastructure logs allows AIOps to identify problems on the network, perform root-cause analysis to determine the source of the issues, and create & prioritize service incidents to accelerate remediation.

AIOps is often thought of as “intelligent automation” because, while most automation follows a predetermined script or playbook of actions, AIOps can make decisions on-the-fly in response to analyzed data. AIOps and automation work together to reduce management complexity and improve network resilience.

Want to find out more about using AIOps and automation to create a more resilient network? Read Using AIOps and Machine Learning To Manage Automated Network Infrastructure

IT infrastructure management best practices for maximum resilience

Network resilience is one of the top IT infrastructure management challenges facing modern enterprises. These IT infrastructure management best practices ensure resilience by isolating management infrastructure from attackers, reducing the risk of human error during configurations and other tedious workflows, breaking vendor lock-in to decrease network complexity, and applying artificial intelligence to the defense and maintenance of critical infrastructure.

Need help getting started with these practices and technologies? ZPE Systems can help simplify IT infrastructure management with the vendor-neutral Nodegrid platform. Nodegrid’s OOB serial consoles and integrated branch routers allow you to build an isolated management infrastructure that supports your choice of third-party solutions for automation, AIOps, and more.

Want to learn how to make IT infrastructure management easier with Nodegrid?

To learn more about implementing IT infrastructure management best practices for resilience with Nodegrid, download our Network Automation Blueprint

Request a Demo

Collaboration in DevOps: Strategies and Best Practices

Collaboration in DevOps is illustrated by two team members working together in front of the DevOps infinity logo.
The DevOps methodology combines the software development and IT operations teams into a highly collaborative unit. In a DevOps environment, team members work simultaneously on the same code base, using automation and source control to accelerate releases. The transformation from a traditional, siloed organizational structure to a streamlined, fast-paced DevOps company is rewarding yet challenging. That’s why it’s important to have the right strategy, and in this guide to collaboration in DevOps, you’ll discover tips and best practices for a smooth transition.

Collaboration in DevOps: Strategies and best practices

A successful DevOps implementation results in a tightly interwoven team of software and infrastructure specialists working together to release high-quality applications as quickly as possible. This transition tends to be easier for developers, who are already used to working with software code, source control tools, and automation. Infrastructure teams, on the other hand, sometimes struggle to work at the velocity needed to support DevOps software projects and lack experience with automation technologies, causing a lot of frustration and delaying DevOps initiatives. The following strategies and best practices will help bring Dev and Ops together while minimizing friction.

Turn infrastructure and network configurations into software code

Infrastructure and network teams can’t keep up with the velocity of DevOps software development if they’re manually configuring, deploying, and troubleshooting resources using the GUI (graphical user interface) or CLI (command line interface). The best practice in a DevOps environment is to use software abstraction to turn all configurations and networking logic into code.

Infrastructure as Code (IaC)

Infrastructure as Code (IaC) tools allow teams to write configurations as software code that provisions new resources automatically with the click of a button. IaC configurations can be executed as often as needed to deploy DevOps infrastructure very rapidly and at a large scale.

Software-Defined Networking (SDN) 

Software-defined networking (SDN) and Software-defined wide-area networking (SD-WAN) use software abstraction layers to manage networking logic and workflows. SDN allows networking teams to control, monitor, and troubleshoot very large and complex network architectures from a centralized platform while using automation to optimize performance and prevent downtime.

Software abstraction helps accelerate resource provisioning, reducing delays and friction between Dev and Ops. It can also be used to bring networking teams into the DevOps fold with automated, software-defined networks, creating what’s known as a NetDevOps environment.

Use common, centralized tools for software source control

Collaboration in DevOps means a whole team of developers or sysadmins may work on the same code base simultaneously. This is highly efficient — but risky. Development teams have used software source control tools like GitHub for years to track and manage code changes and prevent overwriting each other’s work. In a DevOps organization using IaC and SDN, the best practice is to incorporate infrastructure and network code into the same source control system used for software code.

Managing infrastructure configurations using a tool like GitHub ensures that sysadmins can’t make unauthorized changes to critical resources. For example, administrators initiate many ransomware attacks and other major outages by directly changing infrastructure configurations without testing or approval. This happened in a high-profile MGM cyberattack when an IT staff member fell victim to social engineering and granted elevated Okta privileges to an attacker without having to get approval from a second pair of eyes.

Using DevOps source control, all infrastructure changes must be reviewed and approved by a second party in the IT department to ensure they don’t introduce vulnerabilities or malicious code into production. Sysadmins can work quickly and creatively, knowing there’s a safety net to catch mistakes, reducing Ops delays, and fostering a more collaborative environment.

Consolidate and integrate DevOps tools with a vendor-neutral platform

An enterprise DevOps deployment usually involves dozens – if not hundreds – of different tools to automate and streamline the many workflows involved in a software development project. Having so many individual DevOps tools deployed around the enterprise increases the management complexity, which can have the following consequences.

  • Human error – The harder it is to stay on top of patch releases, security bulletins, and monitoring logs, the more likely it is that an issue will slip between the cracks until it causes an outage or breach.
  • Security complexity – Every additional DevOps tool added to the architecture makes integrating and implementing a consistent security model more complex and challenging, increasing the risk of coverage gaps.
  • Spiraling costs – With many different solutions handling individual workflows around the enterprise, the likelihood of buying redundant services or paying for unneeded features increases, which can impact ROI.
  • Reduced efficiency – DevOps aims to increase operational efficiency, but having to work across so many disparate tools can slow teams down, especially when those tools don’t interoperate.

The best practice is consolidating your DevOps tools with a centralized, vendor-neutral platform. For example, the Nodegrid Services Delivery Platform from ZPE Systems can host and integrate 3rd-party DevOps tools, unifying them under a single management umbrella. Nodegrid gives IT teams single-pane-of-glass control over the entire DevOps architecture, including the underlying network infrastructure, which reduces management complexity, increases efficiency, and improves ROI.

Maximize DevOps success

DevOps collaboration can improve operational efficiency and allow companies to release software at the velocity required to stay competitive in the market. Using software abstraction, centralized source code control, and vendor-neutral management platforms reduces friction on your DevOps journey. The best practice is to unify your DevOps environment with a vendor-neutral platform like Nodegrid to maximize control, cost-effectiveness, and productivity.

Want to Simplify collaboration in DevOps with the Nodegrid platform?

Reach out to ZPE Systems today to learn more about how the Nodegrid Services Delivery Platform can help you simplify collaboration in DevOps.

 

Contact Us