Providing Out-of-Band Connectivity to Mission-Critical IT Resources

Network Resilience Doesn’t Mean What it Did 20 Years Ago

Network resilience requirements have changed

Enterprise networks are like air. When they’re running smoothly, it’s easy to take them for granted, as business users and customers are able to go about their normal activities. But when customer service reps are suddenly cut off from their ticketing system, or family movie night turns into a game of “Is it my router, or the network?”, everyone notices. This is why network resilience is critical.

But, what exactly does resilience mean today? Let’s find out by looking at some recent real-world examples, the history of network architectures, and why network resilience doesn’t mean what it did 20 years ago.

Why does network resilience matter?

There’s no shortage of real-world examples showing why network resilience matters. The takeaway is that network resilience is directly tied to business, which means that it impacts revenue, costs, and risks. Here is a brief list of resilience-related incidents that occurred in 2023 alone:

  • FAA (Federal Aviation Administration) – An overworked contractor unintentionally deleted files, which delayed flights nationwide for an entire day.
  • Southwest Airlines – A firewall configuration change caused 16,000 flight cancellations and cost the company about $1 billion.
  • MOVEit FTP exploit – Thousands of global organizations fell victim to a MOVEit vulnerability, which allowed attackers to steal personal data for millions.
  • MGM Resorts – A human exploit and lack of recovery systems let an attack persist for weeks, causing millions in losses per day.
  • Ragnar Locker attacks – Several large organizations were locked out of IT systems for days, which slowed or halted customer operations worldwide.

What does network resilience mean?

Based on the examples above, it might seem that network resilience could mean different things. It might mean having backups of golden configs that you could easily restore in case of a mistake. It might mean beefing up your security and/or replacing outdated systems. It might mean having recovery processes in place.

So, which is it?

The answer is, it’s all of these and more.

Donald Firesmith (Carnegie Mellon) defines resilience this way: “A system is resilient if it continues to carry out its mission in the face of adversity (i.e., if it provides required capabilities despite excessive stresses that can cause disruptions).”

Network resilience means having a network that continues to serve its essential functions despite adversity. Adversity can stem from human error, system outages, cyberattacks, and even natural disasters that threaten to degrade or completely halt normal network operations. Achieving network resilience requires the ability to quickly address issues ranging from device failures and misconfigurations, to full-blown ISP outages and ransomware attacks.

The problem is, this is now much more difficult than it used to be.

How did network resilience become so complicated?

Twenty years ago, IT teams managed a centralized architecture. The data center was able to serve end-users and customers with the minimal services they needed. Being “constantly connected” wasn’t a concern for most people. For the business, achieving resilience was as simple as going on-site or remoting-in via serial console to fix issues at the data center.

Network architecture showing simplicity of data center connected via MPLS to branch office

Then in the mid-2000s, the advent of the cloud changed everything. Infrastructure, data, and computing became decentralized into a distributed mix of on-prem and cloud solutions. Users could connect from anywhere, and on-demand services allowed people to be plugged in around-the-clock. Services for work, school, and entertainment could be delivered anytime, no matter where users were.

Network architecture showing complexity of data center, CDN, remote user, branch office, all connected via many paths

Behind the scenes, this explosion of architecture created three problems for achieving network resilience, which a simple serial could no longer fix:

Too Much Work

Infrastructure, data, and computing are widely distributed. Systems inevitably break and require work, but teams don’t have the staff to keep up.

Too Much Complexity

Pairing cloud and box-based stacks creates complex networks. Teams leave systems outdated, because they don’t want to break this delicate architecture.

Too Much Risk

Unpatched, outdated systems are prime targets for packaged attacks that move at machine speed. Defense requires recovery tools that teams don’t have.

Enabling businesses to be resilient in the modern age requires an approach that’s different than simply deploying a serial console for remote troubleshooting. Gen 1 and 2 serial consoles, which have dominated the market for 20 years, were designed to solve basic issues by offering limited remote access and some automation. The problem is, these still leave teams lacking the confidence to answer questions like:

  • “How can we guarantee access to fix stuff that breaks, without rolling trucks?”
  • “Can we automate change management, without fear of breaking the network?”
  • “Attacks are inevitable — How do we stop hackers from cutting off our access?”

Hyperscalers, Internet Service Providers, Big Tech, and even the military have a resilience model that they’ve proven over the last decade. Their approach involves fully isolating command and control from data and user environments. This allows them to not only gain low-level remote access to maintain and fix systems, but also to “defend the hill” and maintain control if systems are compromised or destroyed.

This approach uses something called Isolated Management Infrastructure (IMI).

Isolated Management Infrastructure is the best practice for network resilience

Isolated Management Infrastructure is the practice of creating a management network that is completely separate from the production network. Most IT teams are familiar with out-of-band management as this network; IMI, however, provides many capabilities that can’t be hosted on a traditional serial console or OOB network. And with increasing vulnerabilities, CISA issued a binding directive specifically calling for organizations to implement IMI.

Isolated Management Infrastructure using Gen 3 serial consoles, like ZPE Systems’ Nodegrid devices, provides more than simple remote access and automation. Similar to a proper out-of-band network, IMI is completely isolated from production assets. This means there are no dependencies on production devices or connections, and management interfaces are not exposed to the internet or production gear. In the event of an outage or attack, teams retain management access, and this is just the beginning of the benefits of having IMI.

A network architecture diagram showing Isolated Management Infrastructure next to production infrastructure

IMI includes more than nine functions that are required for teams to fully service their production assets. These include:

  • Low-level access to all management interfaces, including serial, Ethernet, USB, IPMI, and others, to guarantee remote access to the entire environment
  • Open, edge-native automation to ensure services can continue operating in the event of outages or change errors
  • Computing, storage, and jumpbox capabilities that can natively host the apps and tools to deploy an IRE, to ensure fast, effective recovery from attacks

Get the guide to build IMI

ZPE Systems has worked alongside Big Tech to fulfill their requirements for IMI. In doing so, we created the Network Automation blueprint as a technical guide to help any organization build their own Isolated Management Infrastructure. Download the blueprint now to get started.

IT Infrastructure Management Best Practices

A small team uses IT infrastructure management best practices to manage an enterprise network

A single hour of downtime costs organizations more than $300,000 in lost business, making network and service reliability critical to revenue. The biggest challenge facing IT infrastructure teams is ensuring network resilience, which is the ability to continue operating and delivering services during equipment failures, ransomware attacks, and other emergencies. This guide discusses IT infrastructure management best practices for creating and maintaining more resilient enterprise networks.
.

What is IT infrastructure management? It’s a collection of all the workflows involved in deploying and maintaining an organization’s network infrastructure. 

IT infrastructure management best practices

The following IT infrastructure management best practices help improve network resilience while streamlining operations. Click the links on the left for a more detailed look at the technologies and processes involved with each.

Isolated Management Infrastructure (IMI)

• Protects management interfaces in case attackers hack the production network

• Ensures continuous access using OOB (out-of-band) management

• Provides a safe environment to fight through and recover from ransomware

Network and Infrastructure Automation

• Reduces the risk of human error in network configurations and workflows

• Enables faster deployments so new business sites generate revenue sooner

• Accelerates recovery by automating device provisioning and deployment

• Allows small IT infrastructure teams to effectively manage enterprise networks

Vendor-Neutral Platforms

• Reduces technical debt by allowing the use of familiar tools

• Extends OOB, automation, AIOps, etc. to legacy/mixed-vendor infrastructure

• Consolidates network infrastructure to reduce complexity and human error

• Eliminates device sprawl and the need to sacrifice features

AIOps

• Improves security detection to defend against novel attacks

• Provides insights and recommendations to improve network health for a better end-user experience

• Accelerates incident resolution with automatic triaging and root-cause analysis (RCA)

Isolated management infrastructure (IMI)

Management interfaces provide the crucial path to monitoring and controlling critical infrastructure, like servers and switches, as well as crown-jewel digital assets like intellectual property (IP). If management interfaces are exposed to the internet or rely on the production network, attackers can easily hijack your critical infrastructure, access valuable resources, and take down the entire network. This is why CISA released a binding directive that instructs organizations to move management interfaces to a separate network, a practice known as isolated management infrastructure (IMI).

The best practice for building an IMI is to use Gen 3 out-of-band (OOB) serial consoles, which unify the management of all connected devices and ensure continuous remote access via alternative network interfaces (such as 4G/5G cellular). OOB management gives IT teams a lifeline to troubleshoot and recover remote infrastructure during equipment failures and outages on the production network. The key is to ensure that OOB serial consoles are fully isolated from production and can run the applications, tools, and services needed to fight through a ransomware attack or outage without taking critical infrastructure offline for extended periods. This essentially allows you to instantly create a virtual War Room for coordinated recovery efforts to get you back online in a matter of hours instead of days or weeks. A diagram showing a multi-layered isolated management infrastructure. An IMI using out-of-band serial consoles also provides a safe environment to recover from ransomware attacks. The pervasive nature of ransomware and its tendency to re-infect cleaned systems mean it can take companies between 1 and 6 months to fully recover from an attack, with costs and revenue losses mounting with every day of downtime. The best practice is to use OOB serial consoles to create an isolated recovery environment (IRE) where teams can restore and rebuild without risking reinfection.
.

Network and infrastructure automation

As enterprise network architectures grow more complex to support technologies like microservices applications, edge computing, and artificial intelligence, teams find it increasingly difficult to manually monitor and manage all the moving parts. Complexity increases the risk of configuration mistakes, which cause up to 35% of cybersecurity incidents. Network and infrastructure automation handles many tedious, repetitive tasks prone to human error, improving resilience and giving admins more time to focus on revenue-generating projects.

Additionally, automated device provisioning tools like zero-touch provisioning (ZTP) and configuration management tools like RedHat Ansible make it easier for teams to recover critical infrastructure after a failure or attack. Network and infrastructure automation help organizations reduce the duration of outages and allow small IT infrastructure teams to manage large enterprise networks effectively, improving resilience and reducing costs.

For an in-depth look at network and infrastructure automation, read the Best Network Automation Tools and What to Use Them For

Vendor-neutral platforms

Most enterprise networks bring together devices and solutions from many providers, and they often don’t interoperate easily. This box-based approach creates vendor lock-in and technical debt by preventing admins from using the tools or scripting languages they’re familiar with, and it makes a fragmented, complex architecture of management solutions that are difficult to operate efficiently. Organizations also end up compromising on features, ending up with a lot of stuff they don’t need and too little of what they do need.

A vendor-neutral IT infrastructure management platform allows teams to unify all their workflows and solutions. It integrates your administrators’ favorite tools to reduce technical debt and provides a centralized place to deploy, orchestrate, and monitor the entire network. It also extends technologies like OOB, automation, and AIOps to otherwise unsupported legacy and mixed-vendor solutions. Such a platform is revolutionary in the same way smartphones were – instead of needing a separate calculator, watch, pager, phone, etc., everything was combined in a single device. A vendor-neutral management platform allows you to run all the apps, services, and tools you need without buying a bunch of extra hardware. It’s a crucial IT infrastructure management best practice for resilience because it consolidates and unifies network architectures to reduce complexity and prevent human error.

Learn more about the benefits of a vendor-neutral IT infrastructure management platform by reading How To Ensure Network Scalability, Reliability, and Security With a Single Platform

AIOps

AIOps applies artificial intelligence technologies to IT operations to maximize resilience and efficiency. Some AIOps use cases include:

  • Security detection: AIOps security monitoring solutions are better at catching novel attacks (those using methods never encountered or documented before) than traditional, signature-based detection methods that rely on a database of known attack vectors.
  • Data analysis: AIOps can analyze all the gigabytes of logs generated by network infrastructure and provide health visualizations and recommendations for preventing potential issues or optimizing performance.
  • Root-cause analysis (RCA): Ingesting infrastructure logs allows AIOps to identify problems on the network, perform root-cause analysis to determine the source of the issues, and create & prioritize service incidents to accelerate remediation.

AIOps is often thought of as “intelligent automation” because, while most automation follows a predetermined script or playbook of actions, AIOps can make decisions on-the-fly in response to analyzed data. AIOps and automation work together to reduce management complexity and improve network resilience.

Want to find out more about using AIOps and automation to create a more resilient network? Read Using AIOps and Machine Learning To Manage Automated Network Infrastructure

IT infrastructure management best practices for maximum resilience

Network resilience is one of the top IT infrastructure management challenges facing modern enterprises. These IT infrastructure management best practices ensure resilience by isolating management infrastructure from attackers, reducing the risk of human error during configurations and other tedious workflows, breaking vendor lock-in to decrease network complexity, and applying artificial intelligence to the defense and maintenance of critical infrastructure.

Need help getting started with these practices and technologies? ZPE Systems can help simplify IT infrastructure management with the vendor-neutral Nodegrid platform. Nodegrid’s OOB serial consoles and integrated branch routers allow you to build an isolated management infrastructure that supports your choice of third-party solutions for automation, AIOps, and more.

Want to learn how to make IT infrastructure management easier with Nodegrid?

To learn more about implementing IT infrastructure management best practices for resilience with Nodegrid, download our Network Automation Blueprint

Request a Demo

Collaboration in DevOps: Strategies and Best Practices

Collaboration in DevOps is illustrated by two team members working together in front of the DevOps infinity logo.
The DevOps methodology combines the software development and IT operations teams into a highly collaborative unit. In a DevOps environment, team members work simultaneously on the same code base, using automation and source control to accelerate releases. The transformation from a traditional, siloed organizational structure to a streamlined, fast-paced DevOps company is rewarding yet challenging. That’s why it’s important to have the right strategy, and in this guide to collaboration in DevOps, you’ll discover tips and best practices for a smooth transition.

Collaboration in DevOps: Strategies and best practices

A successful DevOps implementation results in a tightly interwoven team of software and infrastructure specialists working together to release high-quality applications as quickly as possible. This transition tends to be easier for developers, who are already used to working with software code, source control tools, and automation. Infrastructure teams, on the other hand, sometimes struggle to work at the velocity needed to support DevOps software projects and lack experience with automation technologies, causing a lot of frustration and delaying DevOps initiatives. The following strategies and best practices will help bring Dev and Ops together while minimizing friction.

Turn infrastructure and network configurations into software code

Infrastructure and network teams can’t keep up with the velocity of DevOps software development if they’re manually configuring, deploying, and troubleshooting resources using the GUI (graphical user interface) or CLI (command line interface). The best practice in a DevOps environment is to use software abstraction to turn all configurations and networking logic into code.

Infrastructure as Code (IaC)

Infrastructure as Code (IaC) tools allow teams to write configurations as software code that provisions new resources automatically with the click of a button. IaC configurations can be executed as often as needed to deploy DevOps infrastructure very rapidly and at a large scale.

Software-Defined Networking (SDN) 

Software-defined networking (SDN) and Software-defined wide-area networking (SD-WAN) use software abstraction layers to manage networking logic and workflows. SDN allows networking teams to control, monitor, and troubleshoot very large and complex network architectures from a centralized platform while using automation to optimize performance and prevent downtime.

Software abstraction helps accelerate resource provisioning, reducing delays and friction between Dev and Ops. It can also be used to bring networking teams into the DevOps fold with automated, software-defined networks, creating what’s known as a NetDevOps environment.

Use common, centralized tools for software source control

Collaboration in DevOps means a whole team of developers or sysadmins may work on the same code base simultaneously. This is highly efficient — but risky. Development teams have used software source control tools like GitHub for years to track and manage code changes and prevent overwriting each other’s work. In a DevOps organization using IaC and SDN, the best practice is to incorporate infrastructure and network code into the same source control system used for software code.

Managing infrastructure configurations using a tool like GitHub ensures that sysadmins can’t make unauthorized changes to critical resources. For example, administrators initiate many ransomware attacks and other major outages by directly changing infrastructure configurations without testing or approval. This happened in a high-profile MGM cyberattack when an IT staff member fell victim to social engineering and granted elevated Okta privileges to an attacker without having to get approval from a second pair of eyes.

Using DevOps source control, all infrastructure changes must be reviewed and approved by a second party in the IT department to ensure they don’t introduce vulnerabilities or malicious code into production. Sysadmins can work quickly and creatively, knowing there’s a safety net to catch mistakes, reducing Ops delays, and fostering a more collaborative environment.

Consolidate and integrate DevOps tools with a vendor-neutral platform

An enterprise DevOps deployment usually involves dozens – if not hundreds – of different tools to automate and streamline the many workflows involved in a software development project. Having so many individual DevOps tools deployed around the enterprise increases the management complexity, which can have the following consequences.

  • Human error – The harder it is to stay on top of patch releases, security bulletins, and monitoring logs, the more likely it is that an issue will slip between the cracks until it causes an outage or breach.
  • Security complexity – Every additional DevOps tool added to the architecture makes integrating and implementing a consistent security model more complex and challenging, increasing the risk of coverage gaps.
  • Spiraling costs – With many different solutions handling individual workflows around the enterprise, the likelihood of buying redundant services or paying for unneeded features increases, which can impact ROI.
  • Reduced efficiency – DevOps aims to increase operational efficiency, but having to work across so many disparate tools can slow teams down, especially when those tools don’t interoperate.

The best practice is consolidating your DevOps tools with a centralized, vendor-neutral platform. For example, the Nodegrid Services Delivery Platform from ZPE Systems can host and integrate 3rd-party DevOps tools, unifying them under a single management umbrella. Nodegrid gives IT teams single-pane-of-glass control over the entire DevOps architecture, including the underlying network infrastructure, which reduces management complexity, increases efficiency, and improves ROI.

Maximize DevOps success

DevOps collaboration can improve operational efficiency and allow companies to release software at the velocity required to stay competitive in the market. Using software abstraction, centralized source code control, and vendor-neutral management platforms reduces friction on your DevOps journey. The best practice is to unify your DevOps environment with a vendor-neutral platform like Nodegrid to maximize control, cost-effectiveness, and productivity.

Want to Simplify collaboration in DevOps with the Nodegrid platform?

Reach out to ZPE Systems today to learn more about how the Nodegrid Services Delivery Platform can help you simplify collaboration in DevOps.

 

Contact Us

Terminal Servers: Uses, Benefits, and Examples

NSCStack
Terminal servers are network management devices providing remote access to and control over remote infrastructure. They typically connect to infrastructure devices via serial ports (hence their alternate names, serial consoles, console servers, serial console routers, or serial switches). IT teams use terminal servers to consolidate remote device management and create an out-of-band (OOB) control plane for remote network infrastructure. Terminal servers offer several benefits over other remote management solutions, such as better performance, resilience, and security. This guide answers all your questions about terminal servers, discussing their uses and benefits before describing what to look for in the best terminal server solution.

What is a terminal server?

A terminal server is a networking device used to manage other equipment. It directly connects to servers, switches, routers, and other equipment using management ports, which are typically (but not always) serial ports. Network administrators remotely access the terminal server and use it to manage all connected devices in the data center rack or branch where it’s installed.

What are the uses for terminal servers?

Network teams use terminal servers for two primary functions: remote infrastructure management consolidation and out-of-band management.

  1. Terminal servers unify management for all connected devices, so administrators don’t need to log in to each separate solution individually. Terminal servers save significant time and effort, which reduces the risk of fatigue and human error that could take down the network.
  2. Terminal servers provide remote out-of-band (OOB) management, creating a separate, isolated network dedicated to infrastructure management and troubleshooting. OOB allows administrators to troubleshoot and recover remote infrastructure during equipment failures, network outages, and ransomware attacks.

Learn more about using OOB terminal servers to recover from ransomware attacks by reading How to Build an Isolated Recovery Environment (IRE).

What are the benefits of terminal servers?

There are other ways to gain remote OOB management access to remote infrastructure, such as using Intel NUC jump boxes. Despite this, terminal servers are the better option for OOB management because they offer benefits including:

The benefits of terminal servers

Centralized management

Remote recovery

Even with a jump box, administrators typically must access the CLI of each infrastructure solution individually. Each jump box is also separately managed and accessed. A terminal server provides a single management platform to access and control all connected devices. That management platform works across all terminal servers from the same vendor, allowing teams to monitor and manage infrastructure across all remote sites from a single portal. 

When a jump box crashes or loses network access, there’s usually no way to recover it remotely, necessitating costly and time-consuming truck rolls before diagnostics can even begin. Terminal servers use OOB connection options like 5G/4G LTE to ensure continuous access to remote infrastructure even during major network outages. Out-of-band management gives remote teams a lifeline to troubleshoot, rebuild, and recover infrastructure fast.

Improved performance

Stronger security

Network and infrastructure management workflows can use a lot of bandwidth, especially when organizations use automation tools and orchestration platforms, potentially impacting end-user performance. Terminal servers create a dedicated OOB control plane where teams can execute as many resource-intensive automation workflows as needed without taking bandwidth away from production applications and users. 

Jump boxes often lack the security features and oversight of other enterprise network resources, which makes them vulnerable to exploitation by malicious actors. Terminal servers are secured by onboard hardware Roots of Trust (e.g., TPM), receive patches from the vendor like other enterprise-grade solutions, and can be onboarded with cybersecurity monitoring tools and Zero Trust security policies to defend the management network. 

Examples of terminal servers

Examples of popular terminal server solutions include the Opengear CM8100, the Avocent ACS8000, and the Nodegrid Serial Console Plus. The Opengear and Avocent solutions are second-generation, or Gen 2, terminal servers, which means they provide some automation support but suffer from vendor lock-in. The Nodegrid solution is the only Gen 3 terminal server, offering unlimited integration support for 3rd-party automation, security, SD-WAN, and more.

What to look for in the best terminal server

Terminal servers have evolved, so there is a wide range of options with varying capabilities and features. Some key characteristics of the best terminal server include:

  • 5G/4G LTE and Wi-Fi options for out-of-band access and network failover
  • Support for legacy devices without costly adapters or complicated configuration tweaks
  • Advanced authentication support, including two-factor authentication (2FA) and SAML 2.0
  • Robust onboard hardware security features like a self-encrypted SSD and UEFI Secure Boot
  • An open, Linux-based OS that supports Guest OS and Docker containers for third-party software
  • Support for zero-touch provisioning (ZTP), custom scripts, and third-party automation tools
  • A vendor-neutral, centralized management and orchestration platform for all connected solutions

These characteristics give organizations greater resilience, enabling them to continue operating and providing services in a degraded fashion while recovering from outages and ransomware. In addition, vendor-neutral support for legacy devices and third-party automation enables companies to scale their operations efficiently without costly upgrades.

Why choose Nodegrid terminal servers?

Only one terminal server provides all the features listed above on a completely vendor-neutral platform – the Nodegrid solution from ZPE Systems.

The Nodegrid S Series terminal server uses auto-sensing ports to discover legacy and mixed-vendor infrastructure solutions and bring them under one unified management umbrella.

The Nodegrid Serial Console Plus (NSCP) is the first terminal server to offer 96 management ports on a 1U rack-mounted device (Patent No. 9,905,980).

ZPE also offers integrated branch/edge services routers with terminal server functionality, so you can consolidate your infrastructure while extending your capabilities.

All Nodegrid devices offer a variety of OOB and failover options to ensure maximum speed and reliability. They’re protected by comprehensive onboard security features like TPM 2.0, self-encrypted disk (SED), BIOS protection, Signed OS, and geofencing to keep malicious actors off the management network. They also run the open, Linux-based Nodegrid OS, supporting Guest OS and Docker containers so you can host third-party applications for automation, security, AIOps, and more. Nodegrid extends automation, security, and control to all the legacy and mixed-vendor devices on your network and unifies them with a centralized, vendor-neutral management platform for ultimate scalability, resilience, and efficiency.

Want to learn more about Nodegrid terminal servers?

ZPE Systems offers terminal server solutions for data center, branch, and edge deployments. Schedule a free demo to see Nodegrid terminal servers in action.

Request a Demo

Best Network Performance Monitoring Tools

Best Network Performance Monitoring Tools
Network performance monitoring tools provide visibility into the health and efficiency of networks and their underlying infrastructure of devices and software. Some platforms focus entirely on collecting and analyzing logs from various sources on the network, while others provide additional management capabilities that let you control, change, and troubleshoot network infrastructure. Choosing the right solution requires a thoughtful consideration of factors such as the cost, scalability, and interoperability of the software, as well as your team’s experience and abilities. This guide compares three of the best network performance monitoring tools by analyzing these critical factors before providing advice on the most scalable and cost-effective way to deploy your solutions.

Comparing best network performance monitoring tools

Platform

Key Features

SolarWinds Network Performance Monitor (NPM)

  • Network device, performance, and fault monitoring

  • Deep packet inspection and analysis

  • LAN and WAN monitoring

  • Automatic network discovery, mapping, and monitoring

  • Network availability monitoring

  • Network diagnostics

  • Network path analysis

  • Network performance testing

  • SNMP monitoring

  • Wi-Fi analysis

Kentik

  • Network telemetry dashboards

  • Multi-vendor network monitoring

  • Cloud, edge, and hybrid cloud monitoring

  • SaaS application performance & uptime monitoring

  • Intelligent automated alerts

  • SNMP, traffic flow, VPC, host agent, and synthetic monitoring

  • Multi-cloud performance monitoring

  • Kubernetes workload monitoring

  • SD-WAN monitoring

  • Network security monitoring

  • Network map visualizations

  • QoE monitoring

ThousandEyes

  • Network availability and performance testing

  • WAN performance monitoring

  • Cisco SD-WAN monitoring and optimization

  • Browser session monitoring

  • Network path visibility

  • User Wi-Fi connectivity monitoring

  • VPN mapping and monitoring

  • Cross-layer data visualizations

Disclaimer: This comparison was written by a 3rd party in collaboration with ZPE Systems using data gathered from publicly available data sheets and admin guides, as of 10/20/2023. Please email us if you have corrections or edits, or want to review additional attributes: Matrix@zpesystems.com

SolarWinds Network Performance Monitor (NPM)

The Network Performance Monitor (NPM) is part of the SolarWinds Orion platform of integrated products. This mature and richly featured monitoring software is delivered as a cloud-based service and can observe SaaS (software as a service), cloud, hybrid cloud, and on-premises infrastructure. With advanced features like deep packet inspection (DPI), WAN optimization monitoring, automatic network mapping, and automated diagnostic tools, SolarWinds NPM is meant to be a complete, enterprise-grade observability solution. As part of the Orion platform, it’s also extensible with other products from the SolarWinds ecosystem, such as a Network Configuration Manager. As an enterprise solution, SolarWinds NPM comes with a high price tag that grows even larger as additional monitoring agents are added, limiting the scalability. Another important factor to consider is that SolarWinds recently suffered a high-profile hack that compromised thousands of customers, so there are security risks involved in trusting the Orion supply chain. Additionally, despite a large library of integrations, SolarWinds is a closed ecosystem that doesn’t work well with 3rd-party tools or custom scripts.​

Pros

Cons

  • Supports SaaS, cloud, and on-premises networks
  • Includes advanced monitoring features like DPI
  • Part of a large ecosystem of observability and management solutions
  • Pricing is expensive and limits scalability
  • Recently suffered a high-profile breach that impacted thousands of customers
  • Closed ecosystem may not support your 3rd-party tools

Kentik

Kentik is an end-to-end network observability platform for cloud, multi-cloud, hybrid cloud, SaaS, and data center infrastructure. In addition to network performance monitoring, the platform includes monitoring solutions for SaaS application performance and SD-WAN performance. Other observability features include SaaS uptime monitoring, AI-driven insights and alerts, network security monitoring, and QoE (Quality of Experience) monitoring. Kentik also recently launched a Kubernetes network monitoring solution called Kentik Kube that provides end-to-end cluster visibility. Overall, Kentik is a powerful network observability platform that includes many of its most innovative features in its “Essentials” and “Pro” pricing packages, providing a lot of bang for your buck. The downside is that you can’t subscribe to features individually and must purchase a whole package, meaning you could end up paying for features you don’t need. Because Kentik is not a large vendor, its customer service may be slow to respond in some cases. Additionally, although Kentik does have a large library of integrations, it is not a vendor-neutral platform.

Pros

Cons

  • Supports cloud, multi-cloud, hybrid cloud, SaaS, and data center infrastructure
  • Includes many advanced features and solutions at no additional cost
  • Provides AI-driven network insights and intelligent alerts
  • Products aren’t available a la carte
  • Customer service and technical support can be slow to respond
  • Isn’t entirely vendor-neutral

ThousandEyes

ThousandEyes is a digital experience monitoring platform primarily focused on network and application synthetic testing, end-user performance monitoring, and ISP Internet monitoring for SaaS, cloud, and on-premises networks. Additionally, ThousandEyes is part of the Cisco family and can be used to monitor and optimize Cisco SD-WAN architectures. Across its family of observability products, ThousandEyes includes features like wireless network visibility, SaaS performance visualizations, cloud application outage detection, and SD-WAN performance forecasting. The major advantage of the ThousandEyes platform is that it provides true end-to-end visibility of the entire service delivery chain, including end-user device performance and third-party provider availability. One downside is the endpoint agent-based monitoring solution requires on-premises VMs to run, which can be cumbersome to maintain and limits scalability. The pricing is expensive compared to similar solutions, and you may have to combine products to get all the features you need. Additionally, ThousandEyes is not a vendor-neutral platform and has a relatively small library of integrations.

Pros

Cons

  • Supports SaaS, cloud, and on-premises networks
  • Works with Cisco DNA software for SD-WAN monitoring
  • Provides end-to-end visibility of the entire service delivery chain
  • Agent-based monitoring requires on-premises VMs, limiting scalability
  • Pricing is expensive compared to similar solutions
  • Limited integrations, preventing interoperability

Conclusion

Each of the solutions on this list has advantages that make it well-suited to certain environments, as well as limitations to consider. Solarwinds NPM is part of a large ecosystem of observability and management solutions that includes advanced features like DPI, but it’s suffering from a major security incident and has a closed ecosystem. Kentik packs a lot of innovative, AI-driven monitoring capabilities into its platform offerings, but its pricing tiers are inflexible, and it doesn’t have the large, enterprise-grade support team of its larger competitors. ThousandEyes provides end-to-end visibility of the entire service delivery chain and works seamlessly with Cisco DNA software, but it has a steep learning curve and a limited library of integrations.

How to run the best network performance monitoring tools

Most network performance monitoring tools – even cloud-based SaaS offerings – communicate with endpoint agents using software deployed on VMs (virtual machines) running on-premises in each business location. Running these VMs on fully provisioned servers or PCs is expensive, but deploying them on NUCs is highly insecure, especially as organizations scale out with distributed branches and edge computing sites. What’s needed is a consolidated hardware solution that combines critical branch, edge, and data center networking functionality with vendor-neutral VM and application hosting, such as the Nodegrid platform from ZPE Systems. Nodegrid’s serial switches and network edge routers run the open, Linux-based Nodegrid OS, which can host your choice of third-party software – including Docker containers – for network performance monitoring, SD-WAN, security, automation, and more. Nodegrid’s versatile, modular hardware solutions also provide out-of-band (OOB) management access to critical remote infrastructure and monitoring solutions, giving teams a lifeline to recover from outages and ransomware attacks. Nodegrid uses innovative, enterprise-grade security features like Secure Boot, self-encrypted disk, and two-factor authentication (2FA), and its onboard software is frequently patched for vulnerabilities to defend against a breach. Deploying Nodegrid at each business site consolidates your network to reduce hardware overhead, streamlining management and enabling easy scalability.

Deploy the best network performance monitoring tools with Nodegrid

Reach out to ZPE Systems to see a demo of how the best network performance monitoring tools run on the Nodegrid platform.
Contact Us

Breaking Down The 2023 Ragnar Locker Cyberattacks

Breaking Down the 2023 Ragnar Locker Cyberattacks

This article was written by James Cabe, CISSP, a 30-year cybersecurity expert who’s helped major companies including Microsoft and Fortinet.

Throughout 2023, several organizations were successfully hit by Ragnar Locker cyberattacks. The affected victims spanned the globe and were forced to shut down much of their critical operations, while the attackers demanded tens of millions of dollars in ransom payments. Despite the group being taken down by law enforcement in October, organizations are re-evaluating their defensive measures — and more importantly, their recovery strategies — to combat these attacks.

If you read my previous articles about the ongoing MOVEit breach and the ransomware that hit MGM, you probably know that isolation is key. It helps you fight through attacks by cutting the kill chain, so that you can restore services quickly without reinfection.

Who Carries Out Ragnar Locker Cyberattacks?

Recent Ragnar Locker cyberattacks were carried out by the Dark Angels Team cybercriminal group. Dark Angels Team’s modus operandi is to breach a company’s defenses, spread laterally, and steal data that can be used to extort the target company. The approach they take involves gaining access to the Windows domain controller, where they deploy ransomware. They encrypt devices using Windows and ESXi encryptors, which gives organizations little recourse aside from taking their critical systems offline in order to stop the spread.

Dark Angels banner

How Do Ragnar Locker Cyberattacks Start?

Ragnar Locker breaches, like all ransomware attacks, require a kill chain that must first be initiated. MITRE ATT&CK defines this as the ‘initial,’ and in these attacks, the initial comes from social engineering. Email stuffing is often the tactic of choice, whereby the attacker sends an email that appears to have a trail of replies or forwards (see the example below). Email trails like this trick spam filters and land directly in the target’s inbox. When an employee clicks a malicious link inside the email, the attack kicks off.

An email showing an example of email stuffing.

Image: Email stuffing is used by marketers and threat actors alike to bypass spam filters.

How Do Companies Discover Ragnar Locker Cyberattacks?

After the Ragnar Locker cyberattack kicks off, the bad link uses Java to load the locker ransomware, then a series of batch scripts installs a payload consisting of virtual box emulation software. This emulation software takes over and encrypts the host, and displays the ransomware message (see image below).

A Ragnar Locker ransomware message shown in a notes file.

Image: A Ragnar Locker ransomware message showing on encrypted devices.

How Do Ragnar Locker Cyberattacks Spread?

The attack spreads by gaining access to Windows domain controllers and then attacking the management interfaces of the VMware ESXi machines. Most organizations don’t properly segment or isolate these management interfaces. This makes them especially vulnerable even to older Babuk ransomware source code that is an ESXi encryptor. Basically, the attackers only need to gain access to the management network, and then they can attack the production network.

From Intel471: “VMware’s ESXi is called a ‘bare metal’ hypervisor because the underlying hardware on which it is installed doesn’t need an operating system. ESXi allows the hardware to be utilized for multiple virtual machines (VMs), which saves on hardware costs. ESXi is a fruitful target for attackers since it may be connected to several VMs and the storage for them. Security experts warn ransomware actors have built specific binaries to target these systems. Groups joining this trend include HelloKitty, Black Basta, Cheerscrypt and GwisinLocker.”

They continue, “Over the last few years, several vulnerabilities have been identified in ESXi, including CVE-2021-21974. The vulnerability is a heap overflow vulnerability within Open Service Location Protocol (OpenSLP), which is a network discovery tool. The vulnerability is remotely exploitable over port 427, and has a Common Vulnerability Scoring System Version 3.0 (CVSSv3) base score of 8.8. It’s suspected that it may be the vulnerability exploited in this attack. VMware said that “significantly out-of-date products” were targeted with vulnerabilities that had been addressed. It affects ESXi versions 7.0 before ESXi70U1c-17325551, 6.7 before ESXi670-202102401-SG and 6.5 before ESXi650-202102101-SG. Due to other vulnerabilities in OpenSLP, VMware disabled OpenSLP starting in 2021 in ESXi versions 7.0 U2c and ESXi 8.0, which is the current version.”

Ultimately, these attacks exploit a combination of a lack of management plane isolation to the VMware management interfaces, specifically on port 427 (OpenSLP), and a lack of patching and updating. Organizations also typically lack a backup authentication mechanism for the control plane, as well as Privileged Access Management, which are both good fallback options.

How Can Companies Stop Ragnar Locker Cyberattacks?

Ragnar Locker ransomware and other attacks are successful because companies don’t employ proper management plane isolation. Attackers can gain access to VMware management interfaces, and then they essentially have the keys to the kingdom. That’s it. No amount of defense can save you.

If you recall CISA’s binding operational directive, they call for an isolated management infrastructure. This is what we refer to as IMI. Rather than serving as a defense, like we think of traditional cybersecurity products, the IMI is an architecture that allows you to fight back. It’s your quick-reaction force, your cavalry, your secret weapon that ensures you always have a counterattack ready to deploy.

IMI is infrastructure that is dedicated — and most importantly, fully isolated from production assets — to ensuring operations can recover quickly from breaches and outages. Here’s a graphical breakdown:

Isolated Management Infrastructure diagram

The IMI includes all of the tools you need for rerouting traffic, decommissioning affected gear, wiping/re-imaging devices, and restoring infrastructure. You can also incorporate automation to speed the process along and make recovery something that happens in minutes or hours at the most. Aside from being completely isolated from production assets, the IMI itself is also segmented and employs zero trust practices. This means that you and only you have access to your secret weapon for cutting the ransomware kill chain.

How Do You Use Isolated Management Infrastructure?

An IMI can host an IRE (Isolated Recovery Environment), which is used to cut off all user data and remote access (except for OOB) to an entire infected site. A properly implemented recovery environment should automate most of these activities to speed up the recovery. One of the first considerations is the requirement for a secondary organization in your IAM that is not attached to normal operations. This is what is known as a set of “Break the Glass” accounts. These are known in military circles but have made it into formal practice as part of a strong playbook for ransomware. Once you do this, you can instantiate selected Zero Trust remote access to the site using credentials that are not in the scope of the attack, and then bring up a communications channel for a virtual war room using software like Rocket Chat, Jitsi, Slack, or other standalone communications tools that are installable on the IRE environment. 

Avoiding normal authentication methods or IAM and normal communication channels is required for the integrity of the recovery and strengthens the recovery playbook. During this time, no email may be used that is associated directly with the organization. Ideally, email should never touch an account that is associated with it either.

The next step is to create a new set of clean side networks that do not directly connect to the main backbone or put it behind another firewall for triage good/bad. Using a sniffer software running on the IRE, the recovery team can then run a passive scan or an active scanner against all machines continuing to try to send email to Exchange/M365. You can give access to people that are deemed good (not sending traffic) but lock off (with an EDR) the ability to open Outlook for a while, while keeping them on the web email. From there, continue working through to find all the sending drivers to see if they have a good backup. If not, back up the infected drive for offline data retrieval for later. Then re-image while scanning the UEFI BIOS during boot (if needed, run an IPMI scan). If the site has a list of assets that are considered crown jewels, prioritize these.

Once you have a segmented “clean side” established with all the network services required to operate the site (DNS, IAM, DHCP), then Internet access can be restored to this site on a limited basis; which means only out-bound communications, nothing in-bound. Restorative operations can continue apace. making sure that the infected side assets are captured in backup for later forensics following chain-of-custody if damages exceeding insurance limits are found to be the case. This is decided in the war room.

Download the Isolated Management Infrastructure Blueprint

Now is the time to lay the groundwork for your IMI so you can fight back against ransomware. Download the Network Automation Blueprint, which gives you a step-by-step guide to building your Isolated Management Infrastructure.

Get in touch with me!

True security can only be achieved through resilience, and that’s my mission. If you want help shoring up your defenses, building an IMI, and implementing a Resilience System, get in touch with me. Here are links to my social media accounts: