Benefits of Single-Homed Nodes

Introduction

Redundancy is a fundamental tenet of good systems design and one that you have likely implemented at multiple levels in your existing systems. You don’t run mission-critical processes on just one server, house your data in just one data center, or rely entirely on just one connection to deliver data to your customers. At each level, you configure redundant systems so as to avoid having any single point of failure.

Similarly, it is important to build redundancy into your digital experience monitoring strategy to ensure that you receive consistent, reliable, actionable data, and don’t miss issues that might affect your end-users. Catchpoint provides everything you need to accomplish these goals. However, as with any toolkit, you need to understand how each component works to use it effectively.

This article explains why Catchpoint has chosen a “single-homed” backbone node design and provides guidance on how to utilize these nodes in a way that results in a resilient monitoring strategy.

Nodes Overview

Catchpoint’s worldwide fleet of test nodes provides a multitude of vantage points from which to monitor your systems. There are several node types, each defined by the type of provider they are located within and/or connected to:

Backbone nodes connect directly to major global and regional ISPs.
Last-mile nodes connect to regional consumer ISPs from users’ home networks.
Cloud nodes are located within major cloud providers such as AWS, Azure, Google, Oracle, Tencent, AliCloud, and IBM. These nodes rely on the Cloud providers’ default networks, which are multi-homed.
Wireless nodes connect to regional 3G and 4G networks.

This article focuses primarily on the design of Catchpoint’s backbone nodes. These nodes provide a view of the performance of your systems from outside the infrastructure in which they reside, enabling them to accurately represent the end-user experience while excluding the variations brought in by end-users networks.

Next, we will look closer at how these nodes work.

Single-homed Versus Multi-homed

Each backbone node consists of several redundant instances of Catchpoint’s node software running in a specific physical location. In each geographic region, there are multiple “single-homed” backbone nodes. In other words, each node is connected to a single Tier-1 ISP.

Each single-homed node routes all traffic through the one ISP it is connected to. Catchpoint deliberately chose this design for our backbone nodes as opposed to using “multi-homing,” where one node per region would be connected to multiple ISPs.

A multi-homed node may route any request through any of its connected ISPs, with the routing handled transparently and automatically by BGP policies.

Since redundancy is so important, you might wonder why Catchpoint didn’t simply set up one multi-homed node in each region. After all, wouldn’t this have been simpler to implement, while providing connection redundancy inherently? This is an understandable line of reasoning, but there are significant drawbacks to multi-homing when it comes to performance monitoring. The following sections provide a closer look at these drawbacks and show how you can leverage Catchpoint's single-homed nodes to achieve redundancy without increasing costs.

Why Not Multi-home?

Multi-homed Performance Monitors Produce Noise

An effective performance monitoring strategy begins with establishing a solid baseline measurement. The more precise your baseline, the easier it is to detect significant deviations that might indicate a problem. To establish a useful baseline, your measuring instruments must take their measurements in exactly the same way every time. If they keep changing what they measure, or how they measure it, this introduces “noise,” which makes your baseline less precise and actual performance deviations more difficult to detect.

This brings out the fundamental issue with multi-homing when it comes to performance monitoring. Recall that with a multi-homed node, routing is handled by BGP, so you can't predict or control which connection the node will use for any given request. Each ISP will have its own BGP routing policies as well, so they will use different routes to reach your data center (and third-party hosts.) If your service is hosted in multiple data centers, two different ISPs might not even route requests to the same data center of your system! It is also possible that a multi-homed node could send requests out via one connection and receive responses back via another, as in the diagram below.

All of these complexities make performance metrics generated by multi-homed nodes inherently unreliable for a monitoring strategy. This means a less precise baseline, which makes it much more difficult to distinguish real performance changes from all the noise.

Multi-homed Node Design Risk: False Positives

You may have heard the fable of the boy who cried wolf. This tale is a classic example of a “false positive,” where an alarm is sounded even though there is actually no threat. One of the consequences of noise in your performance measurements is the potential for false-positive alerts.

Suppose you were monitoring your website from a multi-homed node and received an alert about an apparent change in performance. It is possible that something has gone wrong, but it might simply be the result of BGP rerouting traffic through another ISP with different performance characteristics. You’d have to do some digging and analysis just to figure out whether the alert is telling you about a real customer-impacting issue or is merely the result of the routing change.

In contrast, a single-homed node provides a fixed vantage point, always routing every request through the same ISP. This means that when you see a significant change in that node’s test results, you can be much more confident that it is indicating a real issue that needs attention. The performance change could still be due to external routing, either by the Transit ISP or your ISP, but these are issues that do impact your users, and which you may be able to influence through your relationships with the providers.

Multi-homed Node Design Risk: False Negatives

When it comes to monitoring, the problem of false positives is obvious and easy to understand – nobody likes getting a bunch of alerts only to investigate and find out that there is nothing wrong. But there is another more subtle type of problem that multi-homed nodes can introduce, which is referred to as a “false negative.” This is when there actually is an issue that you should be aware of, but your monitors are blind to it, so you never receive any alert at all.

Suppose BGP happens to be routing all traffic from the multi-homed node through one particular ISP of the two ISPs you have in your datacenter. Meanwhile, another ISP in the region experiences a significant performance degradation or outage. For all you know, everything is just fine – everything on your dashboards looks good and your alert queue is quiet. Because the multi-homed node is favoring just one ISP and ignoring the other, it has left you blind to a network issue that potentially impacts many of your users and affects the perceived performance of your systems.

If instead, you were monitoring via several single-homed nodes, each connected to a different ISP, you would catch the ISP issue immediately.

If you also targeted nodes in other geographies, you’d be able to see whether the provider’s issue is regional or global. This type of intelligence is especially valuable if you deliver services via multi-homed connections, as you can use it to adjust your own BGP policies.

Building Redundancy into Your Monitoring Strategy

We’ve established how relying on one multi-homed node per region exposes you to noise, false positives, and false negatives. But simply switching to a single-homed node per region is not the best solution, as this would completely sacrifice connection redundancy and cause false negatives (you only measure from one ISP). If you were only monitoring through one ISP and that provider went down, you would see a total interruption in your data for that region.

We recommend the following best practices, which take full advantage of the tools Catchpoint puts at your disposal to build not only redundancy but also better visibility into your monitoring strategy.

Select multiple nodes per city/region, so that if one node is unable to perform tests for any reason, you will still receive results from other nodes.
Select several nodes in adjacent cities to augment your monitoring and simulate the experiences of a wider range of users.
Select backbone nodes on a variety of ISPs so that you can detect ISP-specific issues, and so that if any single ISP goes down, you will continue to receive test results from the others.
Select other types of nodes to supplement and corroborate the data from the backbone nodes. For example, use a cloud node if you just need to monitor the availability of a cloud-based application. These nodes are multi-homed because their purpose is simply to monitor cloud-platform availability and not to capture external performance-related issues.
Take advantage of Targeting and Scheduling to distribute a fixed number of test runs across all of the selected nodes. This way you can avoid increasing your monitoring spend, while still providing thorough testing coverage.
Configure your alerts to trigger only after an appropriate number of nodes indicate a performance change. This helps avoid false positives due to the odd issue with a single node.

Example Scenario 1: Digital Experience Monitoring

Suppose you want to monitor the performance of your website from New York and Los Angeles. Your monitoring budget will cover a total of 60 test runs per hour. You might initially consider selecting one backbone node in each city and scheduling your test to run on each node every two minutes (30 test runs per hour in each city). Let’s look closer at that scenario.

Now suppose that the ISP in New York becomes unavailable for an hour due to unexpected maintenance by the provider. Due to the lack of redundancy in Node targeting, this would result in a total interruption to your monitoring from the New York region.

Let’s reconsider and instead try targeting multiple Nodes per city, and configuring Catchpoint to run on a random node every two minutes.

By targeting multiple nodes with random node selection, your costs are the same and an ISP failure does not interrupt monitoring.

Example Scenario 2: Cloud Application Monitoring

Let’s suppose you want to monitor your cloud-based application and you only care about the performance and availability of the application, and that it executes what it needs to. You are not concerned about reachability to the application from various places on the internet, the performance and SLA of ISP providers, cloud provider SLAs, other vendors' SLAs, or the end-user's experience.

The best way to accomplish this would be to target cloud nodes, which are located within your cloud providers’ multi-homed networks, and in the same geographical locations where your application resides. Please note that certain cloud providers block traceroutes on all their regions or in some of the regions, and there are no workarounds to their limitation (Microsoft Azure, Google Cloud).

Conclusion

Catchpoint provides a very powerful toolkit that can help you accomplish almost any digital experience monitoring goal, but it is essential to understand how each component of the system works in order to use it effectively. Our single-homed backbone nodes provide fixed vantage points from which to measure your systems, avoiding the noise inherent in multi-homed nodes. By targeting multiple backbone nodes and taking advantage of other node types as well, you can build redundancy into your monitoring system so that it provides consistent, clear, actionable data, for a truly resilient monitoring strategy.

Documentation Index