Ensuring Catchpoint’s resilience: why we are changing our data storage and analytics policy

Availability remains a critical component of Catchpoint’s pillars of monitoring.

We work with our customers to help them achieve excellent user experiences through available services – and we apply the same practices to our own internal systems.

This is why we carefully analyze our own Internet Stack to minimize the dependencies that we have in common with our customers. When there’s a common dependency, we add redundancies through other providers to ensure resiliency. To do this, we provide thousands of worldwide vantage points in real networks that customers can use to baseline application performance. We also run our core analytics architecture from dedicated infrastructure in the Tier 5 Switch SuperNAP7, so that there are no shared dependencies with our customers.

As Catchpoint’s products, our customers, and the complexity of the telemetry we’re collecting increases, we’ve seen an increase in unpredictable data patterns and schemas. For example, RUM or OpenTelemetry tracing data is generally used by customers to monitor the real user performance of their applications.

On Black Friday, for example, the amount of this data increases by several folds. It’s difficult to scale dedicated infrastructure in a datacenter when this happens – we’ve been doing it for years, but as the amount of data increases, it’s becoming harder to do so.

While we used to insist on hosting all services in the core datacenter, the current landscape of the internet favors ubiquitous cloud services, which offer significant advantages over private hosting, including the ability to scale as needed.

After thorough consideration of our customer’s many use cases, we’ve decided to leverage public cloud (IaaS) vendors for certain data.

Our promise is to provide our services such that when you require immediate discovery and notification of problems – they will be available. We will continue to design our infrastructure to minimize dependency overlap with customers for critical use cases such as Availability, Reachability, and Performance baselining.

By migrating some components to the public cloud, we can provide faster feature development and be more responsive at a larger scale for both data ingestion and analytics.

At this time, the following components are being migrated:

RUM, Tracing (which has always been cloud-based) and Capture data may be stored in the public cloud.
Data pipelines will ensure resilience by using multiple providers (including cloud) and paths to deliver data to the core datacenter from the point of measurement.

Migrations are completely transparent to customers – there will be no impact, except that you will see performance improvements and faster feature iteration as these migrations are completed.

We will assess future features to make appropriate storage choices based on their supported use cases.

Documentation Index

Ensuring Catchpoint’s resilience: why we are changing our data storage and analytics policy