How Apify built its own observability hub

Web scraping and browser automation are growing really fast, and observability is the key to keeping our platform stable and performant. At Apify, we recently changed a key observability component to a self-hosted one. Learn how we made the transition, what challenges we encountered, and what we learned along the way.

Our scale: The numbers behind Apify

Apify is a widely adopted cloud-based platform for web scraping and browser automation, with:

  • 25,000+ active users
  • 1.5M+ containers started daily
  • Up to 8,000 API requests per second

The whole platform produces a large amount of monitoring data, which must be handled by the platform. Specifically:

  • 130GB of logs across all components
  • 4,000+ high cardinality metrics, which means 6–10 million active time series

Start manually, automate later … and iterate

Our CTO Marek shared Apify’s observability story three years ago. He mentioned the “Start manually, automate later … and iterate” approach there. Now, it was time for the next iteration, and we looked at the New Relic infrastructure monitoring platform we started using in 2020.

At that time, it was a great starting point because it offered out-of-the-box simplicity with minimal setup – perfect for our initial observability needs. But as the platform grew, we faced some limitations.

  1. Not a perfect fit for us: As a company that supports open communities (we are building one around our platform), we prefer tools that are popular and widespread in these communities. Firstly, it makes any integration easier for us as there are plenty of ready-made solutions (dashboards, integrations), but it also gives us a better chance to hire good engineers for future growth.
  2. Data sampling: Sampling in New Relic made it harder for developers to debug accurately, especially during spikes.
  3. Developer frustrations: Poor UX, such as difficulties copying user IDs directly from dashboards, slowed down debugging. In addition, bugs and memory leaks in their SDKs caused degraded performance of the Apify platform.
  4. High costs: Changes in New Relic’s pricing model, plus the growth of the Apify Platform over the past two years, have caused costs to increase significantly.
  5. No proper deployment. In the “start manually” times, we didn’t use IaC (Infrastructure as a Code) or GitOps techniques for configuration. We have to say that the problem wasn’t NewRelic (it supports IaC) but the way we did it. Today, this means that any change to the monitoring platform is increasingly difficult to implement and monitor.

Making the switch to Grafana

Migrating to a new observability platform hasn’t been without obstacles. We had to deal with several challenges:

  1. Starting from scratch: Moving away from the proprietary New Relic SDK required rewriting custom metrics to support the new platform. And since we’re making big changes anyway, we reviewed all metrics, alerts, and dashboards.
  2. Adoption for developers: Transitioning teams to a system with different workflows and interfaces required proper preparation.
  3. Scalability and cost balance: The load on our platform fluctuates, so the new monitoring platform had to effectively handle it.

After exploring various alternatives, we chose Grafana because of its popularity, strong community support, and open-source architecture. Initially, Grafana Cloud was considered (we prefer managed services!), but we quickly realized the associated costs ($60-80k/month at our scale) ☹️

Instead, we decided to self-host Grafana because we estimated it would cost 1/10 of the price. And it’s always good to have that experience!

The Hub

We took the deployment of our monitoring platform seriously. To keep it secure and well organized, we set it up in a separate AWS account with its own infrastructure. This decision helped us keep things isolated and allowed us to better manage costs and access.

We wanted to get as close as possible to the Grafana Cloud architecture, which has been developed and refined over many years and its design is proven. And the large community that surrounds it is a big advantage – lots of open source dashboards and plugins are ready to use!

We stuck to the following components:

  • Grafana for dashboards and visualization.
  • Prometheus & AlertManager for record metrics and alert processing.
  • Mimir as a long-term metric storage.
  • Loki for providing logs and Kubernetes events from the platform itself.
  • OpenTelemetry collector on the infrastructure side and SDK used in our Node.js applications for exporting custom metrics and traces.

Using GitHub, Terraform, and Terramate for the deployment makes sure that every change is consistent for all environments and is easily revertible.

We simply call our platform “The Hub“.

The platform runs on an EKS cluster with approximately 40 nodes using around 1 TB of memory and 250 CPU cores in total. We used the ARM architecture whenever possible. As a result, 80% of our nodes are ARM-based nodes (Graviton instances).

For our developers with ❤️

The platform is designed for our developers for everyday use. We’ve made sure they have everything they need from the start – for example, complete documentation and ready-made dashboards to make the transition as easy as possible.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *