Monitoring the World: Scaling Thanos in Dynamic Prometheus Environments #23

angelicagardner · 2024-07-16T10:03:11Z

Info

Speaker(s): Colin Douch, Cloudflare
Video URL: https://youtu.be/ofhvbG0iTjU?si=be7Df_-4ryWneqiZ

Official Abstract/Summary

Cloudflare's Thanos journey started back in 2017, with conversations about how we could have a single pane of glass to monitor our new Prometheus infrastructure, replacing our old centralised OpenTSDB instance. Since then, our Prometheus footprint has grown to monitor nearly 500 datacenters around the world, with Thanos continuing to provide that invaluable single pane of glass. Along the way, we've encountered and solved interesting scaling problems arising from running hundreds of geographically dispersed sidecars, collecting tens of billions of active timeseries. In this talk, we will explain these challenges, and present the tooling we have developed to automatically manage and scale our infrastructure. From creating and wiring new buckets and sidecars as we provision new Prometheus servers around the world, to automatically sharding stores as our buckets grow, to utilising our spare CPU capacity to run compactors in locations in non peak hours.

angelicagardner added 2024 Year prometheus monitoring ThanosCon Conference/Host Type thanos scalability labels Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring the World: Scaling Thanos in Dynamic Prometheus Environments #23

Monitoring the World: Scaling Thanos in Dynamic Prometheus Environments #23

angelicagardner commented Jul 16, 2024

Monitoring the World: Scaling Thanos in Dynamic Prometheus Environments #23

Monitoring the World: Scaling Thanos in Dynamic Prometheus Environments #23

Comments

angelicagardner commented Jul 16, 2024

Info

Official Abstract/Summary