Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring the World: Scaling Thanos in Dynamic Prometheus Environments #23

Open
angelicagardner opened this issue Jul 16, 2024 · 0 comments

Comments

@angelicagardner
Copy link
Owner

Info

Speaker(s): Colin Douch, Cloudflare
Video URL: https://youtu.be/ofhvbG0iTjU?si=be7Df_-4ryWneqiZ

Official Abstract/Summary

Cloudflare's Thanos journey started back in 2017, with conversations about how we could have a single pane of glass to monitor our new Prometheus infrastructure, replacing our old centralised OpenTSDB instance. Since then, our Prometheus footprint has grown to monitor nearly 500 datacenters around the world, with Thanos continuing to provide that invaluable single pane of glass. Along the way, we've encountered and solved interesting scaling problems arising from running hundreds of geographically dispersed sidecars, collecting tens of billions of active timeseries. In this talk, we will explain these challenges, and present the tooling we have developed to automatically manage and scale our infrastructure. From creating and wiring new buckets and sidecars as we provision new Prometheus servers around the world, to automatically sharding stores as our buckets grow, to utilising our spare CPU capacity to run compactors in locations in non peak hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant