You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cloudflare's Thanos journey started back in 2017, with conversations about how we could have a single pane of glass to monitor our new Prometheus infrastructure, replacing our old centralised OpenTSDB instance. Since then, our Prometheus footprint has grown to monitor nearly 500 datacenters around the world, with Thanos continuing to provide that invaluable single pane of glass. Along the way, we've encountered and solved interesting scaling problems arising from running hundreds of geographically dispersed sidecars, collecting tens of billions of active timeseries. In this talk, we will explain these challenges, and present the tooling we have developed to automatically manage and scale our infrastructure. From creating and wiring new buckets and sidecars as we provision new Prometheus servers around the world, to automatically sharding stores as our buckets grow, to utilising our spare CPU capacity to run compactors in locations in non peak hours.
The text was updated successfully, but these errors were encountered:
Info
Speaker(s): Colin Douch, Cloudflare
Video URL: https://youtu.be/ofhvbG0iTjU?si=be7Df_-4ryWneqiZ
Official Abstract/Summary
The text was updated successfully, but these errors were encountered: