You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are observing a sudden surge in M3 Aggregation Errors of the type "too far in past" after around ~30 hours of traffic. (1.5 full day of prod traffic). I am not sure if its just caused by load since 1) the cluster looks very stable for the first full day of prod traffic 2) the increase in these errors are not gradual. It really is a sudden increase and that causes the M3 Coordinators to finally OOM and crash.
The surge in the above errors are accompanied by an increase in the ingest latency metric
If the issue is performance related, please provide the following information along with a description of the issue that you're experiencing:
What service is experiencing the performance issue? (M3Coordinator, M3DB, M3Aggregator, etc)
Approximately how many datapoints per second is the service handling?
What is the approximate series cardinality that the series is handling in a given time window? I.E How many unique time series are being measured?
What is the hardware configuration (number CPU cores, amount of RAM, disk size and types, etc) that the service is running on? Is the service the only process running on the host or is it colocated with other software?
What is the configuration of the service? Please include any YAML files, as well as namespace / placement configuration (with any sensitive information anonymized if necessary).
How are you using the service? For example, are you performing read/writes to the service via Prometheus, or are you using a custom script?
In addition to the above information, CPU and heap profiles are always greatly appreciated.
CPU / Heap Profiles
CPU and heap profiles are critical to helping us debug performance issues. All our services run with the net/http/pprof server enabled by default.
Instructions for obtaining CPU / heap profiles for various services are below, please attach these profiles to the issue whenever possible.
M3Coordinator
CPU curl <HOST_NAME>:<PORT(default 7201)>/debug/pprof/profile?seconds=5 > m3coord_cpu.out
If the service experiencing performance issues is M3DB and you're monitoring it using Prometheus, any screenshots you could provide using this dashboard would be helpful.
The text was updated successfully, but these errors were encountered:
Observed couple of deviation between my config and the standard configs in the repo
entryTTL: 11h in the aggregation config ? I could see it being set to 1 hour in almost all the config examples in the repo.
bufferDurationForPastTimedMetric: 5m as opposed to 10 seconds in almost all the config examples in the repo.
We are using the fleet-of-m3-coordinators-and-m3-aggregators topology to aggregate metrics before sending it to downstream grafana remote storage
https://m3db.io/docs/how_to/any_remote_storage/#fleet-of-m3-coordinators-and-m3-aggregators
Apps ---> otel collector Prometheus Receiver -> M3 Coordinator --> M3 Aggregators --> M3 Coordinator (Aggregated metrics )-> Prometheus Remote Write to Grafana
We are observing a sudden surge in M3 Aggregation Errors of the type "too far in past" after around ~30 hours of traffic. (1.5 full day of prod traffic). I am not sure if its just caused by load since 1) the cluster looks very stable for the first full day of prod traffic 2) the increase in these errors are not gradual. It really is a sudden increase and that causes the M3 Coordinators to finally OOM and crash.
The surge in the above errors are accompanied by an increase in the ingest latency metric
Topology Details:
25 Otel Collector Pods -> 25 M3 Coordinator Nodes -> 24 M3 Agg Nodes - Shards 512 - RF 2
M3 Agg End to End details dashboard snapshot.pdf
M3 Coordinator Dashboard snapshot.pdf
I suspect something might be wrong with my configs
M3 Agg Log example:
Configs:
M3 Coordinator
M3 Agg
Performance issues
If the issue is performance related, please provide the following information along with a description of the issue that you're experiencing:
In addition to the above information, CPU and heap profiles are always greatly appreciated.
CPU / Heap Profiles
CPU and heap profiles are critical to helping us debug performance issues. All our services run with the net/http/pprof server enabled by default.
Instructions for obtaining CPU / heap profiles for various services are below, please attach these profiles to the issue whenever possible.
M3Coordinator
CPU
curl <HOST_NAME>:<PORT(default 7201)>/debug/pprof/profile?seconds=5 > m3coord_cpu.out
Heap
curl <HOST_NAME>:<PORT(default 7201)>/debug/pprof/heap > m3coord_heap.out
M3DB
CPU
curl <HOST_NAME>:<PORT(default 9004)>/debug/pprof/profile?seconds=5 > m3db_cpu.out
Heap
curl <HOST_NAME>:<PORT(default 9004)>/debug/pprof/heap -> m3db_heap.out
M3DB Grafana Dashboard Screenshots
If the service experiencing performance issues is M3DB and you're monitoring it using Prometheus, any screenshots you could provide using this dashboard would be helpful.
The text was updated successfully, but these errors were encountered: