Latency going up as more hosts are added #156

modena01 · 2024-06-11T10:28:54Z

Thanks for smokeping! I am a prometheus newb, so please bear with me. Smokeping was working fine for me at first with a single host, then I tried adding about 100 additional hosts to ping, and the reported ICMP latency went up significantly. I dropped back down to 21 hosts, and latency dropped, but not back to the same level as with 1 target host.
is it correct config to have

targets:
 - hosts:
  - my.one.host
  - my.two.host
 - hosts:
  - my.three.host

Is the purpose of different (multiple) "hosts" section merely to have different variables such as interval and size, for different hosts?
If smokeping is creating and tracking and reporting buckets to prometheus, is there a valid reason to scrape smokeping from prometheus any more often than say 1min?

My prometheus config is as yet very simple:

- job_name: 'smokeping_prober'
   scrape_interval: 60s
   static_configs:
   - targets: ['localhost:9374']

From the prometheus log, I see a message like this when I have a single ICMP target:

"Waiting 1s between starting pingers"

but with 21 targets I get:

"Waiting 47.619047ms between starting pingers"

so it is clearly dividing the number of targets into 1000ms, but I cannot find this in the smokeping code, so I guess it is prometheus doing this? I was looking at this trying to figure out why reported latency is going up higher and higher the more ICMP target hosts I add.

Thanks for your help.

The text was updated successfully, but these errors were encountered:

SuperQ · 2024-06-11T11:50:34Z

No, that is message is from an older version of the smokeping_prober. The message was removed when we added dynamic reload support.

Reported latency may be going up because the prober is being starved for CPU and unable to process response packets fast enough.

modena01 · 2024-06-13T05:24:53Z

Thanks SuperQ, I have now updated to the latest version, here is an example of what happens when I went from 21 hosts, to around 100.

do I need to run multiple smokeping instances and split the hosts out per instance? Increasing the interval period does not seem to help.

modena01 · 2024-06-13T05:25:19Z

I'm looking at needing hundreds (probably 500+) hosts to monitor...

Nachtfalkeaw · 2024-07-15T20:06:51Z

How often do you ping per second and how many hosts?
what packet size for icmp packet?
How many CPU cores do you have?

I am pinging a few hundred (200-300 hosts) but with different intervals. some I ping every 200ms and others every 5s. I noticed that at the beginning the CPU load is higher than at later times - maybe the load is distributed. Running "top" I sometimes see smokeping_prober consume 1100% CPU and then other times only 300-500%.

The scrape interval of prometheus defines how the bucket lengt which means each buckt contains all ping results of the scrape interval. If you ping a host every 1s and scrape every 60s you have 60 results in that bucket. This may be "ok" for you but if you have some pings with high latency you do not know if they are at the beginning or the end or spread in the bucket,

So it depends on the use case. I scrape every 15s which contains at least 3 pings for the "every 5s ping" targets.

So back to yout question - I would check your CPU consumtion - maybe - if possible - just add a few more CPU cores and check how the behaviour changes.

Alb0t · 2024-11-01T22:29:18Z

Anyone here find a solution? Facing the same issue here. Increasing the number of GOMAXPROC seems to help, but only a little bit. Using latest version.

Trying to ping ~500 hosts every second. 24byte icmp. Is this some issue with Prometheus's addToBucket concurrency? Maybe we need more distinct metrics instead of label pairs?

Alb0t · 2024-11-01T22:32:23Z

SuperQ · 2024-11-02T08:29:45Z

Is this some issue with Prometheus's addToBucket concurrency? Maybe we need more distinct metrics instead of label pairs?

No, the timing is calculated entirely outside of the metric manipulation. There is no difference in performance between metrics and labels in Prometheus monitoring.

My guess right now is this has to do with the way the pingers are structured. Every target creates a new UDP listener, which means that we now have a lot of socket listeners all trying to read the packets off the receive queue. This is creating a lot of contention and delay.

What we should do is create a single small pool of UDP receivers which timestamp the packets and send them to correct metric.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency going up as more hosts are added #156

Latency going up as more hosts are added #156

modena01 commented Jun 11, 2024 •

edited

Loading

SuperQ commented Jun 11, 2024

modena01 commented Jun 13, 2024 •

edited

Loading

modena01 commented Jun 13, 2024

Nachtfalkeaw commented Jul 15, 2024

Alb0t commented Nov 1, 2024 •

edited

Loading

Alb0t commented Nov 1, 2024

SuperQ commented Nov 2, 2024

Latency going up as more hosts are added #156

Latency going up as more hosts are added #156

Comments

modena01 commented Jun 11, 2024 • edited Loading

SuperQ commented Jun 11, 2024

modena01 commented Jun 13, 2024 • edited Loading

modena01 commented Jun 13, 2024

Nachtfalkeaw commented Jul 15, 2024

Alb0t commented Nov 1, 2024 • edited Loading

Alb0t commented Nov 1, 2024

SuperQ commented Nov 2, 2024

modena01 commented Jun 11, 2024 •

edited

Loading

modena01 commented Jun 13, 2024 •

edited

Loading

Alb0t commented Nov 1, 2024 •

edited

Loading