adding global_cache_ttl feature #622

victoramsantos · 2023-12-14T21:35:48Z

This change add the global_cache_ttl feature. When it's value is greater than 0, cloudwatch-exporter will return the last returned /metrics call (just updating cloudwatch_* and java metrics), until the current time pass the global_cache_ttl seconds.

Why?

When working with Prometheus running in HA, we can have multiple calls for cloudwatch-exporter that can increase the AWS APIs calls. With this feature we can cache the last answer and guarantee that during the global_cache_ttl seconds, which is indicate to be the same as scrape_interval that prometheus use for scrape cloudwatch-exporter, we will not call AWS again.

Changes

ADD: If global_cache_ttl is greater than 0, AWS metrics are cached
ADD: cloudwatch_exporter_cached_answer will return 1 if that answer is cached or 0 if not
ADD: Unit tests that calls CloudWatchCollector different times simulating time passing and returning cached and non cached values.

Signed-off-by: Victor Amorim <victor.amorim@nubank.com.br>

victoramsantos · 2023-12-14T21:56:59Z

@matthiasr Please, any feedback is welcome 😊

matthiasr

Sorry for the delay, work keeps me busy 😔

I like the idea! I am not so sure about the name… "global cache" describes the implementation more than the effect, and the addition to the README doesn't really explain when one would want it. Your PR description does, but a user won't see that. What else could we call the setting to make it more about the effect for the user? Something with refresh time? Or we leave it as is, but expand on it more in the documentation?

How will this interact with timestamps? What do users need to be aware of when they set this and enable timestamps? At what point will it start to impact queries due to the lookback delta?

In addition to the gauge showing whether this request was served from cache or not, I would also like to have a counter for cache hits. This would help quantify the impact, which is not reliably possible with the gauge alone, especially in an HA setup where one Prometheus may always be the one triggering the refresh. The cache hit counter should have the same labels as the metric for the requests overall so it is easy to get a ratio in PromQL.

matthiasr · 2024-01-15T22:16:42Z

src/main/java/io/prometheus/cloudwatch/CloudWatchCollector.java

  public List<MetricFamilySamples> collect() {
    long start = System.nanoTime();
    double error = 0;
    List<MetricFamilySamples> mfs = new ArrayList<>();
+
+    if (shouldCache() && shouldReturnFromCache()) {
+      LOGGER.log(Level.INFO, "Returning from cache");


This should be debug level, otherwise it will be very noisy

victoramsantos · 2024-01-17T16:45:24Z

Hi, thanks for the feedbacks!

What else could we call the setting to make it more about the effect for the user? Something with refresh time? Or we leave it as is, but expand on it more in the documentation?

I'm trying to think in something 😅 . I'll also update the documentation.

How will this interact with timestamps? What do users need to be aware of when they set this and enable timestamps? At what point will it start to impact queries due to the lookback delta?

Sorry, I didn't understand these questions :/ What do you mean with enable timestamps and lookback delta?

In addition to the gauge showing whether this request was served from cache or not, I would also like to have a counter for cache hits. This would help quantify the impact, which is not reliably possible with the gauge alone, especially in an HA setup where one Prometheus may always be the one triggering the refresh. The cache hit counter should have the same labels as the metric for the requests overall so it is easy to get a ratio in PromQL.

Agree, I'll update it.

I'm not good with test, so do you think we need more or what we have is enough?

matthiasr · 2024-03-03T13:13:10Z

Most of the time, Prometheus records samples at the time of scrape. However, this does not work with CloudWatch – because metric values converge, we need to look back in time. By default, the exporter will get metrics from 10 minutes ago. Now that could be terribly confusing because you would see a change in a metric 10 minutes later in Prometheus than in CloudWatch. To compensate, by default, the exporter also reports timestamps. This means even though Prometheus scrapes the exporter at, say, 08:42, it knows that the actual metric value is from 08:32, and saves the sample as such.

However, when you run a query in Prometheus, it only looks back 5 minutes by default (the lookback delta). Thus, when you run an instant query (e.g. the "Table" tab in the Prometheus web interface), at at 08:42, it won't show anything. This is a frequent source of "bug" reports 😬 unfortunately it's a consequence of the model mismatch between CloudWatch and Prometheus, and not something the exporter can really do anything about.

I am wondering (and would like documentation on) the effects of the cache on timestamps. Since it caches the response, presumably (and probably correctly), the timestamps it reports will be even further in the past.

Now that I think of it, the problem may go beyond the lookback delta – if the same Prometheus scrapes the exporter again, it will see duplicate samples. I am not 100% sure how it handles those; it will definitely affect queries though. For example, if one were to set a global cache TTL of 5 minutes, a rate(some_metric[4m]) will not work, because it will only see one or zero samples, even though the scrape interval might be way higher.

I think a user can avoid this, and still get the benefit you are after, by never setting this value larger than the smallest scrape interval of any Prometheus server that scrapes it. That way, each Prometheus will always see new data between scrapes. We could even use this as inspiration for the naming: call it minimum_scrape_interval or so, and explain that it reduces API calls when multiple Prometheus servers with the same scrape interval request metrics, but should never be set higher than the smallest scrape interval in use.

adding global_cache_ttl feature

d97d910

Signed-off-by: Victor Amorim <victor.amorim@nubank.com.br>

victoramsantos force-pushed the feat/global_cache_ttl branch from 6a82858 to d97d910 Compare December 14, 2023 21:37

victoramsantos added 2 commits December 14, 2023 18:49

formating

680d018

Signed-off-by: Victor Amorim <victor.amorim@nubank.com.br>

formating

c7e2d4e

Signed-off-by: Victor Amorim <victor.amorim@nubank.com.br>

matthiasr reviewed Jan 15, 2024

View reviewed changes

matthiasr added enhancement java Pull requests that update Java code labels Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding global_cache_ttl feature #622

adding global_cache_ttl feature #622

victoramsantos commented Dec 14, 2023

victoramsantos commented Dec 14, 2023

matthiasr left a comment

matthiasr Jan 15, 2024

victoramsantos commented Jan 17, 2024

matthiasr commented Mar 3, 2024

adding global_cache_ttl feature #622

Are you sure you want to change the base?

adding global_cache_ttl feature #622

Conversation

victoramsantos commented Dec 14, 2023

victoramsantos commented Dec 14, 2023

matthiasr left a comment

Choose a reason for hiding this comment

matthiasr Jan 15, 2024

Choose a reason for hiding this comment

victoramsantos commented Jan 17, 2024

matthiasr commented Mar 3, 2024