Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding global_cache_ttl feature #622

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

victoramsantos
Copy link
Contributor

This change add the global_cache_ttl feature. When it's value is greater than 0, cloudwatch-exporter will return the last returned /metrics call (just updating cloudwatch_* and java metrics), until the current time pass the global_cache_ttl seconds.

Why?

When working with Prometheus running in HA, we can have multiple calls for cloudwatch-exporter that can increase the AWS APIs calls. With this feature we can cache the last answer and guarantee that during the global_cache_ttl seconds, which is indicate to be the same as scrape_interval that prometheus use for scrape cloudwatch-exporter, we will not call AWS again.

Changes

  • ADD: If global_cache_ttl is greater than 0, AWS metrics are cached
  • ADD: cloudwatch_exporter_cached_answer will return 1 if that answer is cached or 0 if not
  • ADD: Unit tests that calls CloudWatchCollector different times simulating time passing and returning cached and non cached values.

Signed-off-by: Victor Amorim <victor.amorim@nubank.com.br>
Signed-off-by: Victor Amorim <victor.amorim@nubank.com.br>
Signed-off-by: Victor Amorim <victor.amorim@nubank.com.br>
@victoramsantos
Copy link
Contributor Author

@matthiasr Please, any feedback is welcome 😊

Copy link
Contributor

@matthiasr matthiasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, work keeps me busy 😔

I like the idea! I am not so sure about the name… "global cache" describes the implementation more than the effect, and the addition to the README doesn't really explain when one would want it. Your PR description does, but a user won't see that. What else could we call the setting to make it more about the effect for the user? Something with refresh time? Or we leave it as is, but expand on it more in the documentation?

How will this interact with timestamps? What do users need to be aware of when they set this and enable timestamps? At what point will it start to impact queries due to the lookback delta?

In addition to the gauge showing whether this request was served from cache or not, I would also like to have a counter for cache hits. This would help quantify the impact, which is not reliably possible with the gauge alone, especially in an HA setup where one Prometheus may always be the one triggering the refresh. The cache hit counter should have the same labels as the metric for the requests overall so it is easy to get a ratio in PromQL.

public List<MetricFamilySamples> collect() {
long start = System.nanoTime();
double error = 0;
List<MetricFamilySamples> mfs = new ArrayList<>();

if (shouldCache() && shouldReturnFromCache()) {
LOGGER.log(Level.INFO, "Returning from cache");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be debug level, otherwise it will be very noisy

@victoramsantos
Copy link
Contributor Author

Hi, thanks for the feedbacks!

What else could we call the setting to make it more about the effect for the user? Something with refresh time? Or we leave it as is, but expand on it more in the documentation?

I'm trying to think in something 😅 . I'll also update the documentation.

How will this interact with timestamps? What do users need to be aware of when they set this and enable timestamps? At what point will it start to impact queries due to the lookback delta?

Sorry, I didn't understand these questions :/ What do you mean with enable timestamps and lookback delta?

In addition to the gauge showing whether this request was served from cache or not, I would also like to have a counter for cache hits. This would help quantify the impact, which is not reliably possible with the gauge alone, especially in an HA setup where one Prometheus may always be the one triggering the refresh. The cache hit counter should have the same labels as the metric for the requests overall so it is easy to get a ratio in PromQL.

Agree, I'll update it.

I'm not good with test, so do you think we need more or what we have is enough?

@matthiasr
Copy link
Contributor

Most of the time, Prometheus records samples at the time of scrape. However, this does not work with CloudWatch – because metric values converge, we need to look back in time. By default, the exporter will get metrics from 10 minutes ago. Now that could be terribly confusing because you would see a change in a metric 10 minutes later in Prometheus than in CloudWatch. To compensate, by default, the exporter also reports timestamps. This means even though Prometheus scrapes the exporter at, say, 08:42, it knows that the actual metric value is from 08:32, and saves the sample as such.

However, when you run a query in Prometheus, it only looks back 5 minutes by default (the lookback delta). Thus, when you run an instant query (e.g. the "Table" tab in the Prometheus web interface), at at 08:42, it won't show anything. This is a frequent source of "bug" reports 😬 unfortunately it's a consequence of the model mismatch between CloudWatch and Prometheus, and not something the exporter can really do anything about.

I am wondering (and would like documentation on) the effects of the cache on timestamps. Since it caches the response, presumably (and probably correctly), the timestamps it reports will be even further in the past.

Now that I think of it, the problem may go beyond the lookback delta – if the same Prometheus scrapes the exporter again, it will see duplicate samples. I am not 100% sure how it handles those; it will definitely affect queries though. For example, if one were to set a global cache TTL of 5 minutes, a rate(some_metric[4m]) will not work, because it will only see one or zero samples, even though the scrape interval might be way higher.

I think a user can avoid this, and still get the benefit you are after, by never setting this value larger than the smallest scrape interval of any Prometheus server that scrapes it. That way, each Prometheus will always see new data between scrapes. We could even use this as inspiration for the naming: call it minimum_scrape_interval or so, and explain that it reduces API calls when multiple Prometheus servers with the same scrape interval request metrics, but should never be set higher than the smallest scrape interval in use.

@matthiasr matthiasr added enhancement java Pull requests that update Java code labels Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement java Pull requests that update Java code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants