Signficantly improve inhibitor performance via new cache datastructure #4134
+122
−44
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves or matches performance of the existing inhibitor in almost all benchmark cases. When each inhibition rule has many inhibiting alerts, performance is improved by several orders of magnitude. No change to the inhibitor interface is necessary.
The new implementation replaces the
InhibitRule
'sscache
with anicache
(meaning "intersection cache") which pre-calculates the equal label values a target alert would need to be inhibited by each inhibitor alert. Practically, this is implemented by calculating a fingerprint for just theequalLabels
of each inhibiting alert and storing that alert in a map keyed by theequalLabels
fingerprint. This allows O(1) access to the set of inhibiting alerts for anInhibitRule
in theInhibitor.Mutes
method. The new data structure is essentially just:To insert a new inhibitor alert "A":
LabelSet
of just the subset of A's labels which are in the InhibitRule'sequal_labels
. Compute the fingerpint for this LabelSet and call it "E"To check if an alert "T" is inhibited:
LabelSet
of just the subset of T's labels which are in the InhibitRule'sequal_labels
. Compute the fingerpint for this LabelSet and call it "E"The new implementation has a few other minor improvements which help performance: it reduces the number of time an inhibitor alert's fingerprint is calculated by caching it in the icache, it pre-calculates whether an inhibitor matches the source and target matchers for an inhibit rules, it calls
time.Now()
less frequently, and it avoids checking if a target alert matches the source matchers unless necessary.In the real world, this change helps most when an inhibit rule covers many inhibitors and target alerts. For example, a rule where all critical alerts inhibit all warning alerts with the same
instance
andalertname
can result in many possible inhibitors attached to a rule where each inhibitor alert only inhibits one target alert. The old implementation of the inhibitor is extremely inefficient in this case: assuming N inhibiting alerts and a fixed number ofequalLabels
, it performs O(N) string comparisons per target. The new implementation is effectively O(1) in this case (except in a pathological case where all alerts in the inhibitor are resolved, since the new implementation still has to iterate past resolved alerts).The new implementation does require slightly more memory per inhibiting alert than the old one: we store the alert's fingerprint and a boolean to check if the inhibitor matches both the source and the target. This works out to be about 16 extra bytes per cached inhibitor alert after accounting for padding. For 10,000 alerts, that's about 160KB. I think this is a totally acceptable increase in memory requirements.
We've been running a very similar patch in our production environment which adds all inhibiting alerts to the
marker
for more than a month. In our environment this still results in a more then 4x reduction in time spent in theInhibitor.Mutes
method.This change was motivated by observations which show a large amount of CPU spent doing string comparisons in the inhibitor. It seems like we're not the only ones who have been looking for this, and I think this might be a good alternative to #3933.