Signficantly improve inhibitor performance via new cache datastructure #4134

Spaceman1701 · 2024-11-22T20:56:18Z

This PR improves or matches performance of the existing inhibitor in almost all benchmark cases. When each inhibition rule has many inhibiting alerts, performance is improved by several orders of magnitude. No change to the inhibitor interface is necessary.

                                                     │    main.txt    │    icache-for-upstream-final.txt    │
                                                     │     sec/op     │    sec/op     vs base               │
Mutes/1_inhibition_rule,_1_inhibiting_alert-4            1.865µ ±  4%   1.597µ ±  7%  -14.35% (p=0.002 n=6)
Mutes/10_inhibition_rules,_1_inhibiting_alert-4          2.007µ ±  7%   1.464µ ±  5%  -27.04% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1_inhibiting_alert-4         1.941µ ±  7%   1.447µ ± 16%  -25.46% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_1_inhibiting_alert-4        1.884µ ±  3%   1.422µ ± 10%  -24.55% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_1_inhibiting_alert-4       1.879µ ± 13%   1.419µ ±  4%  -24.49% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10_inhibiting_alerts-4          1.973µ ±  8%   1.404µ ±  4%  -28.86% (p=0.002 n=6)
Mutes/1_inhibition_rule,_100_inhibiting_alerts-4         3.726µ ± 10%   1.348µ ± 13%  -63.84% (p=0.002 n=6)
Mutes/1_inhibition_rule,_1000_inhibiting_alerts-4       21.485µ ±  5%   1.411µ ±  3%  -93.43% (p=0.002 n=6)
Mutes/1_inhibition_rule,_10000_inhibiting_alerts-4     184.245µ ±  9%   1.489µ ±  6%  -99.19% (p=0.002 n=6)
Mutes/100_inhibition_rules,_1000_inhibiting_alerts-4    20.453µ ± 14%   1.409µ ±  5%  -93.11% (p=0.002 n=6)
Mutes/10_inhibition_rules,_last_rule_matches-4           2.226µ ±  8%   1.912µ ±  8%  -14.13% (p=0.002 n=6)
Mutes/100_inhibition_rules,_last_rule_matches-4          6.959µ ±  2%   6.293µ ±  5%   -9.57% (p=0.002 n=6)
Mutes/1000_inhibition_rules,_last_rule_matches-4         55.08µ ±  4%   48.42µ ±  3%  -12.10% (p=0.002 n=6)
Mutes/10000_inhibition_rules,_last_rule_matches-4        534.1µ ±  7%   488.9µ ±  4%   -8.45% (p=0.002 n=6)
geomean                                                  8.267µ         3.181µ        -61.52%                                            8.267µ         3.364µ        -59.31%

The new implementation replaces the InhibitRule's scache with an icache (meaning "intersection cache") which pre-calculates the equal label values a target alert would need to be inhibited by each inhibitor alert. Practically, this is implemented by calculating a fingerprint for just the equalLabels of each inhibiting alert and storing that alert in a map keyed by the equalLabels fingerprint. This allows O(1) access to the set of inhibiting alerts for an InhibitRule in the Inhibitor.Mutes method. The new data structure is essentially just:

map[model.Fingerpint]map[model.Fingerprint]*types.Alert
// equaLabelsFp -> fp -> alert

To insert a new inhibitor alert "A":

does "A" match the source labels? If not, exit
make a LabelSet of just the subset of A's labels which are in the InhibitRule's equal_labels. Compute the fingerpint for this LabelSet and call it "E"
find the entry in the icache for E and insert A into it using A's fingerprint

To check if an alert "T" is inhibited:

does T match the target labels? If not, exit. T is not inhibited
make a LabelSet of just the subset of T's labels which are in the InhibitRule's equal_labels. Compute the fingerpint for this LabelSet and call it "E"
does an entry exist in the InhibitRule's icache for E? If not, exit. T is not inhibited
Find the first non-resolved alert in the icache entry and return it. T is inhibited by this alert.

The new implementation has a few other minor improvements which help performance: it reduces the number of time an inhibitor alert's fingerprint is calculated by caching it in the icache, it pre-calculates whether an inhibitor matches the source and target matchers for an inhibit rules, it calls time.Now() less frequently, and it avoids checking if a target alert matches the source matchers unless necessary.

In the real world, this change helps most when an inhibit rule covers many inhibitors and target alerts. For example, a rule where all critical alerts inhibit all warning alerts with the same instance and alertname can result in many possible inhibitors attached to a rule where each inhibitor alert only inhibits one target alert. The old implementation of the inhibitor is extremely inefficient in this case: assuming N inhibiting alerts and a fixed number of equalLabels, it performs O(N) string comparisons per target. The new implementation is effectively O(1) in this case (except in a pathological case where all alerts in the inhibitor are resolved, since the new implementation still has to iterate past resolved alerts).

The new implementation does require slightly more memory per inhibiting alert than the old one: we store the alert's fingerprint and a boolean to check if the inhibitor matches both the source and the target. This works out to be about 16 extra bytes per cached inhibitor alert after accounting for padding. For 10,000 alerts, that's about 160KB. I think this is a totally acceptable increase in memory requirements.

We've been running a very similar patch in our production environment which adds all inhibiting alerts to the marker for more than a month. In our environment this still results in a more then 4x reduction in time spent in the Inhibitor.Mutes method.

This change was motivated by observations which show a large amount of CPU spent doing string comparisons in the inhibitor. It seems like we're not the only ones who have been looking for this, and I think this might be a good alternative to #3933.

Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>

Spaceman1701 added 2 commits November 22, 2024 15:39

reimplement inhibitor datastructures to improve performance

fdef566

Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>

fix formatting

30e3747

Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>

Spaceman1701 force-pushed the icache-with-first-result branch from 6b19c38 to 30e3747 Compare November 22, 2024 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Signficantly improve inhibitor performance via new cache datastructure #4134

Signficantly improve inhibitor performance via new cache datastructure #4134

Spaceman1701 commented Nov 22, 2024 •

edited

Loading

Signficantly improve inhibitor performance via new cache datastructure #4134

Are you sure you want to change the base?

Signficantly improve inhibitor performance via new cache datastructure #4134

Conversation

Spaceman1701 commented Nov 22, 2024 • edited Loading

Spaceman1701 commented Nov 22, 2024 •

edited

Loading