Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add best practices for metrics #2528

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mrueg
Copy link
Member

@mrueg mrueg commented Oct 17, 2024

What this PR does / why we need it:
This PR introduces a doc for best practices around metrics.
Hoping to get some comments and an agreement on these ideas.
@rexagod @dgrisonnet @dashpole @logicalhan @CatherineF-dev

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
None
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 17, 2024
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 17, 2024
@mrueg mrueg changed the title doc: Add best practices for metrics docs: Add best practices for metrics Oct 17, 2024
@mrueg mrueg force-pushed the metrics-best-practices branch 2 times, most recently from af6c15a to 95a7f71 Compare October 17, 2024 21:47
@CatherineF-dev
Copy link
Contributor

LGTM for Stability

Copy link
Contributor

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really nice so far. A couple of comments.

docs/design/metrics-best-practices.md Outdated Show resolved Hide resolved

### Optional properties

Some Kubernetes objects have optional fields. In case there is an optional value, it is better to not expose the label at all instead of exposing a "nil" value or an empty string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure about this one.

Prometheus metric families should have consistent labels.

For example, this is considered invalid.

my_metric{a="foo",b="bar"} 1
my_metric{b="baz"}1

In the above case, you want to expose an empty string label so the metric family is consistent.

my_metric{a="foo",b="bar"} 1
my_metric{a="",b="baz"}1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to know! I'll fix it (and will add some fixes to the next release of ksm).
Is there any upstream documentation on this that suggests labels to be empty instead of leaving them out so I can reference them in here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had an exmaple in the OpenMetrics exposition spec, but I can't find it right now. I'll see if I can find the docs on this.

Copy link
Member

@rexagod rexagod Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The technicality here can be split into two considerations:

KSM bypasses this trait by design (a promtool check metrics will succeed for the invalid example above).

Screenshot 2024-10-24 at 1 01 59 PM

That being said, it's always good to be consistent with the ecosystem, given we can incorporate a solution that ensures this in the CI.

docs/design/metrics-best-practices.md Outdated Show resolved Hide resolved
@mrueg
Copy link
Member Author

mrueg commented Oct 22, 2024

I'm also trying to suggest we move most labels into a single metric

metric_with_lots_of_labels{uid=identifier, a1=a1,a2=a2,...an=an} 1

instead of

metric_for_a1{uid=identifier,a1=a1}
metric_for_a2{uid=identifier,a2=a2}
....
metric_for_an{uid=identifier,an=an}

unless there are good reasons why the latter one is better for a TSDB/Querying etc

@rexagod
Copy link
Member

rexagod commented Oct 24, 2024

I believe field-specific metrics may help (a) limit cardinality (ref)? In the original issue, there was talk about including only tightly bounded labelsets in the _info metrics, however, that may lead to separate metrics that may fall into the "gray area" and (b) be ambiguous to the user if they are in an _info metric or not, as there isn't a feasible way to guess their inclusion or exclusion from the same without knowing a fields possible values beforehand. Not to mention, (c) static values may become dynamic in the future and cause breaking changes when moved to their own metrics (referring to the metric stability spec for BETA metrics).

@mrueg
Copy link
Member Author

mrueg commented Oct 28, 2024

I believe field-specific metrics may help (a) limit cardinality (ref)? In the original issue, there was talk about including only tightly bounded labelsets in the _info metrics, however, that may lead to separate metrics that may fall into the "gray area" and (b) be ambiguous to the user if they are in an _info metric or not, as there isn't a feasible way to guess their inclusion or exclusion from the same without knowing a fields possible values beforehand.
Not to mention, (c) static values may become dynamic in the future and cause breaking changes when moved to their own metrics (referring to the metric stability spec for BETA metrics).

This is why you need to look at the lifecycle of the object. If it's not lifetime scoped, it's dynamic. I'd rather have a few metric series with more labels than a bunch of metrics to avoid bad UX during querying where you need to do a few group lefts and joins to get to the result you want to see.

Copy link
Contributor

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mrueg mrueg mentioned this pull request Oct 31, 2024
@richabanker
Copy link
Contributor

/triage accepted
/assign @rexagod

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 31, 2024
Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!


### Optional properties

Some Kubernetes objects have optional fields. In case there is an optional value, the label should still be exposed, ideally as an empty string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty string and not having the label is the same thing in Prometheus, so why do this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be treated differently by other monitoring systems, so rather stay explicit here and ensure we do it the same way everywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to chime in on #2528 (comment)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brancz, mrueg, SuperQ

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Co-authored-by: Ben Kochie <superq@gmail.com>
@mrueg
Copy link
Member Author

mrueg commented Nov 14, 2024

@dgrisonnet @CatherineF-dev @rexagod I think this is in a good state now. :)

@rexagod
Copy link
Member

rexagod commented Nov 15, 2024

/assign @dgrisonnet

For visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants