Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug-1898341: add host tag #1018

Merged
merged 4 commits into from
May 23, 2024
Merged

bug-1898341: add host tag #1018

merged 4 commits into from
May 23, 2024

Conversation

willkg
Copy link
Contributor

@willkg willkg commented May 23, 2024

This does some minor cleanup. We don't use statsd_namespace so there's no point it having a configuration variable for it. This removes some header material from antenna/__init__.py.

This adds a markus backend to capture what metrics were emitted during tests. We can compare before-and-after after making fundamental changes like this to how metrics are emitted.

After I landed the CaptureMetricsUsed code, I captured the metrics, ran them through sort and uniq and get this:

histogram	breakpad_resource.crash_size	['payload:compressed']
histogram	breakpad_resource.crash_size	['payload:uncompressed']
histogram	breakpad_resource.gzipped_crash_decompress	['result:success']
incr	breakpad_resource.gzipped_crash	[]
incr	breakpad_resource.incoming_crash	[]
incr	breakpad_resource.throttle	['result:accept']
incr	breakpad_resource.throttle_rule	['rule:accept_everything']
incr	breakpad_resource.throttle_rule	['rule:is_nightly']
incr	crashmover.publish_crash_dropped.count	[]
incr	crashmover.publish_crash_exception.count	[]
incr	crashmover.save_crash.count	[]
incr	crashmover.save_crash_dropped.count	[]
incr	crashmover.save_crash_exception.count	[]
incr	health.broken.count	[]
incr	health.heartbeat.count	[]
incr	health.lbheartbeat.count	[]
incr	health.version.count	[]
timing	breakpad_resource.on_post.time	[]
timing	crashmover.crash_handling.time	[]
timing	crashmover.crash_publish.time	[]
timing	crashmover.crash_save.time	[]

I changed the code to use a metrics client defined in antenna/libmarkus.py. In doing that, I had to make changes to a lot of keys. Then I added the host tag.

That gives us this list which is the same, but with the addition of the host tag information.

histogram	breakpad_resource.crash_size	['payload:compressed', 'host:49c259757b46']
histogram	breakpad_resource.crash_size	['payload:compressed', 'host:7fefaf509b01']
histogram	breakpad_resource.crash_size	['payload:uncompressed', 'host:49c259757b46']
histogram	breakpad_resource.crash_size	['payload:uncompressed', 'host:7fefaf509b01']
histogram	breakpad_resource.gzipped_crash_decompress	['result:success', 'host:49c259757b46']
histogram	breakpad_resource.gzipped_crash_decompress	['result:success', 'host:7fefaf509b01']
incr	breakpad_resource.gzipped_crash	['host:49c259757b46']
incr	breakpad_resource.gzipped_crash	['host:7fefaf509b01']
incr	breakpad_resource.incoming_crash	['host:49c259757b46']
incr	breakpad_resource.incoming_crash	['host:7fefaf509b01']
incr	breakpad_resource.throttle	['result:accept', 'host:49c259757b46']
incr	breakpad_resource.throttle	['result:accept', 'host:7fefaf509b01']
incr	breakpad_resource.throttle_rule	['rule:accept_everything', 'host:49c259757b46']
incr	breakpad_resource.throttle_rule	['rule:accept_everything', 'host:7fefaf509b01']
incr	breakpad_resource.throttle_rule	['rule:is_nightly', 'host:49c259757b46']
incr	breakpad_resource.throttle_rule	['rule:is_nightly', 'host:7fefaf509b01']
incr	crashmover.publish_crash_dropped.count	['host:49c259757b46']
incr	crashmover.publish_crash_dropped.count	['host:7fefaf509b01']
incr	crashmover.publish_crash_exception.count	['host:49c259757b46']
incr	crashmover.publish_crash_exception.count	['host:7fefaf509b01']
incr	crashmover.save_crash.count	['host:49c259757b46']
incr	crashmover.save_crash.count	['host:7fefaf509b01']
incr	crashmover.save_crash_dropped.count	['host:49c259757b46']
incr	crashmover.save_crash_dropped.count	['host:7fefaf509b01']
incr	crashmover.save_crash_exception.count	['host:49c259757b46']
incr	crashmover.save_crash_exception.count	['host:7fefaf509b01']
incr	health.broken.count	['host:49c259757b46']
incr	health.broken.count	['host:7fefaf509b01']
incr	health.heartbeat.count	['host:49c259757b46']
incr	health.heartbeat.count	['host:7fefaf509b01']
incr	health.lbheartbeat.count	['host:49c259757b46']
incr	health.lbheartbeat.count	['host:7fefaf509b01']
incr	health.version.count	['host:49c259757b46']
incr	health.version.count	['host:7fefaf509b01']
timing	breakpad_resource.on_post.time	['host:49c259757b46']
timing	breakpad_resource.on_post.time	['host:7fefaf509b01']
timing	crashmover.crash_handling.time	['host:49c259757b46']
timing	crashmover.crash_handling.time	['host:7fefaf509b01']
timing	crashmover.crash_publish.time	['host:49c259757b46']
timing	crashmover.crash_publish.time	['host:7fefaf509b01']
timing	crashmover.crash_save.time	['host:49c259757b46']
timing	crashmover.crash_save.time	['host:7fefaf509b01']

That's not quite what we want. It adds the host tag for both AWS and GCP environments. However, I think that'll be ok.

Pretty sure we can see the hostname value in Sentry since it gets passed into the sentry init function.

  • Antenna: ec2-18-234-116-224.compute-1.amazonaws.com_i-04a52c01b39074c53
  • Tecken: ip-172-31-59-195.ec2.internal

If we look at Grafana, we see the values that telegraf is putting in.

  • Antenna: ip-172-31-23-117.us-west-2.compute.internal
  • Tecken: ip-172-31-12-209.us-west-2.compute.internal

I implemented the Antenna hostname thing much like we have Tecken:

  • configuration key is "HOSTNAME"
  • if it's not set, the value defaults to socket.gethostname()
  • we don't set it in the infrastructure code

Given that, I think even though this implementation adds a host tag for both AWS and GCP environments, it'll be fine.

We don't want to be setting statsd_namespace--that's not our standard
for metrics. Ergo, we should remove it.
This captures the metrics used in the tests in a `metrics_emitted.txt`
file. You can uniquify the contents of this file to determine what got
emitted and compare it from run to run.
@willkg willkg requested a review from a team as a code owner May 23, 2024 00:35
This changes HOST_ID to HOSTNAME to match the convention and what we get
in GCP.

If the hostname is set, this will add a "host:{hostname}" tag to all
emitted metrics.
@willkg willkg force-pushed the willkg-bug-1898341-hostname branch from bb79815 to 5bdc213 Compare May 23, 2024 12:05
@willkg willkg merged commit 7605048 into main May 23, 2024
2 checks passed
@willkg
Copy link
Contributor Author

willkg commented May 23, 2024

There's an issue in Markus for building something that lets you register metrics and then enforce that only those metrics get emitted. I was tinkering with something along those lines with Eliot, but I don't like the shape of that experiment. Anyhow, something like that would have been really helpful here. Some day I'll get there.

Thank you for reviewing!

@willkg willkg deleted the willkg-bug-1898341-hostname branch May 23, 2024 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants