Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API for fetching details about an oximeter producer #7139

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bnaecker
Copy link
Collaborator

@bnaecker bnaecker force-pushed the add-producer-details-api branch 2 times, most recently from 6c93ed5 to 041af41 Compare November 22, 2024 02:36
@bnaecker
Copy link
Collaborator Author

This PR adds a bunch of useful debugging information into the oximeter collector. I wrote this after debugging #7120 and the related problems, and to help validate #7097. It adds an endpoint into the oximeter collector fetch detailed information about a specific producer, such as the time it was registered or updated; the time of the last successful or failed collection; and the total numbers of successful or failed collections. I've added a basic sanity-check test, and an omdb subcommand for exercising it too. Here's what that looks like.

I started up Omicron on my dev Helios machine, and we can list the producers like so:

bnaecker@shale : ~/omicron $ ./target/release/omdb oximeter list-producers
note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
Collector ID: e6b6ef16-59f0-428c-a4da-e00ba1dc7920

Last refresh: 2024-11-22 01:35:32.348074186 UTC

ID                                   ADDRESS                       INTERVAL
1d9c8c20-2218-444b-b860-144b21e4e991 [fd00:1122:3344:101::1]:44361 30s
2d7c6a47-815c-4aab-9c23-9fe6f3af682c [fd00:1122:3344:101::b]:36292 10s
5bd1f5ec-d084-4ebb-8b55-1eef2d1c1002 [fd00:1122:3344:101::c]:58573 10s
6dc59b22-be1a-4c4a-996a-ac5a9cd90870 [fd00:1122:3344:101::2]:8001  1s
79e9a733-2db0-4719-a841-9639afddece2 [fd00:1122:3344:101::a]:57584 10s
b2135f96-792b-44c9-aa6c-b827fa92b556 [fd00:1122:3344:101::1]:8001  1s
c7223ed4-af03-42f9-ace7-7976b8602b4a [fd00:1122:3344:101::2]:4677  1s
d3ec7c1e-99c5-460b-a536-bd5d5d097f68 [fd00:1122:3344:101::2]:40056 10s
f4117644-8add-4fc6-b865-7aa2d2ffb399 [fd00:1122:3344:101::2]:53096 10s

This tool already existed, but now we can drill down to see what's happening in each. Just looking at the first one we get this:

bnaecker@shale : ~/omicron $ ./target/release/omdb oximeter producer-details 106064ab-9f51-4ae5-b1b0-481c087b2a0f
note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
              ID: 106064ab-9f51-4ae5-b1b0-481c087b2a0f
         Address: [fd00:1122:3344:101::1]:39136
      Registered: 2024-11-22T02:18:03.612Z
         Updated: 2024-11-22T02:18:03.612Z
        Interval: 30s
 Last collection: 2024-11-22T02:28:03.618Z
    Last success: 2024-11-22T02:28:03.657Z (39.004488ms, 846 samples)
    Last failure: Never
       Successes: 21
        Failures: 0

These all show zero failures because things are working fine on my machine. I wanted to experiment a bit to see what happens when things do start to fail. So I disabled one of the Nexus services, which is producer 40badf8b-9c27-4c5d-a010-81b9bc70d0f8, in the corresponding Nexus zone. When we do that, we start to see this:

note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
              ID: 40badf8b-9c27-4c5d-a010-81b9bc70d0f8
         Address: [fd00:1122:3344:101::a]:34819
      Registered: 2024-11-22T02:18:48.600Z
         Updated: 2024-11-22T02:18:48.600Z
        Interval: 10s
 Last collection: 2024-11-22T02:26:28.604Z
    Last success: 2024-11-22T02:26:18.605Z (1.095562ms, 2 samples)
    Last failure: 2024-11-22T02:26:28.605Z (unreachable)
       Successes: 46
        Failures: 1

So now there are some failures, and the last failure line shows when that happened and why (the server was unreachable). After a few seconds, Nexus comes back up and re-registers itself as a producer, which updates oximeter's information about it. We can see that here:

note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
              ID: 40badf8b-9c27-4c5d-a010-81b9bc70d0f8
         Address: [fd00:1122:3344:101::a]:47557
      Registered: 2024-11-22T02:18:48.600Z
         Updated: 2024-11-22T02:26:33.605Z
        Interval: 10s
 Last collection: 2024-11-22T02:26:43.605Z
    Last success: 2024-11-22T02:26:43.607Z (1.658154ms, 2 samples)
    Last failure: 2024-11-22T02:26:28.605Z (unreachable)
       Successes: 47
        Failures: 1

The number of successes has incremented, and the address has changed. Note that the lines printing the last failure and success are sticky, so the last failure will stick around forever, even if it was a long time ago. I've found that pretty helpful.

This is all in addition to the timeseries we're already reporting showing the cumulative number of collections and failures, broken down by the reason for the failure. We can see this failure here:

bnaecker@shale : ~/omicron $ ./target/release/omdb oxql
note: ClickHouse URL not specified. Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
Oximeter Query Language shell

Basic commands:
  \?, \h, help       - Print this help
  \q, quit, exit, ^D - Exit the shell
  \l                 - List timeseries
  \d <timeseries>    - Describe a timeseries
  \ql [<operation>]  - Get OxQL help about an operation

Or try entering an OxQL `get` query
0x〉get oximeter_collector:failed_collections | filter producer_id == "40badf8b-9c27-4c5d-a010-81b9bc70d0f8" | last 1

oximeter_collector:failed_collections

 base_route:
 collector_id: b8043883-0e83-4e39-9057-c1189a1905d2
 collector_ip: fd00:1122:3344:101::d
 collector_port: 12223
 producer_id: 40badf8b-9c27-4c5d-a010-81b9bc70d0f8
 producer_ip: fd00:1122:3344:101::a
 producer_port: 34819
 reason: unreachable
   [2024-11-22 02:26:28.605131895, 2024-11-22 02:33:48.609147507]: [1]

- Add `producer_details` API to `oximeter` collector, which returns
  information about registration time, update time, and collection
  summaries.
- Update producer details during collections themselves
- Add `omdb oximeter producer-details` subcommand for printing
- Closes #7125
///
/// # Panics
///
/// This panics if no collection was started.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this seems okay but not ideal. Could start_collection() return some kind of handle that could turn into success/fail and statically avoid this panic? Something like

let handle = details.start_collection();
match do_collection() {
    Ok(_) => details.end_collection(handle.success(n_samples)),
    Err(_) => details.end_collection(handle.failure("failure reason")),
}

If the handle held the start time internally, that would also avoid incorrect timings caused by incorrectly paired calls; e.g.,

  1. start_collection()
  2. start_collection()
  3. on_success()
  4. on_success()

This is a little questionable given we serialize collection requests today, but I think if a caller did this both success calls would calculate time based on when the second collection started, which is probably not what they intended.

@@ -341,11 +350,13 @@ async fn collection_loop(
log,
"collection task received explicit request to collect"
);
details.start_collection();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right to me for a couple reasons:

  1. We aren't actually starting a collection here; we're attempting to enqueue a request to start a collection, which is a difference that seems important when debugging
  2. If the timer fires immediately before or after this, we could call start_collection() twice before either on_*() completes, since we have two independent queues

I'm not entirely sure how to suggest reworking this (although this seems related to my comment above about making start_collection() return a handle, since that might at least let us avoid interleaving). Do we care about distinguishing how much of the time spent collecting was spent in the internal queue vs the actual collection request?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about distinguishing how much of the time spent collecting was spent in the internal queue vs the actual collection request?

I'm not sure, but it seems like it's not super tricky to get that and it's strictly more information. I'll rework this to handle the other comment, and figure out how to include this bit too.

- Rework how we track the start of a collection, so that we can always
  associate a start with its correct end.
- Independently track the last success and failure for a producer
- Update `oximeter` OpenAPI spec
- Update `omdb oximeter producer-details` with new API
@bnaecker
Copy link
Collaborator Author

bnaecker commented Dec 3, 2024

@jgallagher I've reworked this all pretty significantly in 3ce03a5. It's not exactly what you laid out in your comment, since there were some awkward interactions with types that are private to the oximeter-collector crate. It seems a bit complicated to me, so I'm open to suggestions for improving it, but it does indeed do what we want.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Dec 3, 2024

For completeness, I started up the control plane again locally and here's the kind of output we get now:

bnaecker@shale : ~/omicron $ cargo r --bin omdb -- oximeter producer-details ee483402-4095-48f3-956e-64b229e0c0d7
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.86s
     Running `target/debug/omdb oximeter producer-details ee483402-4095-48f3-956e-64b229e0c0d7`
note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223

          ID: ee483402-4095-48f3-956e-64b229e0c0d7
     Address: [fd00:1122:3344:101::1]:32972
  Registered: 2024-12-03T04:45:32.347Z
     Updated: 2024-12-03T04:45:32.347Z
    Interval: 30s
   Successes: 1520
    Failures: 0

Last success:
  Started at: 2024-12-03T17:25:02.654Z
  Queued for: 7.31µs
    Duration: 37.223142ms
     Samples: 846

Last failure: None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

oximeter could report more debugging information about its producers
2 participants