-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add API for fetching details about an oximeter producer #7139
base: main
Are you sure you want to change the base?
Conversation
6c93ed5
to
041af41
Compare
This PR adds a bunch of useful debugging information into the I started up Omicron on my dev Helios machine, and we can list the producers like so:
This tool already existed, but now we can drill down to see what's happening in each. Just looking at the first one we get this:
These all show zero failures because things are working fine on my machine. I wanted to experiment a bit to see what happens when things do start to fail. So I disabled one of the Nexus services, which is producer
So now there are some failures, and the last failure line shows when that happened and why (the server was unreachable). After a few seconds, Nexus comes back up and re-registers itself as a producer, which updates
The number of successes has incremented, and the address has changed. Note that the lines printing the last failure and success are sticky, so the last failure will stick around forever, even if it was a long time ago. I've found that pretty helpful. This is all in addition to the timeseries we're already reporting showing the cumulative number of collections and failures, broken down by the reason for the failure. We can see this failure here:
|
041af41
to
300658d
Compare
- Add `producer_details` API to `oximeter` collector, which returns information about registration time, update time, and collection summaries. - Update producer details during collections themselves - Add `omdb oximeter producer-details` subcommand for printing - Closes #7125
300658d
to
69dbc6a
Compare
oximeter/api/src/lib.rs
Outdated
/// | ||
/// # Panics | ||
/// | ||
/// This panics if no collection was started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this seems okay but not ideal. Could start_collection()
return some kind of handle that could turn into success/fail and statically avoid this panic? Something like
let handle = details.start_collection();
match do_collection() {
Ok(_) => details.end_collection(handle.success(n_samples)),
Err(_) => details.end_collection(handle.failure("failure reason")),
}
If the handle held the start time internally, that would also avoid incorrect timings caused by incorrectly paired calls; e.g.,
start_collection()
start_collection()
on_success()
on_success()
This is a little questionable given we serialize collection requests today, but I think if a caller did this both success calls would calculate time based on when the second collection started, which is probably not what they intended.
oximeter/collector/src/agent.rs
Outdated
@@ -341,11 +350,13 @@ async fn collection_loop( | |||
log, | |||
"collection task received explicit request to collect" | |||
); | |||
details.start_collection(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right to me for a couple reasons:
- We aren't actually starting a collection here; we're attempting to enqueue a request to start a collection, which is a difference that seems important when debugging
- If the timer fires immediately before or after this, we could call
start_collection()
twice before eitheron_*()
completes, since we have two independent queues
I'm not entirely sure how to suggest reworking this (although this seems related to my comment above about making start_collection()
return a handle, since that might at least let us avoid interleaving). Do we care about distinguishing how much of the time spent collecting was spent in the internal queue vs the actual collection request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about distinguishing how much of the time spent collecting was spent in the internal queue vs the actual collection request?
I'm not sure, but it seems like it's not super tricky to get that and it's strictly more information. I'll rework this to handle the other comment, and figure out how to include this bit too.
- Rework how we track the start of a collection, so that we can always associate a start with its correct end. - Independently track the last success and failure for a producer - Update `oximeter` OpenAPI spec - Update `omdb oximeter producer-details` with new API
@jgallagher I've reworked this all pretty significantly in 3ce03a5. It's not exactly what you laid out in your comment, since there were some awkward interactions with types that are private to the |
For completeness, I started up the control plane again locally and here's the kind of output we get now:
|
producer_details
API tooximeter
collector, which returns information about registration time, update time, and collection summaries.omdb oximeter producer-details
subcommand for printingoximeter
could report more debugging information about its producers #7125