Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Provide explicit timings to the Performance Monitoring API #941

Open
leops opened this issue Jan 30, 2024 · 4 comments
Open

Comments

@leops
Copy link

leops commented Jan 30, 2024

Exposing a lower-level API for manually creating transactions and spans with explicit timing values (instead of automatically reading the system clock when the spans are started / finished) would be useful to capture timing informations at a higher resolution than the default millisecond (which is often not precise enough for native code so microseconds / nanoseconds may be needed), or to build up a transaction from timing data coming from an external source (eg. performance counters from the GPU). I could easily open a PR implementing this, but it does raise the question of what the naming convention and general API design for these explicit transaction management functions should be.

@supervacuus
Copy link
Collaborator

supervacuus commented Jan 31, 2024

Hi @leops, as far as I know, microseconds are the current limit in the backend for performance/tracing events. Anything smaller than this would be truncated (or rounded), so opening the API to arbitrary scales doesn't make much sense (without considerable changes in the backend, and that discussion would have to happen in another issue tracker). Changing our timestamping to microseconds might be sensible, though.

I am unsure whether the tracing API is the proper interface for instrumentation at that scale level because we can gather a maximum of 1000 spans before a transaction must flush. Transaction flushing is a costly operation (the transport happens in a backend thread, but preparing the envelope for sending happens in the calling thread and includes serialization, something you might not want to have in a tight loop).

This sounds like the perfect use case for developer metrics, where you have client-side aggregation of arbitrary measurements (you can use it for profiling code at that low level) over time. This is implemented only in the backend and in the Python SDK (?) in a very early phase and only rolled out to some customers. However, a discussion regarding the needs of Native SDK users might already make sense.

CC: @kahest.

@leops
Copy link
Author

leops commented Feb 1, 2024

Setting the resolution limit to microseconds sounds sensible, I think collecting timings at nanosecond precision is bound to have a lot of measurement noise for one-shot captures anyway. The cost of transaction flushing doesn't sound like that much of a problem though as having an explicit API would allow for the envelope to be built up asynchronously in a background thread (this would necessarily be the case for collecting GPU timings for instance, with timestamp queries being processed asynchronously on the CPU between frames), but I agree this kind of high-resolution transaction should still avoid getting anywhere near the 1000 spans limits and only keep track of operations at a macroscopic level within that scope since performing too many measurements would once again add a lot of noise / overhead. I think the metrics feature is potentially interesting, as would profiling but there's still use case where you want to have a clear hierarchical view of what's going on when a given operation is being slower than expected, and profiling isn't supported in the Native SDK (and once again even less so on GPUs)

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 2 Feb 1, 2024
@supervacuus
Copy link
Collaborator

Hi @leops. That makes a lot of sense. So, to quickly summarize:

  • microseconds would be enough fine-grained resolution for now
  • you'd prefer to have hierarchical tracing rather than generic metrics (the latter aren't currently implemented anyway, but even if they were, performance/tracing would still be the better match for your use case)
  • you would run the transaction flush asynchronously. I mainly mention this because people use the tracing API synchronously in screen-refresh loops or UI threads, which is different from the scenario for which they were initially implemented.
  • if I got this right, you would still prefer to pass explicit timestamps or construct the span brackets yourself rather than defining them explicitly in the code, not only due to different timing resolutions but also because the measurements you want to take are not expressible in the host code. To be clear, you would still need to align these measurements with our wall clock to make these interpretable in traces.

Do you think this makes sense? I am asking to understand what kind of interface would be sufficient.

@leops
Copy link
Author

leops commented Feb 5, 2024

Yes I think this matches what I had in mind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Status: Backlog
Development

No branches or pull requests

3 participants