How to determine job latency via instrumentation #1030

stevenharman · 2021-11-23T03:42:56Z

stevenharman
Nov 23, 2021

👋 Hello! I'm just getting started with OTel and we're hoping to lean on opentelemetry-ruby to get traces to the backend. I've got OpenTelemetry::SDK configured and exporting to Honeycomb for one of our smaller, lower risk components, as a bit of a live playground. In this setup Sidekiq is being instrumented via #use_all, and we're seeing those traces in the backend.

One thing I'm having trouble figuring out is how to see the latency of a job as it's being processed. Perhaps this is a limitation of my experience with the particular backend tooling (Honeycomb), but want to make sure I understand why certain information about a Job is set as a Span attribute, and other bits are Span Events. Namely, there is a Span Event called enqueued_at added to the Job processed Span, and the timestamp of that event is the Job's enqueued_at value. Would it be equally valid to have an attribute like messaging.sidekiq.enqueued_at set, rather than, or perhaps along with, the Span Event?

I see that the Sidekiq instrumentation is using the semantic convention for Messaging, which makes no mention of a concept like enqueued_at (or perhaps sent_at, given the messaging-based focus). Is the lack of that concept in the semantic guide might the reason to use an Event?

I suppose I'm really just looking for guidance for how the instrumentation is expected to be used. i.e., latency of jobs in a work queue like Sidekiq has often been useful info to have when troubleshooting, or for alerting, etc... So am I holding it wrong? Or my expectations off track?

Thank you for all of the time and effort y'all put into this work.

fbogsany · 2021-11-24T15:53:19Z

fbogsany
Nov 24, 2021
Maintainer

open-telemetry/opentelemetry-specification#1582 was proposed to add semantic conventions for job queue systems. Unfortunately, it was closed by the stale bot. It was very much WIP and didn't include the enqueued_at attribute.

We followed the messaging semantic conventions for Sidekiq because it was the closest thing available in the OTel semantic conventions.

I don't recall the conversation leading to using Events rather than Attributes for recording the enqueued_at timestamp. My best guess, though, is that Attributes do not permit timestamps as values, whereas Events naturally include a timestamp, and some backends may be able to render timed events using offset markers from the span, providing a visual representation of the queuing time in this case.

3 replies

stevenharman Nov 24, 2021
Author

Thanks for getting back to me, @fbogsany. A few questions/comments.

open-telemetry/opentelemetry-specification#1582 was proposed to add semantic conventions for job queue systems. Unfortunately, it was closed by the stale bot. It was very much WIP and didn't include the enqueued_at attribute.

That is unfortunate, for sure. That the RFC didn't include something akin to enqueued_at seems like a gap to me (granted, the RFC was still in flight, so...). Understanding the latency of processing a Job (or honestly, a message in case of the Messaging convention) seems such an important part of understanding how the system is working, debugging problems, etc... It seems almost essential. All of which leaves me feeling like I must be missing something. 🤔

We followed the messaging semantic conventions for Sidekiq because it was the closest thing available in the OTel semantic conventions.

Makes total sense, given that the Jobs convention was only in RFC.

Attributes do not permit timestamps as values, whereas Events naturally include a timestamp

It's true that timestamps are not valid values for Attributes. However there is precedent, including in the Jobs RFC, for ISO 8601 encoding timestamps as Attributes.

and some backends may be able to render timed events using offset markers from the span, providing a visual representation of the queuing time in this case.

That is true. However, the timestamp of the Span Events is always going to precede the timestamp of the containing Span since the Job was created and then enqueued before the Job is processed. That said, it's possible that the containing Span is a child span, and so given enough levels and context propagation, a backend might render the Span Event "inside" one of the parent Spans. But I'm wondering if that's less likely/useful than an Attribute which is directly usable within the Span. 🤔

Given all of the above, what advice would you give for proceeding? Would it make sense to change the instrumentation (for Sidekiq, etc...) to add an attribute something like enqueued_at: ISO8601(job.enqueued_at), while also keeping the Span Events for the reasons above? Or maybe replacing the Span Events with such Attributes?

Short of changing the instrumentation, I suppose everyone who uses the instrumentation (and wants the latency of processing) will need to roll their own and/or patch the existing instrumentation? Which feels... not great.

Thank you again for your insight and all of the effort y'all put into this work.

stevenharman May 9, 2023
Author

There's been some discussion of job systems, latencies, and OTel over in the contrib repo. Still no great answer, but at least some workarounds are proposed.

robbkidd Aug 25, 2023
Collaborator

Here's the results of an experiment with using the OpenTelemetry Collector (Contrib Edition) to calculate job latency (comment on that PR linked by @stevenharman, linked again here for other folks to find).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to determine job latency via instrumentation #1030

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to determine job latency via instrumentation #1030

stevenharman Nov 23, 2021

Replies: 1 comment · 3 replies

fbogsany Nov 24, 2021 Maintainer

stevenharman Nov 24, 2021 Author

stevenharman May 9, 2023 Author

robbkidd Aug 25, 2023 Collaborator

stevenharman
Nov 23, 2021

Replies: 1 comment 3 replies

fbogsany
Nov 24, 2021
Maintainer

stevenharman Nov 24, 2021
Author

stevenharman May 9, 2023
Author

robbkidd Aug 25, 2023
Collaborator