-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SDK v1.13 draft: add KeywordRecognizer support to UWP VA #500
base: main
Are you sure you want to change the base?
Conversation
this.audioIntoConnectorSink.BookmarkPosition = KeywordRejectionTimeout; | ||
this.EnsureConnectorReady(); | ||
this.logger.Log($"Starting connector"); | ||
_ = this.connector.StartKeywordRecognitionAsync(this.ConfirmationModel as KeywordRecognitionModel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trrwilson Does this line mean 2nd stage recognition will be run a second time? If so, is there a way to tell the connector that 2nd stage has already been evaluated and to skip to 3rd stage?
(I presume if 3rd stage isn't enabled, then a call to ListenOnceAsync() would work here, but that StartKeywordRecognitionAsync is being used instead to ensure 3rd stage gets called)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @tomh05! Thanks for piling on here and apologies I've been away a bit--between a week of vacation and then getting pulled into a few things, I've been more absent on this front than I'd like.
You are 100% correct that the DialogServiceConnector will unnecessarily repeat the on-device portion of the keyword spotting. Very rarely (but still occasionally) they'll even disagree, with the DialogServiceConnector not deigning to fire after the KeywordRecognizer does. That last part is likely attributable to subtle differences in the byte alignment based on how the KeywordRecognitionResult selects the audio start position, but I'm rambling.
Two parts of that:
- The SDK should support a means of doing this. We have an item on our backlog to design and implement a way to chain a KeywordRecognizer and DialogServiceConnector together to get KWV without repeating on-device KWS. I'd be curious to know from you as one of our most informed consumers, though: how do you think it should work, i.e. what code would make sense to write? There are a lot of ways to specify this--something in the config, a parameter to the Start() call, a different way of creating/annotating the input, and more--and part of the debate is what the most intuitive and clear way to expose this option/capability would be.
- In the interim, one thing that can be experimented with is using ListenOnce instead of KWS on the DialogServiceConnector. That will not be eligible for KWV right now, but at least as an exploration/prototyping step, it'd allow a full observation of what latency benefits skipping that second step would have. In my own ad hoc testing for this change, I saw that the KWS delayed the start of stream to the service by ~200-400ms depending on configuration; that doesn't translate to a full 200-400ms of extra time until the KWV result arrives (it runs faster than real-time by a good margin), but it's still going to be a considerable increase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @trrwilson! Will throw my 2 cents in here 🙂
Re: SDK provision for avoiding the repetition of 2nd stage, this feels like a good fit for a setting on the Connector, because it should be consistent in an application based on the architecture. Adding a parameter on the Start
call would imply that it could be different at the initialization of different conversations within the same app, but surely any given app would either use the KeywordRecognizer
, or it wouldn't? If there is a use-case for apps to switch between online and offline recognition then a parameter to the Start
call seems like a reasonable fallback position 👍
Since we can't do 3rd stage verification just now we can experiment with using ListenOnce
in the short term and see what the latency improvements are like. Is there a timeline for the SDK changes to avoid repeating 2nd stage? Once 3rd stage is available to us it would be a shame to have to pick between "have 3rd stage, but gain latency from repeating 2nd" and "lower latency, but no 3rd stage".
Purpose
Speech SDK v1.12 introduced a new
KeywordRecognizer
object that enables standalone on-device keyword matching without an active connection to Azure Speech Services. The audio associated with results from this object can then be routed into existing objects (such as theDialogServiceConnector
) for use in existing scenarios.This functionality has a significant benefit to voice assistant applications that may be initiated in a "cold start" situation:
DialogServiceConnector
won't begin processing keyword audio until a connection is established (latency hit, several hundred milliseconds)KeywordRecognizer
allows us to parallelize and skip (4) and (5) above, typically saving more than 500ms in cold start and often saving multiple seconds (depending on token retrieval and connection establishment speeds). An on-device result can be obtained in parallel to networking needs and theDialogServiceConnector
, as a consumer of theKeywordRecognitionResult
's audio, can catch up after user-facing action has already begun.This addresses #486 .
Caveats: chaining a
KeywordRecognizer
into aDialogServiceConnector
isn't trivial and requires both audio adapters and some state management. Investigation with v1.12 also revealed that multi-turn use of an audio stream derived from aKeywordRecognitionResult
did not automatically consume recognized audio, which made effective use additionally challenging. This automatic consumption behavior is fixed in v1.13 and this change takes a dependency on that fix.Further, since audio adapters were already necessary, this change also applies said adapters to improve the keyword rejection behavior (and remove the so-called "failsafe timer" approach):
DialogServiceConnector
) as fast as possible, meaning we have no accounting of how much data is/has been consumed at any pointAgentAudioProducer
-- this means we'll evaluate an audio range from approximately 1200ms before a keyword detection threshold to approximately 800ms after that keyword detection threshold and conclude "no keyword" if no confirmation result is obtained from that evaluation.Does this introduce a breaking change?
Keyword detection metrics are likely impacted by the introduction of the new objects. Efforts were made to preserve the logic but there's likely something regressed that can/should be addressed in a subsequent submission.
Pull Request Type
How to Test / What to Check
Note: as of draft time, validation still in progress