This document describes the API to communicated with SEPIA Speech-To-Text (STT) Server.
[UNDER CONSTRUCTION: Please create an issue to push me and update this :-p]
In the meantime please follow the discussion here: SEPIA-Framework/sepia-docs#112
or check out the Javascript client and test-page for a code examples.
Communication with the server is handled via WebSocket messages in JSON format. Each message has a specific 'type' that defines it's purpose (e.g. handle configuration, send audio chunks etc.). The general flow of events is as follows:
- The client opens a WebSocket connection using the URL:
ws://[server-ip]:[port]/socket
(or wss:// if you use custom SSL) and listens for theonopen
event. - If the
onopen
event arrives the client sends a 'welcome' message in JSON format that contains the authentication data and desired ASR configuration. It then checksonmessage
for a response of the same type andonerror
for ... errors. - If the welcome message was confirmed the client starts sending chunks of audio (raw audio buffer) to the server. This can be a typed-array, blob or binary data, depending on your client language.
- The server starts the transcription process and will send "partial" and "final" results (type 'result') in JSON format. The client receives the results via the
onmessage
handler. - Depending on the settings (continuous=true/false) or other conditions either server or client will end the process and close the connection. The client listens for
onclose
andonerror
events. - The client can close the connection anytime by sending a JSON message of type 'audioend' indicating that it will not send any more audio chunks. The server will then try to finalize the running process and send a final result.
To check if the server is actually online you can send a simple HTTP GET request to the 'ping' endpoint: [server-ip]:[port]/ping
.
The answer will be something like this: {"result":"success","server":"SEPIA STT Server","version":"0.9.5"}
.
There is another GET endpoint called /settings
that will give you some more details, e.g.:
{
"result": "success",
"settings": {
"version": "0.9.5",
"engine": "vosk",
"models": ["vosk-model-small-de", "vosk-model-small-en-us"],
"languages": ["de-DE", "en-US"],
"features": ["partial_results", "alternatives", "words_ts", "phrase_list", "speaker_detection"]
}
}
The 'welcome' message should be sent after the WebSocket onopen
event is received. It authenticates the user and tells the server what model and parameters should be used to do speech recognition.
The 'data' parameter (here we name it 'optionsData') defines things like the samplerate (almost always 16000), if the ASR process stops after a "final" result (continuous=true/false), what language to use and what ASR model etc..
If the 'model' parameter is not given the server will choose the first available model for the given 'language'. NOTE: 'model' can overrule 'language' if there is a mismatch.
optionsData = {
"samplerate": 16000,
"language": "en-US",
"model": "vosk-model-small-en-us",
"optimizeFinalResult": true,
"alternatives": 1,
"continuous": false,
...
}
Some engines can have additional parameters like "phrases" for Vosk. You use the included demos to play with the available options.
Send the event:
websocket.send({
"type": "welcome",
"data": optionsData,
"client_id": clientId,
"access_token": accessToken,
"msg_id": messageId
})
type
- Socket message type, in this case 'welcome'.data
- Configuration data for the ASR process, see 'optionsData' above.client_id
andaccess_token
- Defined inside server.conf, e.g. 'user001' and 'ecd71870d19...' or by default 'any' and 'test1234' (common_auth_token
).msg_id
- Auto-index number you can assign to any message to track responses (e.g.: 1, 2, ...).
If the welcome event was successful you will get a response like this (example for Vosk, msg_id=1):
{
"type": "welcome",
"msg_id": 1,
"code": 200,
"info": {
"version": "0.9.5",
"engine": "vosk",
"models": ["vosk-model-small-de", "vosk-model-small-en-us"],
"languages": ["de-DE", "en-US"],
"features": ["partial_results", "alternatives", "words_ts", "phrase_list", "speaker_detection"],
"options": {
"language": "en-US",
"model": "vosk-model-small-en-us",
"samplerate": 16000,
"optimizeFinalResult": true,
"alternatives": 1,
"continuous": false,
"words": false,
"speaker": false,
"phrases": []
}
}
}
The object will contain the actual, active settings in response to your welcome-request and in addition some info like the available models, languages, features of the server etc..
If something went wrong like a failed authentication you will get an error message in return.
TBD
TBD
TBD