In this part of the lesson, you will write code to convert speech in the captured audio to text using the speech service.
The audio can be sent to the speech service using the REST API. To use the speech service, first you need to request an access token, then use that token to access the REST API. These access tokens expire after 10 minutes, so your code should request them on a regular basis to ensure they are always up to date.
-
Open the
smart-timer
project on your Pi. -
Remove the
play_audio
function. This is no longer needed as you don't want a smart timer to repeat back to you what you said. -
Add the following import to the top of the
app.py
file:import requests
-
Add the following code above the
while True
loop to declare some settings for the speech service:speech_api_key = '<key>' location = '<location>' language = '<language>'
Replace
<key>
with the API key for your speech service resource. Replace<location>
with the location you used when you created the speech service resource.Replace
<language>
with the locale name for language you will be speaking in, for exampleen-GB
for English, orzn-HK
for Cantonese. You can find a list of the supported languages and their locale names in the Language and voice support documentation on Microsoft docs. -
Below this, add the following function to get an access token:
def get_access_token(): headers = { 'Ocp-Apim-Subscription-Key': speech_api_key } token_endpoint = f'https://{location}.api.cognitive.microsoft.com/sts/v1.0/issuetoken' response = requests.post(token_endpoint, headers=headers) return str(response.text)
This calls a token issuing endpoint, passing the API key as a header. This call returns an access token that can be used to call the speech services.
-
Below this, declare a function to convert speech in the captured audio to text using the REST API:
def convert_speech_to_text(buffer):
-
Inside this function, set up the REST API URL and headers:
url = f'https://{location}.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1' headers = { 'Authorization': 'Bearer ' + get_access_token(), 'Content-Type': f'audio/wav; codecs=audio/pcm; samplerate={rate}', 'Accept': 'application/json;text/xml' } params = { 'language': language }
This builds a URL using the location of the speech services resource. It then populates the headers with the access token from the
get_access_token
function, as well as the sample rate used to capture the audio. Finally it defines some parameters to be passed with the URL containing the language in the audio. -
Below this, add the following code to call the REST API and get back the text:
response = requests.post(url, headers=headers, params=params, data=buffer) response_json = response.json() if response_json['RecognitionStatus'] == 'Success': return response_json['DisplayText'] else: return ''
This calls the URL and decodes the JSON value that comes in the response. The
RecognitionStatus
value in the response indicates if the call was able to extract speech into text successfully, and if this isSuccess
then the text is returned from the function, otherwise an empty string is returned. -
Above the
while True:
loop, define a function to process the text returned from the speech to text service. This function will just print the text to the console for now.def process_text(text): print(text)
-
Finally replace the call to
play_audio
in thewhile True
loop with a call to theconvert_speech_to_text
function, passing the text to theprocess_text
function:text = convert_speech_to_text(buffer) process_text(text)
-
Run the code. Press the button and speak into the microphone. Release the button when you are done, and the audio will be converted to text and printed to the console.
pi@raspberrypi:~/smart-timer $ python3 app.py Hello world. Welcome to IoT for beginners.
Try different types of sentences, along with sentences where words sound the same but have different meanings. For example, if you are speaking in English, say 'I want to buy two bananas and an apple too', and notice how it will use the correct to, two and too based on the context of the word, not just it's sound.
💁 You can find this code in the code-speech-to-text/pi folder.
😀 Your speech to text program was a success!