Skip to content

How µSpeech Detects Phonemes

Arjo Chakravarty edited this page Apr 1, 2014 · 10 revisions

µSpeech can currently detect 2 phonemes with great accuracy and a number of others with lesser accuracy. µSpeech uses an innovative series of algorithms to do this. Don't expect HMMs, or mixture models or any other ML stuff as the arduino cannot handle them.

First algorithm

introduced in version 1.0.0

In our speech we use three basic types of phonemes: Vowels, Fricatives and 'Plosives. uSpeech is best at handling Fricatives. Fricatives include /s/, /sh/, /f/. There are two routes uspeech uses to identify these. First of vowels have very clear and low frequency wave forms. Thus, the 'complexity' of the waveform is far lesser. Sounds like /s/, /sh/, /ch/ have very complex waveforms which are almost like white noise. These are generally made by air moving quickly through our mouth and the voice box takes no part in them. uSpeech ranks sounds in order of their complexity.

complexity = abs(Integral of sound) / sum(abs(derivative of sound))

Letters like /v/ and /z/ are made by a combination of sound from our voice box and our mouths. These have mid ranged complexity values. These values require calibration on the users side for the algorithm to accurately work. Vowels have a very low complexity score.

The elusive "f"

introduced in version 2.0.0

One of the most elusive fricatives is 'f'. In order to deal with this, we have found a different algorithm. By using a simple low pass filter we can determine when a user says 'f'. This filter requires calibration to function properly.

Going from letters to syllables

introduced in version 4.0.0

In the past uSpeech used to detect only phonemes, recently an accumulator vector has been added in order to identify and compare words. Version 4.2.0 will have online statistics algorithms.

Clone this wiki locally