The staring point for doing speech recognition on an Arduino based board is TensorFlow Light For Microcontrollers with the example sketch called micro_speech!

I have presented a DRAFT proposal for a Micro Speach API in one of my last posts. In the meantime I have finally managed to adapt the MicroSpeech example from TensorFlow Lite to follow the philosophy of my Arduino Audio Tools Library. The example uses a Tensorflow model which can recognise the words ‘yes’ and ‘no’. The output stream class is TfLiteAudioStream. In the example I am using an AudioKit, but you can replace this with any type of microphone.

The Arduino Sketch

Here is the complete Arduino Sketch. The audio source is the microphone of the audiokit and the audio sink is the TfLiteAudioStream class which can classify the result into 4 different outputs labels.

#include "AudioTools.h" #include "AudioLibs/AudioKit.h" #include "AudioLibs/TfLiteAudioStream.h" #include "model.h" // tensorflow model AudioKitStream kit; // Audio source TfLiteAudioStream tfl; // Audio sink const char* kCategoryLabels[4] = { "silence", "unknown", "yes", "no", }; StreamCopy copier(tfl, kit); // copy mic to tfl int channels = 1; int samples_per_second = 16000; void respondToCommand(const char* found_command, uint8_t score, bool is_new_command) { if (is_new_command) { char buffer[80]; sprintf(buffer, "Result: %s, score: %d, is_new: %s", found_command, score, is_new_command ? "true" : "false"); Serial.println(buffer); } } void setup() { Serial.begin(115200); AudioLogger::instance().begin(Serial, AudioLogger::Warning); // setup Audiokit Microphone auto cfg = kit.defaultConfig(RX_MODE); cfg.input_device = AUDIO_HAL_ADC_INPUT_LINE2; cfg.channels = channels; cfg.sample_rate = samples_per_second; cfg.use_apll = false; cfg.auto_clear = true; cfg.buffer_size = 512; cfg.buffer_count = 16; kit.begin(cfg); // Setup tensorflow output auto tcfg = tfl.defaultConfig(); tcfg.setCategories(kCategoryLabels); tcfg.channels = channels; tcfg.sample_rate = samples_per_second; tcfg.kTensorArenaSize = 10 * 1024; tcfg.respondToCommand = respondToCommand; tcfg.model = g_model; tfl.begin(tcfg); } void loop() { copier.copy(); }

The key information that needs to be provided as configuration for tensorflow are

  • number of channels
  • sample rate
  • kTensorArenaSize
  • a callback for handling the responses (respondToCommand)
  • the model
  • the labels

Like in any other audio sketch, we just need to copy the data from the input to the output class.

Overall Processing Logic

The TfLiteAudioOutput uses Fast Fourier transform (FFT) to calculate the FFT result (with the length of kFeatureSliceSize) over slices (defined by kFeatureSliceStrideMs and kFeatureSliceDurationMs) of audio data. This is used to update a spectrogram (with the length of kFeatureSliceSize x kFeatureSliceCount). After we added 2 (kSlicesToProcess) new FFT results to the end, we let Tensorflow evaluate the updated spectrogram to calculate the classification result. These results are post-processed (in the TfLiteRecognizeCommands class) to make sure that the result is stable.


Building the Tensorflow Model

Here is the relevant Jupyter workbook. I am also providing the necessary files to run it in Docker. Just execute docker-compose up and connect to http://localhost:8888.


Leave a Reply

Avatar placeholder

Your email address will not be published.