In my last couple of Blogs I was comparing the following Text To Speach (TTS) libraries which are available on Arduino:

  • SAM Software Automatic Mouth
  • TTS Text-to-Speech Library for Arduino
  • Flite Festival lite

I was hoping to find some TinyML based implementations, but so far without success: I put this on my to-do list for some long cold winter days.

As a conclusion we see that the sound quality is directly related with the memory consumption, so we might never get any high quality speech generated from Microcontrollers because we just don’t have enough memory available. I think there is a good reason why Google and Amazon are only providing their TTS functionality over the network.

An alternative approach might be to record all required words, store them on a SD drive and just use these recordings to generate the sound output as demonstrated in my arduino-simple-tts project.

I think the best option for dynamaically generated TTS is to delegate the “Speech Generation” (and maybe even the output) to a separate machine: A Raspberry Pi makes already all the difference and there are plenty of resources on the internet which cover this topic.

My TTS projects of choice are

  • Rhasspy which provides multiple different TTS implementations and a simple REST API.
  • Mozilla TTS which implements some state of the art models

Sending a Post request to the Rhasspy URL “http://address:12101/api/text-to-speech” is returning a WAV file: Here is the corresponding Arduino sketch which will send the request to Rhasspy and provides the output to I2S:

#include "AudioTools.h"
#include "AudioCodecs/CodecWAV.h"

using namespace audio_tools;  

// UrlStream -copy-> AudioOutputStream -> WAVDecoder -> I2S
URLStream url("ssid","password");
I2SStream i2s;                  // I2S stream 
WAVDecoder decoder;        // decode wav to pcm and send it to I2S
EncodedAudioStream out(&i2s, &decoder); // output to decoder
StreamCopy copier(out, url);    // copy in to out


void setup() {
  Serial.begin(115200);
  AudioLogger::instance().begin(Serial, AudioLogger::Debug);  

// setup i2s output
  auto config = i2s.defaultConfig(TX_MODE);
  config.sample_rate = 16000; 
  config.bits_per_sample = 16;
  config.channels = 1;
  i2s.begin(config);

// rhasspy
   url.begin("http://192.168.1.37:12101/api/text-to-speech?play=false",  "text/plain",POST,"Hallo, my name is Alice");
}

void loop(){
  // copy audio from url -> i2s
  if (!copier.copy()) {
    i2s.end();
    LOGI("stopped");
    stop();
  }
}

This sketch (which is part of the arduino-audio-tools library) is also available on github.

If your microcontroller does not support I2S you can use the following output classes instead:

  • AnalogAudioStream
  • PWMAudioOutput
  • VS1053Stream

Addendum

A lot has happend since I wrote this library. The generic TTS Arduino library with the best audio quality by far is my arduino-espeak-ng.

Here is the updated list of all my tts blogs that cover the topic TTS on micro controllers.


4 Comments

Inso · 25. July 2024 at 10:34

Hiho,

I now had some time to start with the code, starting with this example.
Could it be that it is a bit outdated, or am I using the wrong library?

I have added
https://github.com/pschatzmann/arduino-audio-tools
“AudioTools.h” is found but “CodecWAV.h” I had to change to “AudioCodecs/CodecWAV.h”. Hope that is the right file?

Now the WAVDecoder does not accept the i2s object, instead AudioDecoderExt object is needed (assume 2nd object must be AudioFormat::PCM).

AudioOutputStream does not name a type at all, and I cannot find that class in the latest release.

StreamCopy expects AudioStream or AudioOutput now, but they are abstract..

For sure I think when spending some times with the examples of the lib I should get it to work, but maybe it is also interesting for a potential reader here to know if this is deprecated, what to use and so on 🙂

Best regards

    pschatzmann · 25. July 2024 at 10:53

    I suggest to take the examples directly from the AudioTools library since the posts only reflect the situation that was actual at the time of posting:
    – the using statement is not necessary any more
    – the wav decoder has moved, so use #include “AudioCodecs/CodecWAV.h”
    – the use of WAVDecoder has changed and you need to use an EncodedAudioStream now

Inso · 17. July 2024 at 19:05

Hi,

I found this article quite interesting. I actually plan to have an ESP32 S3 project with TTS, but also would like to have it as a speech to command system.
When reading about Rhasspy here I saw that it also offers this functionality.

Sadly, there is only the information to find

Rhasspy receives audio over MQTT using the Hermes protocol[..]

. I have an MQTT server alread running. Check.

There is already the “ESP32-Rhasspy-Satellite” project which offers all in one, but it is doing a lot of stuff under the hood and has a lot of stuff implemented I would not like to use, f.e. the LED library and so on. I already have an ESP32 MQTT WLan Error Handling implementation I would like to use including my whole way of handling the flow and so on.
Sadly, lots of code from the project is hidden, so I cannot just extract what I need. I can see what libs are used, but they do not help.

Here on your blog, you have a strong focus on audio. Is there by any chance a blog post that can help here (took a look on the index, but from the names it seems there is nothing which could help), or maybe you have some ideas on this?

My plan for this is something like:
– read data from MEMS microphone. Saw on https://www.pschatzmann.ch/home/2021/04/29/bluetooth-a2dp-streaming-from-an-digital-i2s-microphone/ that you just read from I2S, but is this already WAV? Or is conversion needed?
– use a ring buffer to store before sending (so I can take the block size I need from it step my step?) https://github.com/CDFER/Ring-Buffer-Demo-ESP32-Arduino
– send this data by MQTT (topic as mentioned at https://rhasspy.readthedocs.io/en/latest/audio-input/)
My question here is what they mean by using the Hermes protocol (there is not much to find which helps me, Wiki page is very slim..). Also: what is the byte size?
And I wonder what is the best way to determine when to stop the audio streaming to Rhasspy.

For starting the recording, I would use a separate local wake word lib like https://github.com/kahrendt/microWakeWord to trigger the whole process.

Would be great if you would have some input on this, or could share your thoughts about my ideas (as you maybe can see from my text, I am not experienced with audio stuff at all^^).

Best regards

Inso

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *