I started to look into the topic of Text to Speach Synthesis (TTS) on Microcontrollers with the final goal to compare different engines.

Since I don’t want to be bothered to connect the Microcontroller to any output device, I decided to just render the result to a Webbrowser with an ESP32 before committing to any solution.

Unfortunately there are no Arduino engines which would provide the result as a stream, so I started to “extend” some projects. The first solution is “SAM”:

I created this project with the intention to provide SAM as Arduino Library which provides a simple API and supports different output alternatives:

SAM is a very small Text-To-Speech (TTS) program written in C, that runs on most popular platforms. It is an adaption to C of the speech software SAM (Software Automatic Mouth) for the Commodore C64 published in the year 1982 by Don’t Ask Software (now SoftVoice, Inc.). It includes a Text-To-Phoneme converter called reciter and a Phoneme-To-Speech routine for the final output. It is so small that it will work also on embedded computers.

The Arduino sketch for the Webserver is quite small because I am using my arduino-audio-tools . SAM is directly writing to the WebClient stream in a callback:

#include "AudioServer.h"
#include "sam_arduino.h"

using namespace audio_tools;  

AudioWAVServer server("ssid","password");
int channels = 1;
int bits_per_sample = 8;

// Callback which provides the audio data 
void outputData(Stream &out){
  Serial.print("providing data...");
  SAM sam(out,  false);
  sam.setOutputChannels(channels);
  sam.setOutputBitsPerSample(bits_per_sample);
  sam.say("hallo, I am SAM");
}

void setup(){
  Serial.begin(115200);
  // start data sink - provide a callback
  server.begin(outputData, SAM::sampleRate(), channels, bits_per_sample);
}


// Arduino loop  
void loop() {
  // Handle new connections
  server.doLoop();  
}

Well the result did take quite some time (23 sec) to generate and it does not sound great:

I am afraid that this slowness is preventing I2S from working…


1 Comment

Jason · 5. September 2021 at 3:09

Sounds like the pitch and speed are doubled or more? Cut them in half, you might be surprised! Remember, the higher the falue, the lower/slower the result.

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *