MBROLA is a speech synthesizer based on the concatenation of diphones. It takes a list of phonemes as input, together with prosodic information (duration of phonemes and a piecewise linear description of pitch), and produces speech samples on 16 bits (linear), at the sampling frequency of the diphone database.
Today I was playing around with MBROLA and was impressed by the quality of the speech. ESpeak provides an integration into the MBROLA speech synthesizer, so I was considering to integrate this implementation in my library as well.
But then I realised that the corresponding voices need at least 5 to 20 MB of memory. Unfortunately this is much more the we have available on any Microcontroller!
Bad luck…
2 Comments
knghtbrd · 26. August 2025 at 19:33
Notably, there are rp2040 boards which include 16MB of flash in addition to what chip gives you. If you can load the voice data into the extra flash, you’ll likely have enough program space for the engine.
pschatzmann · 26. August 2025 at 22:22
Google gives this: For instance, a single voice file can be tens to hundreds of megabytes, depending on the complexity and extent of the language data.