Last week I was asked, if my Arduino Audio Tools project was supporting the playback of M4A files, which it didn’t: but I thought that would be quite a good opportunity to try out Vibe Coding, since this problem space is well documented and quite established it seemed to be quite a good fit. Needless to say that at that point of time I did not know anything about M4A. So I just asked Claude to generate me an audio file extractor class for AAC which follows my Container API.
And it did generate quite a lot of detailed code that at first glace was looking very promising. But running some tests proved that it did not work and after looking at the code and the execution logic in the debugger I came to the concolusion that this is hopeless and impossible to fix!
Then I had quite some detailed discussion with ChatGPT which was leading me to a better understanding what needs to be done:
- Parse the MP4 container format and handle the relevant boxes only
- Build a sample table which was giving the record sizes to be played back
- For AAC get the profile, sample rate index and channel index and with this build and add an adts header before the samples
- For ALAC data extract the magic cookie and provide it to the decoder
- Return the audio data which is contained in the mdat box so that it can be sent to a decoder.
The M4A Audio Format
The M4A audio format is based on the MP4 container structure, which organizes data into a hierarchy of boxes. Each box begins with a 4-byte big-endian size field, followed by a 4-byte type identifier. Container boxes can nest other boxes and typically contain no data of their own, although exceptions exist. In contrast, non-container boxes hold actual data.
Audio samples are stored in the mdat
box, but to extract them correctly, information from the stsz
box is required. The stsz
box indicates the total number of samples and either a fixed sample size or, if the size is zero, a table of individual sample sizes. This data is essential to determine how many bytes to read from mdat
for each sample. Additionally, the child boxes within the stsd
box reveal the audio format—such as alac
(Apple Lossless), mp4a
(AAC), or mp3
.
How to eat an elephant: cut him up into little pieces!
So, I decided to split up the problem into different C++classes:
The MP4 Parser
I expected to be able to find an existing simple, lean and tested MP4 parser which would run on a Microcontroller that provides a simple callback API where I can just plug in some callbacks with the relevant logic, and to my big surprise I found nothing!
So I was forced do this myself and I wrote and tested my MP4Parser class which, to keep things simple, expects each box to fit into the actual RAM buffer. Here I had quite some surprises: In some rare cases it did not find a valid box at the expected location, so I added some fallback error handling that just scans for the next box and reports the skiped data under the last box type.
As a next step, I was extending this with a MP4ParserIncremental class which removes this restriction and can provide the box content incrementally when the buffer is too small. After that functionality was tested, I integrated it back into the MP4Parser.
This was quit simple to test: just execute the test sketch and check if you find the boxes at the indicated locations!
Extracting the Audio Data
Next I decided to have a separate M4AAudioDemuxer class which implements all the necessary parsing data callbacks and data extraction logic and provides the the result as frame entries in a callback. This class basically
- forwards the written data to the parser
- implements the data M4A extraction logic
- and provides the result via a callback
So this class is where the major work was done. With the parsing issues out of the way, I could concentrate on the logical issues. The two major ones were the following:
- The proposed logic from Claude for the extraction of the AAC profile, sample rate index and channel index did just lead to wrong data. I was trying quite some alternative AI models until I was proposed with a working solution.
- The proposed logic for the extraction of the alac magic cookie was also giving wrong values: until I figured out that there is an alac box in the alac box that contains the correct data.
With these issues out of the way the decoding of audio stared to work…
The Container API
The final ContainerM4A which is a ContainerDecoder subclass was quite simple to implement:
- Get a MultiDecoder via the constructor, so that we can support the relevant audio formats: AAC and ALAC
- Just subscribe to the data provided by the M4AAudioDemuxer callback
- When receiving a Frame from the callback, just select the right decoder and write the data to it.
Example Arduino Sketch
An example sketch can be found on Github: I was using an AudioKit for testing, but you can easily adapt this logic to work with any other supported output type.
And there is also an example using the AudioPlayer: You can use a single MultiDecoder both for the AudioPlayer and the ContainerM4A or you can use the MultiDecoder only for the ContainerM4A and provide the container to the AudioPlayer or finally you can use two separate MultiDecoder for clearly specifying what audio types the AudioPlayer should support and what types the ContainerM4A should support.
Caveats
- The M4A file format needs quite a lot of RAM to store the sample table, so don’t even try this w/o enough PSRAM!
- The M4A need to be stored in streaming format where the mdat box (with the audio data) is at the end. Some files do not follow this logic: thouse need to be processed twice!
Outlook
I am planning to provide an Implentation where the API works with an Arduino File: with this we do not need to store the sample table in memory, but we can read it directly from the file when needed. This is much more memory efficient…
0 Comments