In this post, I presented an overview of the process to implement text-to-speech into the Hybrid Publishing workflow. The first step in this process requires deciding upon the text-to-speech (TTS) software. The choice of the software should take into account a few parameters: quality of available voices, ease of install and use, platform support, documentation, license, support for languages and the limit in the number of words to convert. The TTS market is in expansion and there are plenty of non-libre solutions (Cepstral, Ivona and Google, to name a few). When it comes to libre and open source alternatives, research has pointed me to these possibilities:
Espeak is very easy to install and use but the quality of the available voices is quite poor: it sounds like it’s being generated from a metallic source, therefore it sounds way too artificial. It is possible to install additional voices, but these do not seem to have much higher quality.
Flite is a light/small version of Festival (which has been around for quite a while). The demos currently showcase good quality voices. Even though these voices are not available for use as of now, the ones that are installed with the application have reasonably good quality.
PicoTTS is based on SVOX Pico engine for Android. There are reasonably good quality voices but the words limit seems to be lower than the other alternatives: it was not possible to convert the entire sample chapter. The sample chapter is real content – it is the introduction chapter of a real publication and has 7823 words. To test the software, we created a reduced version which has 4144 words.
MaryTTS has a different architecture when compared to the other alternatives. It is based on server requests and was written in Java. MaryTTS is open source, offers support to multiple languages and uses reasonably good quality voices. It can demand a lot of resources from the machine. There is an online demo.
Based on the data above, after weighing the pros and cons of these 4 alternatives, it seems sensible to start testing the workflow with Flite. In the next post I’ll describe, step by step, how to produce a m4b file from a text input.