Spoken Word: Bringing Read-Along Speech Synthesis to the Web

Back in December 2009 I did a hackathon to create an HTML5 Audio Read-Along (demo) which highlighted the text of words spoken in the corresponding audio being played. To introduce the project I wrote:

Screenshot of ReadPlease 2003When I was in college, my most valuable tool for writing papers was a text-to-speech (TTS) program [ReadPlease 2003]. I could paste in a draft of my paper and it would highlight each word as it was spoken, so I could give my proof-reading eyes a break and do proof-listening while I read along; I caught many mistakes I would have missed. Likewise, for powering through course readings I would copy the material into the TTS program whenever possible and speed up the reading rate; because the words are highlighted, it’s easy to re-find your place if you look away and just listen for awhile. (I constantly use OS X’s selected-text speech feature, but unfortunately it does not highlight words). A decade after my college days, I would have hoped that such TTS read-alongs would have become common on the Web (though there is work-in-progress Chrome API and a W3C draft spec now under development), even as read-along apps are prolific in places like the Apple App Store for kids books.

As I further note in the project’s readme, the process I used to create this read-along demo was extremely tedious. It took me four hours to manually find the indices for a couple minutes of speech. I painstakingly obtained time indices for each word in a segment of speech audio to align with its corresponding text so that the text could be highlighted. Naturally my project was just intended as a demo and it is unreasonable to expect anyone else to go through the same process. Nevertheless, I think my proof of concept is compelling. I won second place in the HTML5 audio Dev Derby by Mozilla back in 2012.

Several years later I made Listenability which was an open source DIY clone of the now-defunct “SoundGecko” service. It allowed for you to create a podcast of articles that you sent to your blog and leveraged your system’s own speech synthesis to generate the podcast audio enclosure asynchronously. Daniel Bachhuber created SimpleTTS which integrates WordPress with the Amazon Polly text-to-speech to create the MP3 files and attached them to posts. His work was then followed-up with another Polly solution, this time being developed directly by AWS in partnership with WP Engine. These Polly integrations provide great ways to integrate speech synthesis into the publishing workflow.

Publishing text content in audio form provides key value for users because it introduces another mode for reading the content, but instead of reading with your eyes, you can read with your ears, such as while you are doing dishes or riding a bike. Speech synthesis makes audio scalable by automating the audio creation; it introduces your content into domains normally dominated by music, audiobooks, podcasts, and (oh yeah) radio.

The Amazon Polly solutions are great for when you want to publish audio as an alternative to the text. What they aren’t as great for is publishing audio alongside the text as I set out to demonstrate in the read-along experience in December 2009. (It is possible to implement a read-long with Polly using Speech Marks, but the aforementioned integrations don’t yet do so.) If there is an audio player sitting at the top of an article any you hit play, you can quickly lose your place in the text if you’re trying to read along since the currently-spoken words are not highlighted. Additionally, if you are reading the article with your eyes and then decide you want to switch to audio while you do the dishes, it is difficult to seek the audio content to the place where you last read in the text content. What I want to see is a multi-modal reading experience.

So in December 2017 I worked on another Christmas vacation project. Since Chrome, Firefox, and Safari now support an (experimental) Web Speech API with speech synthesis, you can now do text-to-speech in browsers using just the operating system’s own installed TTS voices (which are now excellent). With this it is possible to automate the read-along interface that I had created manually before. I call this new project Spoken Word. Here’s a video showing an example:

Here’s a full rundown of the features:

  • Uses local text-to-speech engine in user’s browser. Directly interfaces with the speechSynthesis browser API. Zero external requests or dependencies, so it works offline and there is no network latency.
  • Words are selected/highlighted as they are being spoken to allow you to read along.
  • Skips speaking elements that should not be read, including footnote superscripts (the sup element). These elements are configurable.
  • Pauses of different length added are between headings versus paragraphs.
  • Controls remain in view during playback, with each the current text being spoken persistently being scrolled into view. (Requires browser support for position:sticky.)
  • Back/forward controls allow you to skip to the next paragraph; when not speaking, the next paragraph to read will be selected entirely.
  • Select text to read from that point; click on text during speech to immediately change position.
  • Multi-lingual support, allowing embedded text with [lang] attribute to be spoken by the appropriate voice (assuming the user has it installed), switching to language voices in the middle of a sentence.
  • Settings for changing the default voice (for each language), along with settings for the rate of speech and its pitch. (Not supported by all engines.) Changes can be made while speaking.
  • Hit escape to pause during playback.
  • Speech preferences are persistently stored in localStorage, with changes synced across windows (of a given site).
  • Ability to use JS in standalone manner (such as in bookmarklet). Published on npm. Otherwise, it is primarily packaged as a WordPress plugin.
  • Known to work in the latest desktop versions of Chrome, Firefox, and Safari. (Tested on OSX.) It does not work reliably in mobile/touch browsers on Android or iOS, apparently due both to the (still experimental) speechSynthesis API not being implemented well enough on those systems and/or programmatic range selection does not work the same way as on desktop. For these reasons, the functionality is disabled by default on mobile operating systems.

Screenshots of the WordPress plugin with the Twenty Seventeen theme active:

You can try it out on a standalone example with some test content, or install the WordPress plugin on your own site (as it is installed here on my blog for this very article, but you need a desktop browser currently to see it).

For more details, see the GitHub project. Pull requests are welcome and the code is MIT licensed. I hope that this project inspires multi-modal read-along experiences to become common on the Web.

Leave a Reply