Spoken Word: Bringing Read-Along Speech Synthesis to the Web

Update 2020-01-16:

Really love the “Read Aloud Selection” feature of @MSEdgeDev! I've long been wanting to see this kind of read-along capability built-in to browsers.

Previously I made a “Spoken Word” JS app allowing a webpage to embed TTS playback: https://t.co/xCKX9y1kyq

Built-in is better! pic.twitter.com/94XF1cLtXN
— Weston Ruter #WCAsia (Bluesky @weston.ruter.net) (@westonruter) January 17, 2020

Back in December 2009 I did a hackathon to create an HTML5 Audio Read-Along (demo) which highlighted the text of words spoken in the corresponding audio being played. To introduce the project I wrote:

When I was in college, my most valuable tool for writing papers was a text-to-speech (TTS) program [ReadPlease 2003]. I could paste in a draft of my paper and it would highlight each word as it was spoken, so I could give my proof-reading eyes a break and do proof-listening while I read along; I caught many mistakes I would have missed. Likewise, for powering through course readings I would copy the material into the TTS program whenever possible and speed up the reading rate; because the words are highlighted, it’s easy to re-find your place if you look away and just listen for awhile. (I constantly use OS X’s selected-text speech feature, but unfortunately it does not highlight words). A decade after my college days, I would have hoped that such TTS read-alongs would have become common on the Web (though there is work-in-progress Chrome API and a W3C draft spec now under development), even as read-along apps are prolific in places like the Apple App Store for kids books.

Screenshot of ReadPlease 2003

As I further note in the project’s readme, the process I used to create this read-along demo was extremely tedious. It took me four hours to manually find the indices for a couple minutes of speech. I painstakingly obtained time indices for each word in a segment of speech audio to align with its corresponding text so that the text could be highlighted. Naturally my project was just intended as a demo and it is unreasonable to expect anyone else to go through the same process. Nevertheless, I think my proof of concept is compelling. I won second place in the HTML5 audio Dev Derby by Mozilla back in 2012.

Screenshot of my HTML5 Audio Read-Along.

Several years later I made Listenability which was an open source DIY clone of the now-defunct “SoundGecko” service. It allowed for you to create a podcast of articles that you sent to your blog and leveraged your system’s own speech synthesis to generate the podcast audio enclosure asynchronously. Daniel Bachhuber created SimpleTTS which integrates WordPress with the Amazon Polly text-to-speech to create the MP3 files and attached them to posts. His work was then followed-up with another Polly solution, this time being developed directly by AWS in partnership with WP Engine. These Polly integrations provide great ways to integrate speech synthesis into the publishing workflow.

Publishing text content in audio form provides key value for users because it introduces another mode for reading the content, but instead of reading with your eyes, you can read with your ears, such as while you are doing dishes or riding a bike. Speech synthesis makes audio scalable by automating the audio creation; it introduces your content into domains normally dominated by music, audiobooks, podcasts, and (oh yeah) radio.

The Amazon Polly solutions are great for when you want to publish audio as an alternative to the text. What they aren’t as great for is publishing audio alongside the text as I set out to demonstrate in the read-along experience in December 2009. (It is possible to implement a read-long with Polly using Speech Marks, but the aforementioned integrations don’t yet do so.) If there is an audio player sitting at the top of an article any you hit play, you can quickly lose your place in the text if you’re trying to read along since the currently-spoken words are not highlighted. Additionally, if you are reading the article with your eyes and then decide you want to switch to audio while you do the dishes, it is difficult to seek the audio content to the place where you last read in the text content. What I want to see is a multi-modal reading experience.

So in December 2017 I worked on another Christmas vacation project. Since Chrome, Firefox, and Safari now support an (experimental) Web Speech API with speech synthesis, you can now do text-to-speech in browsers using just the operating system’s own installed TTS voices (which are now excellent). With this it is possible to automate the read-along interface that I had created manually before. I call this new project Spoken Word. Here’s a video showing an example:

Here’s a full rundown of the features:

Uses local text-to-speech engine in user’s browser. Directly interfaces with the speechSynthesis browser API. Zero external requests or dependencies, so it works offline and there is no network latency.
Words are selected/highlighted as they are being spoken to allow you to read along.
Skips speaking elements that should not be read, including footnote superscripts (the sup element). These elements are configurable.
Pauses of different length added are between headings versus paragraphs.
Controls remain in view during playback, with each the current text being spoken persistently being scrolled into view. (Requires browser support for position:sticky.)
Back/forward controls allow you to skip to the next paragraph; when not speaking, the next paragraph to read will be selected entirely.
Select text to read from that point; click on text during speech to immediately change position.
Multi-lingual support, allowing embedded text with [lang] attribute to be spoken by the appropriate voice (assuming the user has it installed), switching to language voices in the middle of a sentence.
Settings for changing the default voice (for each language), along with settings for the rate of speech and its pitch. (Not supported by all engines.) Changes can be made while speaking.
Hit escape to pause during playback.
Speech preferences are persistently stored in localStorage, with changes synced across windows (of a given site).
Ability to use JS in standalone manner (such as in bookmarklet). Published on npm. Otherwise, it is primarily packaged as a WordPress plugin.
Known to work in the latest desktop versions of Chrome, Firefox, and Safari. (Tested on OSX.) It does not work reliably in mobile/touch browsers on Android or iOS, apparently due both to the (still experimental) speechSynthesis API not being implemented well enough on those systems and/or programmatic range selection does not work the same way as on desktop. For these reasons, the functionality is disabled by default on mobile operating systems.

Screenshots of the WordPress plugin with the Twenty Seventeen theme active:

You can try it out on a standalone example with some test content, or install the WordPress plugin on your own site (as it is installed here on my blog for this very article, but you need a desktop browser currently to see it).

For more details, see the GitHub project. Pull requests are welcome and the code is MIT licensed. I hope that this project inspires multi-modal read-along experiences to become common on the Web.

Comments

7 responses to “Spoken Word: Bringing Read-Along Speech Synthesis to the Web”

April 15, 2019

Jane Lawson

Hi Weston

I’m an English teacher running a website on Drupal 7. I have been searching for ages for something such as you have developed, and finally tracked you down. I hope you don’t mind me contacting you.

My website users are learning English, and I record audio lessons which they are supposed to speak along to. Highlighting the words as they go would be so very useful and would make my lessons more effective.

I see that you are now developing Spoken Word for WordPress. I’m hoping beyond hope that you will do the same for Drupal.

I have a few questions:

1. Is it possible to use this with a pre-supplied audio file, rather than speech recognition? I use actors or real people for my audio and would like for the speech recognition element of your software to align each word to the corresponding timestamp in my audio file.

You can see a sample lesson of mine here
https://www.dailystep.com/en/lesson/advanced-english-conversation-advanced-english-listening-skills-learn-advanced-english

2. I would also love the option to be able to choose to highlight sentences, or phrases rather than individual words. These could perhaps be determined by the software whenever a full stop (period) or comma appears, and potentially could also be manually adjusted.

3. Is it possible to manually select the colour of the highlight? That way I could, for example, highlight clause by clause in yellow, or sentence by sentence in green.

4. I noticed that in your Word Press module, Spoken Word, that the user can click on an area of text and the audio will play from there. That sounds great.

But what would also be fantastic is if the user could select an area of text and that corresponding audio would play. Potentially with the option to repeat that same step so they could listen to a selected area of text as often as they like to before deselecting it.

5. Your blog page says that it does not work on Android or iOS at the moment due to the still experimental speech synthesis not being implemented well enough on those systems.

However, if the correct audio file is supplied so you know exactly what the audio is, would that make it easier for this to work on mobile devices? Most traffic these days comes from these devices.

I think there would be a big demand for this.

Thank you for your time in reading this Weston, and I’d love to hear back from you whenever you get the time. Also, congratulations on developing such a useful and innovative program. I’d really appreciate any reply or advice you can give me on this topic.

Best wishes,

Jane

English Teacher
DailyStep English
https://www.dailystep.com

January 17, 2020

Aditya

Hi,
We are trying to use the above example for a solution. But was unable to compile and run the code present in github.

Please let us know if there are any changes required to run the code from github:

Tried npm run start after getting the code from

https://github.com/westonruter/spoken-word

Thanks & Regards,
Aditya.

January 17, 2020

Panduranga Bhandarkar

Hello,

We intend to use the spoken word highlighting with Watson Speech to Text. Is it possible to use this code only for the word highlighting with Watson Speech to Text API?

If yes, can you provide information on how this code can be leveraged, to just use the spoken word highlighting in text?

If not, please do let us know, so that we can explore other options.

Thank you!

1. January 21, 2020
  
  Weston Ruter
  
  If Watson has voices which can be installed in the operating system alongside other voices that come with the OS, then yes. This project relies entirely on the SpeechSynthesis API in the browser.
  
June 16, 2020

Josette Seur

Hello, I want to make videos on youtube of children books that I read aloud. I therefore have the text I read embedded in the images I have taken of the books. And alongside that, I have an audio file of me reading the book. I have seen so many videos on you tube of people reading books and having the words highlighted at the same time, but it is impossible for me to figure out how they do it, and what software they use to create this effect. Can you please help me? I would be so thankful!

1. June 16, 2020
  
  Weston Ruter
  
  Sorry, I can’t help. I’m not actively maintaining this project anymore.
  
July 15, 2020

Lynda

Hi. Thanks so much for making and giving this code. I think it’s fantastic for helping to encourage struggling readers.
Best wishes
Lynda