Spoken Word: Bringing Read-Along Speech Synthesis to the Web

Back in December 2009 I did a hackathon to create an HTML5 Audio Read-Along (demo) which highlighted the text of words spoken in the corresponding audio being played. To introduce the project I wrote:

When I was in college, my most valuable tool for writing papers was a text-to-speech (TTS) program [ReadPlease 2003]. I could paste in a draft of my paper and it would highlight each word as it was spoken, so I could give my proof-reading eyes a break and do proof-listening while I read along; I caught many mistakes I would have missed. Likewise, for powering through course readings I would copy the material into the TTS program whenever possible and speed up the reading rate; because the words are highlighted, it’s easy to re-find your place if you look away and just listen for awhile. (I constantly use OS X’s selected-text speech feature, but unfortunately it does not highlight words). A decade after my college days, I would have hoped that such TTS read-alongs would have become common on the Web (though there is work-in-progress Chrome API and a W3C draft spec now under development), even as read-along apps are prolific in places like the Apple App Store for kids books.

As I further note in the project’s readme, the process I used to create this read-along demo was extremely tedious. It took me four hours to manually find the indices for a couple minutes of speech. I painstakingly obtained time indices for each word in a segment of speech audio to align with its corresponding text so that the text could be highlighted. Naturally my project was just intended as a demo and it is unreasonable to expect anyone else to go through the same process. Nevertheless, I think my proof of concept is compelling. I won second place in the HTML5 audio Dev Derby by Mozilla back in 2012.

Several years later I made Listenability which was an open source DIY clone of the now-defunct “SoundGecko” service. It allowed for you to create a podcast of articles that you sent to your blog and leveraged your system’s own speech synthesis to generate the podcast audio enclosure asynchronously. Daniel Bachhuber created SimpleTTS which integrates WordPress with the Amazon Polly text-to-speech to create the MP3 files and attached them to posts. His work was then followed-up with another Polly solution, this time being developed directly by AWS in partnership with WP Engine. These Polly integrations provide great ways to integrate speech synthesis into the publishing workflow.

Publishing text content in audio form provides key value for users because it introduces another mode for reading the content, but instead of reading with your eyes, you can read with your ears, such as while you are doing dishes or riding a bike. Speech synthesis makes audio scalable by automating the audio creation; it introduces your content into domains normally dominated by music, audiobooks, podcasts, and (oh yeah) radio.

The Amazon Polly solutions are great for when you want to publish audio as an alternative to the text. What they aren’t as great for is publishing audio alongside the text as I set out to demonstrate in the read-along experience in December 2009. (It is possible to implement a read-long with Polly using Speech Marks, but the aforementioned integrations don’t yet do so.) If there is an audio player sitting at the top of an article any you hit play, you can quickly lose your place in the text if you’re trying to read along since the currently-spoken words are not highlighted. Additionally, if you are reading the article with your eyes and then decide you want to switch to audio while you do the dishes, it is difficult to seek the audio content to the place where you last read in the text content. What I want to see is a multi-modal reading experience.

So in December 2017 I worked on another Christmas vacation project. Since Chrome, Firefox, and Safari now support an (experimental) Web Speech API with speech synthesis, you can now do text-to-speech in browsers using just the operating system’s own installed TTS voices (which are now excellent). With this it is possible to automate the read-along interface that I had created manually before. I call this new project Spoken Word. Here’s a video showing an example:

Here’s a full rundown of the features:

  • Uses local text-to-speech engine in user’s browser. Directly interfaces with the speechSynthesis browser API. Zero external requests or dependencies, so it works offline and there is no network latency.
  • Words are selected/highlighted as they are being spoken to allow you to read along.
  • Skips speaking elements that should not be read, including footnote superscripts (the sup element). These elements are configurable.
  • Pauses of different length added are between headings versus paragraphs.
  • Controls remain in view during playback, with each the current text being spoken persistently being scrolled into view. (Requires browser support for position:sticky.)
  • Back/forward controls allow you to skip to the next paragraph; when not speaking, the next paragraph to read will be selected entirely.
  • Select text to read from that point; click on text during speech to immediately change position.
  • Multi-lingual support, allowing embedded text with [lang] attribute to be spoken by the appropriate voice (assuming the user has it installed), switching to language voices in the middle of a sentence.
  • Settings for changing the default voice (for each language), along with settings for the rate of speech and its pitch. (Not supported by all engines.) Changes can be made while speaking.
  • Hit escape to pause during playback.
  • Speech preferences are persistently stored in localStorage, with changes synced across windows (of a given site).
  • Ability to use JS in standalone manner (such as in bookmarklet). Published on npm. Otherwise, it is primarily packaged as a WordPress plugin.
  • Known to work in the latest desktop versions of Chrome, Firefox, and Safari. (Tested on OSX.) It does not work reliably in mobile/touch browsers on Android or iOS, apparently due both to the (still experimental) speechSynthesis API not being implemented well enough on those systems and/or programmatic range selection does not work the same way as on desktop. For these reasons, the functionality is disabled by default on mobile operating systems.

Screenshots of the WordPress plugin with the Twenty Seventeen theme active:

You can try it out on a standalone example with some test content, or install the WordPress plugin on your own site (as it is installed here on my blog for this very article, but you need a desktop browser currently to see it).

For more details, see the GitHub project. Pull requests are welcome and the code is MIT licensed. I hope that this project inspires multi-modal read-along experiences to become common on the Web.

ECMAScript Proposal: Named Function Parameters

I recently ran across the ES wiki which is documenting proposals and features for new versions of ECMAScript (JavaScript). I was excited to see the spread operator...” which basically brings Perl-style lists to JavaScript. I was also excited to see the spread-related rest parameters which basically implement Python’s positional parameter glob *args; however, I did not see something equivalent to Python’s named parameter glob **kwargs (see on Python Docs).

I’ve been giving thought to passing in named arguments to function calls in JavaScript, eliminating the need for the current pattern of wrapping named arguments in an object, like:

function foo(kwargs){
    if(kwargs.z === undefined)
        kwargs.z = 3; //default value
    return [kwargs.x, kwargs.y, kwargs.z];
}
foo({z:3, y:2, x:1}); //=> [1, 2, 3]

Instead of this, I’d love to be able discretely define what arguments I’m expecting in a positional list, but then also allow them to be passed in as named arguments:

function foo(x, y, z){
    return [x, y, z];
}
foo(1, 2, 3) === foo(z:3, x:1, y:2); //=> [1, 2, 3]

This could also work with the ES proposal for parameter default values:

function foo(x, y, z = 3){
    return [x, y, z];
}
foo(1, 2) === foo(y:2, x:1); //=> [1, 2, 3]

Although the current proposal has a requirement that only trailing formal parameters may have default values specified, this shouldn’t be necessary when named parameters are used:

function foo(x = 1, y, z){
    return [x, y, z];
}
foo(z:3, y:2); //=> [1, 2, 3]

Furthermore, although the current spread proposal only works with arrays, it could also work with objects for named parameters:

var kwargs = {y:2, x:1, z:3};
foo(...kwargs); //=> [1,2,3]

Finally, the rest parameters proposal uses spread to implement the rough equivalent of Python’s positional parameter glob (*args), but I’m not sure how it could also be applied to named parameters to support Python’s **kwargs at the same time—I’m not sure how named rest parameters would work. For example in Python:

def foo(x, *args, **kwargs):
    return [x, args[0], kwargs['z']]
foo(1, 2, z=3); //=> [1, 2, 3]

Python has the * to prefix positional parameter globs and ** to prefix named parameter globs, but so far ECMAScript only has one prefix the “...” spread operator.

These are just some rough ideas about how JavaScript could support named parameters. I’m sure there things I’ve missed and implications I haven’t thought of, but I wanted to get my thoughts out there. What do you think?

Programming Languages I’ve Learned In Order

Update: See also list on MY TECHNE.

What follows are the programming languages I’ve learned in the order of learning them; their relative importance is marked up with big, and small indicates I didn’t fully learn or actually use the language.

  1. Perl 5
  2. JavaScript / ECMAScript
  3. PHP 4 & 5
  4. SQL
  5. Visual Basic 6
  6. Java
  7. Classic ASP: VBScript & JScript
  8. (Visual) C/C++
  9. XSLT
  10. Ruby
  11. Python (I expect/hope that this will supplant PHP in the next couple years, getting an extra big or two.)

While Python is now my server-side language of choice, I would be much happier to use JavaScript end-to-end. Thanks to the CommonJS initiative, this is becoming a reality.

Not included in the list above are markup languages and other related technologies: (X)HTML 4 & 5, XML, CSS, DOM, RSS, Atom, RDF, XML Schema, XPath, JSON, JSON/XML-RPC, SVG, VML, OSIS, iCal, Microformats, MathML, etc.

Idea via and prompted by James Tauber, via Dougal Matthews.

Proposal for Customizing Google’s Crawlable Ajax URLs

On the Shepherd Interactive site, we have a dynamic navigation menu in Flash. In order to prevent it from having to reload each time a page is changed, I implemented Ajax loading so that the SWF only has to load once. This is similar to what Lala and Facebook do. So if your browser is Ajax-enabled, upon visiting:

http://shepherdinteractive.com/portfolio/interactive/

You will get redirected to site root / with the old path supplied as the URL hash fragment which is then loaded in via JavaScript as the page content:

http://shepherdinteractive.com/#portfolio/interactive/

However, according to Google’s Making AJAX Applications Crawlable specification, Ajax pretty URLs are any containing a hash fragment beginning with !, for example:

http://shepherdinteractive.com/#!portfolio/interactive/

The purpose of the ! is merely to inform Googlebot that such a URL is for an Ajax page whose content can be fetched via:

http://shepherdinteractive.com/?_escaped_fragment_=portfolio/interactive/

The problem I have with Google’s specification is that the pretty URL Ajax fragment prefix (!) is mandated; it is not customizable. I should be able to tell Googlebot which fragment identifiers are for Ajax content and which are not. Therefore, instead of requiring authors to to conform to Google’s Ajax specification, I propose that Google adopt an extension to robots.txt which allows site owners to let Googlebot know what to look for. The current specification’s mandate for ! could be indicated via:

Ajax-Fragment-Prefix: !

Or it could be changed to anything else, such as “ajax:“. If the Ajax fragment doesn’t have a prefix at all (as in the case of Shepherd Interactive’s website above), a regular expression pattern match could be specified in robots.txt, for example:

Ajax-Fragment-Pattern: .+/

This would tell Googlebot that a URL with a fragment containing a slash should be fetched via the _escaped_fragment_ query parameter, and that the Ajax URL itself (including the fragment identifier) should be indexed and returned verbatim in the search results.

It’s true that the Shepherd Interactive site implements Hijax (progressive enhancement with Ajax) techniques, so every Ajax URL has a corresponding non-Ajax URL; so in this sense, Google can still access all of the content. The problem, however, is with links to Ajax URLs from around the Web. I assume for Googlebot, every link to an Ajax URL without the obligatory prefix ! is interpreted as referring to the home page (site root /):

  • http://shepherdinteractive.com/#portfolio/interactive/harmer-steel/
  • http://shepherdinteractive.com/#about-us/our-company/
  • http://shepherdinteractive.com/#services/web-development/

So Google then assigns no additional PageRank to our Ajax URLs since it gets assigned to the home page. If, however, Googlebot could be told that those are actually Ajax URLs, then the PageRank could be properly assigned. (My assumptions here could be incorrect.)

Thoughts?

Update 2010-04-22: “Cowboy” Ben Alman brought up the excellent point that customizable Ajax-crawlable URLs need to work on a per-path basis, as different single page web apps on the same server (or even in the same folder) might have different URI requirements. I wonder if in addition to (or instead of) using robots.txt to tell Googlebot the format of the Ajax-crawlable URLs, why not allow that information to be placed in meta tags? Their specification already includes:

<meta name="fragment" content="!">

Why not make Googlebot support additional meta tags for recognizing the prefix (if any) or pattern that characterizes an Ajax-crawlable URL on a given page? For example:

Current default behavior:
<meta name="crawlable-fragment-prefix" content="!">
Starting with “/” like …about/#/us/:
<meta name="crawlable-fragment-prefix" content="/">
All fragments are crawlable:
<meta name="crawlable-fragment-prefix" content="">
Any fragment ending with a slash:
<meta name="crawlable-fragment-pattern" content=".+/">

With these meta tags, authors would be able to have complete page-level control over the structure of Ajax-crawlable URLs (hash fragments).