Weston Ruter

Web application developer in Portland, Oregon

HTML5 Audio Read-Along

By

Update: I’ve created a new ESV Text/Audio Aligner project on GitHub which queries the ESV text and audio data from their API and aligns them using the newly-improved CMU Sphinx codebase, which now works better with long audio. It’s still a work in progress.

Jump to demo below to listen while reading along to the nativity story from the Gospel of Luke.

I guess I’ve been on an HTML5 Audio kick lately, with this and my previous post on using it to play Google’s Text-To-Speech (TTS) service on Google Translate. I wrote that post while attempting to build a text-to-speech interface which audibly reads the text on the page while highlighting the currently-spoken word (like my old favorite ReadPlease), but a bug in Google Chrome for Mac discouraged me from finishing it. However, when I saw the inestimable Paul Irish‘s post about the jquery-singalong plugin which times HTML5 Audio with text and a bouncing ball cue, I was inspired to do the same but for a (non-TTS) read-along.

ESVThe text I chose was the nativity story from the Gospel of Luke in the English Standard Version (ESV), not only in keeping with the (true) Christmas spirit but also because the ESV Online has an excellent API which allows both a passage’s text and its audio to be queried. With the text and audio in hand, each of the words in the text needs to be time-indexed for its begin time and duration in the corresponding audio. In the past, audio Bibles were divided into chapter segments only and that was as granular as you could go; the ESV team did the innovation of taking this granularity down to the verse-level. Unfortunately, however, the granularity is not available at the word-level. Therefore, in order to make this exciting (IMHO) read-along demo work, I manually traversed the audio to find each word’s begin time and duration (spending more time than I care to say), and I added these time indicies to the word markup as data-begin and data-dur attributes, akin to SMIL’s begin and dur attributes (which I suppose I could use). I also spruced up the markup for the passage by incorporating some HTML5 constructs as well as adding OSIS verse elements as I’ve been talking about over at Open Scriptures.

The read-along demo has been tested in Firefox 3.5, Chrome 4, and Safari 4 (as expected, it will not work in Internet Explorer). Chrome plays the MP3 as served from the ESV API. Firefox doesn’t support MP3 so I include an OGG Vorbis source as well. Safari seems to have trouble playing the MP3 and can’t play OGG, so I also include an 8kHz WAV source as well.

What it does: Upon playing the audio, the word corresponding to the one currently being spoken is highlighted (via DOM Range/Selection); this eliminates the need to re-find your place if you momentarily look away. Likewise, when manually adjusting the seek position, the words which correspond to each position sought will be highlighted; and conversely, clicking a word causes the audio to seek to its corresponding position (and double-clicking will then cause it to start playing). Thus the text itself serves as an interface for navigating the audio. See the inline JavaScript source code for all the magic.

Without further adieu, please enjoy listening while reading along to the Christmas story!


Luke 2:1-20 (ESV)

In those days a decree went out from Caesar Augustus that all the world should be registered. This was the first registration when Quirinius was governor of Syria. And all went to be registered, each to his own town. And Joseph also went up from Galilee, from the town of Nazareth, to Judea, to the city of David, which is called Bethlehem, because he was of the house and lineage of David, to be registered with Mary, his betrothed, who was with child. And while they were there, the time came for her to give birth. And she gave birth to her firstborn son and wrapped him in swaddling cloths and laid him in a manger, because there was no place for them in the inn.

And in the same region there were shepherds out in the field, keeping watch over their flock by night. And an angel of the Lord appeared to them, and the glory of the Lord shone around them, and they were filled with fear. And the angel said to them, Fear not, for behold, I bring you good news of great joy that will be for all the people. For unto you is born this day in the city of David a Savior, who is Christ the Lord. And this will be a sign for you: you will find a baby wrapped in swaddling cloths and lying in a manger. And suddenly there was with the angel a multitude of the heavenly host praising God and saying,

Glory to God in the highest,
and on earth peace among those with whom he is pleased!

When the angels went away from them into heaven, the shepherds said to one another, Let us go over to Bethlehem and see this thing that has happened, which the Lord has made known to us. And they went with haste and found Mary and Joseph, and the baby lying in a manger. And when they saw it, they made known the saying that had been told them concerning this child. And all who heard it wondered at what the shepherds told them. But Mary treasured up all these things, pondering them in her heart. And the shepherds returned, glorifying and praising God for all they had heard and seen, as it had been told them.


Scripture taken from The Holy Bible, English Standard Version. Copyright ©2001 by Crossway Bibles, a publishing ministry of Good News Publishers. Used by permission. All rights reserved. Text provided by the Crossway Bibles Web Service.

Comments

  1. Rui Lopes

    Interesting demo, but I do have to say that we did this at my HCI research group 3-4 years ago, but with HTML+TIME (IE only, though). Check the “Rich Content Books for All” project at http://hcim.di.fc.ul.pt/wiki/RiCoBA for more details.

    Cheers

  2. Jesse Griffin

    Very cool Weston! I especially like the seek-via-word-click feature.

  3. Joshua Clark

    I wanted to build something like this for sermon transcripts. Upload a sermon mp3, and receive a time-coded transcript, instantly searchable. I guess this would be STT, not TTS. But once transcribed, this would be the perfect front-end.

  4. Gerardo Capiel

    You should check out a JavaScript based TTS implementation at http://scotland.proximity.on.ca/dxr/tmp/audio/tts/ . It would be interesting to link that work with your work. Keep me posted, if you do anything with this.

  5. Dmitry

    Hi, Weston,
    You’ve mentioned that you manually traversed the audio and set all the timings. How exactly did you do that? What software did you use?

    So, we have both sound file read by a human (not TTS) and the transcript. We want to do the slicing (that process of recovering word timings) automatically. Did you think if that is possible?
    I was trying to do something similar and tried to employ Free-TTS… But I did manage to do smth usefull.
    I’m doing some study-English software and that exciting read-along feature is what I’m really looking for. Can you advise?

  6. Weston Ruter

    @Dmitry:
    I think I used Audacity and a Google Spreadsheet to manually obtain the timings. I literally stepped-through the audio second-by-second finding the start and end time for each word in the audio and then added them to the spreadsheet. It took a few hours just for this passage.

    I know it is possible to automatically obtain the time indices for the words in audio given a transcript. The closest I got to doing this myself was utilizing the CMU Sphinx project which includes the ability to align text and audio. I had some success, but hit a roadblock. You can read all about my efforts and see the code on a thread on the Sphinx forum: http://sourceforge.net/projects/cmusphinx/forums/forum/382337/topic/4503550

    If you’re able to tweak my code to get the desired results, please let me know!

  7. Weston Ruter

    I filed a feature request for Google Chrome to extend their experimental TTS API to facilitate read-along apps: http://code.google.com/p/chromium/issues/detail?id=83404

    I was excited to learn about Chrome’s experimental TTS API. An application that I am very keen to develop is a read-along, where the page highlights the text corresponding with the words as they are being spoken. To do this, an API would need to be exposed for determining when each word is spoken. Currently, the Chrome TTS API only has events for onSpeak and onStop. To do a read-along, however, something like a “onSayWord” or “onUtter” event would be needed, where the event handler would be passed an Event object indicating the actual word being spoken and maybe a word index (the original text passed in would need to get broken up into individual utterances). It would also be useful to be able to seek to a specific time position within TTS audio given a word index (or time index)—this would allow you to navigate the audio via selecting words in the text. See the URL example provided for the kind of application I’d love to build utilizing such extensions to the TTS API.

  8. Weston Ruter

    The CMU Sphinx project has improved their code to work with aligning long audio with text. I’ve created a new project which uses it to align the ESV text with the audio: https://github.com/westonruter/esv-text-audio-aligner

  9. Harry Pannu

    Weston,

    In your attempt to use Sphinx4, were you able to get word cue points directly from MP3 or did you convert MP3 to WAV format first? I am trying to make Sphinx4 use MP3 directly to save the labor of WAV conversion. I am getting good results. I would like to exchange information with you hoping it will benefit both of us. Looking forward to hear from you.

    Thanks,
    Harry Pannu

  10. Weston Ruter

    @Harry:
    In my Sphinx4 attempt, I did indeed convert from MP3 to WAV first. I wasn’t able to get Sphinx to work with MP3. However, my build script automatically converts the MP3s to WAV format before passing into Sphinx, so it’s no extra labor. See https://github.com/westonruter/esv-text-audio-aligner/blob/master/align.py#L120

  11. Harry Pannu

    Did you ever need to tweak the cue points generated by Sphinx? I am working on a web app that is suppose to auto generate cue points for a given script and audio file on the click of a button in the web interface. On top it should have an “easy-to-use” interface for editors to manually correct any errors.

    After struggling for several days, at the end, it turned out to be pretty simple to get the MP3 decoding working with Sphinx. I would be happy to provide you the specifics if interested in trying it. It will be comforting to know that it works for you as well.

  12. Weston Ruter

    @Harry:
    The cue points are not reliable enough yet for me. Maybe things have improved since last time I tried, but compare the results of aligning John 1 and John 3:
    https://github.com/westonruter/esv-text-audio-aligner/blob/master/reports/John.1
    https://github.com/westonruter/esv-text-audio-aligner/blob/master/reports/John.3

    John 3 gets extremely misaligned. See more information in the last post on the Sphinx forums: http://sourceforge.net/projects/cmusphinx/forums/forum/382337/topic/4503550

  13. Harry Pannu

    Weston,

    I have already skimmed through most of your posts as I was looking for a solution for myself. I just added a message of my own about how I got MP3 working with Sphinx for everyone’s good.

    See https://sourceforge.net/projects/cmusphinx/forums/forum/382337/topic/4787012?message=10898990

Leave a Comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>