Weston Ruter

Web application developer in Portland, Oregon

Proposal for Customizing Google’s Crawlable Ajax URLs

On the Shepherd Interactive site, we have a dynamic navigation menu in Flash. In order to prevent it from having to reload each time a page is changed, I implemented Ajax loading so that the SWF only has to load once. This is similar to what Lala and Facebook do. So if your browser is Ajax-enabled, upon visiting:

http://shepherdinteractive.com/portfolio/interactive/

You will get redirected to site root / with the old path supplied as the URL hash fragment which is then loaded in via JavaScript as the page content:

http://shepherdinteractive.com/#portfolio/interactive/

However, according to Google’s Making AJAX Applications Crawlable specification, Ajax pretty URLs are any containing a hash fragment beginning with !, for example:

http://shepherdinteractive.com/#!portfolio/interactive/

The purpose of the ! is merely to inform Googlebot that such a URL is for an Ajax page whose content can be fetched via:

http://shepherdinteractive.com/?_escaped_fragment_=portfolio/interactive/

The problem I have with Google’s specification is that the pretty URL Ajax fragment prefix (!) is mandated; it is not customizable. I should be able to tell Googlebot which fragment identifiers are for Ajax content and which are not. Therefore, instead of requiring authors to to conform to Google’s Ajax specification, I propose that Google adopt an extension to robots.txt which allows site owners to let Googlebot know what to look for. The current specification’s mandate for ! could be indicated via:

Ajax-Fragment-Prefix: !

Or it could be changed to anything else, such as “ajax:“. If the Ajax fragment doesn’t have a prefix at all (as in the case of Shepherd Interactive’s website above), a regular expression pattern match could be specified in robots.txt, for example:

Ajax-Fragment-Pattern: .+/

This would tell Googlebot that a URL with a fragment containing a slash should be fetched via the _escaped_fragment_ query parameter, and that the Ajax URL itself (including the fragment identifier) should be indexed and returned verbatim in the search results.

It’s true that the Shepherd Interactive site implements Hijax (progressive enhancement with Ajax) techniques, so every Ajax URL has a corresponding non-Ajax URL; so in this sense, Google can still access all of the content. The problem, however, is with links to Ajax URLs from around the Web. I assume for Googlebot, every link to an Ajax URL without the obligatory prefix ! is interpreted as referring to the home page (site root /):

  • http://shepherdinteractive.com/#portfolio/interactive/harmer-steel/
  • http://shepherdinteractive.com/#about-us/our-company/
  • http://shepherdinteractive.com/#services/web-development/

So Google then assigns no additional PageRank to our Ajax URLs since it gets assigned to the home page. If, however, Googlebot could be told that those are actually Ajax URLs, then the PageRank could be properly assigned. (My assumptions here could be incorrect.)

Thoughts?

Update 2010-04-22: “Cowboy” Ben Alman brought up the excellent point that customizable Ajax-crawlable URLs need to work on a per-path basis, as different single page web apps on the same server (or even in the same folder) might have different URI requirements. I wonder if in addition to (or instead of) using robots.txt to tell Googlebot the format of the Ajax-crawlable URLs, why not allow that information to be placed in meta tags? Their specification already includes:

<meta name="fragment" content="!">

Why not make Googlebot support additional meta tags for recognizing the prefix (if any) or pattern that characterizes an Ajax-crawlable URL on a given page? For example:

Current default behavior:
<meta name="crawlable-fragment-prefix" content="!">
Starting with “/” like …about/#/us/:
<meta name="crawlable-fragment-prefix" content="/">
All fragments are crawlable:
<meta name="crawlable-fragment-prefix" content="">
Any fragment ending with a slash:
<meta name="crawlable-fragment-pattern" content=".+/">

With these meta tags, authors would be able to have complete page-level control over the structure of Ajax-crawlable URLs (hash fragments).

Photo of “Audible Ajax”

Weston Ruter and Audible Ajax

I miss the Audible Ajax podcast on Ajaxian. Photo by my friend Nathan Watkins; taken 2006-07-17 in Midelt, Morocco where I was soaking in the podcast.

We’re Having a Baby Boy!

Five months in, I’m pleased to publicly announce that my wife and I are expecting a baby boy at the end of May! His name is “Asecret” ;-)

Baby Boy Ruter profile pic taken 2009-12-18 08:39:43

Portland, Oregon Snow Driving

I took this video outside of Shepherd Interactive, my work. View in HD to see the sparks fly :-) See also the footage of my commute home which ended up being a journey half on foot from my work to downtown: part 1 and part 2.

What follows are screengrabs from Google Maps of the traffic around 5pm. One word: gridlock.

Google Text-To-Speech (TTS)

Update : Andufo shared the happy news that more languages are now available in the Google TTS service! I have added a new language selection drop-down for English, Spanish, French, German, Italian, and Hatian Creole.

Google Translate announced the ability to hear translations into English spoken via text-to-speech (TTS). Looking at the Firebug Net panel for where this TTS data was coming from, I saw that the speech audio is in MP3 format and is queried via a simple HTTP GET (REST) request: http://translate.google.com/translate_tts?tl=en&q=text. Google Translate notes that the speech is only available for short translations to English Now multiple languages are supported, and it turns out that the TTS web service is restricting the text to 100 characters. Another restriction is that the service returns 404 Not Found if the request includes a Referer header (presumably one that is not for translate.google.com).

In spite of the limitations of the web service which certainly reflect the intention that the web service is only to be used by Google Translate, thanks to the new HTML5's Audio element and rel="noreferrer", the service may be utilized by client-side web applications like following (Google Chrome 4 recommended):

Google Text-To-Speech (TTS)

I am really excited at the prospect of text-to-speech being made available on the Web! It's just too bad that fetching MP3s on an remote web service is the only standard way of doing so currently; modern operating systems all have TTS capabilities, so it's a shame that web apps and can't utilize them via client-side scripting. I posted to the WHATWG mailing list about such a Text-To-Speech (TTS) Web API for JavaScript, and I was directed to a recent thread about a Web API for speech recognition and synthesis.

Perhaps there is some momentum building here? Having TTS available in the browser would boost accessibility for the seeing-impaired and improve usability for people on-the-go. TTS is just another technology that has traditionally been relegated to desktop applications, but as the Open Web advances as the preferred platform for application development, it is an essential service to make available (as with Geolocation API, Device API, etc.). And besides, I want to build TTS applications and my motto is: If it can't be done on the Open Web, it's not worth doing at all!