Proposal for Customizing Google’s Crawlable Ajax URLs

On the Shepherd Interactive site, we have a dynamic navigation menu in Flash. In order to prevent it from having to reload each time a page is changed, I implemented Ajax loading so that the SWF only has to load once. This is similar to what Lala and Facebook do. So if your browser is Ajax-enabled, upon visiting:

http://shepherdinteractive.com/portfolio/interactive/

You will get redirected to site root / with the old path supplied as the URL hash fragment which is then loaded in via JavaScript as the page content:

http://shepherdinteractive.com/#portfolio/interactive/

However, according to Google’s Making AJAX Applications Crawlable specification, Ajax pretty URLs are any containing a hash fragment beginning with !, for example:

http://shepherdinteractive.com/#!portfolio/interactive/

The purpose of the ! is merely to inform Googlebot that such a URL is for an Ajax page whose content can be fetched via:

http://shepherdinteractive.com/?_escaped_fragment_=portfolio/interactive/

The problem I have with Google’s specification is that the pretty URL Ajax fragment prefix (!) is mandated; it is not customizable. I should be able to tell Googlebot which fragment identifiers are for Ajax content and which are not. Therefore, instead of requiring authors to to conform to Google’s Ajax specification, I propose that Google adopt an extension to robots.txt which allows site owners to let Googlebot know what to look for. The current specification’s mandate for ! could be indicated via:

Ajax-Fragment-Prefix: !

Or it could be changed to anything else, such as “ajax:“. If the Ajax fragment doesn’t have a prefix at all (as in the case of Shepherd Interactive’s website above), a regular expression pattern match could be specified in robots.txt, for example:

Ajax-Fragment-Pattern: .+/

This would tell Googlebot that a URL with a fragment containing a slash should be fetched via the _escaped_fragment_ query parameter, and that the Ajax URL itself (including the fragment identifier) should be indexed and returned verbatim in the search results.

It’s true that the Shepherd Interactive site implements Hijax (progressive enhancement with Ajax) techniques, so every Ajax URL has a corresponding non-Ajax URL; so in this sense, Google can still access all of the content. The problem, however, is with links to Ajax URLs from around the Web. I assume for Googlebot, every link to an Ajax URL without the obligatory prefix ! is interpreted as referring to the home page (site root /):

http://shepherdinteractive.com/#portfolio/interactive/harmer-steel/
http://shepherdinteractive.com/#about-us/our-company/
http://shepherdinteractive.com/#services/web-development/

So Google then assigns no additional PageRank to our Ajax URLs since it gets assigned to the home page. If, however, Googlebot could be told that those are actually Ajax URLs, then the PageRank could be properly assigned. (My assumptions here could be incorrect.)

Thoughts?

Update 2010-04-22: “Cowboy” Ben Alman brought up the excellent point that customizable Ajax-crawlable URLs need to work on a per-path basis, as different single page web apps on the same server (or even in the same folder) might have different URI requirements. I wonder if in addition to (or instead of) using robots.txt to tell Googlebot the format of the Ajax-crawlable URLs, why not allow that information to be placed in meta tags? Their specification already includes:

<meta name="fragment" content="!">

Why not make Googlebot support additional meta tags for recognizing the prefix (if any) or pattern that characterizes an Ajax-crawlable URL on a given page? For example:

Current default behavior:: <meta name="crawlable-fragment-prefix" content="!">
Starting with “/” like …about/#/us/:: <meta name="crawlable-fragment-prefix" content="/">
All fragments are crawlable:: <meta name="crawlable-fragment-prefix" content="">
Any fragment ending with a slash:: <meta name="crawlable-fragment-pattern" content=".+/">

With these meta tags, authors would be able to have complete page-level control over the structure of Ajax-crawlable URLs (hash fragments).

Comments

6 responses to “Proposal for Customizing Google’s Crawlable Ajax URLs”

April 22, 2010

“Cowboy” Ben Alman

Weston, I agree completely… The completely arbitrary fragment pattern that Google is enforcing here is exactly that.. arbitrary. I don’t want to have to change the way my URIs are decorated on Google’s whim. The robots.txt suggestion is SO much more user-friendly (and hidden from the end-user and my JavaScript).

Granted, any proposed solution needs to be able to work on a per-path basis, as different single page web apps on the same server (or even in the same folder) might have different URI requirements.

Reply
April 22, 2010

Weston Ruter

@cowboy: I’m glad you agree! Any idea on how to communicate our concerns back to Google? I searched and searched for a forum/group to be able to share my feedback, but I didn’t come up with much.

Reply
April 22, 2010

“Cowboy” Ben Alman

Weston, I think the Meta tag is a great per-page solution, and I wouldn’t mind seeing even just a plain ol’ <meta type="crawlable-fragment" content="yes"/> or something similar that denotes “this is a crawlable single-page web app.”

Reply
April 27, 2010

Ajaxian » Telling robots about your crawl-able Ajax apps

[…] Weston Ruter wants to talk to the search robots out there and tell them about the URL format for crawling Ajax apps. […]

Reply
April 28, 2010

Kyle

I like the idea of us having more control as opposed to being locked into Google’s conventions. I also strongly feel we should be able to control the “_escaped_fragment=” part too… what if my app already uses that parameter name for something else?

Reply
May 20, 2010

Bob

I like the idea of us having more control as opposed to being locked into Google’s conventions. I also strongly feel we should be able to control the “_escaped_fragment=” part too… what if my app already uses that parameter name for something else?

Uh, rename it? You’re already going to be jumping through hoops to make all this work anyway. I suppose you want to customize background-color to be backgroundColor or just bgc in CSS too? Frankly the background details end users never see aren’t much of a concern to me, and #! is a fair convention to use, if we’re going to be adopting this mess.

Granted, they could offer something more customizable, but in the end all it means more processor time on their servers just to stroke the egos of the 0.0001% of us who make the web sites. Anyway why would you, being Google want to give up branding/bragging rights?

I don’t think we’re there yet with this #! mess; but we can’t disguise theming it as improving it either…

Reply