March 2009 Archives

Underscores are so 90s

| | Comments (12)

Something unexpected I just learned: to many, it's now considered best-practice to use hyphens ( - ) instead of underscores ( _ ) when separating words in URLs. For example, instead of this:

http://example.com/fun_stuff/my_cats.html

You'd be better off doing this:

http://example.com/fun-stuff/my-cats.html

Why? Simple: Google recommends it. According to Matt Cutts, the big G will do a better job recognizing and indexing the words in your URL if you separate them with hyphens.

This has everything to do with search engine optimization (SEO), a term I tend to shy away from, what with its coming from the sinister realm of marketing versus the practical software architecture tack that I try to stick with. Sometimes the paths cross unexpectedly, though.

For me, this happened most recently as I set up a new Movable Type (4.23) instance for a client. I was surprised to find that, by default, it enforced that all files it created use dashes in their names instead of underscores. Googling on this topic led me to discussions from people asking how to get MT to use dashes in its filenames instead of underscores - the exact opposite of what I sought, but I was intrigued that so many should care. A little more searching led to this post on Google Inside, as well as the articles I linked to above.

I always approach the notion of work-pattern changes due to Google Fiat with a bit of skepticism, but this one's apparently settled into place globally over the last few years without my really noticing. Interesting, anyway.

Ajaxload is a nifty do-one-thing-well web-based service. Poke in a few parameters, and out pops a little animated spinner graphic. It's optimized against the background color of your choice, and free for you to download and use however you wish.

In web applications, you most often see these graphics used wherever there's AJAX. Their sudden appearance and rolling motion help reassure users that something is happening, and they should stand by and await further results. They often look something like this: tiny_spinner.gif. (That, in fact, is one I just created with Ajaxload for use with Planbeast, a new side project of mine.)

I somehow failed to find this site when searching on likely keywords via Google, earlier this afternoon; I found it instead by its being linked to from some Scriptaculous documentation. Many thanks to Catherine Roman for this simple and useful service.

I recently finished a project for a client that involved installing CAPTCHAs on their various web forms. You've seen these before - they're the little widgets that challenge you to retype the intentionally garbled numbers and letters it displays in order to prove that you're an actual human, and not a node of some spammer's botnet.

In researching the current best practices for adding a CAPTCHA to an existing site, I found ReCAPTCHA, a project of Carnegie Mellon University. You've probably come across examples of this particular flavor of CAPTCHA before. The image always consists of two unrelated words or word fragments, usually resembling smudgily typed copy with some additional bot-foiling visual artifacts thrown on for good measure. Something like this:

captcha.png

It happens that the two words have a very good reason for looking like they do: they have been scanned out of old printed books and periodicals, part of various CMU-affiliated efforts to digitize old media. As ReCAPTCHA's own About page explains, even the best OCR software is only so-so at recognizing text, and frequently can't recognize words that would be obvious to any human reader of the language at hand.

Now, here's the cool part: by plugging itself into ReCAPTCHA, a computer working in one of these massive scanning projects can submit a word it's unsure about to the global community of people who happen to be filling web-based forms in at that very moment. It will quickly get a response that 98 percent of all the people who saw it thought the word was "doggy" (or whatnot), and that will be enough agreement for the machine's purposes.

The reason there are two words per ReCAPTCHA instance is that one of the words is undergoing this kind of trial, while the other one already has - in other words, the ReCAPTCHA system already knows what word it is. This is how the widget still functions as a CAPTCHA - the entity filling it in must still be correct about at least one of the words, if it wants to prove that it's a human. Meanwhile, bots are foiled not just for the usual reasons, but because all the words on display have already proven to be confusing to computers trying to read them!

I think this is incredibly cool. That slimy spammers have made technologies like CAPTCHAs a necessity of the modern web is quite unfortunate, but the way that ReCAPTCHA has found a way to put a positive, culture-perserving spin on it is ingenious and laudable.