Tuesday, August 19, 2008

Making Captcha Technology Productive

Shelf:Life caught this story from last Thursday's Times - Captcha technology (where you have to type in a garbled word in order to post a comment or register on a given website) is being used to assist in OCR: " Instead of displaying a random collection of letters and numbers, the newly designed Captchas present the user with a word from an old manuscript [by which they don't actually mean manuscript, but rather printed document, like a book or newspaper] that a computer, somewhere, is having trouble deciphering."

Luis von Ahn, a 29-year old Carnegie Mellon prof, devised the original Captcha system eight years ago, and plans to officially roll out the new version next month (some 45,000 sites are using it already). The words being recognized are fed back to the the Internet Archive, but von Ahn says other digitization projects could also use this system.

How it works: "When three people type in the same word, the system deduces that this must be the one displayed on the manuscript, and relays this to the computer which has been stumped by the mystery word." Perfect? Certainly not, but I bet it's not too far off (von Ahn says ReCaptcha has proven 99% accurate, compared with about 80% for traditional OCR methods). Given the potential, it seems this effort should be strongly encourages. von Ahn: "About 60 million Captchas are solved around the world every day - each taking roughly ten seconds. Individually that's not a lot of time, but in aggregate these puzzles consume more than 150,000 hours of work each day. What if we could make more use of this effort?"

I like it. Gold star.

[Update: Clearly I didn't read the Globe closely enough this Sunday. Matthew Battles has a piece in the "Ideas" section about this very topic, which is also well worth reading. Thanks to PKS for tipping me off.]