Friday 8 February 2013

ReCaptcha, spam and the world of digital books (honestly)

Jenny writes: I was trying to buy tickets for the One Direction’s matinee show in Manchester through Ticketmaster, when I encountered a sentence that transported me to a new universe.
(And it's not the one you've just read.) Ticketmaster's website wanted to confirm that I was a real person, not a spambot (I was that close to scoring the tickets, I swear). It asked me to type in two words to prove that I am a human being. So far, so normal. Those wavy, unintelligible words framed in a box that you're supposed to decipher - officially known as Captcha codes - are everywhere online.  But what caught my eye was a little disclaimer at the side.
“By entering the words in the box you are also helping digitize books from the Internet Archive and preserve literature that was written before the computer age.”
Really? How?
Professor Luis van Ahn of Carnegie Mellon University- inventor of the captcha code - realised a few years ago that the energy we put into solving these codes could be put into doing something more useful.  He was aware of the many organisations trying to digitise all sorts of literature and other material, like the Internet Archive, a non-profit digital library whose aim is to offer "free universal access to books, movies & music, as well as 260 billion archived web pages". 
The trouble is that when old paper documents are digitised, some words remain indecipherable to the scanning machine and have to be verified by a human being. When you've got a project as massive as the Internet Archive's, that's incredibly timeconsuming. Luis van Ahn's idea was to take the words that the scanner hasn't recognised and pump them out as reCaptcha codes - harnessing the brain power of online shoppers to solve the problem words. Once the code has been recognised by a couple of individuals, it gets popped back into a database, and when verified, used to help fully digitise the text it came from. (A more detailed explanation of the process can be found here).

This isn’t news to people who know about techie stuff: there have been articles in the Guardian, Independent and New York Times. But I must have been doing something else those days.
Is this the stuff of digital utopia - free and instant access to literature, including all the back issues of the New York Times? Or is it nightmare: yet another timewasting barrier between me and my consumable goods? Are we all just feeding the machine? Google bought reCaptcha in 2009, and has been using the technology to help Google books, its controversial large scale project to digitise out-of-copyright books (controversial because it turned out many of them weren't actually out of copyright). What's certain is that by typing in these codes, we're all contributing to the new digital landscape.
And it may be that the shelf life of these codes is pretty limited anyway. People hate them, and other solutions are constantly being developed. Since I drafted this article, Ticketmaster announced it was dropping Captchas in favour of a more user friendly alternative. 
So perhaps reCaptcha codes have already had their moment in the sun. Overall, I feel good that I've helped other people access free books online. And that I've taken part in the world’s largest palaeography project.
So having done my good deed for the day, only one question remains: anyone got spare One Direction tickets?

No comments:

Post a Comment