Tom's Vanity Press
The online home of some random schmo.
browse by date
reCAPTCHA - wanky name, awesome concept
May 7, 2010 at 21:50:22
Categories: meta tech history At the risk of turning this into "yet another tech blog" (do we really need to talk about computers on the computer all the time!?), I would like to announce that I've finally gotten around to implementing reCAPTCHA on my vanity press after hearing about it just under a year ago. "Regular readers" of this blog know all too well about the glacial pace at which it is updated, content- and features-wise. For those of you who have been living like an ostrich this past year or so, reCAPTCHA is - in my opinion - a novel solution to two seemingly unrelated problems. On the one hand, you've got millions upon billions of webpages out there with forms, comment boxes and the like that can be filled in and submitted either manually by legitimate users (and non-legitimate spammers) of the websites or with automated spam-bot programs written by shady individuals. The latter is a problem that has plagued the World Wide Web (and the rest of the Internet) since time immemorial and continues to do so to this day. On the other hand, you've got a massive pile of newspapers with issues dating back one hundred, maybe two hundred years or more. Newsprint is a low-grade type of paper that tends to decay rather rapidly, but it is relatively cheap and the contents of most newspapers are ephemeral in nature, so it is widely used for printing newspapers. Newspapers also take up significant amounts of space in storage. But what if you want to hang on to the papers for posterity, so that future generations may know that a pair of good leather shoes cost sixpence in 1906 or be able to read second-hand accounts of key historical events created as they happened, et cetera? The traditional solution has been to transfer newspaper onto microfilm, which can be read with a microfilm reader (a light projected through the film onto a white surface). Not a bad system - microfilms, after all, take up far less space than newspapers and decay more slowly - however, they do have their drawbacks. Since most newspapers are not indexed, it can be a wild goose chase trying to find a specific article (or set of articles). Also, microfilm readers can make you nauseous if you use them for too long - I speak from experience! With the advent of increased computing power and storage and improved optical character recognition techniques, it has become increasingly commonplace for newspaper articles to be digitised. While digital storage media isn't completely immune from the ravages of time and other forms of corruption it is many magnitudes less volatile than acetate-based microfilm or paper, it can be perfectly reproduced and copied/backed up and it can be readily indexed and cross-referenced, making the discovery of information a lot simpler and less time consuming. So clearly, digitising newspapers/books/journals/periodicals/etc. is the way to go. There's still one major problem; the character recognising programs aren't perfect. They can't accurately transcribe each and every word from the newspapers due to the imperfect nature of ink-printed text on hundred-year-old newsprint. And there's no way in buggery that the small but dedicated team of researchers can pore over each and every newspaper in their collection to check for accuracy - they'll be finished long after the Four Horsemen of the Apocalypse have ridden into town! Then one day, a bright spark by the name of Luis von Ahn had an idea. "OK, we've got all of this digitised text that our computers have had trouble recognising. I'm pretty sure that no computer program out there could accurately recognise the text. We also have these spam bot programs out there on the 'net, relentlessly filling in web forms, clogging up Internet traffic with ads for Viagra substitutes and what have you and making the 'net an unpleasant place for everyone else. But these web forms have millions of legitimate human users. Humans can accurately interpret printed text in ways computers can not. So why not give webpage users a little test to see if the filled-in and submitted online form has been sent by a human or a machine? It'll be a "Completely Automated Public Test to tell Computers and Humans Apart", or CAPTCHA for short. We could give these CAPTCHAs to those poor, hapless web users, based on the words our computers couldn't digitise properly. We'll give them a word that we've correctly digitised and another unknown word, but we won't tell them which is the known word. If the user can accurately determine the known word, then chances are, the user's response for the other word is correct as well and the form can be successfully submitted and we'll have a new word successfully digitised. Of course, we'll have to put this word out a few times to different users, just to be sure. The poor spam-bot, however, is likely to fail at guessing the word just like our computers did, therefore the spam it produces will be stopped dead in its tracks. Millions of webpage forms, millions of users, millions of form submissions a second... why, we'll have our text accurately and fully digitised in no time!" (I'm paraphrasing Dr. Von Ahn's thought processes, of course. Who knows what his brilliant mind was really thinking...) At the time of writing, the reCAPTCHA project is proofreading the New York Times, but when that's finished, they'll no doubt be moving on to other newspapers and books, especially since that great Internet behemoth Google - which already has their own vast digital library of books - acquired reCAPTCHA last year. The National Library of Australia has been busy digitising my fair nation's newspapers as well. However, they have adopted an all-too-slow method of proofreading what has been digitised. Compare the NLA's claimed 9000+ site visitors who have offered their corrections while trolling through random newspaper pages online to the millions of anonymous users every second offering their corrections simply by submitting a form, and you can see why it would be awesome if the NLA got in bed with reCAPTCHA. Hopefully for me, this means that I will stop receiving daily spam comments (which I don't publish, obviously -- none of them I find particularly amusing), although I'm a thousand percent sure teams of spammers are busily beavering away as I type (and as you read) at trying to defeat the system, so it'll only be a matter of time before I receive automated spam comments again. For now though, you, dear reader, are able to make not one, but two worthwhile contributions; the first being constructive, interesting and spam-free comments to this or any other of my posts - currently, or yet to be written, the second being your small contribution to ensuring that a part of written human history is preserved for current and future generations, so that they may have ready access to an invaluable source of information that has been extensively vetted for accuracy. I am proud to have finally become a participant in reCAPTCHA - a truly worthwhile project for humanity. Comment now! Preserve history! Tom Last updated: May 8, 2010 at 14:18:23
Add a comment
Contact me Subscribe (RSS) Your host
Tomislav "Tom" Bozic
a "recovering hikikomori"
and "Croatian mirepoix"
was born on
14th Iyyar 5744, or
27th Floréal CXCII
and spends most of his time within the
Sydney, New South Wales, Australia
metropolitan area. (the rest shall be revealed in due course...)
All dates and times displayed on this page are based on Sydney local time.