Friday, December 17, 2010

New Addictive Google Tool

Yesterday afternoon Google Labs released Books Ngram Viewer, a nifty graphing tool based on a corpus of 500 billion words from 5.2 million books in Chinese, English, French, German, Russian, and Spanish (from among the 15 million books scanned by Google since 2004). The full datasets are also downloadable. This is the first time such an extensive collection of data has been made available to researchers, but still, its limitations must be considered (and are well laid out in a Guardian piece).

The scholarly thrust behind this is an article in Science, "Quantitative Analysis of Culture Using Millions of Digitized Books." More coverage from the NYTimes (which also considers some of the drawbacks to this approach), NPR, Boston Globe.

Ngram has proven addictive, with lots of interesting uses cropping up on blogs and Twitter. One of my favorites, from @cliotropic, is here ("American bifurcated-garment nomenclature"). I've also found it fascinating to look at things like diseases, rivalries, more rivalries, farm animals, &c. There's lots to examine here, and much fertile ground for scholars to work with.

[I'm going to add some updates to this as they come out: the Scientific American coverage is important, as is the worry that this tool will result in serious misinterpretations of data. Also, I'm not a huge fan of the term the study's authors are using for this study: Culturnomics (now complete with website). Also added, the first in a series of posts from Ben Schmidt at Sapping Attention, which makes for essential reading. And Mark Davies' call for a comparison of Ngram to COHA (Corpus of Historical American English), which can do some even more interesting things and in many ways is much more useful to scholars. At Thingology, Tim points out one of the huge metadata problems (the Google-OCR's failure to read the long s as an s rather than an f), and Natalie Binder maintains the viewer "isn't ready for prime time."]

[Further update: there's now an important critique by Geoffrey Nunberg in the Chronicle Review.]