Monday, December 20, 2010

Google's Ngram Viewer

I've been playing off and on with Google's Ngram viewer since it was announced on Friday. This is the tool that enables you to graph the frequency over time in usage of given words or phrases. All sorts of interesting experiments are possible -- for example, try comparing the usage of a word and a synonym vs. an antonym or a euphemism to compare their usage (or, you could examine those three words -- "antonym" seems to be much less frequently used but growing in frequency!).

But, I've already noted some anomalies. The plot for "United States of America" is surprisingly spiky, with surprisingly few mentions in the early 1800s. That is perhaps an artifact of the sources available for the Google book digitization project, but it does cast concern on some of the conclusions being drawn from this tool.

But worse, there are definitely some issues with dating and with automated text recognition. Search for "Genomics", and some awfully early references show up. These seem to fall into two categories: serious book dating errors and text errors. In the former category, I don't believe Nucleic Acids Research published in 1835, and a number of other periodicals seem to be afflicted with similar misdatings. In the latter, "générales" seems to be a favorite to transmute to "genomics".

These issues do not invalidate the tool, but they do urge caution in interpreting results -- particularly if trying to explore the emergence and acceptance of a new term.

An approach to deal with this would be to turn the problem around. A systematic search for anachronistic word patterns could identify misdatings or questionable datings in either direction. Not only would this identify documents transported backwards in time, but also ones which should be flagged for time travel in the other direction. For example, using the tool I discovered that someone sharing my surname co-authored a screed against Masonry back in the 1700s -- and this same work shows up as a modern book due to a reprinting in recent years.

But in any case, it is an interesting way to explore language and culture. Even without a little tidying & curation.

3 comments:

CJ said...

I'll indulge in a little shameless promotion of a friend and colleague's work and point you to MLTrends (http://www.ogic.ca/mltrends, Palidwor et al. (2010) J. Biomed. Discov. Collab. 5, 1-6), which does essentially the same thing for Medline.

gawp said...

Not bad, but they manage to avoid citing prior work in visualizing and analyzing word usage in a time annotated corpus of text. And there is quite a bit of it.

i.e. Batchelor MT, Henry BI, Watt SD: Who cares what's new? Nature 1997, 387(6631): 337.

+10 points for making the ngram files available for download, though.

bluekeybox said...

> The plot for "United States of America" is surprisingly spiky, with surprisingly few mentions in the early 1800s. That is perhaps an artifact of the sources available for the Google book digitization project, but it does cast concern on some of the conclusions being drawn from this tool

That's because Google Ngram is case-sensitive. Try searching for United States of America.