Last update : May 13, 2013
An N-gram is a contiguous sequence of n items from a given sequence, collected from a text or speech corpus. An N-gram could be any combination of letters, phonemes, syllables, words or base pairs, according to the application.
An N-gram of size 1 is referred to as a unigram, size 2 is a bigram, size 3 is a trigram. Larger sizes are referred to by the value of N (four-gram, five-gram, …). N-gram models are widely used in statistical natural language processing. In speech recognition, phonemes and sequences of phonemes are modeled using a N-gram distribution.
“All Our N-gram are Belong to You” was the title of a post published in August 2006 by Alex Franz and Thorsten Brants in the Google Research Blog. Google believed that the entire research community should benefit from access to their massive amounts of data collected by scanning books and by analysing the web. The data was distributed by the Linguistics Data Consortium (LDC) of the University of Pennsylvania. Four years later (December 2010), Google unveiled an online tool for analyzing the history of the data digitized as part of the Google Books project (N-Gram Viewer). The appeal of the N-gram Viewer was not only obvious to scholars (professional linguists, historians, and bibliophiles) in the digital humanities, linguistics, and lexicography, but also casual users got pleasure out of generating graphs showing how key words and phrases changed over the past few centuries.
The version 2 of the N-Gram Viewer was presented in October 2012 by engineering manager Jon Orwant. A detailed description how to use the N-Gram Viewer is available at the Google Books website. The maximum string that can be analyzed is five words long (Five gram). Mathematical operators allow you to add, subtract, multiply, and divide the counts of N-grams. Part-of-speech tags are available for advanced use, for example to distinguish between verbs or nouns of the same word. To make trends more apparent, data can be viewed as a moving average (0 = raw data without smoothing, 3 = default, 50 = maximum). The results are normalized by the number of books published in each year. The data can also be downloaded for further exploration.
N-Gram data is also provided by other institutions. Some sources are indicated hereafter :
- Microsoft Web N-gram Services
- N-grams data (Corpus of Contemporary American English)
- Music N-gram viewer
- DBpedia : structured information extracted from Wikipedia
Links to further informations about N-grams are provided in the following list :
- Information is beautiful : Google Ngram Experiments, by David McCandless
- What we learned from 5 million books (TED video), by Erez Lieberman Aiden and Jean-Baptiste Michel
- Natural Language Processing for the Working Programmer, by Daniël de Kok and Harm Brouwer
- Language Detection With N-Grams, by Ian Barber
- Post your top 5 N-grams here! (TED)
- Syntactic Annotations for the Google Books Ngram Corpus
- Analyzing Women and Men With Google Ngram’s Help, by Liz Colville