Part of final dataset

1505 – 2008 A-words analysis

Today we will explore an awesome n-grams dataset provided by Google. The data is obtained by counting words and phrases in books from 16th century and forth.

Dataset

I've downloaded data for words starting with letter "a". After decompressing it appeared to be 1.7G text file with 86.6 millions of lines. Seems like enough for a start. Also we'll need a file with per year totals for books written and words used. Formats description and links to all files are available on this page.

Tools

To do the analysis we'll use Scala Spark shell. Also you'll need some text editor to review the results. 🙂

Analysis plan

So, what are we going to get from the data? First idea that comes to mind is getting most frequent words for each year. Applying this approach in a straightforward way leads to a trivial result: for all years most used words would be "and", "a", "about", "all" etc. Taking into account words order could lead to some (mostly philosophical) insights, yet overall outcome seems to be rather useless.

To eliminate too common words, we we'll replace frequency of word with a more complex quantity: TF-IDF (treating each year as a separate document). This quantity penalizes words that occur to often across the years. Thus, "and" gets much lower score than other, meaningful words.

So, our plan is:

  1. Compute for each word total frequency across all years and convert it to IDF.
  2. Get frequency for each word for each year and then compute TF-IDF.
  3. Get top N words for each year.
  4. Review the results.

The code

Let's start by defining paths and writing data loading code. I'm not sure why, but the file with total counts separates entries with tab (\t) character. So it is a one-liner. In each entry columns are separated by comas. Thus, we load text file, take the first line, split it by tabs, remove blank entries, split each entry into columns and produce a map [YEAR => (words count, books count)].

  1. val path_totals = "googlebooks-eng-all-totalcounts-20120701.txt"
  2. val totals = sc.textFile(path_totals)
  3. .first
  4. .split("\t")
  5. .filter(_!=" ")
  6. .map(x =>
  7. {val t = x.split(",");
  8. (t(0).toInt,
  9. (t(1).toLong, t(3).toLong))
  10. })
  11. .toMap

Now we'll write code for loading core data. Each line yields information for one year-word pair separated by tab character. We'll need a class Word to represent a record.

  1. val path = "googlebooks-eng-all-1gram-20120701-a"
  2. @SerialVersionUID(101L)
  3. class Word(a_year: Int, a_word: String,
  4. a_cnt: Int, a_books: Int)
  5. extends Serializable {
  6. val year: Int = a_year
  7. val word: String = a_word
  8. val cnt: Int = a_cnt
  9. val books: Int = a_books
  10.  
  11. override def toString(): String = "(" + year + "," + word + "," + cnt + "," + books + ")"
  12. }
  13. val raw_data = sc.textFile(path)
  14. .map(x =>
  15. {val elem = x.split("\t");
  16. new Word(elem(1).toInt,
  17. elem(0),
  18. elem(2).toInt,
  19. elem(3).toInt)})
  20. .cache()

Next step is the computation of inverse "document" frequencies.

  1. val iyf = raw_data.map(x => (x.word, 1.0))
  2. .reduceByKey(_ + _) // count word occurrences across years
  3. .mapValues(x => Math.log(totals.size/x)) // and convert to IDF
  4. val iyf_words = raw_data.map(x => (x.word, // get word frequencies for each year
  5. (x.year, x.cnt.toFloat/totals(x.year)._1)))
  6. .join(iyf) // join with year frequencies
  7. .mapValues(x => (x._1._1, x._1._2*x._2)) // get TF-IDF
  8. // now convert to more convenient form: [YEAR => (WORD, TF-IDF)]
  9. val years = iyf_words.map(x => (x._2._1, (x._1, x._2._2))).cache()

All the preparations are done. Now we need to query for top words and collect the results.

  1. val max_val = 10 // get top 10 words
  2. val years_top = years.groupByKey() // group values for each year
  3. .mapValues(x => x.toList.sortBy(-_._2).take(max_val)) // sort descending by TF-IDF and retrieve top 10 elements
  4. .map(x => (x._1, x._2.map(_._1))) // form [YEAR => words list] map
  5. val temp = years_top.sortByKey().collect() // run the computation and load ordered results to memory
  6.  

A side note regarding the groupByKey() call. It works and produces desired results, yet it is not kind of method that you want to invoke when working with large datasets. The reason is that for this call Spark will keep in memory all values corresponding to some key. And for large datasets you'll often find that even for one key amount of data exceeds your memory capacity. Thus you could find your Spark stopping execution and throwing memory-related exceptions. A better way is to substitute groupByKey() with aggregateByKey() using appropriate parameters.

To finalize the coding part we'll save the results to a file.

  1. import java.io._
  2. def printToFile(f: java.io.File)(op: java.io.PrintWriter => Unit) {
  3. val p = new java.io.PrintWriter(f)
  4. try { op(p) } finally { p.close() }
  5. }
  6. printToFile(new File("out_iyf.txt")) {p => temp.foreach(p.println)}

Open file out_iyf.txt in your favorite text editor and enjoy! Here is a bit cleaned version of the file.

Results review

Ok, now let's take a look at results and see what insights we can get. For last 100-200 years words "America" and "American" constantly dominate the top. Moreover, there are three periods: until 1830s these words appeared rarely, then in 1830s-1880s period words at last get into the lower part of lists and since 1880s constantly occupy top positions. These periods seem to correlate with publishing history. Large publishing houses in America started working in the first half of 19th century. Though they mostly were publishing pirated copies of British books. 🙂 Also 19th century was rich with technological innovations for publishing. It might be that new technologies and introduction of royalties lead to greater rise of American publishing industry.

Another interesting example is the world "army". It frequently appears in 18-19th centuries and then fades out. Might be that in last 100 years it has been replaced by other more specific synonyms. Note 1758-1764 series of years when the "army" held top 1. Before and after that period other terms occasionally occupy top-1 position. It could be a coincidence, yet it might be that a lot of books at that time were describing what now is called Seven Years' War. Main conflict for that war lasted from 1756 to 1764.

A few other words to check: automobile, aircraft, analysis, average.

Conclusion

Even a rather specific and modest size dataset could lead to interesting insights.

One thought on “1505 – 2008 A-words analysis”

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.