Tag Archives: TF-IDF

1505 – 2008 all words analysis

This is a rather short followup to previous post where we reviewed words starting with letter "A". I've downloaded the rest of the dataset for 1-grams. After unpacking, the dataset exploded to 20+ GB.

Simple solution

So, how should we change our script for A-words so that it should be working with all words? The simple way shows the beauty of Spark (and Hadoop) by exploiting wildcard naming. We just need to change one character: put asterisk instead of letter "a" in filename with list of words.

  1. val path = "googlebooks-eng-all-1gram-20120701-*"

If you have a lot of memory available on your machine(s), then you might see this script finishing successfully. But in most of setups you'll see out of memory exceptions. The groupByKey() call is guilty in this fail. For some years all words do not fit into memory. It might be that in future releases of Spark this function will be optimized. Yet now it is unusable for our task. Let's try the other way.

Working solution

In order to have our script working we will use aggregateByKey() call. The rest of the code remains the same.

  1. val years_top = years.aggregateByKey(
  2. Array.fill[(String, Double)](max_val)
  3. (("[blank]",0)))
  4. (
  5. (y : Array[(String, Double)], x)
  6. => {if (y(max_val-1)._2 < x._2) {y(max_val-1) = x; y.sortBy(-_._2)}; y}, (y1 : Array[(String, Double)], y2 : Array[(String, Double)]) => (y1 ++ y2).sortBy(-_._2)
  7. .take(max_val)
  8. ).map(x => (x._1, x._2.map(_._1)
  9. .toList))

We use three functions with aggregateByKey(). The first one creates a container when new key is encountered. The second corresponds to a case when we meet that key all other times: it updates our container created by the first function. The third one helps to merge two containers.

This code runs without high requirements to memory, also it might run faster. Here are (slightly edited) results produced by this script. In fact, they seem less interesting than ones we've got from only A-words. Probably, some additional filters are required.

Good luck in your analysis.