1505 – 2008 all words analysis

This is a rather short followup to previous post where we reviewed words starting with letter "A". I've downloaded the rest of the dataset for 1-grams. After unpacking, the dataset exploded to 20+ GB.

Simple solution

So, how should we change our script for A-words so that it should be working with all words? The simple way shows the beauty of Spark (and Hadoop) by exploiting wildcard naming. We just need to change one character: put asterisk instead of letter "a" in filename with list of words.

  1. val path = "googlebooks-eng-all-1gram-20120701-*"

If you have a lot of memory available on your machine(s), then you might see this script finishing successfully. But in most of setups you'll see out of memory exceptions. The groupByKey() call is guilty in this fail. For some years all words do not fit into memory. It might be that in future releases of Spark this function will be optimized. Yet now it is unusable for our task. Let's try the other way.

Working solution

In order to have our script working we will use aggregateByKey() call. The rest of the code remains the same.

  1. val years_top = years.aggregateByKey(
  2. Array.fill[(String, Double)](max_val)
  3. (("[blank]",0)))
  4. (
  5. (y : Array[(String, Double)], x)
  6. => {if (y(max_val-1)._2 < x._2) {y(max_val-1) = x; y.sortBy(-_._2)}; y}, (y1 : Array[(String, Double)], y2 : Array[(String, Double)]) => (y1 ++ y2).sortBy(-_._2)
  7. .take(max_val)
  8. ).map(x => (x._1, x._2.map(_._1)
  9. .toList))

We use three functions with aggregateByKey(). The first one creates a container when new key is encountered. The second corresponds to a case when we meet that key all other times: it updates our container created by the first function. The third one helps to merge two containers.

This code runs without high requirements to memory, also it might run faster. Here are (slightly edited) results produced by this script. In fact, they seem less interesting than ones we've got from only A-words. Probably, some additional filters are required.

Good luck in your analysis.

PCA-based pretraining of neural networks

Training of a neural network on large datasets could be a rather long and challenging thing. There are lots of approaches for reducing training time: parallelization, early stopping, momentum, dimensionality reduction etc. They provide faster convergence of training, prevent unnecessary iterations, utilize hardware resources in a more efficient way. In this post we'll see how good initialization can affect training. Continue reading PCA-based pretraining of neural networks

Notes on “Intriguing properties of neural networks” paper

I've recently encountered a great paper describing interesting properties of neural networks. It tries to go beyond the "black box" view on ANNs and shows that single neurons and layers can have own meaning that is often comprehensible even to a human. Briefly, each layer produces a space (with neurons set as a basis) where each vector has semantic information associated with it.

There is one more idea described in the paper. Neural networks (especially deep ones) as a function in inputs space is not always smooth. By smoothness here I don't mean existence of derivatives, but rather the fact that inputs in the vicinity of train set samples could have unexpected classification labels assigned. Authors describe a way, how to obtain visually almost indistinct pairs of images that will be classified by network to different classes (e.g. bus classified as ostrich). By applying this procedure it is possible to modify a dataset so that error rate will go up from say 5-10% to 100%! Moreover, this "corrupted" dataset also leads to high error rates for networks that were trained on different samples of data or have different architecture then network used for images modification.

Two improvements are proposed for training routines:

  1. train an ANN, get modified images and add them to the dataset, train the ANN again on updated data;
  2. add a regularization term to loss function so that network output instabilities will be compensated.

After first reading it wasn't clear for me, what form regularization term should have. Yet expressions for upper bounds of instability are provided, so it shouldn't be hard to come up with some solution.

As for quick and dirty term I'd give a try to gradient regularization, e.g. penalizing high values of gradient. Yet this could possibly slow down learning, because computation of Hessian becomes necessary.

MATLAB path in debug environment

Suppose that you've run some function foo() in MATLAB and it complains that  another function bar() is not found. You immediately realize that a directory with bar() is not in path and add this directory. Oops, stop on error flag is set, so we are in debug mode. Leave that mode by pressing Shift+F5 and re-run foo().

Surprisingly, you will see the same error message again! The reason is that your path changes were applied in temporary workspace of the debugger. So, by leaving debug and moving back to the original workspace you have reverted all path changes. Long story short, always double-check which environment you are in before applying any changes.

Clustering via mutual information maximization. Part 2

In the previous post we've reviewed logistic regression model and designed a simple clustering algorithm based on it. We have managed to get a decent clustering on a subset of MNIST dataset. Yet there were some drawbacks in our approach.
Continue reading Clustering via mutual information maximization. Part 2

Clustering via mutual information maximization. Part 1

In this series of posts we will learn logistic regression classifier in three ways: supervised, unsupervised and something in between. We will find out that for some machine learning problems you need only a few labels for your data to get a decent model. Continue reading Clustering via mutual information maximization. Part 1

Prediction of random numbers. Part 2

In the first part of this post we considered a human-based binary random numbers generator and built a predictive model for it. The model appeared to work better than a coin toss, at least for a short sequence we had. This part develops more complex models with somewhat higher accuracy. Continue reading Prediction of random numbers. Part 2