So, how should we change our script for A-words so that it should be working with all words? The simple way shows the beauty of Spark (and Hadoop) by exploiting wildcard naming. We just need to change one character: put asterisk instead of letter "a" in filename with list of words.

val path = "googlebooks-eng-all-1gram-20120701-*"

If you have a lot of memory available on your machine(s), then you might see this script finishing successfully. But in most of setups you'll see out of memory exceptions. The *groupByKey()* call is guilty in this fail. For some years all words do not fit into memory. It might be that in future releases of Spark this function will be optimized. Yet now it is unusable for our task. Let's try the other way.

In order to have our script working we will use *aggregateByKey()* call. The rest of the code remains the same.

val years_top = years.aggregateByKey( (("[blank]",0))) ( .take(max_val) ).map(x => (x._1, x._2.map(_._1) .toList))

We use three functions with *aggregateByKey()*. The first one creates a container when new key is encountered. The second corresponds to a case when we meet that key all other times: it updates our container created by the first function. The third one helps to merge two containers.

This code runs without high requirements to memory, also it might run faster. Here are (slightly edited) results produced by this script. In fact, they seem less interesting than ones we've got from only A-words. Probably, some additional filters are required.

Good luck in your analysis.

]]>I've downloaded data for words starting with letter "a". After decompressing it appeared to be 1.7G text file with 86.6 millions of lines. Seems like enough for a start. Also we'll need a file with per year totals for books written and words used. Formats description and links to all files are available on this page.

To do the analysis we'll use Scala Spark shell. Also you'll need some text editor to review the results.

So, what are we going to get from the data? First idea that comes to mind is getting most frequent words for each year. Applying this approach in a straightforward way leads to a trivial result: for all years most used words would be "and", "a", "about", "all" etc. Taking into account words order could lead to some (mostly philosophical) insights, yet overall outcome seems to be rather useless.

To eliminate too common words, we we'll replace frequency of word with a more complex quantity: TF-IDF (treating each year as a separate document). This quantity penalizes words that occur to often across the years. Thus, "and" gets much lower score than other, meaningful words.

So, our plan is:

- Compute for each word total frequency across all years and convert it to IDF.
- Get frequency for each word for each year and then compute TF-IDF.
- Get top N words for each year.
- Review the results.

Let's start by defining paths and writing data loading code. I'm not sure why, but the file with total counts separates entries with tab (\t) character. So it is a one-liner. In each entry columns are separated by comas. Thus, we load text file, take the first line, split it by tabs, remove blank entries, split each entry into columns and produce a map [YEAR => (words count, books count)].

val path_totals = "googlebooks-eng-all-totalcounts-20120701.txt" val totals = sc.textFile(path_totals) .first .split("\t") .filter(_!=" ") .map(x => {val t = x.split(","); (t(0).toInt, (t(1).toLong, t(3).toLong)) }) .toMap

Now we'll write code for loading core data. Each line yields information for one year-word pair separated by tab character. We'll need a class *Word* to represent a record.

val path = "googlebooks-eng-all-1gram-20120701-a" @SerialVersionUID(101L) a_cnt: Int, a_books: Int) val year: Int = a_year val word: String = a_word val cnt: Int = a_cnt val books: Int = a_books } val raw_data = sc.textFile(path) .map(x => {val elem = x.split("\t"); new Word(elem(1).toInt, elem(0), elem(2).toInt, elem(3).toInt)}) .cache()

Next step is the computation of inverse "document" frequencies.

val iyf = raw_data.map(x => (x.word, 1.0)) .reduceByKey(_ + _) // count word occurrences across years val iyf_words = raw_data.map(x => (x.word, // get word frequencies for each year (x.year, x.cnt.toFloat/totals(x.year)._1))) .join(iyf) // join with year frequencies .mapValues(x => (x._1._1, x._1._2*x._2)) // get TF-IDF // now convert to more convenient form: [YEAR => (WORD, TF-IDF)] val years = iyf_words.map(x => (x._2._1, (x._1, x._2._2))).cache()

All the preparations are done. Now we need to query for top words and collect the results.

val max_val = 10 // get top 10 words val years_top = years.groupByKey() // group values for each year .mapValues(x => x.toList.sortBy(-_._2).take(max_val)) // sort descending by TF-IDF and retrieve top 10 elements .map(x => (x._1, x._2.map(_._1))) // form [YEAR => words list] map val temp = years_top.sortByKey().collect() // run the computation and load ordered results to memory

A side note regarding the *groupByKey()* call. It works and produces desired results, yet it is not kind of method that you want to invoke when working with large datasets. The reason is that for this call Spark will keep in memory all values corresponding to some key. And for large datasets you'll often find that even for one key amount of data exceeds your memory capacity. Thus you could find your Spark stopping execution and throwing memory-related exceptions. A better way is to substitute *groupByKey() *with *aggregateByKey() *using appropriate parameters.

To finalize the coding part we'll save the results to a file.

import java.io._ try { op(p) } finally { p.close() } }

Open file *out_iyf.txt* in your favorite text editor and enjoy! Here is a bit cleaned version of the file.

Ok, now let's take a look at results and see what insights we can get. For last 100-200 years words "America" and "American" constantly dominate the top. Moreover, there are three periods: until 1830s these words appeared rarely, then in 1830s-1880s period words at last get into the lower part of lists and since 1880s constantly occupy top positions. These periods seem to correlate with publishing history. Large publishing houses in America started working in the first half of 19th century. Though they mostly were publishing pirated copies of British books. Also 19th century was rich with technological innovations for publishing. It might be that new technologies and introduction of royalties lead to greater rise of American publishing industry.

Another interesting example is the world "army". It frequently appears in 18-19th centuries and then fades out. Might be that in last 100 years it has been replaced by other more specific synonyms. Note 1758-1764 series of years when the "army" held top 1. Before and after that period other terms occasionally occupy top-1 position. It could be a coincidence, yet it might be that a lot of books at that time were describing what now is called Seven Years' War. Main conflict for that war lasted from 1756 to 1764.

A few other words to check: automobile, aircraft, analysis, average.

Even a rather specific and modest size dataset could lead to interesting insights.

]]>Parameters initialization plays an important role in the process of model training. Even for convex loss functions, where global minimum is reachable from any start point, good initial guess would significantly reduce training time. For more complex models, bad initialization can lead you to local minimum with rather high loss. One common practice is to initialize weights with small random numbers. It prevents most of dangers, because chance of having a really bad initial values is small. Yet with this kind of starting point you also should not expect any benefits. Let's do some investigation and try to come up with some, possibly more productive, way for departure parameters selection.

Last note before we move on: there are some clever methods for initialization via pretraining. One of them is to train a layer in unsupervised manner as RBM. And then proceed to supervised training. It is a interesting and effective approach, yet we'll not cover it in this post. Still we'll do pretraining too.

Suppose that your features vector space is high-dimensional, e.g. ANN has thousands of inputs. Natural approach here is to reduce dimensionality before training. This would provide speedup during training, usually by cost of accuracy. Though, it is often possible to get up to 50% decrease in number of inputs with tiny loss in classification quality. A well-known way to do dimensionality reduction is principal components analysis (PCA). The trick is to find axis along which most of data variation is observed and then to project data into lower subspace. Think of a noisy data along a straight line in 3-D space. Projection on that line would be a good choice of axis for 1-D representation of data. Assuming that data is already z-scored, algorithm for PCA is straightforward:

- rotate basis axis so that they coincide with directions where data changes the most;
- leave only dimensions where data varies noticeably.

First step of the algorithm (rotation) in linear algebra sense means multiplication by some matrix. Luckily, MATLAB already has a routine necessary to find that matrix: singular value decomposition (SVD). Moreover, columns of the matrix returned by *svd()* call are already sorted so that first columns correspond to directions with higher variability. So, our second step is to keep only first few columns. Multiplication by resulting matrix reduces dimensionality of the input.

Suppose that we have an input vector . Assume that is the matrix returned by *svd()* call and is input after PCA transformation. We want to train a network layer using PCA preprocessed data. Weights of the layer that we are training are denoted by matrix . Forward pass for this layer involves the following computation:

(1)

On the other hand it is same as calculating:

(2)

In other words, it is equivalent to a larger layer trained on initial non-PCA inputs with weights .

We can exploit this fact by training a smaller classifier on dataset of reduced dimensionality, multiplying weights by and then using obtained weights as a starting point for training on non-reduced dataset. Hopefully, initial model will catch the outline of dataset and significantly simplify the task for a more complex one.

I've trained three groups of logistic regressions on subset of MNIST:

- on images pre-processed by PCA;
- pretrain on images pre-processed by PCA, extend weights to the size corresponding to raw images, continue training (now on raw images);
- on raw images.

All the classifiers used the same fixed learning rate. This makes important the starting point for the learning algorithm. With gradually decreasing learning rate all the classifiers from group 2 and group 3 will reach similar accuracy.

The following image shows a box plot of error rates for each group.

*Figure 1. Error rates for three groups of classifiers.*

The PCA-based group is less accurate than two other groups, yet sometimes it achieves decent results. The difference between group 2 and group 3 is not big. Still it is consistent. In most of cases the model with PCA pretrain yields higher accuracy.

Code for training classifiers from group 2 can be found in my GItHub repo. It worth noting that PCA involves operations that are rather costly. So, if you have a lot of data it might be wiser to use a smaller subset to do dimensionality reduction. Say, 3-5 times the initial number of dimensions. It should save you a lot of computation time without much hurt to the accuracy.

]]>There is one more idea described in the paper. Neural networks (especially deep ones) as a function in inputs space is not always smooth. By smoothness here I don't mean existence of derivatives, but rather the fact that inputs in the vicinity of train set samples could have unexpected classification labels assigned. Authors describe a way, how to obtain visually almost indistinct pairs of images that will be classified by network to different classes (e.g. bus classified as ostrich). By applying this procedure it is possible to modify a dataset so that error rate will go up from say 5-10% to 100%! Moreover, this "corrupted" dataset also leads to high error rates for networks that were trained on different samples of data or have different architecture then network used for images modification.

Two improvements are proposed for training routines:

- train an ANN, get modified images and add them to the dataset, train the ANN again on updated data;
- add a regularization term to loss function so that network output instabilities will be compensated.

After first reading it wasn't clear for me, what form regularization term should have. Yet expressions for upper bounds of instability are provided, so it shouldn't be hard to come up with some solution.

As for quick and dirty term I'd give a try to gradient regularization, e.g. penalizing high values of gradient. Yet this could possibly slow down learning, because computation of Hessian becomes necessary.

]]>Surprisingly, you will see the same error message again! The reason is that your path changes were applied in temporary workspace of the debugger. So, by leaving debug and moving back to the original workspace you have reverted all path changes. Long story short, always double-check which environment you are in before applying any changes.

]]>It was hard to describe the model analytically, especially for medians case. As a consequence, reliable expansion of the model to more than two clusters seemed challenging.

Recall the cost function that we maximized in unsupervised learning:

(1)

Here is a step function, that produces 1 if its argument is greater than threshold and 0 otherwise. The value of the threshold is either mean or median of arguments provided by all the samples.

When deriving weights update rules or other important formulas you often do two things: averaging and differentiation. While the first one does not seem to be a problem, the latter would be an issue. And the reason is that is not smooth. Its derivative in the threshold point is undefined unless we we've done some additional assumptions. Furthermore, location of the threshold point is not fixed and depends on the dataset that you use.

Overall, we can try dealing with difficulties by redefining so that its derivative is a delta-function. The other way is to split expression (1) in two parts: when threshold is exceeded and when it is not. Then derive all the formulas for both cases and combine them. Still, you will need to apply other tricks in order to extend the model to more complex cases.

How can we make more friendly to derivatives? The answer is plain: drop the plus sign. This can be viewed as an approximation of the step function with a smooth sigmoid . They both share similar features. For arguments with big absolute value sigmoid will act more and more similar to the step function. So let's examine our new cost function:

(2)

It turns out that this is a rather well-known quantity in physics and information theory: entropy. Up to a sign. There are various definitions of entropy depending on the field of application. On a high level, entropy measures how messy the system is. So, high values mean that there is a lot of chaos and low values mean that everything is in order. When searching for weights maximizing (2) function, we, in fact, minimize entropy. Thus we force the algorithm to find clustering that with as little clutter as possible.

Let's check the difinition of entropy from information theory:

(3)

Here describes a set of all possible system states (e.g. set of feature values for a sample + a label) and is a probability to observe a specific state. The value of provides amount of information gained from knowledge that the system is in state. If probabilities are close to extreme values, then entropy gets near zero values. The reason is that with such probabilities you are already almost certain what state the system is in. On the other hand, for probabilities around 50% you get maximum values of , because any state would be surprise for you.

Probabilities are given by a model. By training a model minimizing the entropy you are virtually asking to make a clusterization with the least achievable surprise.

To rewrite expression (3) to form (2) we need to substitute state with labels and features

(4)

Note that is a set of all possible features-label combinations. We limited our model to produce only two clusters, so for every features instance there are two possible combinations: and . We have limited amount of train samples, so exact calculation of is not always feasible. To resolve this problem we can approximate the entropy by summing over all train set. If the dataset had been properly sampled, we would get decent results. Updated formula for the entropy yields

(5)

Expression (5) uses joint features-labels probabilities . On the other hand, our models provide conditional probabilities . Having the distribution for features known, we can use Bayes theorem and get the joint distribution. Unfortunately, often you don't know the marginal distribution. You could get it by bootstraping or simply assuming that all samples are equally likely to appear. Still, there is another, more elegant way.

It would be cool if we substitute in (5) with . The question is what we get then?

(6)

This is called conditional entropy. It shows how much information is needed to assign a cluster label to a sample with known features . A good thing is that and are related

(7)

Also, depends only on features. So for any model it would be the same. As a consequence, model minimizing conditional entropy also minimizes entropy . Using the fact that and substituting with a model estimate in (6) we arrive to the cost function (2) with negative sign.

Now that we know the nature of the model, we can analyze its pros and cons. Good things are that it has a clear physical sense, it is tractable for analysis and it is fairly simple. Nevertheless, it has some issues too. And the main one is the presence of local minimum that produces absolutely useless results. Suppose that all weights of the model are near zero. Then output of the model doesn't care what data is passed in. It will be always the same value determined by bias. Bad and obviously not desired situation. With bias magnitude high enough you get almost perfect 1 or 0 probabilities. That yields a very low value of conditional entropy and thus would be accepted by training routine as a good solution.

Usually, it is enough to initialize weights with small random values before training to feel relatively safe from those local minimums. The model works fine then. Or alternatively we can modify the cost function by adding anti-regularization term. To be honest, I even planned to use this (6) model as a basis and name the post "Clustering via entropy minimization". However, after doing some search I've found a paper describing this approach in more details and one more work with improvement of the algorithm (Grandvalet 2004 and Gomes 2010).

One more remark before going to the next section. If you calculate derivatives , you will get weights update rule

(8)

Here denotes the conditional probability . Similarly to the previous part, weights update rule gets form of sum over samples weighted by gain . In this case gain is

(9)

*Figure 1. Sample gain for entropy minimization model.*

Gain shares similar features with models from previous part: it becomes close to zero when travels to extreme values. Also note zero at . Interpretation is following: samples with circa 50% probability estimate are near the decision boundary and could be either class, so it is much harder to extract information from them. The farther the sample from the decision boundary the more confident the model is in its class. During the learning phase the algorithm changes weights so that on average samples become more and more distant from the decision boundary. This is similar to what SVM does. Also there is a turn point. If a model is around 95% or more certain in a sample's class, utility of that sample decreases. It prevents large grow of weights acting as a regularizer. So, effectively, algorithm uses only part of samples that lie in 60-95% region of probabilities where it is fairly confident in samples class.

What is bad with all samples being in one class? Having everything in one basket makes introduction of classes useless because there is no information gained by knowing the class. In other words, class labels are then completely unrelated to features. So, what we really want is not to have everything in order, but to have labels that are closely related to sample features. In math language we would like to maximize mutual information, i.e. mutual dependence between labels and features. In terms of formulas, it is described as a distance between joint features-labels distribution and product of marginals

(10)

Having means that which essentially represents independence of features and labels. So, we want to maximize , i.e. to put as much information from features into labels as possible. One could think of this as of a very lossy compression.

Expression (10) operates with three probabilities: , and . Yet out model provides only conditional probability . Luckily, mutual information is closely related to entropy:

(11)

There is a great image on Wikipedia describing this relation.

*Figure 2. Venn diagram showing relation between entropies and mutual information.*

The new model is pretty the same that we already had. The difference is in the first term : entropy of the labels. Maximization of (11) means minimization of (6) while keeping labels entropy high enough. Note that labels in this case are considered alone from the features

(12)

Factor occurs because we are calculating entropy over all labels in the dataset. In this case all of them act as independent subsystems and thus their entropy is additive. The simplest way to estimate is to assign labels to all the samples in the dataset and then calculate frequencies for ones and zeros.

Suppose that all elements are classified in the same cluster. This leads to equal to either 1 or 0. Thus, becomes zero and decreases the mutual information! Seems like the problem is solved.

Let's move on and derive weights update rule. To do this we need to find gradients . We already have an expression (8) for conditional entropy derivatives. Marginal distribution is not related to features , still we can connect it to the conditional .

(13)

Where stands for averaging over all possible samples. To estimate the expected value we take average over the dataset of samples.

(14)

Combining all together we get an expression for the entropy

(15)

and its derivatives

(16)

We can rewrite the last equation in terms of sums over samples

(17)

The gradient (17) is very similar to the one for conditional entropy . Combining both of them we get the following weights update rule

(18)

Again, the rule (18) is in form of weighted sum over dataset. Expression for gain

(19)

It is close to (9). When expressions (19) and (9) coincide.

*Figure 3. Sample gain for mutual information maximization model.*

The gain function is always zero near extreme points 0 and 1. Also it has one zero in between. The location of the zero is determined by mean value . Interpretation for this is same as for entropy case: points too close or too far from decision boundary provide no new information to the learning algorithm. Also note how the curve form changes favoring more the samples from larger cluster. It gives a hint that this method won't work good if dataset is too skewed, i.e. there are to few samples for one of the classes.

We have cost function (11) and weights update rule (18). With regard to theory we are done. Now it is time to code the equations.

We'll start with the simplest parts: functions for entropy and mutual information calculation.

function S = get_entropy(p)

Note that mean is used instead of sum. It works as a normalization, moving numbers in the same scale independent from number of samples in the dataset.

Same normalization is used here. Otherwise we had to multiply Ht by number of samples.

The next step is gradients estimation.

function [s, gW] = get_grads(W, data) s = get_probs(W, data); l_expr(isnan(l_expr) | l_expr == 0 | isinf(l_expr)) = 1; gW = gW.';

Gradient in unsupervised learning should not depend on labels, so we pass only weights and samples to *get_grads* function. Assignment in line 5 aims to prevent infinite or NaN values of gradients. Analytically we can show that in those extreme cases *sfun* will take zero values. In spite of that during the calculation phase MATLAB does not take limits or take into account functions order of growth, so line 5 is introduced to deal with this issue.

The rest of the code is very similar to the one from the first part. Here is a GitHub link to a file that takes care of initialization, train loop and tests. Key differences from what we've already written in part 1 is in change of the cost function (*get_mint* instead of *get_logll*) and absence of labels in gradients calculation.

There are also a few minor tweaks. We will review two of them.

The first one is data preprocessing. You can not be 100% sure in your dataset. It is possible that there are duplicate features or NaNs or useless ones etc. That's why it is important to cleanse your inputs before passing them to the learning algorithm.

function [data, keep_inds, means, stds] = preprocess_data(data, keep_inds, means, stds) keep_inds = stds>0; % no variation => no usefull information means = means(keep_inds); stds = stds(keep_inds); end data = data(keep_inds, :); % z-scoring data = bsxfun(@minus, data, means); data = bsxfun(@times, data, 1./stds);

This is one example of a preprocessing function. It removes constant features, i.e. ones that have the same value across the whole dataset. It could reduce the dimensionality of your dataset and thus speedup the training. After useless entries are removed z-score is applied. Z-scoring is a normaliztion routine that centres features and scales them to have unit variance. Having all the features in the same scale is generally a good idea. Features are combined as a linear combintation in sigmoid so vast discrepencies in scale could lead to shadowing of small-scale features and imprecise models learned. In fact akin ideas are used in MATLAB Neural Networks (see documentation for removeconstantrows and mapminnax functions).

Do you remember how we initialized weights in our early models? We used small random numbers. For supervised models that worked good. Yet, for clustering purposes this approach produced flipping labels issue: initialization determined which class would be marked with which label. This issue is not critical: you can switch labels after reviewing the model trained. Often it is not even important what actual labels would be. Still it would be good to know ahead that for example zero from MNIST dataset would be marked by close to zero outputs of the model. To achieve this we can train a supervised model on a tiny bit of labeled dataset. If you don't have labels, you can add them manually. For subset of MNIST that we used in previous part it is enough to have just as little as 2-3 labeled samples per class. Still leaving thousands of samples unlabeled. Initialization routine is shown above.

function W = init_weights(data, labels) % Argument labels is a Nx2 matrix. % The first column represents labels (e.g. zeros or ones). % The second column: indecies of samples from data corresponding to the labels. % So, if first two samples have "0" label and 10th and 13th have "1" label, then % labels = [0, 1; 0, 2; 1, 10; 1, 13]; else X = data(:, labels(:, 2)); t = labels(:, 1)'; W = train_lr(X, t); end

The supervised model is trained using *train_lr* function that was introduced in part 1. The curious thing is that we've used unlabeled data, added a few labels, passed that to our routines and as a result obtained a semisupervised algorithm for clustering that produces stable labels for clusters. In other words, two runs of training would likely end with models assigning same labels to most of data. Thus, labels flipping issue is resolved!

Wow! It was a long road, but we've made it till the end. Let's summarize our achievements. We've reviewed logistic regression model and created algorithms for classification and clustering based on this model. Our models expect that there are only two classes/clusters in data. It makes models more simple for understanding and coding, yet limits their power. Using some tricks you can solve multiclass problems even with this dichotomous toolkit. Two of many possible ways: hierarchy of models and one-vs-all approach.

After introducing baseline model for unsupervised learning, we highlighted its drawbacks and eliminated them by changing cost function and adding smart supervised initialization of weights. Most of the code mentioned in this post can be found on GitHub.

At this point you have a decent clustering framework and understanding how and why it works. Good luck in your experiments!

]]>Let's start with a model that you can find in nearly any introductory course to machine learning: logistic regression.

Aim of this model is to divide input data in two predefined classes (A and B). Suppose that input data is provided in a form of column-vector . Then logistic regression estimates the probability that input corresponds to class A:

Where row-vector and scalar are parameters of the model (named weights and bias). I'll assume that first element of if always equal to 1. In this case we can concatenate and forming a row-vector of weights . After this operations transforms to a simpler form:

(1)

*Figure 1. Plot of sigmoid function. This function maps linear combination of inputs into a probability .*

Training a regression is a process of estimating weights that maximize some target function. In order to train a supervised classifier we need input vectors and corresponding labels. So for each input vector we need a label describing which class belongs to. Let's assign when is from class A and if is from class B. Also suppose we have inputs-label pairs.

What is the distribution of probabilities for classes given features ? It is same as for Bernoulli process:

(2)

Joint probability distribution for features and labels according to Bayes' theorem reads .

Now let's try to find the weights maximizing the probability for all pairs. In other words we want to maximize the following function:

This function is called likelihood and overall approach is maximum likelihood estimation. Often it is easier to maximize logarithm of likelihood. Both of them achieve maximum at same point. Logarithm of likelihood:

Note that third summand under sum sign is independent of , so we can throw it out of our target function:

(3)

One of common ways to maximize a function is to use a gradient asscent. Gradient asscent is an iterative method. On each iteration it updates estimate of optimal parameters (weights). The value of update is proportional to the gradient of the function with respect to parameters. You can think of it as climbing the mountain choosing the steepest way on each step.

(4)

Coefficient is called learning rate. It controls speed of convergence of the algorithm. Gradient of log-likelihood (3) yields

(5)

It can be further simplified using the fact that all labels are binary with values either 0 or 1:

(6)

Note that gradients (5) and (6) have the following form:

(7)

Here is a scalar gain for a given sample from a dataset. Gain depends only on labels and model outputs and remains same for all components of vector . Amplitude of gain seeks zero when . So the more accurate sample is classified by the model the lower its impact on weights change is. At this point we've done with theory. To build a framework for logistic regression training we need to code formulas (2), (3), (4) and (6).

All the coding in this post will be done in MATLAB. We will start we the simple piece of code: estimation of class probability given inputs and weights.

Part 1: code for function.

function s = sigmoid(data)

Part 2: code for equation (2).

function p = get_probs(W, data) p = sigmoid(W*data);

All the code is vectorized so you can pass multiple samples in *data* argument.

The next step is calculation of log-likelihood values for a given dataset (3).

function ll = get_logll(p, t)

Sometimes it is possible that *p* will be equal to zero or one. In that case one of logarithms turns to . As a result we could get for some samples *0*log(0) = NaN* instead of expected zeros. To prevent this kind of issues argument of the logarithm has been changed from *p* to *p+(p==0)*. So, value of the logarithm, the sum and resulting log-likelihood would always be finite numbers.

At this point we can do prediction for a model and assess its quality. Now it is time for some training code. Gradient estimation (6):

function gW = get_lr_grads(W, data, t) s = get_probs(W, data); % get probabilities estimates sfun = t - s; % calculate gain for all samples gW = gW.';

To have true gradient for log-likelihood you need to replace *mean* with *sum* in line 5. In fact *gW* is a gradient for mean log-likelihood over all samples which is *N* times lower that the log-likelihood itself. So both of them reach maximum in the same point. The reason for using *mean* is that *sum* would be unbounded for large datasets. While it is reasonable to expect mean log-likelihood to be finite.

Top-level function that does initialization, updates weights according to (4) and prints training state information.

function W = train_lr(data, t) max_iter = 100; % maximum number of iterations min_diff = 0.01; % minimum tolerable change in log-likelihood alpha = 0.1; % learning rate old_ll = Inf; % variable for storing previous log-likelihood p = get_probs(W, data); ll = get_logll(p, t); gW = get_lr_grads(W, data, t); % gradient estimation W = W + alpha*gW; % weights update p = get_probs(W, data); % outputs of current model ll = get_logll(p, t); % current log-likelihood if old_ll-ll < min_diff break; % no noticeable changes, terminate loop end old_ll = ll; i = i+1; end

Comments describe most of code. A few additional notes. To do a prediction it is necessary to make a decision at some point. Having the probability is cool, but you need to decide which class input belongs to, right? See line 20 above. To do a decision we simply check if value is greater then 0.5 (50% probability). For higher values assign class A, for lower — class B. Good thing with this approach is its clearness: it is very easy to implement it. The drawback is that for instance if the model provides you with 50% probability then you assign class B. Though, you don't have enough evidence to be confident in this decision. You don't have enough evidence to do any decision in this case!

Note the initialization code in line 7. Log-likelihood in our problem is strictly concave, so we can reach its maximum using gradient ascent from any point given enough time. In fact we could use any weights in initialization part, for example all zeros. This is not always true. For more complex models (e.g. neural networks) starting point is crucial and often determines which one of many local maximums you eventually arrive.

To conclude this section let's train a model. We will use MNIST dataset. It is a well-known playground for machine learning models. I think it will not be far from truth to say that virtually any machine learning model has been trained on MNIST at least once. It is a labeled dataset of hand-written digits. Each sample is a 28x28 grey-scale image of a single digit. Here is one example of digit 0 from train set.

*Figure 2. Digit 0 from train set of MNIST.*

You can get MNIST in original format and read it by yourself or download version already prepared for MATLAB (43 megs). I used code provided by G. Hinton to create those mat-files.

From now I'll assume that you have a set of mat-files named *digit0.mat* - *digit9.mat*. And each file contains a matrix *D* of size *Nx784*, where *N* is number of images of corresponding digit. Let's read data for two digits (zero and one in this case), concatenate to one matrix and start training.

data0 = D.'; data1 = D.'; data = [data0, data1]; t = [t0, t1]; W = train_lr(data, t);

Output of this code (actual numbers would possibly be different):

Iteration #0 --log-likelihood: -10008.723 Iteration #1 --log-likelihood: -5767.701, errors: 19.31% Iteration #2 --log-likelihood: -4279.538, errors: 7.00% Results after 2 iterations: log-likelihood: -4279.538, errors: 7.00%

It works! It took a few iterations to converge to a decent result with 7% error rate. By adding minor modifications to training code you can get even more accurate models. For example, by applying z-score to inputs and removing constant components you can reduce number of errors twice.

Too easy? No problem, let's make a problem more challenging and remove all the labels.

Now all you have is a set of images and you want to divide them into groups. This is called clustering. Suppose that you've reviewed images and found out that they are handwritten digits: zeros and ones. How would you modify your logistic regression model to handle this case? The naive way to do this is to let weights update rule (6) figure out labels itself.

(8)

Where equals to one if and only if . Otherwise it is equal to zero.

To test this approach we can slightly modify existing code by replacing line 4 in *get_lr_grads* function with the following code:

sfun = (s>0.5) - s;

Note that gradients are now independent from labels. General equation for gradient (7) remained almost the same. Only the form of sample gain changed.

Now train the model. You'll find that models trained by this code tend to group all cases in one class. Indeed having all probailities close to extreme points gets us almost zero magnitude gradients. That is a local maximum for a function that we are trying to maximize. In fact, the gradient of the form (8) corresponds to the following function:

(9)

Recall the log-likelihood (3) for supervised training. Function is its analog for our model. It is clear that when is close to zero or one for all cases, this function gets close to zero. Which is its maximum. By the way, model that classifies all images correct is a local maximum for this function too. So there is a minor chance that after training you'll arrive to a decent solution. For this way of training the initialization of weights is crucial and completely determines the local maximum that the model arrives.

Suppose that we know that numbers of zeros and ones in the dataset are approximately the same. We can then update our labels assignment strategy. On each iteration we will estimate a threshold value for that divide dataset in two groups. One of ways to get that threshold is to calculate mean or median value for . Both should work. Let's change line 4 in gradients estimation function again:

sfun = (s>median(s)) - s;

Training yields a model with either 5-10% or 90-95% error rates. 95% means that our model has switched labels for ones and zeros and "real" error rate is 5%. Note that error rates are in the same range as for supervised case. We don't even need labels to do well on this task!

The drawback of this approach is that it is harder to describe analytically. Especially when medians are used. As a consequence it is harder to make a reliable extrapolation of this method when there are more clusters.

That's all for now. In the next part we will introduce a clever initialization of weights preventing flipping of labels. Also we will derive a similar model that is more open to analysis and extensions. Hint: check the definition of entropy in CS sense and this post title.

]]>Our model had been based on sequence of 64 elements. It is a tiny piece of data. Also there was no separation between training and testing datasets, so we can't eliminate possible overfitting (i.e. good results only on data that the model has already seen).

A good thing is that these issues can be easily resolved by gathering more data. You can fire up your favorite IDE (or text editor) and write a simple application that will assist you in data collection. If you are too lazy, just use my data. A bunch of zeroes and ones typed lead to 5000+ points (100 sequences of 50 elements on average). By the way, you can also find all the code from this post on GitHub.

One of the ways to collect data using MATLAB:

function data = collect_data(fname) % collect data % Collects data from user and stores in data/'fname'.mat path. data = []; % collect data while we get 0 or 1 as input. while true if isempty(in) || (in ~= 0 && in ~= 1), break; end data(end+1) = in;%#ok clc; end % save and display stats

If you plan to use this code, make sure that you have a *data* directory where sequences will be stored. In addition here is a handy code to read the data:

function data = get_all_data() % get_all_data % Reads all data gathered so far. Output is a cell-array. data = {}; data{end+1} = z.data;%#ok end

Output of this function is a cell-array with sequences. From this point I'll assume that you have enough data to do the following calculations.

Too warm up let's do some analysis. First we check distribution of means for sequences generated.

data = get_all_data(); D = []; end

All the means are now stored in *mns* variable. Also note that *D* now holds all datapoints concatenated together. We will use it later.

Mean of *mns* (i.e. mean of means) appears to be *mean(mns) = 0.5033*, so on average I produce same amount of zeroes and ones. Standard deviation for distribution of means is *std(mns) = 0.044*. Thus ones rate for most of sequences would lie in 42-59% interval. You can get these numbers by stepping two standard deviations from mean on both sides. Visual checking of histogram for distribution of means (*hist(mns)*) confirms our calculations.

Mean close to 50% says that coin tossing will deliver us error rates close to 50%. Doesn't seems to be an exciting result.

Now let's go a bit deeper and calculate sample autocorrelation function (ACF). In the previous post we used R acf function to do this. To do the same in MATLAB I've coded a simple ACF calculation code which you can find here. It returns ACF values for given number of lags and plots a graph if no output arguments provided. Here is an example of call and a figure that it produces.

get_acf(D);% D contains all data points

Red lines represent confidence interval for autocorrelation of white noise. It is clearly seen that first 5-6 lags have significant values of ACF. It is possible that this is related to capacity of our working memory. There is a famous paper by George Miller "The Magical Number Seven, Plus or Minus Two". The number of objects that a human can hold in his working memory is close to 7. Look at the figure above: correlations vanish for lags higher than 6-7. A coincidence?

In first part we had observed that closest neighbors possessed significant negative correlation. So why don't we try the model from the previous part on this new data? After running a test you will find out that error rate would be 38%. It is quite good, far better than a coin toss. Still there are other peaks of ACF that we haven't taken into account yet. Can we do better by exploiting correlations on higher lags?

To get use of more information we need a more complex model. Apparently, this model should involve information about sequence development history. There are lots of models based on number of ones in previous *N* steps, length of the last sequence of zeroes or linear combination of previous elements or some other statistics. Among them are LPC, multistep Markov chains or even SVMs and neural networks.

A more straightforward approach is to model a joint distribution for a few consecutive elements. To do that we should split sequences in chunks of data (frames). For example, if we want to model a joint distribution of three consecutive elements, then the sequence of seven elements {0, 1, 1, 0, 1, 0, 1} leads to five frames: {0, 1, 1}, {1, 1, 0}, {1, 0, 1}, {0, 1, 0}, {1, 0, 1}. Note that frames are overlapping. One of ways to model a distribution is to train a Restricted Boltzman Machine. It is a cool unsupervised model that had become popular in last several years. It was invented by Geoffrey Hinton and is an important building brick for today's machine learning mainstream: deep learning. However, we will use a more simple yet powerful model. Recall the way we got our trivial guesser: we were counting pairs of neighbor elements. To explore more sophisticated data patterns we will count occurrences of three, four or even more elements. To model arbitrary joint distribution of *N* binary variables you need parameters. Thus number of parameters grows exponentially fast with history length. With amount of data at hand we should expect models with memory length up to 7-8 would work reasonably well. Luckily, ACF suggests that we probably wont need any longer memory.

Following code goes through a sequence *data* frame-by-frame and checks *ord* consecutive elements. Then it looks at next element and updates corresponding counter *p*.

function p = get_predictors(data, ord, p) % get_predictors % Estimates distribution from data. % initialization to have 50% probability as a default case end % calculate index ind = 1; if D(j+1) ind = ind + 2^j; end end % update distribution p(D(end)+1, ind) = p(D(end)+1, ind) + 1; end

I'll explain counter indexing rules a bit. If next element (i.e. the one after *ord* consecutive elements) is zero, then first row is used. Otherwise second row is used. Column index *ind* is a number that we get after treating consecutive elements as a binary number and converting them to decimal base (e.g. 1001 leads to index 9). So, each element of the counter *p* is a number of occurrences of zero or one (depending on row index) after previous *ord* elements (determined by column index). This representation is flexible enough so that we can add more sequences and update it using only new data. Getting probabilities from counters is still easy. For example, the conditional probability of finding 1 given the subsequence {0, 1, 1} of previous elements can be obtained by evaluating:

Suppose that we have estimated the distribution. What next? Now we need to use it to predict future outcomes given the history. The most straightforward way is to sample from the distribution. The other way is to apply a threshold, i.e. to always predict the value with higher probability. I've coded a script that runs a test for models with various memory lengths. It uses half of sequences to estimate distributions and then tests models on remaining data. So, tests are run on data that wasn't seen by models before.

First let's check how model will work in case if we use a random distribution instead, i.e. all values in *p* variable are just random numbers. So, we generate random distribution and sample from it.

1-step memory: 1347/2468 errors (54.58%), CI: 2.78%, chance 25.00% 2-step memory: 1063/2418 errors (43.96%), CI: 2.81%, chance 12.50% 3-step memory: 1292/2368 errors (54.56%), CI: 2.84%, chance 6.25% 4-step memory: 1151/2318 errors (49.65%), CI: 2.88%, chance 3.12% 5-step memory: 1132/2268 errors (49.91%), CI: 2.91%, chance 1.56% 6-step memory: 1184/2218 errors (53.38%), CI: 2.94%, chance 0.78% 7-step memory: 1095/2168 errors (50.51%), CI: 2.98%, chance 0.39% 8-step memory: 1068/2118 errors (50.42%), CI: 3.01%, chance 0.20%

What do all these numbers mean? Third line reads: "3-step memory: 1292/2368 errors (54.56%), CI: 2.84%, chance 6.25%". It describes results of test for model which takes into account previous three elements. After running prediction test on 2368 cases, 1292 mistakes occurred. That yields error rate 54.56%. Radius of the 95% confidence interval is approximately 2.84%. So, true error rate is somewhere between 51.72% and 57.40%. You probably noticed that number of cases is not the same for all the models. We can not predict outcomes when we don't know the history. So, for model with 5-step memory we can't predict first 5 elements in each sequence. This is the reason why number of cases gets smaller for models with longer memory.

It is clearly seen that for all the models we've got a high error rate. Lower value for 2-step memory is likely got by chance. What if we try applying threshold instead of sampling?

1-step memory: 1215/2468 errors (49.23%), CI: 2.79%, chance 25.00% 2-step memory: 1189/2418 errors (49.17%), CI: 2.82%, chance 12.50% 3-step memory: 1100/2368 errors (46.45%), CI: 2.84%, chance 6.25% 4-step memory: 1090/2318 errors (47.02%), CI: 2.88%, chance 3.12% 5-step memory: 1048/2268 errors (46.21%), CI: 2.91%, chance 1.56% 6-step memory: 1161/2218 errors (52.34%), CI: 2.94%, chance 0.78% 7-step memory: 1090/2168 errors (50.28%), CI: 2.98%, chance 0.39% 8-step memory: 1001/2118 errors (47.26%), CI: 3.01%, chance 0.20%

Error rates got lower, but they are still close to 50%. If you run the test multiple times it is possible that you'd occasionally see good results. Especially for models with shorter memory. The reason for this is in very limited space of possible model instances. Consider a 1-step model with a threshold. You can formulate only four distinct rules: **predict 1 if previous is 1**, **predict 1 if previous is 0**, **predict 0 if previous is 1**, **predict 0 if previous is 0**. By choosing two non-contradictory rules from this set you get a predicting model. There are only four possible ways to do that. If one of them reflects real distribution of data, then with probability 25% a random 1-step model with a threshold will work good. That is what chance value describes. For N-step memory you can build models. Thus, chance of guessing a good model is *at least* . Chance drops quickly and seems low for memory length equal to 3 or greater.

Now it is time to run a test for predictors based on real distribution. Results for predictions using sampling.

1-step memory: 1165/2468 errors (47.20%), CI: 2.79%, chance 25.00% 2-step memory: 1096/2418 errors (45.33%), CI: 2.81%, chance 12.50% 3-step memory: 1075/2368 errors (45.40%), CI: 2.84%, chance 6.25% 4-step memory: 1005/2318 errors (43.36%), CI: 2.87%, chance 3.12% 5-step memory: 1009/2268 errors (44.49%), CI: 2.90%, chance 1.56% 6-step memory: 957/2218 errors (43.15%), CI: 2.93%, chance 0.78% 7-step memory: 911/2168 errors (42.02%), CI: 2.96%, chance 0.39% 8-step memory: 891/2118 errors (42.07%), CI: 2.99%, chance 0.20%

Looks better. All confidence intervals are below 50% and for some models below the values that we had for random distribution. It means that estimation of distribution indeed helps!

Threshold instead of sampling:

1-step memory: 938/2468 errors (38.01%), CI: 2.75%, chance 25.00% 2-step memory: 927/2418 errors (38.34%), CI: 2.78%, chance 12.50% 3-step memory: 818/2368 errors (34.54%), CI: 2.78%, chance 6.25% 4-step memory: 810/2318 errors (34.94%), CI: 2.81%, chance 3.12% 5-step memory: 798/2268 errors (35.19%), CI: 2.84%, chance 1.56% 6-step memory: 762/2218 errors (34.36%), CI: 2.87%, chance 0.78% 7-step memory: 766/2168 errors (35.33%), CI: 2.91%, chance 0.39% 8-step memory: 750/2118 errors (35.41%), CI: 2.95%, chance 0.20%

Error rates got decreased by 7-10% for all memory lengths. So, one thing is sorted out: models with threshold work better for our problem. The decrease is indeed big.

We have achieved the error rate comparable to what we had in the first part. The key difference is that now we used much more data. And thus results are more reliable. Upper bound for 95% confidence interval for error rate of the best model is lower than 38% (it was around 46% for model from part 1). Shorter (1- or 2-step) memory leads to slightly smaller accuracy, but the difference between "3+ models" is not significant. So, if you need a simple yet accurate model, choose the 3-step model. If you need the most accurate model, try 6-step model. Though it is likely it would work as accurate as the one with 3-steps.

The post appears to be a bit longer than I expected. Here is a short TL;DR summary. We have used manually generated data to understand if it possible to predict "random" numbers that are created by a human being. Quick check of ACF shows that there are significant correlations on lags up to 6-7. This could be related to estimates of working memory capacity. Also by manually examining data you can find that it is very unlikely to find a long sequence of zeroes or ones. More unlikely than in case of Bernoulli process. It is possible that in your case patterns would be different. When I asked my wife to do experiment from the first part it emerged that she tends to produce longer sequences of same elements. However this is a pattern too.

After reviewing the data we estimated the distribution describing conditional probabilities of finding 1 at some position if previous history is given. To run tests we used sequences that haven't been seen by model before. Models that have information about at least three previous elements achieve error rate of 35% and less. It means that approximately in 2 of 3 cases these models would do correct prediction. So, treating yourself as a random numbers generator is not a good idea.

]]>I got this issue 2 or 3 times in last two month. There is a corresponding Launchpad entry. The bug has "Fix released" status for Ubuntu since mid September. Yet somehow it still happens. Last time I saw this bug I had Ubuntu 14.04 with unity 7.2.2. Apt-get-upgrading changed version of unity to 7.2.3. So, maybe it is indeed fixed now. However, if you encounter this bug try following steps:

- Switch to console (usually you can do this by pressing Ctrl+Alt+F1).
- Login if you haven't done this earlier.
- Kill all instances of unity-panel-service process (
*sudo killall unity-panel-service*). - Switch back to graphical interface (Ctrl+Alt+F7).
- If problem persists, repeat one more time.

A few notes. After switching to graphical interface it might be that you first see a black screen. In that case wait for some time to let the service to restart. In case of success you will be able to unlock the screen. Also you would find that applications are no longer on the virtual desktops that they were positioned initially. So, you will have to position them one more time manually.

In the comments to bug report there was one more solution. You could try running *unity --replace*. In my case that worked, but all windows lost their headers and borders. Probably, it could be sorted out. For example, by starting *metacity*, but it seems like a kinda longer workaround.

**UPD**. Had this issue again a few minutes ago. Seems like this workaround doesn't always help.

That simple clarification makes it possible to achieve a higher score. I'm not sure if we have a good source of entropy in our organisms. But the way a human acts is often somewhat predictable. One of the issues that appears when randomness meets common sense has been reviewed in the previous post.

If your random numbers generator is good, then it is challenging to build a model for it. It has no significant correlations, no memory, no periodicity. Well, it could be periodical, but the period should be enormous. Modern random numbers generators rely on external sources of entropy (mouse movements, keyboard strokes etc.).

The random number generator gathers environmental noise from device drivers and other sources into an entropy pool. The generator also keeps an estimate of the number of bits of noise in the entropy pool. From this entropy pool random numbers are created.

Even after those complications, numbers produced are called pseudorandom. And what about random numbers generator in your head?

Try to experiment on your own. Write a series of ones and zeroes. Make them random. Or at least they should appear to be random from your perspective. Close your eyes to prevent effects caused by the fact that you see the previous numbers in sequence. Also try not to think much. To achieve that write 2-3 numbers per second. Now observe your result.

When I was doing that I often had some quick thought like "hm.. too much ones, let's do a zero" or had switches between anti-symmetric (1, 0, 1, 0, 1) and symmetric (1, 1, 1, 0, 0) regimes. Also occasionally I mentally complained that it is hard to produce a lot of random numbers when you have to choose either zero or one. Here is an example of what you could get.

*Figure 1. Example of manual sampling.*

There are 64 elements in that sample, 31 ones and 33 zeroes. So, mean value is close to 0.5 which should be common for majority of sequences. Now, we can enumerate all pairs 0-0, 0-1, 1-0 and 1-1 (see poorly painted table above). It reveals an important fact: number of 0-1 and 1-0 pairs is twice the number of 0-0 and 1-1 pairs. What does it mean? It says that we have strong negative correlations on lag 1 (i.e. closest neighbors).

*Figure 2. ACF for generated series.*

Indeed, after plotting ACF it is clear that there are significant correlations on lags 1 and 12. We will leave the latter aside, but it might be possible to extract important insights from it too. As for negative acf(1): it means that it is noticeably more likely to find 1 followed by 0 or 0 followed by 1. This regularity can be utilized to build a naive guesser: **if current element is X, then next element would be 1-X**. Applying it to the sequence we get two thirds predicted correct. Using R we can check bounds of confidence interval for success rate:

> binom.test(42, 63) Exact binomial test data: 42 and 63 number of successes = 42, number of trials = 63, p-value = 0.01114 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.5366192 0.7804625 sample estimates: probability of success 0.6666667

Lower bound of confidence interval (53.66%) is close to what we would get by simply using coin toss to do predictions. But also true probability of success is higher than 50% on significance level of 0.05. So, our model works and it is more accurate, than a coin!

If you get more sequences or a longer sequence, success rate will likely be closer to the lower bound of the confidence interval due to fancy patterns that exist in data. These patterns are incomprehensible for the trivial model considered. Nevertheless, there is an important consequence: human introduces correlation to data he generates, so it is possible to build a model that effectively predicts future values of a random human-generated sequence. One of possible explanations is that we remember what elements we sampled in the previous moments in time. And that information affects future values.

In the next part we will assume that people use memory when generate random numbers. That should help to build a more complex and accurate predictive model.

]]>