Clustering via mutual information maximization. Part 2

In the previous post we've reviewed logistic regression model and designed a simple clustering algorithm based on it. We have managed to get a decent clustering on a subset of MNIST dataset. Yet there were some drawbacks in our approach.

It was hard to describe the model analytically, especially for medians case. As a consequence, reliable expansion of the model to more than two clusters seemed challenging.

Why is it hard?

Recall the cost function that we maximized in unsupervised learning:

(1)   \begin{equation*} F_{cl} = \sum_{i=1}^{N}\big\{\sigma_+(Wx_i)\log\sigma(Wx_i) + (1-\sigma_+(Wx_i))\log(1-\sigma(Wx_i))\big\}. \end{equation*}

Here \sigma_+ is a step function, that produces 1 if its argument is greater than threshold and 0 otherwise. The value of the threshold is either mean or median of arguments provided by all the samples.

When deriving weights update rules or other important formulas you often do two things: averaging and differentiation. While the first one does not seem to be a problem, the latter would be an issue. And the reason is that \sigma_+ is not smooth. Its derivative in the threshold point is undefined unless we we've done some additional assumptions. Furthermore, location of the threshold point is not fixed and depends on the dataset that you use.

Overall, we can try dealing with difficulties by redefining \sigma_+ so that its derivative is a delta-function. The other way is to split expression (1) in two parts: when threshold is exceeded and when it is not. Then derive all the formulas for both cases and combine them. Still, you will need to apply other tricks in order to extend the model to more complex cases.

Entropy: measure the chaos

How can we make \sigma_+ more friendly to derivatives? The answer is plain: drop the plus sign. 🙂 This can be viewed as an approximation of the step function \sigma_+ with a smooth sigmoid \sigma. They both share similar features. For arguments with big absolute value sigmoid will act more and more similar to the step function. So let's examine our new cost function:

(2)   \begin{equation*} F_{cl} = \sum_{i=1}^{N}\big\{\sigma(Wx_i)\log\sigma(Wx_i) + (1-\sigma(Wx_i))\log(1-\sigma(Wx_i))\big\}. \end{equation*}

It turns out that this is a rather well-known quantity in physics and information theory: entropy. Up to a sign. There are various definitions of entropy depending on the field of application. On a high level, entropy measures how messy the system is. So, high values mean that there is a lot of chaos and low values mean that everything is in order. When searching for weights maximizing (2) function, we, in fact, minimize entropy. Thus we force the algorithm to find clustering that with as little clutter as possible.

Let's check the difinition of entropy H from information theory:

(3)   \begin{equation*} H = -\sum_{\{S\}}p(S)\log(p(S)). \end{equation*}

Here \{S\} describes a set of all possible system states (e.g. set of feature values for a sample + a label) and p(S) is a probability to observe a specific state. The value of H provides amount of information gained from knowledge that the system is in S state. If probabilities are close to extreme values, then entropy gets near zero values. The reason is that with such probabilities you are already almost certain what state the system is in. On the other hand, for probabilities around 50% you get maximum values of H, because any state would be surprise for you.

Probabilities p(S) are given by a model. By training a model minimizing the entropy you are virtually asking to make a clusterization with the least achievable surprise.

To rewrite expression (3) to form (2) we need to substitute state S with labels t and features x

(4)   \begin{equation*} H = -\sum_{\{x, t\}}p(x, t)\log(p(x, t)). \end{equation*}

Note that \{x, t\} is a set of all possible features-label combinations. We limited our model to produce only two clusters, so for every features instance x there are two possible combinations: \{x, t=0\} and \{x, t=1\}. We have limited amount of train samples, so exact calculation of H is not always feasible. To resolve this problem we can approximate the entropy by summing over all train set. If the dataset had been properly sampled, we would get decent results. Updated formula for the entropy yields

(5)   \begin{multline*} H = -\sum_i p(x_i, t=0)\log(p(x_i, t=0))\\ -\sum_i p(x_i, t=1)\log(p(x_i, t=1)). \end{multline*}

Expression (5) uses joint features-labels probabilities p(x, t). On the other hand, our models provide conditional probabilities p(t | x). Having the distribution for features p(x) known, we can use Bayes theorem and get the joint distribution. Unfortunately, often you don't know the marginal distribution. You could get it by bootstraping or simply assuming that all samples are equally likely to appear. Still, there is another, more elegant way.

It would be cool if we substitute p(x, t) in (5) with p(t | x). The question is what we get then?

(6)   \begin{multline*} H_c = -\sum_i p(t=0 | x_i)\log(p(t=0 | x_i))\\ -\sum_i p(t=1 | x_i)\log(p(t=1 | x_i)). \end{multline*}

This is called conditional entropy. It shows how much information is needed to assign a cluster label t to a sample with known features x. A good thing is that H and H_c are related

(7)   \begin{equation*} H_c[t | x] = H[x, t] - H[x]. \end{equation*}

Also, H[x] depends only on features. So for any model it would be the same. As a consequence, model minimizing conditional entropy H_c also minimizes entropy H. Using the fact that p(t=0 | x_i) = 1-p(t=1 | x_i) and substituting p(t=1 |x_i) with a model estimate \sigma(Wx_i) in (6) we arrive to the cost function (2) with negative sign.

Now that we know the nature of the model, we can analyze its pros and cons. Good things are that it has a clear physical sense, it is tractable for analysis and it is fairly simple. Nevertheless, it has some issues too. And the main one is the presence of local minimum that produces absolutely useless results. Suppose that all weights of the model are near zero. Then output of the model doesn't care what data is passed in. It will be always the same value determined by bias. Bad and obviously not desired situation. With bias magnitude high enough you get almost perfect 1 or 0 probabilities. That yields a very low value of conditional entropy and thus would be accepted by training routine as a good solution.

Usually, it is enough to initialize weights with small random values before training to feel relatively safe from those local minimums. The model works fine then. Or alternatively we can modify the cost function by adding anti-regularization term. To be honest, I even planned to use this (6) model as a basis and name the post "Clustering via entropy minimization". However, after doing some search I've found a paper describing this approach in more details and one more work with improvement of the algorithm (Grandvalet 2004 and Gomes 2010).

One more remark before going to the next section. If you calculate derivatives \nabla_W H_c, you will get weights update rule

(8)   \begin{equation*} \Delta W \propto -\sum_i p_i (1-p_i) \log(\frac{p_i}{1-p_i}) x_i \end{equation*}

Here p_i denotes the conditional probability p(t=1 | x_i). Similarly to the previous part, weights update rule gets form of sum over samples weighted by gain g(p). In this case gain is

(9)   \begin{equation*} g(p) = p (1-p) \log(\frac{p}{1-p}). \end{equation*}

"Entropy model" sample gain
Figure 1. Sample gain for entropy minimization model.

Gain shares similar features with models from previous part: it becomes close to zero when p travels to extreme values. Also note zero at p = 0.5. Interpretation is following: samples with circa 50% probability estimate are near the decision boundary and could be either class, so it is much harder to extract information from them. The farther the sample from the decision boundary the more confident the model is in its class. During the learning phase the algorithm changes weights so that on average samples become more and more distant from the decision boundary. This is similar to what SVM does. Also there is a turn point. If a model is around 95% or more certain in a sample's class, utility of that sample decreases. It prevents large grow of weights acting as a regularizer. So, effectively, algorithm uses only part of samples that lie in 60-95% region of probabilities where it is fairly confident in samples class.

Mutual information

What is bad with all samples being in one class? Having everything in one basket makes introduction of classes useless because there is no information gained by knowing the class. In other words, class labels are then completely unrelated to features. So, what we really want is not to have everything in order, but to have labels that are closely related to sample features. In math language we would like to maximize mutual information, i.e. mutual dependence between labels and features. In terms of formulas, it is described as a distance between joint features-labels distribution p(x,t) and product of marginals p(x)p(t)

(10)   \begin{equation*} I = \sum_{\{x,t\}} p(x, t) \log(\frac{p(x,t)}{p(x)p(t)}). \end{equation*}

Having I = 0 means that p(x, t) = p(x)p(t) which essentially represents independence of features and labels. So, we want to maximize I, i.e. to put as much information from features into labels as possible. One could think of this as of a very lossy compression.

Expression (10) operates with three probabilities: p(x, t), p(x) and p(t). Yet out model provides only conditional probability p(t | x). Luckily, mutual information is closely related to entropy:

(11)   \begin{equation*} I[x,t] = H[t] - H[t | x]. \end{equation*}

There is a great image on Wikipedia describing this relation.
Mutual information diagram
Figure 2. Venn diagram showing relation between entropies and mutual information.

The new model is pretty the same that we already had. The difference is in the first term H[t]: entropy of the labels. Maximization of (11) means minimization of (6) while keeping labels entropy high enough. Note that labels in this case are considered alone from the features

(12)   \begin{equation*} H[t] = -N\sum_{\{t\}} p(t)\log p(t) = -N p(0)\log p(0) - N p(1) \log p(1). \end{equation*}

Factor N occurs because we are calculating entropy over all labels in the dataset. In this case all of them act as independent subsystems and thus their entropy is additive. The simplest way to estimate p(t) is to assign labels to all the samples in the dataset and then calculate frequencies for ones and zeros.

Suppose that all elements are classified in the same cluster. This leads to p(t) equal to either 1 or 0. Thus, H[t] becomes zero and decreases the mutual information! Seems like the problem is solved.

Let's move on and derive weights update rule. To do this we need to find gradients \nabla_W I = \nabla_W H[t] - \nabla_W H[t | x]. We already have an expression (8) for conditional entropy derivatives. Marginal distribution p(t) is not related to features x, still we can connect it to the conditional p(t | x).

(13)   \begin{equation*} p(t) = \sum_{\{x\}}p(x,t) = \sum_{\{x\}}p(t | x) p(x) = \langle p(t | x) \rangle_x. \end{equation*}

Where \langle . \rangle_x stands for averaging over all possible samples. To estimate the expected value \langle p(t | x) \rangle_x we take average over the dataset of N samples.

(14)   \begin{equation*} p(t) = \langle p(t | x) \rangle_x = \frac{1}{N}\sum_i p(t | x_i). \end{equation*}

Combining all together we get an expression for the entropy

(15)   \begin{multline*} H[t] = -N\langle p(t=1 | x) \rangle_x \log \langle p(t=1 | x) \rangle_x\\ -N\langle 1-p(t=1 | x) \rangle_x \log \langle 1-p(t=1 | x) \rangle_x \end{multline*}

and its derivatives

(16)   \begin{equation*} \nabla_W H[t] = -N\langle p(t=1 | x)(1-p(t=1 | x)) x \rangle_x \log\frac{\langle p(t=1 | x) \rangle_x}{1-\langle p(t=1 | x) \rangle_x}. \end{equation*}

We can rewrite the last equation in terms of sums over samples

(17)   \begin{equation*} \nabla_W H[t] = -N\log\frac{\sum_i p_i}{1-\sum_i p_i}\sum_i p_i (1-p_i) x_i. \end{equation*}

The gradient (17) is very similar to the one for conditional entropy H[t | x]. Combining both of them we get the following weights update rule

(18)   \begin{equation*} \Delta W \propto \sum_i p_i (1-p_i) \big(\log(\frac{p_i}{1-p_i}) - \log\frac{\sum_j p_j}{1-\sum_j p_j}\big) x_i. \end{equation*}

Again, the rule (18) is in form of weighted sum over dataset. Expression for gain

(19)   \begin{equation*} g(p) = p (1-p) \log\big(\frac{p}{\langle p \rangle_x}\frac{1-\langle p \rangle_x}{1-p}\big). \end{equation*}

It is close to (9). When \langle p \rangle_x = 0.5 expressions (19) and (9) coincide.
Gains for mutual entropy clusterization
Figure 3. Sample gain for mutual information maximization model.

The gain function g(p) is always zero near extreme points 0 and 1. Also it has one zero in between. The location of the zero is determined by mean value \langle p \rangle_x. Interpretation for this is same as for entropy case: points too close or too far from decision boundary provide no new information to the learning algorithm. Also note how the curve form changes favoring more the samples from larger cluster. It gives a hint that this method won't work good if dataset is too skewed, i.e. there are to few samples for one of the classes.

We have cost function (11) and weights update rule (18). With regard to theory we are done. Now it is time to code the equations.

Implementation

We'll start with the simplest parts: functions for entropy and mutual information calculation.

  1. function S = get_entropy(p)
  2.  
  3. S = -mean(p.*log(p+(p==0))+(1-p).*log(1-p+(p==1)), 2);

Note that mean is used instead of sum. It works as a normalization, moving numbers in the same scale independent from number of samples in the dataset.

  1. function I = get_mint(p)
  2.  
  3. avg_p = mean(p);
  4. Ht = avg_p*log(avg_p) + (1-avg_p)*log(1-avg_p);
  5. I = Ht - get_entropy(p);

Same normalization is used here. Otherwise we had to multiply Ht by number of samples.

The next step is gradients estimation.

  1. function [s, gW] = get_grads(W, data)
  2.  
  3. s = get_probs(W, data);
  4. l_expr = s./(1-s).*mean(1-s)./mean(s);
  5. l_expr(isnan(l_expr) | l_expr == 0 | isinf(l_expr)) = 1;
  6. sfun = s.*(1-s).*log(l_expr);
  7. gW = mean(bsxfun(@times, data, sfun), 2);
  8. gW = gW.';

Gradient in unsupervised learning should not depend on labels, so we pass only weights and samples to get_grads function. Assignment in line 5 aims to prevent infinite or NaN values of gradients. Analytically we can show that in those extreme cases sfun will take zero values. In spite of that during the calculation phase MATLAB does not take limits or take into account functions order of growth, so line 5 is introduced to deal with this issue.

The rest of the code is very similar to the one from the first part. Here is a GitHub link to a file that takes care of initialization, train loop and tests. Key differences from what we've already written in part 1 is in change of the cost function (get_mint instead of get_logll) and absence of labels in gradients calculation.

Preprocessing and initialization

There are also a few minor tweaks. We will review two of them.

The first one is data preprocessing. You can not be 100% sure in your dataset. It is possible that there are duplicate features or NaNs or useless ones etc. That's why it is important to cleanse your inputs before passing them to the learning algorithm.

  1. function [data, keep_inds, means, stds] = preprocess_data(data, keep_inds, means, stds)
  2.  
  3. if nargin < 2
  4. means = mean(data, 2);
  5. stds = std(data, [], 2);
  6. keep_inds = stds>0; % no variation => no usefull information
  7. means = means(keep_inds);
  8. stds = stds(keep_inds);
  9. fprintf('After preprocessing %d of %d features remained.\n', sum(keep_inds), size(data, 1));
  10. end
  11. data = data(keep_inds, :);
  12. % z-scoring
  13. data = bsxfun(@minus, data, means);
  14. data = bsxfun(@times, data, 1./stds);
  15. data = [ones(1, size(data, 2));data];

This is one example of a preprocessing function. It removes constant features, i.e. ones that have the same value across the whole dataset. It could reduce the dimensionality of your dataset and thus speedup the training. After useless entries are removed z-score is applied. Z-scoring is a normaliztion routine that centres features and scales them to have unit variance. Having all the features in the same scale is generally a good idea. Features are combined as a linear combintation in sigmoid so vast discrepencies in scale could lead to shadowing of small-scale features and imprecise models learned. In fact akin ideas are used in MATLAB Neural Networks (see documentation for removeconstantrows and mapminnax functions).

Do you remember how we initialized weights in our early models? We used small random numbers. For supervised models that worked good. Yet, for clustering purposes this approach produced flipping labels issue: initialization determined which class would be marked with which label. This issue is not critical: you can switch labels after reviewing the model trained. Often it is not even important what actual labels would be. Still it would be good to know ahead that for example zero from MNIST dataset would be marked by close to zero outputs of the model. To achieve this we can train a supervised model on a tiny bit of labeled dataset. If you don't have labels, you can add them manually. For subset of MNIST that we used in previous part it is enough to have just as little as 2-3 labeled samples per class. Still leaving thousands of samples unlabeled. Initialization routine is shown above.

  1. function W = init_weights(data, labels)
  2. % Argument labels is a Nx2 matrix.
  3. % The first column represents labels (e.g. zeros or ones).
  4. % The second column: indecies of samples from data corresponding to the labels.
  5. % So, if first two samples have "0" label and 10th and 13th have "1" label, then
  6. % labels = [0, 1; 0, 2; 1, 10; 1, 13];
  7. if nargin < 2 || isempty(labels)
  8. disp('Preinitialization via randomization.');
  9. W = randn(1, size(data, 1))/10;
  10. else
  11. X = data(:, labels(:, 2));
  12. t = labels(:, 1)';
  13. disp('Training LR.');
  14. W = train_lr(X, t);
  15. end

The supervised model is trained using train_lr function that was introduced in part 1. The curious thing is that we've used unlabeled data, added a few labels, passed that to our routines and as a result obtained a semisupervised algorithm for clustering that produces stable labels for clusters. In other words, two runs of training would likely end with models assigning same labels to most of data. Thus, labels flipping issue is resolved!

End of the post, at last

Wow! It was a long road, but we've made it till the end. Let's summarize our achievements. We've reviewed logistic regression model and created algorithms for classification and clustering based on this model. Our models expect that there are only two classes/clusters in data. It makes models more simple for understanding and coding, yet limits their power. Using some tricks you can solve multiclass problems even with this dichotomous toolkit. Two of many possible ways: hierarchy of models and one-vs-all approach.

After introducing baseline model for unsupervised learning, we highlighted its drawbacks and eliminated them by changing cost function and adding smart supervised initialization of weights. Most of the code mentioned in this post can be found on GitHub.

At this point you have a decent clustering framework and understanding how and why it works. Good luck in your experiments! 😉

One thought on “Clustering via mutual information maximization. Part 2”

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.