Clustering via mutual information maximization. Part 1

In this series of posts we will learn logistic regression classifier in three ways: supervised, unsupervised and something in between. We will find out that for some machine learning problems you need only a few labels for your data to get a decent model.

Classification: logistic regression 101

Let's start with a model that you can find in nearly any introductory course to machine learning: logistic regression.

Aim of this model is to divide input data in two predefined classes (A and B). Suppose that input data is provided in a form of column-vector x. Then logistic regression estimates the probability p_A that input corresponds to class A:

    \[ p_A = \sigma(wx+b) = \frac{1}{1 + exp(-wx - b)}. \]

Where row-vector w and scalar b are parameters of the model (named weights and bias). I'll assume that first element of x if always equal to 1. In this case we can concatenate b and w forming a row-vector of weights W = \{b, w\}. After this operations p_A transforms to a simpler form:

(1)   \begin{equation*} p_A = \sigma(Wx) = \frac{1}{1 + exp(-Wx)}. \end{equation*}

Sigmoid
Figure 1. Plot of sigmoid function. This function maps linear combination of inputs Wx \in \mathbb{R} into a probability p_A \in \{0, 1\}.

Training a regression is a process of estimating weights that maximize some target function. In order to train a supervised classifier we need input vectors and corresponding labels. So for each input vector x we need a label t describing which class x belongs to. Let's assign t = 1 when x is from class A and t = 0 if x is from class B. Also suppose we have N inputs-label pairs.

What is the distribution of probabilities for classes given features p(t|x)? It is same as for Bernoulli process:

(2)   \begin{equation*} p(t|x) = p_A^t (1-p_A)^{1-t} = \sigma(Wx)^t (1-\sigma(Wx))^{1-t}. \end{equation*}

Joint probability distribution for features and labels according to Bayes' theorem reads p(x, t) = p(t|x)p(x).
Now let's try to find the weights W maximizing the probability p(x, t) for all (x_i, t_i) pairs. In other words we want to maximize the following function:

    \[ L = \prod_{i=1}^{N}p(x_i, t_i) = \prod_{i=1}^{N}\sigma(Wx_i)^{t_i} (1-\sigma(Wx_i))^{1-t_i}p(x_i). \]

This function is called likelihood and overall approach is maximum likelihood estimation. Often it is easier to maximize logarithm of likelihood. Both of them achieve maximum at same point. Logarithm of likelihood:

    \[ \log L = \sum_{i=1}^{N}\big\{t_i\log\sigma(Wx_i) + (1-t_i)\log(1-\sigma(Wx_i)) + \log p(x_i)\big\}. \]

Note that third summand under sum sign is independent of W, so we can throw it out of our target function:

(3)   \begin{equation*} \log L = \sum_{i=1}^{N}\big\{t_i\log\sigma(Wx_i) + (1-t_i)\log(1-\sigma(Wx_i))\big\}. \end{equation*}

One of common ways to maximize a function is to use a gradient asscent. Gradient asscent is an iterative method. On each iteration it updates estimate of optimal parameters (weights). The value of update is proportional to the gradient of the function with respect to parameters. You can think of it as climbing the mountain choosing the steepest way on each step.

(4)   \begin{equation*} W_{next} = W_{prev} + \alpha \nabla_W \log l. \end{equation*}

Coefficient \alpha is called learning rate. It controls speed of convergence of the algorithm. Gradient of log-likelihood (3) yields

(5)   \begin{multline*} \nabla_W \log l = \sum_{i=1}^{N}\sigma(Wx_i)(1-\sigma(Wx_i)) \big(\frac{t_i}{\sigma(Wx_i)} - \frac{1-t_i}{1-\sigma(Wx_i)}\big)x_i \\ = \sum_{i=1}^{N}\big\{t_i(1-\sigma(Wx_i)) - (1-t_i)\sigma(Wx_i)\big\}x_i. \end{multline*}

It can be further simplified using the fact that all labels are binary with values either 0 or 1:

(6)   \begin{equation*} \nabla_W \log l = \sum_{i=1}^{N}(t_i - \sigma(Wx_i))x_i. \end{equation*}

Note that gradients (5) and (6) have the following form:

(7)   \begin{equation*} \nabla_W \log l = \sum_{(x, t) \in (X, T)}g(p_A, t)x. \end{equation*}

Here g(p_A, t) is a scalar gain for a given sample from a dataset. Gain depends only on labels and model outputs and remains same for all components of vector x. Amplitude of gain seeks zero when p_A \rightarrow t. So the more accurate sample is classified by the model the lower its impact on weights change is. At this point we've done with theory. To build a framework for logistic regression training we need to code formulas (2), (3), (4) and (6).

Coding the logistic regression

All the coding in this post will be done in MATLAB. We will start we the simple piece of code: estimation of class probability given inputs and weights.
Part 1: code for \sigma(x) function.

  1. function s = sigmoid(data)
  2.  
  3. s = 1./(1+exp(-data));

Part 2: code for equation (2).

  1. function p = get_probs(W, data)
  2.  
  3. p = sigmoid(W*data);

All the code is vectorized so you can pass multiple samples in data argument.

The next step is calculation of log-likelihood values for a given dataset (3).

  1. function ll = get_logll(p, t)
  2.  
  3. ll = sum(t.*log(p+(p==0)) + (1-t).*log(1-p+(p==1)));

Sometimes it is possible that p will be equal to zero or one. In that case one of logarithms turns to -\infty. As a result we could get for some samples 0*log(0) = NaN instead of expected zeros. To prevent this kind of issues argument of the logarithm has been changed from p to p+(p==0). So, value of the logarithm, the sum and resulting log-likelihood would always be finite numbers.

At this point we can do prediction for a model and assess its quality. Now it is time for some training code. Gradient estimation (6):

  1. function gW = get_lr_grads(W, data, t)
  2.  
  3. s = get_probs(W, data); % get probabilities estimates
  4. sfun = t - s; % calculate gain for all samples
  5. gW = mean(bsxfun(@times, data, sfun), 2); % apply gain
  6. gW = gW.';

To have true gradient for log-likelihood you need to replace mean with sum in line 5. In fact gW is a gradient for mean log-likelihood over all samples which is N times lower that the log-likelihood itself. So both of them reach maximum in the same point. The reason for using mean is that sum would be unbounded for large datasets. While it is reasonable to expect mean log-likelihood to be finite.
Top-level function that does initialization, updates weights according to (4) and prints training state information.

  1. function W = train_lr(data, t)
  2.  
  3. max_iter = 100; % maximum number of iterations
  4. min_diff = 0.01; % minimum tolerable change in log-likelihood
  5. alpha = 0.1; % learning rate
  6.  
  7. W = randn(1, size(data, 1))/10; % weights initialization
  8.  
  9. i = 1; % iterations counter
  10. old_ll = Inf; % variable for storing previous log-likelihood
  11. p = get_probs(W, data);
  12. ll = get_logll(p, t);
  13. fprintf('Iteration #%d\n', 0);
  14. fprintf('--log-likelihood: %2.3f\n', ll);
  15. while i < max_iter % main training loop
  16. fprintf('Iteration #%d\n', i);
  17. gW = get_lr_grads(W, data, t); % gradient estimation
  18. W = W + alpha*gW; % weights update
  19. p = get_probs(W, data); % outputs of current model
  20. dist = mean((p>0.5)~=(t>0.5)); % number of incorrect predictions
  21. ll = get_logll(p, t); % current log-likelihood
  22.  
  23. fprintf('--log-likelihood: %2.3f, errors: %2.2f%%\n', ll, 100*dist);
  24. if old_ll-ll < min_diff
  25. break; % no noticeable changes, terminate loop
  26. end
  27. old_ll = ll;
  28. i = i+1;
  29. save('log_reg.mat', 'W'); % save intermediate results
  30. end
  31. fprintf('Results after %d iterations: log-likelihood: %2.3f, errors: %2.2f%%\n', i, ll, 100*dist);

Comments describe most of code. A few additional notes. To do a prediction it is necessary to make a decision at some point. Having the probability is cool, but you need to decide which class input belongs to, right? 🙂 See line 20 above. To do a decision we simply check if value is greater then 0.5 (50% probability). For higher values assign class A, for lower — class B. Good thing with this approach is its clearness: it is very easy to implement it. The drawback is that for instance if the model provides you with 50% probability then you assign class B. Though, you don't have enough evidence to be confident in this decision. You don't have enough evidence to do any decision in this case!

Note the initialization code in line 7. Log-likelihood in our problem is strictly concave, so we can reach its maximum using gradient ascent from any point given enough time. In fact we could use any weights in initialization part, for example all zeros. This is not always true. For more complex models (e.g. neural networks) starting point is crucial and often determines which one of many local maximums you eventually arrive.

To conclude this section let's train a model. We will use MNIST dataset. It is a well-known playground for machine learning models. I think it will not be far from truth to say that virtually any machine learning model has been trained on MNIST at least once. It is a labeled dataset of hand-written digits. Each sample is a 28x28 grey-scale image of a single digit. Here is one example of digit 0 from train set.
MNIST 0
Figure 2. Digit 0 from train set of MNIST.

You can get MNIST in original format and read it by yourself or download version already prepared for MATLAB (43 megs). I used code provided by G. Hinton to create those mat-files.

From now I'll assume that you have a set of mat-files named digit0.mat - digit9.mat. And each file contains a matrix D of size Nx784, where N is number of images of corresponding digit. Let's read data for two digits (zero and one in this case), concatenate to one matrix and start training.

  1. load('digit0.mat');
  2. data0 = D.';
  3. t0 = zeros(1, size(data0, 2));
  4. load('digit1.mat');
  5. data1 = D.';
  6. t1 = ones(1, size(data1, 2));
  7. data = [data0, data1];
  8. t = [t0, t1];
  9. data = data/max(data(:)); % Convert all elements to [0, 1] range.
  10. data = [ones(1, length(t));data];
  11. W = train_lr(data, t);

Output of this code (actual numbers would possibly be different):

Iteration #0
--log-likelihood: -10008.723
Iteration #1
--log-likelihood: -5767.701, errors: 19.31%
Iteration #2
--log-likelihood: -4279.538, errors: 7.00%
Results after 2 iterations: log-likelihood: -4279.538, errors: 7.00%

It works! It took a few iterations to converge to a decent result with 7% error rate. By adding minor modifications to training code you can get even more accurate models. For example, by applying z-score to inputs and removing constant components you can reduce number of errors twice.

Too easy? No problem, let's make a problem more challenging and remove all the labels.

Clustering: naive approach

Now all you have is a set of images and you want to divide them into groups. This is called clustering. Suppose that you've reviewed images and found out that they are handwritten digits: zeros and ones. How would you modify your logistic regression model to handle this case? The naive way to do this is to let weights update rule (6) figure out labels itself.

(8)   \begin{equation*} \nabla_W \log l = \sum_{i=1}^{N}(\sigma_+(Wx_i) - \sigma(Wx_i))x_i. \end{equation*}

Where \sigma_+ equals to one if and only if \sigma > 0.5. Otherwise it is equal to zero.

To test this approach we can slightly modify existing code by replacing line 4 in get_lr_grads function with the following code:

  1. sfun = (s>0.5) - s;

Note that gradients are now independent from labels. General equation for gradient (7) remained almost the same. Only the form of sample gain g(p_A, t) = g(p_A) changed.

Now train the model. You'll find that models trained by this code tend to group all cases in one class. Indeed having all probailities close to extreme points gets us almost zero magnitude gradients. That is a local maximum for a function that we are trying to maximize. In fact, the gradient of the form (8) corresponds to the following function:

(9)   \begin{equation*} F_{cl} = \sum_{i=1}^{N}\big\{\sigma_+(Wx_i)\log\sigma(Wx_i) + (1-\sigma_+(Wx_i))\log(1-\sigma(Wx_i))\big\}. \end{equation*}

Recall the log-likelihood (3) for supervised training. Function F_{cl} is its analog for our model. It is clear that when \sigma(Wx_i) is close to zero or one for all cases, this function gets close to zero. Which is its maximum. By the way, model that classifies all images correct is a local maximum for this function too. So there is a minor chance that after training you'll arrive to a decent solution. For this way of training the initialization of weights is crucial and completely determines the local maximum that the model arrives.

Suppose that we know that numbers of zeros and ones in the dataset are approximately the same. We can then update our labels assignment strategy. On each iteration we will estimate a threshold value for \sigma that divide dataset in two groups. One of ways to get that threshold is to calculate mean or median value for \sigma. Both should work. Let's change line 4 in gradients estimation function again:

  1. sfun = (s>median(s)) - s;

Training yields a model with either 5-10% or 90-95% error rates. 95% means that our model has switched labels for ones and zeros and "real" error rate is 5%. Note that error rates are in the same range as for supervised case. We don't even need labels to do well on this task! 🙂

The drawback of this approach is that it is harder to describe analytically. Especially when medians are used. As a consequence it is harder to make a reliable extrapolation of this method when there are more clusters.

That's all for now. In the next part we will introduce a clever initialization of weights preventing flipping of labels. Also we will derive a similar model that is more open to analysis and extensions. Hint: check the definition of entropy in CS sense and this post title. 😉

One thought on “Clustering via mutual information maximization. Part 1”

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.