Training of a neural network on large datasets could be a rather long and challenging thing. There are lots of approaches for reducing training time: parallelization, early stopping, momentum, dimensionality reduction etc. They provide faster convergence of training, prevent unnecessary iterations, utilize hardware resources in a more efficient way. In this post we'll see how good initialization can affect training.

Parameters initialization plays an important role in the process of model training. Even for convex loss functions, where global minimum is reachable from any start point, good initial guess would significantly reduce training time. For more complex models, bad initialization can lead you to local minimum with rather high loss. One common practice is to initialize weights with small random numbers. It prevents most of dangers, because chance of having a really bad initial values is small. Yet with this kind of starting point you also should not expect any benefits. Let's do some investigation and try to come up with some, possibly more productive, way for departure parameters selection.

Last note before we move on: there are some clever methods for initialization via pretraining. One of them is to train a layer in unsupervised manner as RBM. And then proceed to supervised training. It is a interesting and effective approach, yet we'll not cover it in this post. Still we'll do pretraining too.

#### Principal components analysis in a nutshell

Suppose that your features vector space is high-dimensional, e.g. ANN has thousands of inputs. Natural approach here is to reduce dimensionality before training. This would provide speedup during training, usually by cost of accuracy. Though, it is often possible to get up to 50% decrease in number of inputs with tiny loss in classification quality. A well-known way to do dimensionality reduction is principal components analysis (PCA). The trick is to find axis along which most of data variation is observed and then to project data into lower subspace. Think of a noisy data along a straight line in 3-D space. Projection on that line would be a good choice of axis for 1-D representation of data. Assuming that data is already z-scored, algorithm for PCA is straightforward:

- rotate basis axis so that they coincide with directions where data changes the most;
- leave only dimensions where data varies noticeably.

First step of the algorithm (rotation) in linear algebra sense means multiplication by some matrix. Luckily, MATLAB already has a routine necessary to find that matrix: singular value decomposition (SVD). Moreover, columns of the matrix returned by *svd()* call are already sorted so that first columns correspond to directions with higher variability. So, our second step is to keep only first few columns. Multiplication by resulting matrix reduces dimensionality of the input.

#### The trick

Suppose that we have an input vector . Assume that is the matrix returned by *svd()* call and is input after PCA transformation. We want to train a network layer using PCA preprocessed data. Weights of the layer that we are training are denoted by matrix . Forward pass for this layer involves the following computation:

(1)

On the other hand it is same as calculating:

(2)

In other words, it is equivalent to a larger layer trained on initial non-PCA inputs with weights .

We can exploit this fact by training a smaller classifier on dataset of reduced dimensionality, multiplying weights by and then using obtained weights as a starting point for training on non-reduced dataset. Hopefully, initial model will catch the outline of dataset and significantly simplify the task for a more complex one.

#### Experiment

I've trained three groups of logistic regressions on subset of MNIST:

- on images pre-processed by PCA;
- pretrain on images pre-processed by PCA, extend weights to the size corresponding to raw images, continue training (now on raw images);
- on raw images.

All the classifiers used the same fixed learning rate. This makes important the starting point for the learning algorithm. With gradually decreasing learning rate all the classifiers from group 2 and group 3 will reach similar accuracy.

The following image shows a box plot of error rates for each group.

*Figure 1. Error rates for three groups of classifiers.*

The PCA-based group is less accurate than two other groups, yet sometimes it achieves decent results. The difference between group 2 and group 3 is not big. Still it is consistent. In most of cases the model with PCA pretrain yields higher accuracy.

Code for training classifiers from group 2 can be found in my GItHub repo. It worth noting that PCA involves operations that are rather costly. So, if you have a lot of data it might be wiser to use a smaller subset to do dimensionality reduction. Say, 3-5 times the initial number of dimensions. It should save you a lot of computation time without much hurt to the accuracy.