Training of a neural network on large datasets could be a rather long and challenging thing. There are lots of approaches for reducing training time: parallelization, early stopping, momentum, dimensionality reduction etc. They provide faster convergence of training, prevent unnecessary iterations, utilize hardware resources in a more efficient way. In this post we'll see how good initialization can affect training. Continue reading PCA-based pretraining of neural networks

# Tag Archives: machine learning

# Notes on “Intriguing properties of neural networks” paper

I've recently encountered a great paper describing interesting properties of neural networks. It tries to go beyond the "black box" view on ANNs and shows that single neurons and layers can have own meaning that is often comprehensible even to a human. Briefly, each layer produces a space (with neurons set as a basis) where each vector has semantic information associated with it.

There is one more idea described in the paper. Neural networks (especially deep ones) as a function in inputs space is not always smooth. By smoothness here I don't mean existence of derivatives, but rather the fact that inputs in the vicinity of train set samples could have unexpected classification labels assigned. Authors describe a way, how to obtain visually almost indistinct pairs of images that will be classified by network to different classes (e.g. bus classified as ostrich). By applying this procedure it is possible to modify a dataset so that error rate will go up from say 5-10% to 100%! Moreover, this "corrupted" dataset also leads to high error rates for networks that were trained on different samples of data or have different architecture then network used for images modification.

Two improvements are proposed for training routines:

- train an ANN, get modified images and add them to the dataset, train the ANN again on updated data;
- add a regularization term to loss function so that network output instabilities will be compensated.

After first reading it wasn't clear for me, what form regularization term should have. Yet expressions for upper bounds of instability are provided, so it shouldn't be hard to come up with some solution.

As for quick and dirty term I'd give a try to gradient regularization, e.g. penalizing high values of gradient. Yet this could possibly slow down learning, because computation of Hessian becomes necessary.

# Clustering via mutual information maximization. Part 2

In the previous post we've reviewed logistic regression model and designed a simple clustering algorithm based on it. We have managed to get a decent clustering on a subset of MNIST dataset. Yet there were some drawbacks in our approach.

Continue reading Clustering via mutual information maximization. Part 2

# Clustering via mutual information maximization. Part 1

In this series of posts we will learn logistic regression classifier in three ways: supervised, unsupervised and something in between. We will find out that for some machine learning problems you need only a few labels for your data to get a decent model. Continue reading Clustering via mutual information maximization. Part 1