Scaling of noise in neural networks

To be more precise, you can encounter this effect in case when you have to sum up noisy values. The fundamental rule is that variance of the sum will be equal to sum of variances.

How is it related to networks?

Consider an ordinary feedforward neural network with 100 inputs, 10 neurons in hidden layer and 1 output neuron. In order to calculate transfer functions for hidden neurons you have to add 100 inputs multiplied by corresponding weights. Then, 10 outputs of hidden neurons should be again weighted and summed up to produce output of neural network. If we assume that input of the neural network is a random variable with variance S, then output of the neural network as a random variable will have variance 100*10*S! The actual value of variance would be slightly different because of impact of transfer function and weights, but this estimate should be close to real: noise part of values passed as input to this neural network would be scaled by a factor of 1000.

Why input would be noisy?

There are at least two source of noise:

  • noise captured during measurements;
  • quantization noise.

The first one is related to the way you gather your inputs and ideally could be eliminated or significantly reduced by using precise devices or clever algorithms. The latter is caused by the way information is stored in computer memory. Each input is often represented as single or double precision floating point number. Doubles are more common. Good thing about float-point format is that it covers a wide range of values. The drawback is that you have only a fixed number of significant digits (6 to 9 for single precision, around 16 for double precision). That introduce uncertainty in actual values. After some point you can not distinguish numbers 1-many zeros-0 and 1-many zeros-1, i.e. they are represented by the same float-point constant. This post describes another issues summoned by same effect.

One way to represent uncertainty is to describe neural network inputs (IN) as sum of actual value (X) and noisy random variable (N): IN = X + N. Noise N has zero mean and variance enough to cover all missing digits. So, for case of single precision floats, variance of noise would be between IN*1e-6 and IN*1e-9. Assuming that measurements were ideal (i.e. variance of X is negligible), we get that IN and N have equal variances.
As an immediate consequence, variance of neural network output would be between IN*1e-3 and IN*1e-6. In other words, we have lost 3 significant digits!

If we'd use a complicated model and more features things would go even worse. Consider that we do some image classification task on 32x32 monochrome images. And for some reasons we decided to use raw pixels as features. That yields 1024 features. Let's select a neural network with two hidden layers of 100 neurons each. And one output neuron that will serve as class likelihood measure. After doing simple math you can find that output of network will have 7 significant digits less than input.That renders us unable to use single precision numbers. For doubles, though, values would still be rather precise with 8-9 significant digits left.

This effect should not be important if you use reasonable number of features and your networks architecture is not too complicated. In other cases, be cautious and stick to doubles. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.