Binary sampling

Prediction of random numbers. Part 1

What is a chance to guess a random number? For zero-or-one guessing 50% seems like a fair estimate. But what if you know that source of randomness is a human being?

That simple clarification makes it possible to achieve a higher score. I'm not sure if we have a good source of entropy in our organisms. But the way a human acts is often somewhat predictable. One of the issues that appears when randomness meets common sense has been reviewed in the previous post.

Random numbers are hard to get

If your random numbers generator is good, then it is challenging to build a model for it. It has no significant correlations, no memory, no periodicity. Well, it could be periodical, but the period should be enormous. Modern random numbers generators rely on external sources of entropy (mouse movements, keyboard strokes etc.).

The random number generator gathers environmental noise from device drivers and other sources into an entropy pool. The generator also keeps an estimate of the number of bits of noise in the entropy pool. From this entropy pool random numbers are created.

Even after those complications, numbers produced are called pseudorandom. And what about random numbers generator in your head?

People are biased

Try to experiment on your own. Write a series of ones and zeroes. Make them random. Or at least they should appear to be random from your perspective. Close your eyes to prevent effects caused by the fact that you see the previous numbers in sequence. Also try not to think much. To achieve that write 2-3 numbers per second. Now observe your result. 🙂

When I was doing that I often had some quick thought like "hm.. too much ones, let's do a zero" or had switches between anti-symmetric (1, 0, 1, 0, 1) and symmetric (1, 1, 1, 0, 0) regimes. Also occasionally I mentally complained that it is hard to produce a lot of random numbers when you have to choose either zero or one. Here is an example of what you could get.Binary sampling
Figure 1. Example of manual sampling.

There are 64 elements in that sample, 31 ones and 33 zeroes. So, mean value is close to 0.5 which should be common for majority of sequences. Now, we can enumerate all pairs 0-0, 0-1, 1-0 and 1-1 (see poorly painted table above). It reveals an important fact: number of 0-1 and 1-0 pairs is twice the number of 0-0 and 1-1 pairs. What does it mean? It says that we have strong negative correlations on lag 1 (i.e. closest neighbors).

ACF for binary series
Figure 2. ACF for generated series.

Indeed, after plotting ACF it is clear that there are significant correlations on lags 1 and 12. We will leave the latter aside, but it might be possible to extract important insights from it too. As for negative acf(1): it means that it is noticeably more likely to find 1 followed by 0 or 0 followed by 1. This regularity can be utilized to build a naive guesser: if current element is X, then next element would be 1-X. Applying it to the sequence we get two thirds predicted correct. Using R we can check bounds of confidence interval for success rate:

  1. > binom.test(42, 63)
  3. Exact binomial test
  5. data: 42 and 63
  6. number of successes = 42, number of trials = 63, p-value = 0.01114
  7. alternative hypothesis: true probability of success is not equal to 0.5
  8. 95 percent confidence interval:
  9. 0.5366192 0.7804625
  10. sample estimates:
  11. probability of success
  12. 0.6666667

Lower bound of confidence interval (53.66%) is close to what we would get by simply using coin toss to do predictions. But also true probability of success is higher than 50% on significance level of 0.05. So, our model works and it is more accurate, than a coin!

If you get more sequences or a longer sequence, success rate will likely be closer to the lower bound of the confidence interval due to fancy patterns that exist in data. These patterns are incomprehensible for the trivial model considered. Nevertheless, there is an important consequence: human introduces correlation to data he generates, so it is possible to build a model that effectively predicts future values of a random human-generated sequence. One of possible explanations is that we remember what elements we sampled in the previous moments in time. And that information affects future values.

In the next part we will assume that people use memory when generate random numbers. That should help to build a more complex and accurate predictive model.

One thought on “Prediction of random numbers. Part 1”

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.