Suppose that you are provided with two binary sequences {0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1} and {0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0}. Which one is more likely to be generated by tossing a coin (i.e. a Bernoulli process with 0.5 probability of event occurrence)?

Both sequences have the same generation probabilities equal to (i.e. very unlikely). But common sense gives a hint that the second one looks more "random". Indeed one sequence was generated by repeating 0-1 pattern and the other - using R command.

> rbinom(14, 1, .5) [1] 0 1 1 0 0 0 0 1 1 1 0 0 1 0

Though, I cheated by running it until I got a sample looking natural enough. 🙂

The same common sense suggests that all-zeros or all-ones sequences are not that "random" too. Why?

Key to this common sense is that while looking at the sequence, person examines its macro-characteristics. In other words, we look for quantities that picture the data as whole. And then we decide if that description seems likely. Notion of likeliness appears from the fact that some macro-states can be reached from vast number of micro-states (sequences). On the other hand, there are macro-states with only one relevant micro-state. Number of micro-states (also called statistical weight) representing macro-state can be treated as probability if normalized properly.

Simplest example is number of ones in sequence (i.e. mean value). Given the way sequence is presented it seems natural to expect it to have the same or near to the same number of ones and zeros. You can check that for case of 14 elements sequences, every fifth sequence will have the same number of ones and zeros. Furthermore, 82% percent of sequences have from 5 to 9 ones.

> sum(choose(14, 5:9))/2^14 [1] 0.8204346

So you will rarely see less than 5 or more than 9 ones. That explains our "sense" that sequences similar to all-zeros do not seem to be "random". But what about more complex case of 0-1 pattern? Mean value for that case yields 0.5, which sounds as a neat value. Nevertheless, sequence looks odd due to periodicity. Let's study this case.

A common way to check for patterns in a series is to examine autocorrelation function. It reveals similarity of values on certain lag. For example, acf on lag 2 shows how similar are entries standing two elements apart (first & third, second & forth etc.). We can easily calculate acf values in R and plot them.

> t <- c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1) > acf(t)

I have used a longer sequence to plot this image, but for 14 elements it would look similar. Values between two horizontal lines are statistically not significant. That means that it is likely to get those values if arbitrary sequence is chosen. Recall to analysis of number of ones. Most sequences had close number of ones and zeros. Similar regularity exist here: most of sequences would lead to acf with values close to zero. The longer the sequence the closer values get to the axis.

High positive values on even lags lags mean that if n-th element is one then it is very likely to find ones in positions n+2, n+4, n+6 etc. Negative value on lag 1 suggests that adjacent elements should not be the same. These facts indeed are true for the sequence considered.

Presence of significant correlations is a rare thing in arbitrary chosen sequence, so one more "sense" thing is explained.

To conclude the post I'll repeat the thesis from the beginning. Both sequence are equally likely (unlikely) to be sampled. When a person examines a sequence, he looks at macro-features of the sequence and estimates how probable are those features for a randomly sampled sequence. That is the source of "common sense".

## One thought on “Sampling and common sense”