We need some harder evidence.
This is where this easy method I eluded to earlier comes into play.
If you were to count the number of 1’s and 0’s in the strings you would find they both have the same number of each and so tells you little.
However, for any random process any sequence within it (e.
“11” or “101”) should have equal probability compared to another sequence of the same length.
This is because the number before in a string should not influence the number after.
If I was to graph the different sequences up to a length of two then they look like this:Sequence Counting of String 1 (Random) and String 2 (Non-Random)As you can see it’s now obvious which is non-random as “10” and “01” are much more likely than “11” and “00”.
This makes string 2 somewhat deterministic and therefore predictable, in that if I know the previous value of the string I can predict with pretty good accuracy what the next number is.
In this case if I had a 1 last time, I have an 80% change I will get a 0 and vice versa.
This leads to the structuring which people were picking up on.
I found this method of counting sequences was a nice way to check the randomness of any random number generator you use and quantitatively gauge its randomness.
Note: For anyone interested, the non-random string was made by me mashing the keys with my forefingers rapidly, hence why 1 follows 0 and vice versa.
The other string was made by using the random package in python.
But Why Should We Care?Basically put, in Data Science we use a lot of random number generators in our models.
A Random Forest model gets its name because we use randomness to grow the trees and a K-Means clustering algorithm will randomness to set its initial positions.
If the algorithm we use to get our randomness from is not very random, then we might get trapped in local minima in our data or we don’t fully explore the sample space within our data and miss the solution to our questions.
This effect of not being random is a big issue within cryptography, since any predictability can be used to help deconstruct things and is giving you information on how things are encrypted.
For example, in the Enigma machine it can encode any letter as another letter but itself, this effectively reduced the strength of the encryption as the solution sample space was reduced significantly compared to if it could be any letter.
The algorithms we use in our data science work to get our random numbers are called pseudo-random number generators (other random number generators do exist but are outside the scope of what is already a long e-mail).
Numbers are generated using the algorithm inside a computer and because computers are inherently deterministic, so the generators themselves are not really random.
What I mean by this is that they may eventually repeat or if you use the same starting number to start it off, you always get the same sequence of numbers (this is what setting the seed in code does) and in poorly designed or basic ones they repeat or get stuck.
To demonstrate this we can look at one of the first random number generation algorithms (Middle-square method) invented by John von Neumann, who needed lots of random numbers for his nuclear research.
The way it works is:You take a four-digit starting number (or pad the front with zero till is it is of length four)Multiply it with itselfPad the front of the resulting number with zeros until it is double the original seed size (8 digits bit)Extract the middle four digitsUse this as the seed to generate the next number (go to step 2)Sounds great, right?.Well as I said before, some generators can get stuck and we can see that if I use this algorithm as an input into a random walk program and compare it to a modern pseudo-random generator.
Comparison of a modern and old random number generators used to generate random walks.
As you can see while the Neumann plot starts off well it eventually gets stuck and repeats itself, whereas the other one continues.
Therefore, the Neumann is only useful for much shorter sequences of random numbers than the modern algorithm is.
What is my point?Pseudo-random generators are not truly random; however, they appear random over the time scales that we use them for.
This is plenty good enough in many casesWe use them a lot in Data Science work even if you do not explicitly use them in your codeSetting a seed will cause reproducible results, very useful for designing machine learning algorithms, but be aware it might not be desirable in productionDesigning a good pseudo-random generator is very hard!.Make sure you use the ones people have heavily vettedIf you are still with me after this then I commend you and you now know a little of the importance (and difficulty) of generating truly random numbers.
.. More details