Testing random code

For example, simulation algorithms, like those used in various types of computational modelling, are often non-deterministic: they want to perform different computations each time, to explore different possible events that can happen.

This is achieved by relying on a random number generator (RNG), another algorithm or framework that, when requested, will produce numbers in a way that is difficult to predict — hence, essentially random.

These numbers effectively form part of the main code’s input, ensuring it runs differently each time.

In this case, the varying output is an integral part of the code’s intended behaviour.

Unfortunately, it complicates things when it comes to reasoning about it.

Scientific code is still code and, although often written under different conditions than in industry, should still be treated with the same attention.

As part of that, it should be accompanied by tests, to ensure not only that it does what it is supposed in its current form, but also that changes to it do not break this behaviour.

Randomness, however, complicates this: tests (whether of a small unit of code or end-to-end) usually rely on having an expected, correct result to compare the output of the execution to.

This makes it much harder to reason about the correctness of code that uses RNGs!This is something that I, and I’m sure many others, have faced in their work.

In this post, I will try to discuss some ideas about countering this and show how they work (and when they don’t).

But first, a clarification: there are many strategies, theoretical and applied, about verifying the correctness of RNGs.

This post will not discuss these — indeed, it will assume that we have a (correct!) RNG that we can call, and will focus instead on how to check the correctness of our own random code.

Photo by Mika Baumeister on UnsplashSet the seedIf you’ve read this far, and have tried something similar in the past, you’re probably already saying “Just set the seed and be done with it!”.

The seed is a value that reflects the state of a RNG: for a particular seed value, the sequence of numbers produced is always the same.

This makes for a simple method to sanity check your code:Record the current value of the seed (or set it to something arbitrary).

Run the code, record the results.

Set the seed to the value in step 1.

Run the code again and check the results.

If the results are different to step 2, fail!.Otherwise, pass.

By resetting the seed, we are essentially removing the effect of the randomness.

This has the advantage that the expected results (steps 1–2 above) can be precomputed once and stored with the test code, ready to be used for verification any number of times in the future.

Simple?.Certainly.

But there’s a problem: what if we change our code to make different requests to the RNG?.Assume, for example, that we discover an optimisation that lets us call the RNG fewer times, or that we refactor our code to compute values in a different order.

Suddenly, the sequence of random numbers no longer corresponds to the original code, the output will be different and, if we have a testing framework set up, the tests will fail.

This is to be expected, but it means that we don’t know whether the failures are due to the RNG itself, or to some mistake we have made when changing the code!.The tests therefore don’t ensure the correctness of our code.

Check the meanIn one sense, the previous problem occurs because we are checking just one run of our random code.

We know that the results are liable to change — wouldn’t it be better to run it multiple times, and verify the average result?.In contrast to a single run, it’s possible that we may even know of an analytical expression for what that result should be (mathematically, the expected value of the output).

If we don’t, we could run the algorithm multiple times and compute the average from that.

This certainly provides some statistical information, but it is not always sufficient — after all, sometimes the variance characteristics of a method are just as important as its mean.

How do we know what statistics we should look at?.Furthermore, there is still the question of how many runs are enough: with too few, our test results will likely be unreliable, but too many will slow down what should be an easily repeatable process.

Of course, doing something is an improvement over having no checks at all, but is there anything more detailed that we can do?Photo by Agence Olloweb on UnsplashVerify the propertiesHere is where things start getting more involved — or interesting, if you’re into that sort of thing!.It may be that looking into your algorithm in detail can give you some insight about how to verify it is properly implemented.

For example, John Geweke has proposed a framework for checking implementations of Monte Carlo simulators, based on the mathematical properties of the algorithm.

On the one hand, this approach can give very powerful, specific results.

On the other, it cannot be applied wholesale to other algorithms, and replicating it for a new method requires intense familiarity with the theory behind it.

However, that’s where the scientists developing the methods (or, for more established ones, those studying and using them) can play a role!.By bringing together a deep understanding of the theory with a commitment to reproducibility and reliability, we could arrive at a robust testing approach that would be of benefit to everyone.

So — how do you test your random code?.

. More details

Leave a Reply