The answer, of course, is that we need a “good enough” alternative.
We’re sampling after all, so the level of accuracy is usually flexible if it means performance gains.
So let’s see if we can come up with a different approach.
The goal will be to bootstrap 60% of the 150,000 records in the “SNOWFLAKE_SAMPLE_DATA”.
We’d like to do one iteration on the table (in the algorithmic sense), and therefore make a decision on each row just once in isolation if possible.
We’ll need to come up with a way of selecting it potentially multiple times, but keeping within the parameters of the sample.
A simple modification, if we decided arbitrarily that ten sweeps of the table was good enough, then we could just pick 6% each time, like this:This isn’t very good at all.
For starters, we dealt with every row ten times, and we haven’t really quantified the inaccuracy at all, we just made up a number.
So let’s have a look at what happens for each row each time we do an iteration, and we’ll consider the two extremes: always getting selected, and never getting selected.
After a certain number of iterations, inevitably there’s a point where (as we’d say in Australia) each record has had a “fair go”.
So we need a way of drawing that line.
For the purpose of this exercise, let’s allow the user to specify two things:The desired size of the sample, as a percentage of the original tableA percentage value, where we stop iterating when the probability of never being selected dips below it.
Imagine this as a horizontal line you can place somewhere on the chart above.
Equation time!As a reminder, the goal is a 60% sample.
Forgetting the “with replacement” part for a second, the probability of a row getting selected every time:Or of not getting selected, just flip it around:1–0.
6 = 0.
4The value on the right-hand side is what we want the user to specify.
As you can see, it always decreases, and we will keep adding iterations until we reach it.
So we can express this generally, then solve it to figure out how many iterations we should do:Great, now we have a formula to use.
Now, let’s get back to Snowflake.
Remember, we don’t actually want to do multiple iterations.
But the tricky part is, there isn’t any way to make a row appear more than once in a result set….
or is there?!?!?!Enter UDTFsI really like UDTFs, they have a way of enabling elegant solutions instead of awful boilerplate.
Check this one out:We can use this to kill two birds with one stone.
First of all, it will apply that formula to work out a good number of iterations (i.
number of chances to be selected) based on that threshold we invented.
Then, it is courteous enough to divide the sample fraction by that number, and write out that number of rows.
To demonstrate, the UDTF output on its own looks like this:select * from table(SAMPLE_PROBABILITY(0.
00001::double))The formula says we should do 13 iterations, and 0.
6/13 = 0.
04615384615In other words, we will give each row 13 chances at 0.
04615384615 probability per try.
All we need to do is join to this UDTF like so:The raw result of:SELECT customer.
00001::double))is actually 150,000 * 13 = 1,950,000 rows, but then we add the WHERE clause:where uniform(0::float, 1::float, random()) < SAMPLE_PROBABILITYThis generates a random number between 0.
0 and 1.
0 and removes those rows below the threshold.
A 100% correct implementation should give us 90,000 rows (150,000 * 0.
This one (with the inputs I chose) seems to yield somewhere between 89,000 and 91,000 each time.
And of course, just as we wanted, there are the chance of duplicates.
You can even see one in the first page of results:And on our 150,000 table, to return the ~90k records, is a little over a second on an x-small warehouse.