Wiggly Distributions and Nonparametrics

And then, muuuuuch more speculatively, let’s talk about why it might be bad for a non-parametric estimator.

Scenario One: Timing database queriesSuppose you run an app that lets people read blog posts.

Whenever a reader loads any particular post, your app makes two queries, in parallel, to two separate databases: the first returns the blog-author’s profile picture, and the second the actual text of the blogpost.

You’d like to know which query typically returns first, and by how many milliseconds.

Your hunch is there’s an opportunity to use an idle fetch thread to do some other prep-work while still waiting for its sibling-query to finish.

To see what kind of downtime your faster fetch thread is dealing with, you modify the app to start tracking some stats, over which you’re hoping to perform some non-parametric estimates:Track response-time of the profile pic, T_picTrack response-time of the blog text, T_textCalculate the difference, T_delta := T_pic — T_delta, and log the resultTurns out it’s your lucky day: you don’t know it yet, but the distributionunderlying T_delta is Gaussian with mean of 100 milliseconds andstandard deviation of 60 milliseconds.

As Evans shows in her original post, the Gaussian distribution definitely abides by Wasserman’s integral condition:So as far as Wasserman’s intro would have it, any nonparametric estimates you wanna make over this data should mostly come out okay.

Or at least, you’ll have a good sense of what the performance of that estimator should be: any error bounds and guarantees that Wasserman’s book offers will still apply for your data, since its underlying distribution is so smooth and well-behaved.

To set us up for a comparison later, here’s what that nice, smooth Gaussian looks like, in terms of distribution and density:Scenario Two: Timing database queries but sometimes something goes squirrellyLet’s set everything back up as in Scenario One — two queries, tworesponse-times, log the difference.

Except now imagine, every oncein a while, some phantom bug strikes.

For some pathological cases, the picture always takes exactly 234 milliseconds longer to load than the text.

(For instance: maybe a well-meaning developer has included a code path that’s only meant to arise during unit or integration testing.

And for test purposes, it was nice to be able to control the relative database fetch times.

Hence the memorably human magic number of “234.

” Except somehow, what was only supposed to happen at test time has leaked out into production!)In this case, the distribution P(T_delta ≤ t) is going to have a sharp discontinuity at precisely t = 234:Note that what’s important is not the size of that discontinuity, but that it’s discontinuous at all.

Even if only one-in-a-million T_delta’s are pathologically 234 milliseconds, the density chart at right will still have a big, infinitely-tall impulsive spike.

What’s this mean for non-parametrics?This distribution strikes me as super unsmooth.

I’m not a great measure theorist, but I don’t believe this counts as differentiable, let alone as a density that satisfies Wasserman’s constraint.

In fact, the very next section of the intro, Wasserman talks about handling the case where a density is absolutely continuous, or when the density is discrete, and it makes me wonder if there’s a plan for hybrid cases like our Scenario 2.

Here’s where I gotta cop to not having actually read this book.

Maybe there’s an answer in store!But that aside, I can take a stab at why a hybrid distribution like Scenario 2 is particularly tricky in terms of playing nice with nonparametrics.

Imagine we’re performing kernel density estimation.

This is a lot like taking a histogram, but without the harsh boundary jumps you see between bins.

The intuition is that if we made an observation of, say, T_delta = -30.

5 ms, we should smear that around a little bit.

Treat it as if it’s telling us, “not only is -30.

5 ms a possible outcome, but that probably means nearby values are also possible.

” We treat the appearance of -30.

5 ms in our dataset as a reason to increase our belief that -30.

8 ms, and-29.

2 ms, and other similar values are likely to arise in the future.

For a nice, smooth distribution, I can see how that would be the case.

I bet that for nice, smooth distributions, you can arrive at some nice error bounds for how incorrect your kernel density estimate is after viewing N draws from the distribution.

The smoothness really does mean that the probability of observing x is just about the same as the probability of seeing x +/- ε.

Scenario 2 doesn’t work like that, not in the same way.

Because there’s that one magic number — 234 milliseconds — where any observation of that value is just of a completely different character than the rest of the T_delta space.

The density just changes too rapidly around 234 ms for those samples to power a healthy kernel density estimate.

And that’s true no matter the size of the distribution’s discontinuity gap.

To consider another nonparametric estimator, consider bootstrap sampling in Scenario 2.

The cases where your bootstrap subpopulation doesn’t contain any bogus 234 ms points looks exactly like Scenario 1 — and maybe those cases are sufficiently different that it makes it harder to combine with the other samples.

How often does that 234-less condition arise?.What about the exclusively-234 case?.What do these crazy-corner-case samples do to your estimation procedure?So your error bounds for Scenario 2 will now hinge not just on the number of draws N you’ve observed, but also how common it is to draw those pathological 234 ms cases.

And the error bounds in Wasserman’s book probably don’t leave room for that degree of freedom.

Although, if the book covers all of nonparametric statistics, there must be something about these kinds of hybrid densities!.I should… read the book.

.. More details

Leave a Reply