Traditional methods of data de-identification obscure data values.

For example, you might truncate a date to just the year.

Differential privacy obscures query values by injecting enough noise to keep from revealing information on an individual.

Let’s compare two approaches for de-identifying a person’s age: truncation and differential privacy.

TruncationFirst consider truncating birth date to year.

For example, anyone born between January 1, 1955 and December 31, 1955 would be recorded as being born in 1955.

This effectively produces a 100% confidence interval that is one year wide.

Next we’ll compare this to a 95% confidence interval using ε-differential privacy.

Differential privacyDifferential privacy adds noise in proportion to the sensitivity Δ of a query.

Here sensitivity means the maximum impact that one record could have on the result.

For example, a query that counts records has sensitivity 1.

Suppose people live to a maximum of 120 years.

Then in a database with n records [1], one person’s presence in or absence from the database would make a difference of no more than 120/n years, the worst case corresponding to the extremely unlikely event of a database of n-1 newborns and one person 120 year old.

The Laplace mechanism implements ε-differential privacy by adding noise with a Laplace(Δ/ε) distribution, which in our example means Laplace(120/nε).

A 95% confidence interval for a Laplace distribution with scale b centered at 0 is[b log 0.

05, –b log 0.

05]which is very nearly[-3b, 3b].

In our case b = 120/nε, and so a 95% confidence interval for the noise we add would be [-360/nε, 360/nε].

When n = 1000 and ε = 1, this means we’re adding noise that’s usually between -0.

36 and 0.

36, i.

e.

we know the average age to within about 4 months.

But if n = 1, our confidence interval is the true age ± 360.

Since this is wider than the a priori bounds of [0, 120], we’d truncate our answer to be between 0 and 120.

So we could query for the age of an individual, but we’d learn nothing.

The width of our confidence interval is 720/ε, and so to get a confidence interval one year wide, as we get with truncation, we would set ε = 720.

Ordinarily ε is much smaller than 720 in application, say between 1 and 10, which means differential privacy reveals far less information than truncation does.

Even if you truncate age to decade rather than year, this still reveals more information than differential privacy provided ε < 72.

Related postsIntroduction to differential privacyData privacy consulting[1] Ordinarily even the number of records in the database is kept private, but we’ll assume here that for some reason we know the number of rows a priori.

.. More details