The idea behind PATE is applying a DP aggregation (Report Noisy Max algorithm) on outputs of sensitive models, called “teachers” (sensitive models are trained directly on labeled sensitivity data).

The aggregated result will be used for training another public model on unlabeled public data.

By that, applying DP on teachers’ responses can be viewed as a proxy for preserving the privacy of the sensitive data and teachers must be trained with disjoint subsets of the data.

Why must they be disjoint? Image that if each teacher is trained on the whole data then removing one teacher doesn’t have any effects on participating of any individual data subject in the aggregated result, as that individual data subject still participates in training other teachers.

This will make applying Differential privacy on teachers’ responses is no longer a valid proxy for preserving privacy of individual data subjects in the sensitive data.

Thus, training data sets must be disjoint.

Install PySyftAnalyzing Differential privacy of PATE, or perform PATE analysis, can be done with Pysyft.

You can follow these steps to install Pysyft and related libraries.

The easiest way to install the required libraries is with Conda.

Create a new environment, then install the dependencies in that environment.

In your terminal:If you have any errors relating to zstd — run the following (if everything above installed fine then skip this step):Retry installing syft (pip install syft).

APIs of PATE Differential Privacy AnalysisWe will use Numpy and perform_analysis function of “pate” package in Pysyft:Description of the perform_analysis function: Input (parameters are simplified for the purpose of the article):— A 2-dimensional array of (m, n) teacher models’ outputs.

m: number of teachers or number of outputs.

n: number of times querying.

— A vector of most voted labels.

— ε of the Report Noisy Max algorithm.

Output:— Data independent epsilon [3]: privacy loss in the worst case.

— Data dependent epsilon [3]: a tight bound of privacy loss based on real values of teacher models’ outputs.

Compute data dependent epsilon: For each query, calculate a tight bound privacy loss of the aggregated data, which generated by applying Report Noisy Max algorithm to teacher models’ outputs.

Sum tight bound privacy losses up, then assign the result to the data dependent epsilon.

Return: The data independent epsilon and the data dependent epsilonDescription of cal_max and noisy_max functions:The cal_max has the function of finding the most voted labels for each query.

And, noisy_max is the Report Noisy Max algorithm.

PATE AnalysisI will use “data” word for teacher models’ outputs, in this part.

Let’s consider these scenarios and compare our expectations with analysis results:There is no information in the data:Create the data:array([[0.

, 0.

, 0.

, .

, 0.

, 0.

, 0.

], [0.

, 0.

, 0.

, .

, 0.

, 0.

, 0.

], [0.

, 0.

, 0.

, .

, 0.

, 0.

, 0.

], .

, [0.

, 0.

, 0.

, .

, 0.

, 0.

, 0.

], [0.

, 0.

, 0.

, .

, 0.

, 0.

, 0.

], [0.

, 0.

, 0.

, .

, 0.

, 0.

, 0.

]])Expectation: There would be no general information and no private information in the aggregated results.

Thus, the data dependent epsilon should be low.

Perform PATE analysis:Warning: May not have used enough values of l.

Increase 'moments' variable and run again.

Data Independent Epsilon: 10000.

164470363787Data Dependent Epsilon: 0.

16447036378528898The data dependent epsilon is pretty low that reflected the expectation.

There is rich information in the data and a strong agreement among the teachers:Create the data:array([[ 0, 1, 2, .

, 997, 998, 999], [ 0, 1, 2, .

, 997, 998, 999], [ 0, 1, 2, .

, 997, 998, 999], .

, [ 0, 1, 2, .

, 997, 998, 999], [ 0, 1, 2, .

, 997, 998, 999], [ 0, 1, 2, .

, 997, 998, 999]])Expectation: Information in the data is rich may lead to the richness of both general information and private information.

But, this case, there is a strong agreement among the teachers.

Thus, opting-out of a teacher doesn’t affect the last result.

It means that there is no private information.

By that, the data dependent epsilon should be low, even with a large privacy loss level (ε=5) of the Report Noisy Max algorithm.

PATE analysis (the same codes):Warning: May not have used enough values of l.

Increase 'moments' variable and run again.

Data Independent Epsilon: 10000.

57564627325Data Dependent Epsilon: 0.

5756462732485115The data dependent epsilon is pretty low that reflected the same as our expectation.

Random data:Create the data:array([[74, 60, 81, .

, 44, 95, 90], [77, 50, 72, .

, 40, 14, 49], [69, 60, 54, .

, 57, 60, 9], .

, [57, 47, 67, .

, 40, 59, 55], [26, 74, 21, .

, 27, 88, 57], [85, 39, 39, .

, 7, 84, 30]])Expectation: This case contrasts with the previous that there is no longer a strong agreement among the teachers.

Thus, the richness of information is shared with both general information and private information.

Thereby, the data dependent epsilon should be high.

PATE analysis (the same codes):Data Independent Epsilon: 10000.

57564627325Data Dependent Epsilon: 8135.

9030202753The data dependent epsilon is high that reflected the same as our expectation.

Private Machine Learning with PATE (in Pytorch)Figure 7 [3]: Private Machine Learning with PATE.

How does it work?.With PATE, we will use aggregated results of teacher models for training another model, a “student”, on unlabeled (incomplete) public data.

This helps us to get benefits from sensitive data sources.

Problem DescriptionData: — MNIST dataset consists of greyscale handwritten digit images.

Each image is 28×28 pixels.

— We will use available MNIST dataset in the “torchvision” package.

Thus, the training dataset represents for the labeled sensitive dataset, while the test dataset represents for the unlabeled public data.

Goal: — Providing strong privacy guarantees for the labeled sensitive dataset.

— Training a model for classifying an input image (28×28 pixels) of a handwritten single digit (0–9) from the unlabeled public data.

ImplementationImport librariesDownload aggregation.

py and import it and related libraries.

Load MNIST dataTrain student modelI will assume that you’ve had predictions of 250 fine-tuned teacher models-2048 queries for 2048 input samples from the unlabeled public data and it’s stored in a 2-dimensional array as mentioned in PATE Analysis part:array([[7, 2, 1, .

, 3, 4, 6], [7, 2, 1, .

, 3, 9, 5], [7, 2, 1, .

, 3, 4, 6], .

, [7, 9, 7, .

, 9, 9, 8], [2, 2, 2, .

, 7, 1, 7], [6, 6, 6, .

, 6, 6, 6]])PATE analysis:Data Independent Epsilon: 751.

0955105579642Data Dependent Epsilon: 0.

7707510458439465Use inputs and Report Noisy Max outputs (privacy loss level ε=0.

3) of 2048 queries for training this student model:Training codes:Result: The model has 82.

5% accuracy and less than 0.

771 total privacy loss.

You can find the source code in this GitHub project.

Final ThoughtsDifferential privacy is a powerful tool for quantifying and solving practical problems related to privacy.

Its flexible definition gives it the potential to be applied in a wide range of applications, including Machine Learning applications.

All is just a starting point, but I hope it’s also a sign that it’s possible for getting all benefits from big data techniques without any compromise on privacy.

References[1] Kobbi Nissim, et al.

Differential Privacy: A Primer for a Non-technical Audience.

February 14, 2018.

[2] Cynthia Dwork and Aaron Roth.

The algorithmic foundations of differential privacy.

Foundations and Trends in Theoretical Computer Science, 9(3 4):211–407, 2014.

[3] Nicolas Papernot, et al.

Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data.

2017.

[4] Fredrikson, Matt & Jha, Somesh & Ristenpart, Thomas.

(2015).

Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures.

1322–1333.

10.

1145/2810103.

2813677.

.