Lecture by Andrew Trask in January 2020, part of the MIT Deep Learning Lecture Series.

Website: https://deeplearning.mit.edu
Slides: http://bit.ly/38jzide
Playlist: http://bit.ly/deep-learning-playlist

LINKS:
Andrew Twitter: https://twitter.com/iamtrask
OpenMined: https://www.openmined.org/
Grokking Deep Learning (book): http://bit.ly/2RsxlUZ

OUTLINE:
0:00 - Introduction
0:54 - Privacy preserving AI talk overview
1:28 - Key question: Is it possible to answer questions using data we cannot see?
5:56 - Tool 1: remote execution
8:44 - Tool 2: search and example data
11:35 - Tool 3: differential privacy
28:09 - Tool 4: secure multi-party computation
36:37 - Federated learning
39:55 - AI, privacy, and society
46:23 - Open data for science
50:35 - Single-use accountability
54:29 - End-to-end encrypted services
59:51 - Q&A: privacy of the diagnosis
1:02:49 - Q&A: removing bias from data when data is encrypted
1:03:40 - Q&A: regulation of privacy
1:04:27 - Q&A: OpenMined
1:06:16 - Q&A: encryption and nonlinear functions
1:07:53 - Q&A: path to adoption of privacy-preserving technology
1:11:44 - Q&A: recommendation systems

CONNECT:
- If you enjoyed this video, please subscribe to this channel.
- Twitter: https://twitter.com/lexfridman
- LinkedIn: https://www.linkedin.com/in/lexfridman
- Facebook: https://www.facebook.com/lexfridman
- Instagram: https://www.instagram.com/lexfridman

Today, we're very happy to have Andrew Trask.

He's a brilliant writer, researcher, tweeter, that's a word, in the world of machine learning and artificial intelligence.

He is the author of Grok and Deep Learning, the book that I highly recommended in the lecture on Monday.

He's the leader and creator of OpenMind, which is an open-source community that strives to make our algorithms, our data, and our world in general more privacy-preserving.

He is coming to us by way of Oxford, but without that rich, complex, beautiful, sophisticated British accent, unfortunately.

He is 1 of the best educators, and truly 1 of the nicest people I know.

So yeah, today we're going to be talking about Privacy-Preserving AI.

So the first is going to be looking at privacy tools from the context of a data scientist or a researcher like how their actual UX might change.

Because I think that's the best way to communicate some of the new technologies that are coming about in that context.

Then we're going to zoom out and look at under the assumption that these kinds of technologies become mature, what is that going to do to kind of society?

Like what sort of consequences or side effects could these kinds of tools have both positive and negative?

So first, let's ask the question, Is it possible to answer questions using data that we cannot see?

This is going to be the key question that we look at today.

So first, if we wanted to answer the question, what do tumors look like in humans?

If we wanted to do that, we would first need to download a dataset of tumor-related images.

So we'd be able to statistically study these and be able to recognize what tumors look like in humans.

But this kind of data is not very easy to come by.

So it's very rarely that it's collected, it's difficult to move around, it's highly regulated.

So we're probably going to buy it from relatively small number of sources that are able to actually collect and manage this kind of information.

The scarcity and constraints around this likely to make this a relatively expensive purchase.

If it's going to be an expensive purchase for us to answer this question, well, then we're going to find someone to finance our project.

If we need someone to finance our project, we have to come up with a way of how we're going to pay them back.

If we're going to create a business plan, then we have to find a business partner.

We're going to find a business partner, we have to spend all our classmates in LinkedIn, looking for someone to start a business with us.

Now, it's because we wanted to answer the question, what do tumors look like in humans?

What if we wanted to answer a different question?

What if we wanted to answer the question, what do handwritten digits look like?

Well, this would be a totally different story, right?

We download a dataset, we download a state-of-the-art training script from GitHub, we'd run it, and a few minutes later we'd have, you know, a ability to classify handwritten digits with potentially superhuman ability, right?

And why is it so different between these 2 questions?

The reason is that getting access to private data, data about people is really, really hard.

And as a result, we spend most of our time working on problems and tasks like this.

Anybody who's trained a classifier on MNIST before, raise your hand.

Instead of working on problems like this.

So is anyone trying to classify or to predict dementia, diabetes, Alzheimer's?

Looks like she's going, depression, Anxiety?

So why is it that we spend all our time on tasks like this?

When these tasks, these represent our friends and loved ones and problems in society that really, really matter.

Not to say that there are people working on this, it's absolutely, there are whole fields dedicated to it, but the machine learning community at large, these tasks are pretty inaccessible.

In fact, In order to work on 1 of these, just getting access to the data, you'd have to dedicate a portion of your life just to getting access to it, whether it's doing a startup or joining a hospital or what have you, whereas for other kinds of datasets, they're just simply readily accessible.

Is it possible to answer questions using data that we cannot see?

So in this talk, we're going to walk through a few different techniques.

If the answer to this question is yes, the combination of these techniques is going to try to make it so that we can actually pip install access to datasets like these in the same way that we pip install access to other deep learning tools.

And the idea here is to lower the barrier to entry to increase the accessibility to some of the most important problems that we would like to address.

So as Lex mentioned, I lead a community called OpenMind, which is an open-source community of a little over 6,000 people who are focused on lowering the barrier to entry to privacy-preserving AI machine learning.

Specifically, 1 of the tools they're working on, we're talking about today is called PySIFT.

PySIFT extends the major deep learning frameworks with the ability to do privacy-preserving machine learning.

So specifically today, we're going to be looking at the extensions into PyTorch.

So PyTorch, people generally familiar with PyTorch.

It's my hope that by walking through a few of these tools, it'll become clear how we can start to be able to do data science, the active answering questions using data that we don't actually have direct access to.

Then in the second half of the talk, we're going to generalize this to answering questions, even if you're not necessarily a data scientist.

So first, first tool is remote execution.

So we're going to jump into code for a minute, but hopefully this is sort of line by line and relatively simple.

And even if you aren't familiar with PyTorch, I think it's- it's relatively intuitive.

We're looking at like lists of numbers and these kinds

the top, we import Torch as a deep learning framework.

SIFT extends Torch with this thing called Torch Hook.

All it's doing is just iterating through the library and basically monkey-patching in lots of new functionality.

And most deep learning frameworks are built around 1 core primitive.

And that core primitive is the tensor, right?

So, you know, and for those of you who don't know what tensors are, just think of them as nested list of numbers for now, and that'll be good enough for this, this talk.

But for us, we introduce a second core primitive, which is the worker, right?

And a worker is a location upon- within which computation is going to be occurring.

So in this case, we have a virtualized worker that is, that is pointing to say a hospital data center.

The assumption that we have is that this worker will allow us to run computation inside of the data center without us actually having direct access to that worker itself.

It gives us a limited, whitelisted set of methods that we can use on this remote machine.

an example, so there's that core payment we talked about a minute ago.

The first method that we added is called just dot send.

Takes the tensor, serializes it, sends it into the hospital data center, and returns back to me a pointer.

Now, this pointer is really, really special.

And for those of you who are actually familiar with deep learning frameworks, I hope that this will- this really resonate with you.

Because it has the full PyTorch API as a part of it, but whenever you execute something using this pointer, instead of it running locally, even though it looks like and feels like it's running locally, it actually executes on the remote machine and returns back to you another pointer to the result.

The idea here being that I can now coordinate remote executions, remote computations without necessarily having to have direct access to the machine.

Of course, I can get a.getRequest and we'll see that this is actually really important.

So getting permissions around when you can do .get request and actually ask for data from a remote machine back to you.

So in the Pareto principle, 80 percent for 20 percent, this is like the first big cut.

So pros, data remains on a remote machine.

We can now, in theory, do data science on a machine that we don't have access to, that we don't own, right?

But the problem is, the first- first column we want to address is how can we actually do good data science without physically seeing the data, right?

So it's all well and good to say, I'm going to train a deep learning classifier, but the process of answering questions is inherently iterative, right?

It's inherently sort of give and take, and I learn a little bit and I ask a little bit, I learn a little bit and I ask a little bit, right?

It will get more, more complex here in a minute.

So in this case, let's say we have what's called a grid.

So PyGrid, if PySIFT is a library, PyGrid is sort of the platform version.

So it's, it's sort of, again, this is all open-source Apache 2 stuff.

This is, we have, what's called a grid client.

So this is- this could be a interface to, a large number of datasets inside of a big hospital, right?

And so let's say I wanted to train a classifier to do something with diabetes, right?

So it's mean to predict diabetes or predict, certain kind of diabetes or certain attribute of diabetes.

I should be able to perform remote search.

I get back pointers to throw the remote information.

I can get back detailed descriptions of what the information is about me actually looking at it.

So how it was collected, what the rows and columns are, what the types of different information is, what the various ranges of the values can take on, things that allow me to do remote normalization, these kinds of things, and then in some cases, even look at samples of this data.

So these samples could be sort of human curated, they could be generated from a GAN, they could be actually short snippets from the actual dataset.

Maybe it's okay to release small amounts but not large amounts.

The reason I highlight this, this isn't like crazy complex stuff.

So prior to going back to school, I used to work for a company called Digital Reasoning.

So we delivered AI services to corporations behind the firewall.

So we did classified information, we worked with investment banks helping prevent insider trading, and doing data science on data that like your home team, back in Nashville in our case, is not able to see is really, really challenging.

But there are some things that can give you the first big jump, before you jump into the more complex tools to handle some of the more challenging use cases.

So basic remote execution, so remote procedure calls, basic private search, and the ability to look at sample data gives us enough general context to be able to start doing things like feature engineering and evaluating quality.

So now the data remains in the remote machine, we can do some basic feature engineering, and here's where things get a little more complicated.

So if you remember, in the very first slide where I show you some code at the bottom, I called dot get on the tensor.

What that did was it took the pointer to some remote information and said, hey, send that information to me.

That is an incredibly important bottleneck.

Unfortunately, despite the fact that I'm doing all my remote execution, if that's just naively implemented, well, I can just steal all the data that I want to.

I just call dot get and whatever pointers I want, and there's no additional added real security.

So I'm going to do a quick high-level overview of the intuition of differential privacy, and then we're gonna jump into how it could- can- can and is being- looking sort of in the code, and I'll give you resources for kind of a deeper dive in differential privacy, at the end of the talk, should you be interested.

So differential privacy loosely stated, is a field that allows you to do statistical analysis, without compromising the privacy of the dataset, right?

So it, more specifically, it allows you to query a database, right?

While making certain guarantees about the privacy of the records contained within the database.

Let's say we have an example database and so this is kind of the canonical DB if you look in the literature for differential privacy.

It'll have sort of 1 row for person, 1 row for person and 1 column of zeros and ones, which corresponds to true and false.

We don't actually really care what those zeros and ones are indicating.

You know, it could be presence of a disease, could be male, female, could be, it's just some sensitive attributes, something that's worth protecting, right?

What we're going to do is, we're going to- our goal is to ensure a statistical analysis doesn't compromise privacy.

What we're going to do is query this database, right?

So we're gonna run some function over the entire database, and we're going to look at the result, and then we're gonna ask a very important question.

We're going to ask, if I were to remove someone from this database, say John, would the output of my function change?

If the answer to that is no, then intuitively we can say that, well, this output is not conditioned on John's private information.

Now, if we could say that about everyone in the database, well then, okay, it would be a perfectly privacy-preserving query, but it might not be that useful.

But this intuitive definition I think is quite powerful, right?

The notion of how can we construct queries that are invariant to removing someone or replacing them with someone else, okay?

And the notion of the maximum amount that the output of a function can change as a result of removing or replacing 1 of the individuals is known as the sensitivity.

So if you're reading literature and you look you find come across sensitivity, that's what we're talking about.

So what do we do when we have a really sensitive function?

We're going to take a bit of a sidestep for a minute.

I have a sister, twin sister, who's finishing a PhD in political science.

Political science, often they need to answer questions about very taboo behavior.

Something that people are likely to lie about.

So let's say I wanted to survey everyone in this room and I wanted to answer the question what percentage of you are secretly serial killers, right?

Not because I think any 1 of you are, but because I genuinely want to understand this trend.

I'm not trying to sort of, sort of, be an instrument of the criminal justice system.

I'm trying to be, you know, a sociologist or political scientist and understand this- this actual trend.

The problem is, if I sit down with each 1 of you in- in a private room and I say, I promise, I promise, I promise, I won't tell anybody, right?

I'm still going to get a skewed distribution, right?

There may be some people who are just going to be like, why would I risk, telling you this- this private information.

And so what- what sociologists can do is this- this technique called randomized response, where, I should have brought a coin.

You take a coin and, you give it to each person before you survey them, right?

And you ask them to flip it twice somewhere that you cannot see.

So I would ask each 1 of you to flip a coin twice, somewhere that I cannot see.

Then I would instruct you to, if the first coin flip is a heads, answer honestly.

But if the first coin flip is a tails, answer yes or no based on the second coin flip.

So roughly half the time, you'll be honest and the other half of the time, you'll be, you'll be giving me a fi- perfect 50-50 coin flip.

And the cool thing is that what this is actually doing, is taking whatever the true mean of the distribution is, and averaging it with a 50-50 coin flip, right?

So if say, 55 percent of you, answered yes, that, that you are a, a serial killer, then I know that the true center of the distribution is actually 60%, because it was 60% averaged with

fact that I can recover the center of the distribution, right?

Given enough samples, each individual person has plausible deniability.

If you said yes, it could have been because you actually are, or it could have been because you just happened to flip a certain sequence of coin flips, okay?

Now this concept of adding noise to data to give plausible deniability is sort of the secret weapon of differential privacy, right?

And, and the field itself is a, a set of mathematical proofs for trying to do this as efficiently as possible, to give sort of the smallest amount of noise, to get the most accurate results, right, with the best possible privacy protections, right?

There is a meaningful, sort of base trade-off that you, you, you, you know, there's kind of a Pareto trade-off, right?

And we're trying to push that, push that trade-off down.

But so the, the, the field of research that is differential privacy, is looking at how to add noise to data and, and resulting queries to give plausible deniability to the, and to the, the members of a, of a database or a training dataset.

Now, a few, terms that you should be familiar with.

So there's local and there's global differential privacy.

So local differential privacy adds noise to data before it's sent to the statistician.

So in this case, the 1 with the coin flip, this was local differential privacy.

It affords you the best amount of protection because you never actually reveal sort of in the clear your information to someone, right?

And then there's global differential privacy, which says, okay, we're going to put everything in the database, perform a query, and then before the output of the query gets published, we're going to add a little bit of noise to the output of the query, okay?

This tends to have a much better privacy trade-off, but you have to trust the database owner to not compromise the results.

And we'll see there's some other things we can do there.

But with me so far, this is a good point for questions if you had any questions.

Any of this process of differential privacy verifiable?

So that is a fantastic question and 1 that actually absolutely comes up in practice.

So first, local differential privacy, the nice thing is everyone's doing it for themself, right?

So in that sense, if you're flipping your own coins and answering your own questions, that's- that's your verification, right?

You're- you're kind of trusting yourself.

For global differential privacy, stay tuned for the next tool and we'll, we'll come back to that.

So first, we have a pointer to a remote private dataset we call dot get.

You just asked to sort of see the raw value of some private data point, which you cannot do, right?

Instead, pass in dot get epsilon to add the appropriate amount of noise.

So 1 thing I haven't mentioned yet, differential privacy.

So sensitivity was, related to the type of query, the type of function that we wanted to do and it's in variance to, removing or replacing individual entries in the, in the database.

So epsilon is a measure of what we call our privacy budget, Right?

And what our privacy budget is, is saying, okay, what, what's the, what's the amount of, of statistical uniqueness that I'm going to sort of limit?

What's the upper bound for the amount of statistical uniqueness that I'm going to allow to come out of this, out of this database?

And actually I'm gonna take 1 more side, side track here, because I think it's really worth mentioning.

Anyone familiar with data anonymization come across this term before?

Taking a document like redacting the social security numbers and like all this kind of stuff.

If you don't remember anything else from this talk, it is very dangerous to do just dataset anonymization.

And differential privacy in some respects is, is, is the formal version of data anonymization, where instead of, instead of just saying, okay, I'm just gonna redact out these pieces and then I'll be fine.

This is saying, okay, that we, we can do a lot better.

So for example, a Netflix prize, Netflix machine learning prize, if you remember this, a big million dollar prize, maybe some people in here competed in it.

So in this prize, right, Netflix published an anonymized dataset, right?

And they took all the movies and replaced them with numbers, and they took all the users and replaced them with numbers, and then we just had sparsely populated movie ratings in this matrix, right?

But the problem is, is that each row is statistically unique.

Meaning it kind of is its own fingerprint.

And so 2 months after the dataset was published, some researchers at, UT Austin.

I think it was, I think it was UT Austin,

were able to go and scrape IMDB, and basically create

the same matrix in IMDB, and then just compare the 2.

And it turns out people that were into movie rating, were into movie rating and, and, and were watching movies at similar times, and similar patterns, and similar tastes, right?

And they were able to de-anonymize this first dataset with a high degree of accuracy.

It happened again with, there's a, there's a famous case of like medical records for like, I think a- I think that a Massachusetts senator, I think it was someone in the Northeast, being de-anonymized through very similar techniques.

So some- 1 person goes and buys a anonymized medical dataset over here that has, you know, birthdate and zip code, and this 1 does zip code and- and gender, and this 1 does zip code gender and whether or not you have cancer, right?

And, and when you get all these together, you can start to sort of use the uniqueness in each 1 to, to relink it all back together.

I mean, I, this is so doable to, to the extreme that I, I, unfortunately know of companies whose business model is to buy anonymized data sets, de-anonymize them, and sell market intelligence to insurance companies.

And the reason that it can be done is that just because the dataset that you are publishing, the

1 that you are physically looking at, doesn't seem like

See all Lex Fridman transcripts on Youtube

Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series