1 hours 13 minutes 50 seconds
🇬🇧 English
Speaker 1
00:00
Today, we're very happy to have Andrew Trask. He's a brilliant writer, researcher, tweeter, that's a word, in the world of machine learning and artificial intelligence. He is the author of Grok and Deep Learning, the book that I highly recommended in the lecture on Monday. He's the leader and creator of OpenMind, which is an open-source community that strives to make our algorithms, our data, and our world in general more privacy-preserving.
Speaker 1
00:31
He is coming to us by way of Oxford, but without that rich, complex, beautiful, sophisticated British accent, unfortunately. He is 1 of the best educators, and truly 1 of the nicest people I know. So please give him a warm welcome.
Speaker 2
00:52
Thanks. That was a very generous introduction. So yeah, today we're going to be talking about Privacy-Preserving AI. This talk is going to come in 2 parts.
Speaker 2
00:59
So the first is going to be looking at privacy tools from the context of a data scientist or a researcher like how their actual UX might change. Because I think that's the best way to communicate some of the new technologies that are coming about in that context. Then we're going to zoom out and look at under the assumption that these kinds of technologies become mature, what is that going to do to kind of society? Like what sort of consequences or side effects could these kinds of tools have both positive and negative?
Speaker 2
01:27
So first, let's ask the question, Is it possible to answer questions using data that we cannot see? This is going to be the key question that we look at today. And let's start with an example. So first, if we wanted to answer the question, what do tumors look like in humans?
Speaker 2
01:43
Well, this is a pretty complex question. Tumors are pretty complicated things. So we might train an AI classifier. If we wanted to do that, we would first need to download a dataset of tumor-related images.
Speaker 2
01:54
So we'd be able to statistically study these and be able to recognize what tumors look like in humans. But this kind of data is not very easy to come by. So it's very rarely that it's collected, it's difficult to move around, it's highly regulated. So we're probably going to buy it from relatively small number of sources that are able to actually collect and manage this kind of information.
Speaker 2
02:16
The scarcity and constraints around this likely to make this a relatively expensive purchase. If it's going to be an expensive purchase for us to answer this question, well, then we're going to find someone to finance our project. If we need someone to finance our project, we have to come up with a way of how we're going to pay them back. If we're going to create a business plan, then we have to find a business partner.
Speaker 2
02:32
We're going to find a business partner, we have to spend all our classmates in LinkedIn, looking for someone to start a business with us. Now, it's because we wanted to answer the question, what do tumors look like in humans? What if we wanted to answer a different question? What if we wanted to answer the question, what do handwritten digits look like?
Speaker 2
02:48
Well, this would be a totally different story, right? We download a dataset, we download a state-of-the-art training script from GitHub, we'd run it, and a few minutes later we'd have, you know, a ability to classify handwritten digits with potentially superhuman ability, right? If- if such a thing exists. And why is it so different between these 2 questions?
Speaker 2
03:11
The reason is that getting access to private data, data about people is really, really hard. And as a result, we spend most of our time working on problems and tasks like this. So ImageNet, MNIST, IFR10. Anybody who's trained a classifier on MNIST before, raise your hand.
Speaker 2
03:29
I expect pretty much everybody. Instead of working on problems like this. So is anyone trying to classify or to predict dementia, diabetes, Alzheimer's? Looks like she's going, depression, Anxiety?
Speaker 2
03:49
No 1. So why is it that we spend all our time on tasks like this? When these tasks, these represent our friends and loved ones and problems in society that really, really matter. Not to say that there are people working on this, it's absolutely, there are whole fields dedicated to it, but the machine learning community at large, these tasks are pretty inaccessible.
Speaker 2
04:13
In fact, In order to work on 1 of these, just getting access to the data, you'd have to dedicate a portion of your life just to getting access to it, whether it's doing a startup or joining a hospital or what have you, whereas for other kinds of datasets, they're just simply readily accessible. This brings us back to our question. Is it possible to answer questions using data that we cannot see? So in this talk, we're going to walk through a few different techniques.
Speaker 2
04:44
If the answer to this question is yes, the combination of these techniques is going to try to make it so that we can actually pip install access to datasets like these in the same way that we pip install access to other deep learning tools. And the idea here is to lower the barrier to entry to increase the accessibility to some of the most important problems that we would like to address. So as Lex mentioned, I lead a community called OpenMind, which is an open-source community of a little over 6,000 people who are focused on lowering the barrier to entry to privacy-preserving AI machine learning. Specifically, 1 of the tools they're working on, we're talking about today is called PySIFT.
Speaker 2
05:21
PySIFT extends the major deep learning frameworks with the ability to do privacy-preserving machine learning. So specifically today, we're going to be looking at the extensions into PyTorch. So PyTorch, people generally familiar with PyTorch. Yeah, quite a few users.
Speaker 2
05:36
It's my hope that by walking through a few of these tools, it'll become clear how we can start to be able to do data science, the active answering questions using data that we don't actually have direct access to. Then in the second half of the talk, we're going to generalize this to answering questions, even if you're not necessarily a data scientist. So first, first tool is remote execution. Okay.
Speaker 2
05:59
So let's just walk me through this. So we're going to jump into code for a minute, but hopefully this is sort of line by line and relatively simple. And even if you aren't familiar with PyTorch, I think it's- it's relatively intuitive. We're looking at like lists of numbers and these kinds
Speaker 3
06:10
of things. So up at
Speaker 2
06:11
the top, we import Torch as a deep learning framework. SIFT extends Torch with this thing called Torch Hook. All it's doing is just iterating through the library and basically monkey-patching in lots of new functionality.
Speaker 2
06:22
And most deep learning frameworks are built around 1 core primitive. And that core primitive is the tensor, right? So, you know, and for those of you who don't know what tensors are, just think of them as nested list of numbers for now, and that'll be good enough for this, this talk. But for us, we introduce a second core primitive, which is the worker, right?
Speaker 2
06:38
And a worker is a location upon- within which computation is going to be occurring. Right? So in this case, we have a virtualized worker that is, that is pointing to say a hospital data center. The assumption that we have is that this worker will allow us to run computation inside of the data center without us actually having direct access to that worker itself.
Speaker 2
06:58
It gives us a limited, whitelisted set of methods that we can use on this remote machine.
Speaker 3
07:04
So just to give you
Speaker 2
07:05
an example, so there's that core payment we talked about a minute ago. We have the torch tensor, so
Speaker 1
07:10
1, 3, 4, 5.
Speaker 2
07:12
The first method that we added is called just dot send. This does exactly what you might expect. Takes the tensor, serializes it, sends it into the hospital data center, and returns back to me a pointer.
Speaker 2
07:22
Now, this pointer is really, really special. And for those of you who are actually familiar with deep learning frameworks, I hope that this will- this really resonate with you. Because it has the full PyTorch API as a part of it, but whenever you execute something using this pointer, instead of it running locally, even though it looks like and feels like it's running locally, it actually executes on the remote machine and returns back to you another pointer to the result. The idea here being that I can now coordinate remote executions, remote computations without necessarily having to have direct access to the machine.
Speaker 2
07:55
Of course, I can get a.getRequest and we'll see that this is actually really important. So getting permissions around when you can do .get request and actually ask for data from a remote machine back to you. So just remember that. Cool.
Speaker 2
08:07
So this is where we start. So in the Pareto principle, 80 percent for 20 percent, this is like the first big cut. So pros, data remains on a remote machine. We can now, in theory, do data science on a machine that we don't have access to, that we don't own, right?
Speaker 2
08:24
But the problem is, the first- first column we want to address is how can we actually do good data science without physically seeing the data, right? So it's all well and good to say, I'm going to train a deep learning classifier, but the process of answering questions is inherently iterative, right? It's inherently sort of give and take, and I learn a little bit and I ask a little bit, I learn a little bit and I ask a little bit, right? This brings me to the second tool.
Speaker 2
08:46
So search an example data. Again, we're starting really simple. It will get more, more complex here in a minute. So in this case, let's say we have what's called a grid.
Speaker 2
08:53
So PyGrid, if PySIFT is a library, PyGrid is sort of the platform version. So it's, it's sort of, again, this is all open-source Apache 2 stuff. This is, we have, what's called a grid client. So this is- this could be a interface to, a large number of datasets inside of a big hospital, right?
Speaker 2
09:10
And so let's say I wanted to train a classifier to do something with diabetes, right? So it's mean to predict diabetes or predict, certain kind of diabetes or certain attribute of diabetes. I should be able to perform remote search. I get back pointers to throw the remote information.
Speaker 2
09:27
I can get back detailed descriptions of what the information is about me actually looking at it. So how it was collected, what the rows and columns are, what the types of different information is, what the various ranges of the values can take on, things that allow me to do remote normalization, these kinds of things, and then in some cases, even look at samples of this data. So these samples could be sort of human curated, they could be generated from a GAN, they could be actually short snippets from the actual dataset. Maybe it's okay to release small amounts but not large amounts.
Speaker 2
10:00
The reason I highlight this, this isn't like crazy complex stuff. So prior to going back to school, I used to work for a company called Digital Reasoning. We did sort of on-prem data science. So we delivered AI services to corporations behind the firewall.
Speaker 2
10:17
So we did classified information, we worked with investment banks helping prevent insider trading, and doing data science on data that like your home team, back in Nashville in our case, is not able to see is really, really challenging. But there are some things that can give you the first big jump, before you jump into the more complex tools to handle some of the more challenging use cases. Cool. So basic remote execution, so remote procedure calls, basic private search, and the ability to look at sample data gives us enough general context to be able to start doing things like feature engineering and evaluating quality.
Speaker 2
10:54
So now the data remains in the remote machine, we can do some basic feature engineering, and here's where things get a little more complicated. So if you remember, in the very first slide where I show you some code at the bottom, I called dot get on the tensor. What that did was it took the pointer to some remote information and said, hey, send that information to me. That is an incredibly important bottleneck.
Speaker 2
11:20
Unfortunately, despite the fact that I'm doing all my remote execution, if that's just naively implemented, well, I can just steal all the data that I want to. I just call dot get and whatever pointers I want, and there's no additional added real security.
Speaker 3
11:33
So what are we going to do about this? This brings us to tool number 3
Speaker 2
11:37
called differential privacy. Differential privacy. You want to come across?
Speaker 2
11:40
A little higher? Okay. Cool. Awesome.
Speaker 2
11:43
Good. So I'm going to do a quick high-level overview of the intuition of differential privacy, and then we're gonna jump into how it could- can- can and is being- looking sort of in the code, and I'll give you resources for kind of a deeper dive in differential privacy, at the end of the talk, should you be interested. So differential privacy loosely stated, is a field that allows you to do statistical analysis, without compromising the privacy of the dataset, right? So it, more specifically, it allows you to query a database, right?
Speaker 2
12:14
While making certain guarantees about the privacy of the records contained within the database. So let me show you what I mean. Let's say we have an example database and so this is kind of the canonical DB if you look in the literature for differential privacy. It'll have sort of 1 row for person, 1 row for person and 1 column of zeros and ones, which corresponds to true and false.
Speaker 2
12:32
We don't actually really care what those zeros and ones are indicating. You know, it could be presence of a disease, could be male, female, could be, it's just some sensitive attributes, something that's worth protecting, right? What we're going to do is, we're going to- our goal is to ensure a statistical analysis doesn't compromise privacy. What we're going to do is query this database, right?
Speaker 2
12:50
So we're gonna run some function over the entire database, and we're going to look at the result, and then we're gonna ask a very important question. We're going to ask, if I were to remove someone from this database, say John, would the output of my function change? If the answer to that is no, then intuitively we can say that, well, this output is not conditioned on John's private information. Now, if we could say that about everyone in the database, well then, okay, it would be a perfectly privacy-preserving query, but it might not be that useful.
Speaker 2
13:32
But this intuitive definition I think is quite powerful, right? The notion of how can we construct queries that are invariant to removing someone or replacing them with someone else, okay? And the notion of the maximum amount that the output of a function can change as a result of removing or replacing 1 of the individuals is known as the sensitivity. Okay.
Speaker 2
13:54
So important. So if you're reading literature and you look you find come across sensitivity, that's what we're talking about. So what do we do when we have a really sensitive function? We're going to take a bit of a sidestep for a minute.
Speaker 2
14:07
I have a sister, twin sister, who's finishing a PhD in political science. Political science, often they need to answer questions about very taboo behavior. Okay. Something that people are likely to lie about.
Speaker 2
14:20
So let's say I wanted to survey everyone in this room and I wanted to answer the question what percentage of you are secretly serial killers, right? And not because like, yeah. Not because I think any 1 of you are, but because I genuinely want to understand this trend. Right?
Speaker 2
14:38
I'm not trying to arrest people. I'm not trying to sort of, sort of, be an instrument of the criminal justice system. I'm trying to be, you know, a sociologist or political scientist and understand this- this actual trend. The problem is, if I sit down with each 1 of you in- in a private room and I say, I promise, I promise, I promise, I won't tell anybody, right?
Speaker 2
14:55
I'm still going to get a skewed distribution, right? There may be some people who are just going to be like, why would I risk, telling you this- this private information. And so what- what sociologists can do is this- this technique called randomized response, where, I should have brought a coin. You take a coin and, you give it to each person before you survey them, right?
Speaker 2
15:13
And you ask them to flip it twice somewhere that you cannot see. So I would ask each 1 of you to flip a coin twice, somewhere that I cannot see. Then I would instruct you to, if the first coin flip is a heads, answer honestly. But if the first coin flip is a tails, answer yes or no based on the second coin flip.
Speaker 2
15:38
Okay. So roughly half the time, you'll be honest and the other half of the time, you'll be, you'll be giving me a fi- perfect 50-50 coin flip. And the cool thing is that what this is actually doing, is taking whatever the true mean of the distribution is, and averaging it with a 50-50 coin flip, right? So if say, 55 percent of you, answered yes, that, that you are a, a serial killer, then I know that the true center of the distribution is actually 60%, because it was 60% averaged with
Speaker 3
16:11
a 50-50 coin flip. Does that make sense? However, despite the
Speaker 2
16:15
fact that I can recover the center of the distribution, right? Given enough samples, each individual person has plausible deniability. If you said yes, it could have been because you actually are, or it could have been because you just happened to flip a certain sequence of coin flips, okay?
Speaker 2
16:33
Now this concept of adding noise to data to give plausible deniability is sort of the secret weapon of differential privacy, right? And, and the field itself is a, a set of mathematical proofs for trying to do this as efficiently as possible, to give sort of the smallest amount of noise, to get the most accurate results, right, with the best possible privacy protections, right? There is a meaningful, sort of base trade-off that you, you, you, you know, there's kind of a Pareto trade-off, right? And we're trying to push that, push that trade-off down.
Speaker 2
17:09
But so the, the, the field of research that is differential privacy, is looking at how to add noise to data and, and resulting queries to give plausible deniability to the, and to the, the members of a, of a database or a training dataset. Does that make sense? Now, a few, terms that you should be familiar with. So there's local and there's global differential privacy.
Speaker 2
17:31
So local differential privacy adds noise to data before it's sent to the statistician. So in this case, the 1 with the coin flip, this was local differential privacy. It affords you the best amount of protection because you never actually reveal sort of in the clear your information to someone, right? And then there's global differential privacy, which says, okay, we're going to put everything in the database, perform a query, and then before the output of the query gets published, we're going to add a little bit of noise to the output of the query, okay?
Speaker 2
17:58
This tends to have a much better privacy trade-off, but you have to trust the database owner to not compromise the results. Okay. And we'll see there's some other things we can do there. But with me so far, this is a good point for questions if you had any questions.
Speaker 2
18:09
Got it. So the question is, is this verifiable? Any of this process of differential privacy verifiable? So that is a fantastic question and 1 that actually absolutely comes up in practice.
Speaker 2
18:21
So first, local differential privacy, the nice thing is everyone's doing it for themself, right? So in that sense, if you're flipping your own coins and answering your own questions, that's- that's your verification, right? You're- you're kind of trusting yourself. For global differential privacy, stay tuned for the next tool and we'll, we'll come back to that.
Speaker 3
18:39
All right. So what does this look like in code?
Speaker 2
18:43
So first, we have a pointer to a remote private dataset we call dot get. Whoa, we get big fat error, right? You just asked to sort of see the raw value of some private data point, which you cannot do, right?
Speaker 2
18:53
Instead, pass in dot get epsilon to add the appropriate amount of noise. So 1 thing I haven't mentioned yet, differential privacy. So I mentioned sensitivity, right? So sensitivity was, related to the type of query, the type of function that we wanted to do and it's in variance to, removing or replacing individual entries in the, in the database.
Speaker 2
19:11
So epsilon is a measure of what we call our privacy budget, Right? And what our privacy budget is, is saying, okay, what, what's the, what's the amount of, of statistical uniqueness that I'm going to sort of limit? What's the upper bound for the amount of statistical uniqueness that I'm going to allow to come out of this, out of this database? And actually I'm gonna take 1 more side, side track here, because I think it's really worth mentioning.
Speaker 2
19:32
Data anonymization. Anyone familiar with data anonymization come across this term before? Taking a document like redacting the social security numbers and like all this kind of stuff. By and large, it does not work.
Speaker 2
19:44
If you don't remember anything else from this talk, it is very dangerous to do just dataset anonymization. Okay? And differential privacy in some respects is, is, is the formal version of data anonymization, where instead of, instead of just saying, okay, I'm just gonna redact out these pieces and then I'll be fine. This is saying, okay, that we, we can do a lot better.
Speaker 2
20:02
So for example, a Netflix prize, Netflix machine learning prize, if you remember this, a big million dollar prize, maybe some people in here competed in it. So in this prize, right, Netflix published an anonymized dataset, right? And that was, movies and users, right? And they took all the movies and replaced them with numbers, and they took all the users and replaced them with numbers, and then we just had sparsely populated movie ratings in this matrix, right?
Speaker 2
20:27
Seemingly anonymous, right? There's no names of any kind. But the problem is, is that each row is statistically unique. Meaning it kind of is its own fingerprint.
Speaker 2
20:41
And so 2 months after the dataset was published, some researchers at, UT Austin. I think it was, I think it was UT Austin,
Speaker 3
20:51
were able to go and scrape IMDB, and basically create
Speaker 2
20:56
the same matrix in IMDB, and then just compare the 2. And it turns out people that were into movie rating, were into movie rating and, and, and were watching movies at similar times, and similar patterns, and similar tastes, right? And they were able to de-anonymize this first dataset with a high degree of accuracy.
Speaker 2
21:13
It happened again with, there's a, there's a famous case of like medical records for like, I think a- I think that a Massachusetts senator, I think it was someone in the Northeast, being de-anonymized through very similar techniques. So some- 1 person goes and buys a anonymized medical dataset over here that has, you know, birthdate and zip code, and this 1 does zip code and- and gender, and this 1 does zip code gender and whether or not you have cancer, right? And, and when you get all these together, you can start to sort of use the uniqueness in each 1 to, to relink it all back together. I mean, I, this is so doable to, to the extreme that I, I, unfortunately know of companies whose business model is to buy anonymized data sets, de-anonymize them, and sell market intelligence to insurance companies.
Speaker 2
21:56
Ooh, right? But it can be done, okay? And the reason that it can be done is that just because the dataset that you are publishing, the
Speaker 3
22:05
1 that you are physically looking at, doesn't seem like
Speaker 2
22:08
it has, you know, social security numbers and stuff in it, does not mean that there's enough unique statistical signal for it to be linked to something else. And so when I say maximum amount of Epsilon, Epsilon is an upper bound on the statistical uniqueness that you're publishing in a dataset, right? And so what this tool represents is saying, okay, apply however much noise you need to, given whatever computational graph led back to private data for this tensor, right?
Speaker 2
22:38
To ensure that, you know, to put an upper bound on the potential for linkage attacks, right? Now, if you said epsilon 0, okay, then that's saying effectively like, I'm only going to allow patterns that
Speaker 3
22:51
have occurred at least twice, right?
Speaker 2
22:54
Okay. So meaning- meaning 2 different people had this pattern and thus it's not unique to either 1. Yes. So what happens if you perform the query twice?
Speaker 2
23:01
So the random noise would be re-randomized and, and sent again, and you're absolutely, absolutely correct. So this epsilon, this is how much I'm spending with this query. So if I ran this 3 times, I would spend epsilon of
Speaker 1
23:11
0.3.
Speaker 3
23:12
Does that make sense?
Speaker 2
23:13
So this is, this is a 0.1 query. If I did this multiple times, the epsilons would sum. And so for any given data science project, right?
Speaker 2
23:19
I should- I- I- what we're- we're advocating is that you're given an epsilon budget that you're not allowed to exceed, right? No matter how many queries that you- you participate. Now, there's- there's another sort of subfield of differential privacy that's looking at sort of single query approaches, which is all around synthetic datasets. So how can I perform sort of 1 query against a whole dataset and create a synthetic dataset that has, certain invariances that are desirable, right?
Speaker 2
23:41
So I can do good statistics on it, but then I can query this as many times as I want. Because they're basically, you can't, yeah. Anyway, but we, we, we don't have to get into that now.
Speaker 3
23:51
Does that answer your question?
Speaker 2
23:53
Cool. Awesome. So now you might think, okay, this is like a lossless cause, like how can we be answering questions while protecting, while, while keeping statistical signal gone. But like it's, it's the difference between, it's the difference between if I have a dataset and I want to know what causes cancer, right, I could query a dataset and learn that smoking causes cancer without learning that individuals are or are not smokers.
Speaker 2
24:16
Does that make sense? Right? And the reason for that is, is that I'm specifically looking for patterns that are occurring multiple times across different people. And this actually happens to really, closely mirror the type of generalization that we want in machine learning statistics anyways.
Speaker 2
24:32
Does that make sense? Like, as machine learning practitioners, we're actually not really interested in the one-offs, right? I mean, sometimes our models memorize things, this, this happens, right? But we're actually more interested in the things that are- the things that are not specific to you.
Speaker 2
24:46
I want the things that are going to work, the heart treatments that are going to work for everyone in this room. Not just, I mean, obviously, if you need a heart treatment, I'd be happy. That'd be cool for you to have 1. But what we're chiefly interested in are the things that generalize, which is why this is realistic and why with continued effort on both tooling and- and the theory side, we can- we can have a much better, reality than today.
Speaker 2
25:09
Cool. So, pros, just to review. So first, remote execution allows us- allows data to remain in the remote machine, search and sampling, we can feature engineer using toy data, differential privacy, we can have a formal rigorous privacy budgeting mechanism, right? Now, shoot.
Speaker 2
25:24
How is the privacy budget set? Is it defined by the user or is it defined by the dataset owner or someone else? This is a really, really interesting question actually. So first, it's definitely not set by the data scientist, because that would be a bit of a conflict of interest.
Speaker 2
25:41
And at- at first you might say, it should be the data owner, okay? So the hospital, right? That's trying to cover their butt, right? And make sure that their assets are protected both legally and commercially, right?
Speaker 2
25:54
So they're trying to make money off this. So there's sort of proper incentives there. But the interesting thing, and this gets back to your question, is what happens if I have say a radiology scan in 2 different hospitals, right? And they both spend 1 epsilon worth of, of, of my privacy in each of these hospitals, right?
Speaker 2
26:18
That means that actually 2 Epsilon of my private information is out there, right? And it just means that 1 person has to be clever enough to go to both places to get the join. This is actually the exact same mechanism we were talking about a second ago when someone went from Netflix to IMDb. So the true answer of who should be setting Epsilon budgets, although logistically it's going to be challenging, we'll talk about a little bit in part 2 of the talk, but I'm going a little bit slow.
Speaker 2
26:44
But okay. It should be us. It should be people and it should be people around their own information, right? You should be setting your personal Epsilon budget.
Speaker 2
26:55
That makes sense? That's an aspirational goal. We've got a long way before we can get to that level of, of infrastructure around these kinds of things. And we can talk about that and we can definitely talk about more of that in the kind of question-answer session as well.
Speaker 2
27:09
But I think in- in theory, in theory, that's what- what we would want. Okay. The 2 cons that we still- 2 weaknesses of this approach that we still lack are- someone asked this question. I think it was you.
Speaker 2
27:23
Yeah, yeah, you asked the question. So first, the data is safe but the models put at risk. And what if we need to do a join? Actually, actually yours is the third 1 which I should totally add to the slide.
Speaker 2
27:32
So, so first, if I'm sending my, my computations, my model into the hospital to learn how to be a better cancer classifier, right? My model is put at risk. It's kind of a bummer if like, you know, this is a
Speaker 1
27:43
$10
Speaker 2
27:43
million healthcare model, I'm just sending it to a thousand different hospitals to get learned, to learn. So that's potentially risky. Second, what if I need to do a join a computation across multiple different data owners, who don't trust each other, right?
Speaker 2
27:54
Who sends whose data to whom, right? And thirdly, As you pointed out, how do I trust that these computations are actually happening the way that I am telling the remote machine that they should happen? This brings me to my absolute favorite tool. Secure multi-party computation.
Speaker 2
28:13
Come across this before? Raise them high. Okay, cool. Little bit above average.
Speaker 2
28:18
Most machine learning people have not heard about this yet, and I absolutely-
Speaker 3
28:22
this is the coolest, this is
Speaker 2
28:23
the coolest thing I've learned about since learning about like AI and machine learning. This is a really, really cool technique. Encrypted computation, how about homework encryption?
Speaker 2
28:30
You come across homework encryption? Okay, a few more. Yeah, this is related to that. So first, the kind of textbook definition is like this.
Speaker 2
28:39
So if you go on Wikipedia, you'd see, SecureNPC allows multiple people to combine their private inputs to compute a function without revealing their inputs to each other, okay? But in the context of machine learning, the implication of this is multiple different individuals can share ownership of a number, okay? Share ownership of a number. Show you what I mean.
Speaker 2
29:00
So let's say I have the number 5, my happy smiling face, and I
Speaker 3
29:04
split this into 2 shares, 2 and 3. Okay.
Speaker 2
29:10
I've got 2 friends, Marianne and Bobby, and I give them these shares. They are now the shareholders of this number. Okay.
Speaker 2
29:18
And now I'm gonna go away. And this number is shared between them. Okay. And this, this gives us several desirable properties.
Speaker 2
29:27
First, it's encrypted. From the standpoint that neither Bob nor Marianne can tell what number is encrypted between them by looking at their own share by itself.
Speaker 3
29:38
Now, I've, for those of
Speaker 2
29:41
you who are familiar with, kind of cryptographic math, I'm hand-waving over this a little bit. This would typically be- so in- in- er, decryption would be adding the shares together modulus, a large prime. So these would typically look like sort of large pseudo-random numbers, right?
Speaker 2
29:56
But for the sake of making it sort of intuitive, I've picked pseudo-random numbers that are convenient to the eyes. So first, these 2 values are encrypted, and second, we get shared governance, meaning that we cannot decrypt these numbers or do anything with these numbers unless all of the shareholders agree. Okay? But the truly extraordinary part is that while this number is encrypted between these individuals, we can actually perform computation, right?
Speaker 2
30:26
So in this case, let's say we wanted to multiply the encrypted number times 2, each person can multiply their share times 2, and now they have a encrypted number 10, right? And there's a whole variety of protocols allowing you to do different functions, such as the functions needed for machine learning, while numbers are in this encrypted state, okay? I'll give some more resources for you if you're interested in kind of learning more about this at the end as well. Now the big tie-in.
Speaker 2
30:52
Models and datasets are just large collections of numbers, which we can individually encrypt, which we can individually, share governance over. Now specifically to reference your question, there's 2 configurations of, of SecureNPC, active and passive security. In the active security model, you can tell if anyone does computation that you did not sort of independently authorized, which is great. So what does this look like in practice when you go back to the code?
Speaker 2
31:18
So in this case, we don't need just 1 worker, it's not just 1 hospital because we're looking to have a shared governance, shared ownership amongst multiple different individuals. So let's say we have Bob, Alice, and Tao, and a crypto provider which we won't go into now. I can take a tensor and instead of calling dot send, and sending that tensor to someone else, now I call dot share, and that splits each value into multiple different shares and distributes those amongst the shareholders, right? So in this case, Bob, Alice, and Tao.
Speaker 2
31:47
However, in the frameworks that we're working on, you still get kind of the same PyTorch-like interface, and all the cryptographic protocol happens under the hood. And the idea here is to make it so that we can sort of do encrypted machine learning without you necessarily having to be a cryptographer, right? And vice versa, cryptographers can improve the algorithms and machine learning people can automatically inherit them, right? So kind of classic sort of open-source machine learning library, making complex intelligence more accessible to people, if that makes sense.
Speaker 2
32:16
And what we can do on tensors, we can also do on models. So we can do encrypted training, and encrypted prediction, and we're gonna get into, what kind of awesome use cases this opens up in a bit. And this is a nice set of features, right? In my opinion, this is, this is sort of the MVP of doing privacy preserving data science, right?
Speaker 2
32:39
The idea being that I can have remote access to a remote dataset. I can learn high-level latent patterns like, like, you know, what causes cancer without learning whether individuals have cancer, I can pull back just, just that sort of high-level information with formal mathematical guarantees over, over, you know, what's, what's sort of the filter that's, that's coming back through here, right? And I can work with datasets from multiple different data owners while making sure that each, each individual data owners are protected. Now, what's the catch?
Speaker 2
33:11
Okay. So first, is computational complexity, right? So encrypted computation, secure MPC, this- this involves sending lots of information over- over the network. I think that the state of the art for- for training- for, deep learning prediction, is that this is a 13x slowdown over plaintext, which is inconvenient but not deadly, right?
Speaker 2
33:32
But you do have to understand that assumes like it's like 2 AWS machines were like talking to each other, you know, they're relatively fast. But we also haven't had any like hardware optimization to the extent that, that you know, NVIDIA did a lot for deep learning like that there'll be, you know, probably like some sort of Cisco player that's similar for- for doing kind of encrypted or- or secure PC-based deep learning, right? Let's see. So this brings us back to kind of the fundamental question.
Speaker 2
33:57
Is it possible to answer questions using data we cannot see? The theory is absolutely there. That's, that's something that, that I feel reasonably confident saying, like, like the, the, the sort of the theoretical frameworks that we have. And actually, the other thing that's really worth mentioning here is that these come from totally different fields which is why they kind of haven't been necessarily combined that much yet.
Speaker 2
34:13
I'll get, I'll get more into that in a second. But it's, it's my hope that, that by sort of, by considering what these tools can do, that'll open up your eyes to the potential that in general, we can have this new ability to answer questions using information that we don't actually own ourselves. Because from a sociological standpoint, that's net new for like us as a species, if that makes sense. If ever previously we want, we want to just, we had to have, we had to have like a trusted third party who would then take all the information in themselves and, and make some sort of neutral decision, Right?
Speaker 2
34:46
So we'll come to that in a second. So 1 of the big long-term goals of our community is to make infrastructure for this secure enough and robust enough, and of course in a free Apache 2 open-source license way, that information on the world's most important problems will be this accessible. We can spend less time working on tasks like that, and more time working on tasks like this. So this is going to be the breaking point between Part 1 and Part 2.
Speaker 2
35:17
Part 2 will be a bit shorter. But if you're interested in diving deeper on the technicals of this, here's a 6 or 7 hour course that I taught just on these concepts and on the tools. It's free on Udacity. Feel free to check it out.
Speaker 2
35:30
So the question was, he's asking about how I specified that a model can be encrypted during training. Is that same as homomorphic encryption or that's something else? So, a couple of years ago, there was a- a big burst in literature around training on encrypted data, where you would homomorphically encrypt the dataset, and it turned out that some of the statistical regularities homomorphic encryption allowed you to actually train on that dataset without decrypting it. So this is similar to that, except the 1 downside to that is that in order to use that model in the future, you have to still be able to encrypt data with the same key, which often is constraining in practice.
Speaker 2
36:11
Also, there's a pretty big hit to privacy because you're training on data that inherently has a lot of noise added to it. What I'm advocating for here is instead we actually encrypt both the model and the dataset during training, But inside the encryption, inside the box, right? It's actually performing the same computations that it would be doing in plain text. So you don't get any degradation in accuracy, and you don't get tied to 1 particular public-private key pair.
Speaker 2
36:37
Yeah, so the question was comment on federated learning, specifically Google's implementation. So I think Google's implementation is great. So obviously, the fact that they've shown that this can be done, hundreds of millions of users is incredibly powerful. I mean, even inventing the term, and creating momentum in that direction.
Speaker 2
36:55
I think that there's, 1 thing that is worth mentioning is that there are 2 forms of federated learning. 1 is sort of the 1 where your model is- Federated Learning, sorry. Ooh, got to talk about what that is. Okay.
Speaker 2
37:09
Yes, I'll do that quickly. So Federated Learning is basically the first thing I talked about. So remote execution. So if everyone has a smartphone, when you plug your phone in at night, if you've got Android or iOS, you plug your phone in at night and attach to Wi-Fi.
Speaker 2
37:24
You know when you text and it recommends the next word prediction? That model is trained using federated learning. Meaning that it learns on your device to do that better, and then that model gets uploaded to the Cloud as opposed to uploading all of your tweets to the Cloud and training 1 global model. Does that make sense?
Speaker 2
37:41
So, so you plug your phone at night, model comes down, trains locally, goes back up, it's federated, right? That's, that's, that's basically what federated learning is in a nutshell. And- and, it was pioneered, by the Quark team at- at Google. And, and they're- they're- they do really fantastic work.
Speaker 2
37:55
They've- they've paid down a lot of the technical debt, a lot of the- the- the risk- or technical risk around it. And they published really great papers outlining sort of how they do it, which is fantastic. What I outlined here is actually a slightly different style of federated learning, because there's federated learning with like a fixed dataset and a fixed model and lots of users where the data is very ephemeral, like phones are constantly logging in and logging off. You're plugging your phone in and then you're taking it out.
Speaker 2
38:25
This is the 1 style of federated learning. It's really useful for product development. It's useful for if you want to do a smartphone app that has a piece of intelligence in it, but train that intelligence is going to be prohibitively difficult for you to get access to the data for, or you want to just have a value prop of protecting privacy, right? That's what federated learning, that's how federated learning is good for.
Speaker 2
38:45
What I've outlined here is a bit more exploratory federated learning where it's saying okay instead of instead of the model being hosted in the cloud and data owners showing up and making it a bit smarter every once in a while now the data is going to be hosted at a variety of different private clouds right And data scientists are gonna show up and say, I wanna do something with di- with diabetes today, or I wanna do something with- with, studying dementia today, something like that, right? This is much more difficult because the attack factors for this are much larger, right? I'm trying to be able to answer arbitrary questions about arbitrary datasets in a protected environment. Right.
Speaker 2
39:21
So I think, yeah, that's kind of my general thoughts on. Does federated learning leak any information? So federated learning by itself is not a secure protocol, right, to the extent that, and that's why sort of this ensemble of techniques that I've- so the question was does federated learning leak information? So it is perfectly possible for a federated learning model to simply memorize the dataset and then spit that back out later.
Speaker 2
39:43
You have to combine it with something like differential privacy in order to be able to prevent that from happening.
Speaker 3
39:47
Does that make sense? So just because the training is happening on
Speaker 2
39:51
my device does not mean it's not memorizing my data. Does that make sense?
Speaker 3
39:55
Okay. So now I
Speaker 2
39:55
want to zoom out and go a little less from the data science practitioner perspective. Now, I'll take more the perspective of like a, an economist or political scientist or someone looking kind of globally at like, okay, if this becomes mature, what happens? Right?
Speaker 2
40:10
And, and this is where it gets really exciting. Anyone entrepreneurial? Anyone? Everyone?
Speaker 2
40:15
I don't know. No 1? Okay. Cool.
Speaker 2
40:17
Well, this is- this is the- this is the part for you. So, the big difference is this ability to answer questions using data you
Speaker 3
40:25
can't see. Because as it turns out,
Speaker 2
40:29
most people spend a great deal of their life just answering questions, and a lot of it is involving personal data. I mean, whether it's minute things like, where's my water, where are my keys, or what movie should I watch tonight, or what kind of diet should I have to, to be able to sleep well, right? I mean, a wide variety of different questions, right?
Speaker 2
40:53
And, and we're limited in our answering ability to the information that we have, right? So this ability to answer questions using data we don't have, sociologically I think is quite important. There's 4 different areas that I want to highlight as big groups of use cases for this technology, to help inspire you to see where this infrastructure can go. And actually, before I jump into that, has anyone been to Edinburgh?
Speaker 2
41:20
Edinburgh? Cool. Just tour like the castle and stuff like that. So my wife and I, this is my wife, Amber.
Speaker 2
41:28
We went to Edinburgh for the first time 6 months ago, September, September. And we did the underground, was it the- We did a ghost tour. Yeah, yeah, yeah. We did a ghost tour and, it was really cool.
Speaker 2
41:46
There was 1 thing that took away from it. There was this point we were standing, we just walked out
Speaker 3
41:51
of the tunnels, and she
Speaker 2
41:52
was pointing up at some of the architecture. And, then,
Speaker 3
41:59
she started talking about, basically the cobblestone streets and
Speaker 2
42:05
why the cobblestone streets are there. Cobblestone streets, 1 of the main purposes of them was to sort of lift you out of the muck. And the reason there was muck was there is that they didn't have any in- internal plumbing and so the sewage was just poured out into the street, right?
Speaker 2
42:17
Because you live in a big city. And this was the norm everywhere, right? And actually, I think she even sort of implied that like the invention or popularization of the umbrella had less to do with actual rain, a bit more would do with buckets of stuff coming down from on high. Which is, it's a whole different world, like when you think about what that is.
Speaker 2
42:37
But the, the reason that I bring this up, is that, you know, however many hundred years ago, people were, were walking through, you know, like sludge, sewage was just everywhere, right? It was all over the place and people were walking through it everywhere they go and they were wondering why they got sick, right? And in many cases, And it wasn't because they wanted it to be that way, it's just because it was a natural consequence of the technology that they had at the time, right? This is not malice, this is not anyone being good or bad or, or evil or whatever, it's just, it's just the way things were.
Speaker 3
43:13
And I think that There's
Speaker 2
43:16
a strong analogy to be made with, with kind of how our data is handled as society at the moment, right? We've just sort of walked into a society, we've had new inventions come up and new things that are practical, new uses for it, and now everywhere we go, we're constantly spreading and spewing our data all over
Speaker 3
43:31
the place, right? I mean, every camera that sees me walking
Speaker 2
43:35
down the street, you know, goodness, there's a company that takes a whole picture of the Earth by satellite every day. Like, how the hell am I supposed to do anything without everyone following me around all the time, right? And I imagine that whoever it was, I'm not a historian so I don't really know, but whoever it was that said, what if, what if we ran plumbing from every single apartment, business, school, maybe even some public toilets, underground, under our city, all to 1 location and then processed it, used chemical treatments, and then turn that into usable drinking water.
Speaker 2
44:14
Like how laughable would that have been? Would have been just the- the most massive logistical infrastructure problem ever to take a working city, dig up the whole thing, to take already- already constructed buildings and run pipes through all of them. I mean, so- so Oxford, gosh. I- there's a building there that's, so old they don't have showers because they didn't want to run the plumbing for the head.
Speaker 2
44:37
You have to ladle water over yourself. It's in, Merton College. It's quite, quite famous, right? I mean, the, the infrastructure- anyway, the infrastructure challenges, it just must have seen absolutely massive.
Speaker 2
44:49
And so as I'm about to walk through kind of like 4 broad areas where things could be different theoretically based on this technology and I think it's probably going to hit you like, whoa, that's a lot of change. But I think that the need is sufficiently great. I think that, I mean, if you view our lives as just 1 long process of answering important questions, whether it's where we're going to get food or what causes cancer, like making sure that the right people can answer questions without, without, you know, data just getting spewed everywhere so that the wrong people can answer their questions,
Speaker 3
45:25
right, is important. And, yeah, anyway, So I know this is gonna sound like there's
Speaker 2
45:32
a certain ridiculousness to, to, to maybe what some of this will be. But I, I hope that you at least see that, that theoretically, like the, the basic blocks are there. And, and that really what stands between us and a world that's fundamentally different is, is adoption, maturing of the technology, and good engineering.
Speaker 2
45:51
Because I think once Sir Thomas Crapper invented the toilet, I do remember that 1. At that point, the basics were there, and what stood between them was implementation, adoption, and engineering. I think that that's where we are. The best part is we have companies like Google that have already, already paved the way with some very, very large rollouts of, of the early pieces of this technology.
Speaker 2
46:19
Cool. So what are the big categories? 1 I've already talked about, open data for science. So this 1 is a really big deal.
Speaker 2
46:38
The reason it's a really big deal is mostly because everyone gets excited about making AI progress, Right? Everyone gets super excited about superhuman ability in X, Y, or Z. When I started my PhD at Oxford, I worked for- my professor's name is Phil Bluntsum. The first thing he told me when I sat my butt down in his office, in my first day as a student, he said, Andrew, everyone's going to want to work on models.
Speaker 2
47:00
But if you look historically, the biggest jumps in progress have happened when we had new big datasets or the ability to process new big datasets. And just to give a few anecdotes, ImageNet, right? ImageNet. GPUs allowing us to process larger datasets.
Speaker 2
47:16
Even, even things like AlphaGo. This is synthetically generated infinite datasets. Or, or, or if, I don't know, did you guys, anyone watch the, the AlphaStar, live stream on YouTube? It talked about how it had trained on like 200 years of, of like, of StarCraft, right?
Speaker 2
47:33
Or if you look at, Watson, the- playing, playing Jeopardy, right? This was on the heels of, of a new large structured dataset based on Wikipedia. Or if you look at, Garry Kasparov and IBM's Deep Blue. This was on the heels of the largest open dataset of chess, matches haven't been published online, right?
Speaker 2
47:55
There's this, there's this echo where like big new dataset, big, big new breakthrough, big new dataset, big new breakthrough, right? And what we're talking about here is potentially several orders of magnitude, more data relatively quickly. And the reason for that is that, I'm not saying we're going to invent a new machine, and that machine is going to collect this, and then it's going to go online. I'm saying there's thousands and thousands of enterprises, millions of smartphones, there's hundreds of governments that all already have this data sitting inside of data warehouses.
Speaker 2
48:27
Largely untapped for 2 reasons. 1, legal risk, and 2, commercial viability. If I give you a dataset, all of a sudden I just doubled the supply. What does that do to my billing ability?
Speaker 2
48:40
There's the legal risk that you might do something bad with it that comes back to hurt me. With this category, I know it's like just 1 phrase, but, but this is like ImageNet,
Speaker 3
48:50
but for every data task that's already been established, right? This is us like, I mean,
Speaker 2
48:59
I, I, we're working with a professor at Oxford in the psychology department who wants to study dementia, right? He is- the problem with dementia is, is every hospital has like 5 cases, right? It's not like a very centralized disease.
Speaker 2
49:11
It's not like all the, all the cancer patients go to, you know, 1 big center and like it's where all the technology is like dementia, it's, it's, it's sprinkled everywhere. And so the big thing that's blocking him as a dementia researcher is access to data. And so he's investing in private data science platforms. And I didn't persuade him to, I, I found him after he was, he was already looking to do that.
Speaker 2
49:33
But- but pick- pick any challenge that- that- where data is already being collected and- and this can unlock, not larger amounts of data that exists, but larger amounts of data that can be- can be used together. Does that make sense? This is like a thousand startups right here. Whereas instead of going out and trying to buy as many datasets as you can, which is a really hard and really expensive task.
Speaker 2
49:53
Talk to anyone who's in Silicon Valley right now, trying to do a data science startup, right? Instead, you go to each individual person that has a dataset And you say, hey, let me create a gateway between you and the rest of the world that's gonna keep your data safe and allow people to leverage it. Right? That's like repeatable business model.
Speaker 2
50:12
Pick a use case, right? Be- be the radiology network gatekeeper, right? Okay. So enough on that 1.
Speaker 2
50:21
But like, does it make sense how like on a huge variety of tasks, just the ability to have a- a- a data box silo, that you can do data science against, is gonna increase the accuracy of a huge variety of models really, really, really quickly. Cool? All right, second 1. 0, that's not right.
Speaker 2
50:44
Single use accountability. Use accountability. This one's a little bit tricky. Get to the airport and you get your bag checked.
Speaker 2
51:07
Everyone's familiar with this process, I assume. What happens? Someone's sitting at a monitor, and they see all the objects in your bag. So that occasionally, they can spot objects that are dangerous or illicit, right?
Speaker 2
51:24
There's a lot of extra information leakage over to the fact that they have- that they have to sit and look at thousands of- of all of the objects, you know, basically searching every single person's bag totally and completely just so that occasionally they can find that 1. Answer that, the question they actually want to answer is, is there anything dangerous in this bag?
Speaker 3
51:42
But in order to answer it, they have
Speaker 2
51:44
to basically acquire access to the whole bag, right? So let's, let's, let's think about the same approach of answering questions using data we can't see. The best example of this in the analog world is a sniffing dog.
Speaker 2
51:59
Some of you are like sniffing dogs, so give your bag a whiff at the airport, right? This is actually a really privacy-preserving thing because dogs don't speak English or any other language. So the benefit is dog comes by, nope, everything's fine, moves on. The dog has the ability to only reveal 1 bit of information without you having to search every single back.
Speaker 2
52:25
Okay. That is what I mean when I say a single-use accountability system. It means I am looking at some data stream because I'm holding someone accountable, right? We want to make it so that I can only answer the question that I claim to be looking into.
Speaker 2
52:41
So if this is a video feed, for example, instead of getting access to the raw video feed, and the millions of bits of information, every single person in the frame of view, walking around doing whatever, which I could use for, even if I'm a good person, I technically could use for other purposes. But instead, build a system where I build, say, a machine learning classifier That is an auditable piece of technology that looks for whatever I'm supposed to be looking for, right? And I only see frames, you know, I only open up bags that actually have to. Okay.
Speaker 2
53:18
This does 2 things. 1, it makes all of our accountability systems more privacy-preserving, which is great. It mitigates any potential dual or multi-use. 2, it means that some things that were simply too off-limits for us to properly hold people accountable might be possible, right?
Speaker 2
53:46
1 of the things that was really challenging, so we used to do email surveillance, digital reasoning, right? And, and it was, it was basically help investment banks find insider traders, right? Because they want to help enforce the laws. They, you know, they get fine billion dollar fines if, if, if anyone, cause an infraction.
Speaker 2
54:03
But 1 of the things that was really difficult about developing these kinds of systems was that it's so sensitive, right? We're talking about, you know, hundreds of millions of emails at some massive investment bank. There's so much private information in there that say none of our data scientists, barely any of them were able to actually work with the data and try to make it better. Right?
Speaker 2
54:24
And, and, and this, this, yeah, this makes it really, really difficult. Anyway, cool. So enough on that. Third 1, and this is the 1II think is just incredibly exciting, end-to-end encrypted services.
Speaker 2
54:46
What's up? Everyone familiar with WhatsApp, Telegram, any of these? These are messaging apps, right? Where a message is encrypted on your phone, and it's sent directly to someone else's phone, and only that person's phone can decrypt it, right?
Speaker 2
55:02
Which means that someone can provide a service, you know, messaging without the service provider seeing any information that they're actually providing the service over, right? Very powerful idea. What if the intuition here is that with a combination of machine learning, encrypted computation, and differential privacy, that we could do the same thing for entire services. So imagine going to the doctor, okay?
Speaker 2
55:28
So you go to the doctor, this is really a computation between 2 different datasets. On the 1 hand, you have dataset that the doctor has, which is their, you know, medical background, their knowledge of, of, of different procedures and diseases and tests and all this kind of stuff. And then you have your dataset, which is your symptoms, your, your medical history, you know, your recent things that you've eaten, your, your genes, your genetic predisposition, your heritage, those kinds of things, right? And you're bringing these 2 datasets together to compute a function.
Speaker 2
56:00
And that function is what, what, what treatment should you have, if any, okay? And the idea here is that- so there's this new, this new field called structured transparency. I guess I should probably mention. I'm not sure.
Speaker 2
56:23
I'm not even sure you can call it a new field yet because it's not in the literature, but it's been bouncing around a few different circles. And the, And it's, it's, FXY. I'm not very good with chalk, sorry. And then this is z.
Speaker 2
56:48
Okay. So this, 2 different people providing their data together, computing a function and an output. So, so differential privacy protects the output, encrypted computation, So like MPC, which we talked about earlier, protects the input, right? So it allows them to, to, to compute f of x of y, right, without revealing their inputs.
Speaker 2
57:15
Remember this? So basically, encrypt y, encrypt x, compute the function while it's encrypted. Do we- do we remember? Remember this?
Speaker 2
57:22
Right? And so there's- there's 3 processes here, right? There's input privacy, which is in PC, there's logic, and then there's output privacy.
Speaker 3
57:36
And this is what
Speaker 2
57:37
you need to be able to do end-to-end encrypted services. Okay. So imagine, imagine,
Speaker 3
57:41
so there
Speaker 2
57:42
are, there are machine learning models that can now do, skin cancer prediction, right? So I can take a picture of my, of my arm and send it to a machine, machine learning model and it will predict whether or not I have melanoma on my arm, right? Okay.
Speaker 2
57:53
So in this case,
Speaker 3
57:57
machine learning model perhaps owned by a hospital or startup, image of my arm, okay? Encrypt both,
Speaker 2
58:06
the logic is done by the machine learning model.
Speaker 3
58:10
The prediction, if it's gonna be published to the output,
Speaker 2
58:14
to the, to the rest of the world, you use differential privacy, but in this case, the prediction can come back to me, and only I see the decrypted result. Okay. The implication being that the, the, the doctor role facilitated by machine learning can classify whether or not I have cancer, can provide this service without anyone seeing my medical information.
Speaker 2
58:35
I can
Speaker 3
58:35
go to the doctor and get a prognosis without ever revealing my medical records to anyone including the doctor. Right? Does that make sense?
Speaker 2
58:46
And if you believe, if you believe that sort of the services that are repeatable, that we do for millions and millions of people, right? Can create a training dataset that we can then train a classifier to do,
Speaker 3
58:59
then we should be able
Speaker 2
59:00
to upgrade it to be end-to-end encrypted. Does that make sense? So again, it's kind of, it's kind of big.
Speaker 2
59:06
It assumes that, that AI is smart enough to do it. There's lots of questions around quality and like quality assurance and all these kinds of things, that have to be addressed. There's very likely to be different institutions that we need. But I hope that at least these 3 sort of big categories, this isn't by no means comprehensive, but I hope at least these 3 big categories will be sort of sufficient, for helping sort of lay the groundwork for how sort of each person could be empowered with sole control over the only copies of their information, while still receiving the same goods and services they've become accustomed to.
Speaker 2
59:39
Cool. Thanks. Questions. Let's do it.
Speaker 1
59:43
First, Please give Andrew a big hand.
Speaker 2
59:55
Andrew, it was fascinating, really, really fascinating. Amazing, amazing set of ideas.
Omnivision Solutions Ltd