1 hours 39 minutes 55 seconds
🇬🇧 English
Speaker 1
00:00
The following is a conversation with Marcus Hutter, senior research scientist at Google DeepMind. Throughout his career of research, including with Juergen Schmidhuber and Shane Legg, he has proposed a lot of interesting ideas in and around the field of artificial general intelligence, including the development of AIXI, spelled A-I-X-I model, which is a mathematical approach to AGI that incorporates ideas of Kolmogorov complexity, Solomonov induction, and reinforcement learning. In 2006, Marcus launched the 50,000 Euro Hutter Prize for lossless compression of human knowledge. The idea behind this prize is that the ability to compress well is closely related to intelligence.
Speaker 1
00:47
This to me is a profound idea. Specifically, if you can compress the first 100 megabytes or 1 gigabyte of Wikipedia better than your predecessors, your compressor likely has to also be smarter. The intention of this prize is to encourage the development of intelligent compressors as a path to AGI. In conjunction with this podcast release just a few days ago, Marcus announced a 10X increase in several aspects of this prize, including the money, to 500,000 euros.
Speaker 1
01:22
The better your compressor works relative to the previous winners, the higher fraction of that prize money is awarded to you. You can learn more about it if you Google simply, Hutter Prize. I'm a big fan of benchmarks for developing AI systems and the Hutter Prize may indeed be 1 that will spark some good ideas for approaches that will make progress on the path of developing AGI systems. This is the Artificial Intelligence Podcast.
Speaker 1
01:50
If you enjoy it, subscribe on YouTube, give it 5 stars on Apple Podcasts, support it on Patreon, or simply connect with me on Twitter at Lex Friedman, spelled F-R-I-D-M-A-N. As usual, I'll do 1 or 2 minutes of ads now, and never any ads in the middle that can break the flow of the conversation. I hope that works for you and doesn't hurt the listening experience. This show is presented by Cash App, the number 1 finance app in the App Store.
Speaker 1
02:17
When you get it, use code LEXPODCAST. Cash App lets you send money to friends, buy Bitcoin, and invest in the stock market with as little as $1. Brokerage services are provided by Cash App investing, a subsidiary of Square, and member SIPC. Since Cash App allows you to send and receive money digitally, peer-to-peer, and security in all digital transactions is very important, let me mention the PCI data security standard that Cash App is compliant with.
Speaker 1
02:47
I'm a big fan of standards for safety and security. PCI DSS is a good example of that, where a bunch of competitors got together and agreed that there needs to be a global standard around the security of transactions. Now we just need to do the same for autonomous vehicles and AI systems in general. So again, if you get Cash App from the App Store or Google Play and use the code LEXPODCAST, you'll get $10 and Cash App will also donate $10 to FIRST, 1 of my favorite organizations that is helping to advance robotics and STEM education for young people around the world.
Speaker 1
03:27
And now, here's my conversation with Markus Hutter.
Speaker 2
03:32
Do you think of the universe as a computer or maybe an information processing system? Let's go with a big question first.
Speaker 3
03:39
Okay, I'll go with the big question first. I think it's a very interesting hypothesis or idea. And I have a background in physics, so I know a little bit about physical theories, the standard model of particle physics and general relativity theory.
Speaker 3
03:54
And they are amazing and describe virtually everything in the universe. And they're all, in a sense, computable theories. I mean, they're very hard to compute. And, you know, it's very elegant, simple theories which describe virtually everything in the universe.
Speaker 3
04:07
So there's a strong indication that somehow the universe is computable, but it's a plausible hypothesis.
Speaker 2
04:17
So what do you think, just like you said, general relativity, quantum field theory, what do you think that the laws of physics are so nice and beautiful and simple and compressible? Do you think our universe was designed, is naturally this way, are we just focusing on the parts that are especially compressible? Are human minds just enjoy something about that simplicity and in fact there's other things that are not so compressible?
Speaker 3
04:46
No, I strongly believe and I'm pretty convinced that the universe is inherently beautiful, elegant and simple and described by these equations and we're not just picking that. I mean, if there were some phenomena which cannot be neatly described, scientists would try that, right? And there's biology which is more messy, but we understand that it's an emergent phenomena and it's complex systems, but they still follow the same rules, right, of quantum and electrodynamics.
Speaker 3
05:14
All of chemistry follows that and we know that. I mean, we cannot compute everything because we have limited computational resources. No, I think it's not a bias of the humans, but it's objectively simple. I mean, of course, you never know, you know, maybe there's some corners very far out in the universe or super, super tiny below the nucleus of atoms or, well, parallel universes which are not nice and simple, but there's no evidence for that.
Speaker 3
05:40
And we should apply Occam's razor and choose the simplest tree consistent with it, but also it's a little bit self-referential.
Speaker 2
05:47
So maybe a quick pause, what is Occam's Razor?
Speaker 3
05:50
So Occam's Razor says that you should not multiply entities beyond necessity, which sort of if you translate it to proper English, means, and you know, in the scientific context, means that if you have 2 theories or hypotheses or models which equally well describe the phenomenon, your study or the data, you should choose the more simple 1.
Speaker 2
06:13
So that's just the principle?
Speaker 3
06:15
Yes.
Speaker 2
06:15
Or sort of, that's not like a provable law perhaps. Perhaps we'll kind of discuss it and think about it, but what's the intuition of why the simpler answer is the 1 that is likelier to be more correct descriptor of whatever we're talking about?
Speaker 3
06:35
I believe that outcomes raiser is probably the most important principle in science. I mean, of course, we need logical deduction and we do experimental design, but science is about understanding the world, finding models of the world, and we can come up with crazy complex models which explain everything but predict nothing, but the simple model seem to have predictive power and it's a valid question why. And there are 2 answers to that.
Speaker 3
07:06
You can just accept it, that is the principle of science, and we use this principle and it seems to be successful. We don't know why, but it just happens to be. Or you can try, you know, find another principle which explains Occam's razor. And if we start with the assumption that the world is governed by simple rules, then there's a bias towards simplicity.
Speaker 3
07:31
And applying Occam's razor is the mechanism to finding these rules. And actually in a more quantitative sense, and we come back to that later in case of somnolent reduction, you can rigorously prove that. You assume that the world is simple, then Occam's razor is the best you can do in a certain sense.
Speaker 2
07:49
So I apologize for the romanticized question, but why do you think, outside of its effectiveness, why do you think we find simplicity so appealing as human beings? Why does E equals MC squared seem so beautiful to us humans?
Speaker 3
08:08
I guess mostly, in general, many things can be explained by an evolutionary argument. And there's some artifacts in humans which are just artifacts and not evolutionary necessary. But with this beauty and simplicity, it's, I believe, at least the core is about, like science, finding regularities in the world, understanding the world, which is necessary for survival.
Speaker 3
08:37
If I look at a bush and I just see noise, and there is a tiger and eats me, then I'm dead. But if I try to find a pattern, and we know that humans are prone to find more patterns in data than they are, you know, like the Mars face and all these things, but this bias towards finding patterns, even if they are non, but I mean, it's best of course if they are, yeah? Helps us for survival.
Speaker 2
09:06
Yeah, that's fascinating. I haven't thought really about the, I thought I just loved science, but indeed, in terms of just survival purposes, there is an evolutionary argument for why we find the work of Einstein so beautiful.
Speaker 3
09:24
Maybe a quick small tangent, could you describe what Solomonov induction is? Yeah, so that's a theory which I claim and Rensselaer Lominov sort of claimed a long time ago that this solves the big philosophical problem of induction. And I believe the claim is essentially true.
Speaker 3
09:45
And what it does is the following. So, okay, for the picky listener, induction can be interpreted narrowly and widely. Narrow means inferring models from data. And widely means also then using these models for doing predictions, so predictions also part of the induction.
Speaker 3
10:06
So I'm a little sloppy sort of with the terminology and maybe that comes from Ray Solomonoff, you know, being sloppy, maybe I shouldn't say that. He can't complain anymore. So let me explain a little bit this theory in simple terms. So assume you have a data sequence.
Speaker 3
10:24
Make it very simple. The simplest 1, say, 1, 1, 1, 1, 1, and you see 100 1s. What do you think comes next? The natural answer, I'm going to speed up a little bit, the natural answer is of course, you know, 1.
Speaker 3
10:35
And the question is why? Well, we see a pattern there. There's a 1 and we repeat it. And why should it suddenly after 100 ones be different?
Speaker 3
10:45
So what we're looking for is simple explanations or models for the data we have. And now the question is, a model has to be presented in a certain language. In which language do we use? In science, we want formal languages, and we can use mathematics, or we can use programs on a computer.
Speaker 3
11:04
So abstractly on a Turing machine, for instance, or it can be a general purpose computer. So, and there are of course lots of models. You can say maybe it's 101s and then 100 zeros and 101s, that's a model, right? But there are simpler models.
Speaker 3
11:16
There's a model print1 loop. That also explains the data. And if you push that to the extreme, you are looking for the shortest program, which, if you run this program, reproduces the data you have. It will not stop.
Speaker 3
11:31
It will continue, naturally. And this you take for your prediction. And on the sequence of ones, it's very plausible, right, that print 1 loop is the shortest program. We can give some more complex examples like 12345.
Speaker 3
11:45
What comes next? The short program is again, you know, counter. And so that is, roughly speaking, how Solomon's induction works. The extra twist is that it can also deal with noisy data.
Speaker 3
11:57
So if you have, for instance, a coin flip, say a biased coin, which comes up head with 60% probability, then it will predict, it will learn and figure this out, and after a while it predict, oh, the next coin flip will be head with probability 60%, so it's the stochastic version of that.
Speaker 2
12:15
But the goal is, the dream is, always the search for the short program.
Speaker 3
12:18
Yes, yeah. Well, in Solomon of Induction, precisely what you do is, so you combine, so looking for the shortest program is like applying Opax Razor, like looking for the simplest theory. There's also Epicurus principle, which says, if you have multiple hypotheses, which equally well describe your data, don't discard any of them.
Speaker 3
12:36
Keep all of them around, you never know. And you can put it together and say, okay, I have a bias towards simplicity, but I don't rule out the larger models. And technically what we do is we weigh the shorter models higher and the longer models lower. And you use a Bayesian technique, so you have a prior, which is precisely 2 to the minus the complexity of the program, and you weigh all this hypothesis and take this mixture, and then you get also this stochasticity in.
Speaker 2
13:07
Yeah, like many of your ideas, that's just a beautiful idea of weighing based on the simplicity of the program. I love that. That seems to me maybe a very human-centric concept, seems to be a very appealing way of discovering good programs in this world.
Speaker 2
13:25
You've used the term compression quite a bit. I think it's a beautiful idea. We just talked about simplicity, and maybe science or just all of our intellectual pursuits is basically the attempt to compress the complexity all around us into something simple. So What does this word mean to you, compression?
Speaker 3
13:50
I essentially have already explained it. So compression means for me, finding short programs for the data or the phenomenon at hand. You could interpret it more widely as finding simple theories which can be mathematical theories or maybe even informal, like just in words.
Speaker 3
14:09
Compression means finding short descriptions, explanations, programs for the data.
Speaker 2
14:15
Do you see science as a kind of our human attempt at compression? So we're speaking more generally, because when you say programs, you're kind of zooming in on a particular sort of, almost like a computer science, artificial intelligence focus, but do you see all of human endeavor as a kind of compression?
Speaker 3
14:34
Well, at least all of science I see as an endeavor of compression, not all of humanity maybe. And well, there are also some other aspects of science like experimental design, right? I mean, we create experiments specifically to get extra knowledge.
Speaker 3
14:49
And this is, that is then part of the decision making process. But once we have the data to understand the data is essentially compression. So I don't see any difference between compression, understanding and prediction.
Speaker 2
15:05
So we're jumping around topics a little bit, but returning back to simplicity, a fascinating concept of Kolmogorov complexity. So In your sense, do most objects in our mathematical universe have high Kolmogorov complexity? And maybe what is, first of all, what is Kolmogorov complexity?
Speaker 2
15:25
Okay, Kolmogorov complexity is a notion of simplicity or complexity. And
Speaker 3
15:33
it takes the compression view to the extreme. So I explained before that if you have some data sequence just think about a file on a computer and best sort of you know just a string of bits and if you and we have data compressors like we compress big files into zip files with certain compressors. And you can also produce self-extracting RKFs, that means as an executable, if you run it, it reproduces your original file without needing an extra decompressor.
Speaker 3
16:02
It's just the decompressor plus the RKF together in 1. And now there are better and worse compressors and you can ask what is the ultimate compressor? So what is the shortest possible self-extracting RKF you could produce for a certain data set, yeah, which reproduces the data set. And the length of this is called the Kolmogorov complexity, and arguably, that is the information content in the data set.
Speaker 3
16:27
I mean, if the data set is very redundant or very boring, you can compress it very well, so the information content should be low, and it is low according to this definition.
Speaker 2
16:36
So it's the length of the shortest program that summarizes the data?
Speaker 3
16:41
Yes, yeah.
Speaker 2
16:42
And what's your sense of our universe when we think about the different objects in our universe, that we, concepts or whatever, at every level, do they have high or low Kolmogorov complexity? So what's the hope? Do we have a lot of hope in being able to summarize much of our world?
Speaker 3
17:05
That's a tricky and difficult question. So, as I said before, I believe that the whole universe, based on the evidence we have, is very simple. So it has a very short description.
Speaker 3
17:19
Sorry, to linger on that, the whole universe, what does that mean? Do you mean at the very basic fundamental level in order to create the universe? Yes, yeah, so You need a very short program and you run it.
Speaker 2
17:32
To get the thing going. To get
Speaker 3
17:34
the thing going and then it will reproduce our universe. There's a problem with noise. We can come back to that later possibly.
Speaker 3
17:42
Is noise a problem or is it a bug or a feature? I would say it makes our life as a scientist really, really much harder. I mean, think about without noise, we wouldn't need all of the statistics.
Speaker 2
17:55
But that may be, we wouldn't feel like there's a free will. Maybe we need that for the, for the, for the. This is
Speaker 3
18:01
an illusion that noise can give you free will.
Speaker 2
18:05
At least in
Speaker 3
18:05
that way, it's a feature. But also, if you don't have noise, you have chaotic phenomena, which are effectively like noise. So we can't get away with statistics even then.
Speaker 3
18:15
I mean, think about rolling a dice and forget about quantum mechanics and you know exactly how you throw it. But I mean, it's still so hard to compute the trajectory that effectively it is best to model it as coming out with a number, this probability 1 over 6. But from this set of philosophical Kolmogorov complexity perspective, if we didn't have noise, then arguably you could describe the whole universe as well as a standard model plus generativity. I mean, we don't have a theory of everything yet, but sort of assuming we are close to it or have it, yeah?
Speaker 3
18:52
Plus the initial conditions, which may hopefully be simple, and then you just run it and then you would reproduce the universe. But that's spoiled by noise or by chaotic systems or by initial conditions, which may be complex. So now if we don't take the whole universe, we're just a subset, just take planet Earth. Planet Earth cannot be compressed into a couple of equations.
Speaker 3
19:17
This is a hugely complex system. So interesting. So when you look at the window, like the whole thing might be simple, but when you just take a small window, then... It may become complex, and that may be counterintuitive, but there's a very nice analogy, the book, the library of all books.
Speaker 3
19:34
So imagine you have a normal library with interesting books and you go there great lots of information and huge quite complex yeah so now I create a library which contains all possible books say of 500 pages So the first book just has AAAAA over all the pages. The next book, AAA, and ends with B, and so on. I create this library of all books. I can write a super short program which creates this library.
Speaker 3
19:57
So this library which has all books has 0 information content. And you take a subset of this library and suddenly you have a lot of information in there.
Speaker 2
20:05
So that's fascinating. I think 1 of the most beautiful mathematical objects that at least today seems to be understudied or under-talked about is cellular automata. What lessons do you draw from sort of the game of life for cellular automata where you start with the simple rules just like you're describing with the universe and somehow complexity emerges?
Speaker 2
20:26
Do you feel like you have an intuitive grasp on the fascinating behavior of such systems where some, like you said, some chaotic behavior could happen, some complexity could emerge, some, it could die out in some very rigid structures. Do you have a sense about cellular automata that somehow transfers maybe to the bigger questions of our universe? Yeah, the cellular automata, and especially
Speaker 3
20:52
the converse game of life, is really great because the rules are so simple, you can explain it to every child, and even by hand you can simulate a little bit, and you see these beautiful patterns emerge and people have proven that it's even Turing complete. You cannot just use a computer to simulate Game of Life, but you can also use Game of Life to simulate any computer. That is truly Amazing.
Speaker 3
21:16
And it's the prime example probably to demonstrate that very simple rules can lead to very rich phenomena. And people sometimes, you know, how can, how is chemistry and biology so rich? I mean, this can't be based on simple rules. But no, we know quantum electrodynamics describes all of chemistry.
Speaker 3
21:37
And we come later back to that. I claim intelligence can be explained or described in 1 single equation, this very rich phenomenon. You asked also about whether I understand this phenomenon. It's probably not.
Speaker 3
21:54
And there's this saying, you never understand really things, you just get used to them. And I think I'm pretty used to cellular automata. So you believe that you understand now why this phenomenon happens. But I'll give you a different example.
Speaker 3
22:09
I didn't play too much with this Converse game of life, but a little bit more with fractals and with the Mandelbrot set and its beautiful patterns. Just look, Mandelbrot set. And well, when the computers were really slow, and I just had a black and white monitor and programmed my own programs in assembler to.
Speaker 2
22:29
Assembler, wow.
Speaker 3
22:30
Wow, you're legit. To get these vectors on the screen, and it was mesmerized. And much later, so I returned to this every couple of years, and then I tried to understand what is going on.
Speaker 3
22:43
And you can understand a little bit, so I tried to derive the locations. There are these circles and the apple shape. And then you have smaller Mandelbrot sets recursively in this set. And there's a way to mathematically, by solving high order polynomials, to figure out where these centers are and what size they are approximately.
Speaker 3
23:08
And by sort of mathematically approaching this problem, you slowly get a feeling of why things are like they are. And that sort of is, you know, first step to understanding why this rich phenomenon. Do you think it's possible, what's your intuition,
Speaker 2
23:27
do you think it's possible to reverse engineer and find the short program that generated these fractals by looking at the fractals? Well, in principle, yes. So, I
Speaker 3
23:39
mean, in principle, what you can do is you take any data set, you take these fractals or you take whatever your data set, whatever you have, say a picture of Conway's Game of Life. And you run through all programs. You take a program of size 1234, and all these programs, run them all in parallel in so-called dovetailing fashion.
Speaker 3
23:59
Give them computational resources, first 1 50%, second 1 half resources, and so on, and let them run. Wait until they halt, give an output, compare it to your data, and if some of these programs produce the correct data, then you stop and then you have already some program. It may be a long program because it's faster. And then you continue and you get shorter and shorter programs until you eventually find the shortest program.
Speaker 3
24:22
The interesting thing you can never know whether it's the shortest program because there could be an even shorter program which is just even slower and you just have to wait yeah but asymptotically and actually after finite time you have this shortest program. So this is a theoretical but completely impractical way of finding the underlying
Speaker 2
24:43
structure in
Speaker 3
24:44
every data set. And that is what Solomon of induction does and Kolmogorov complexity. In practice, of course, we have to approach the problem more intelligently.
Speaker 3
24:53
And then if you take resource limitations into account, there's, for instance, the field of pseudo random numbers, and these are random numbers, so these are deterministic sequences, but no algorithm which is fast, fast means runs in polynomial time, can detect that it's actually deterministic. So we can produce interesting, I mean, random numbers, maybe not that interesting, but just an example. We can produce complex looking data, and we can then prove that no fast algorithm can detect the underlying pattern.
Speaker 2
25:31
Which is unfortunately, that's a big challenge for our search for simple programs in the space of artificial intelligence perhaps. Yes, it definitely is
Speaker 3
25:43
for artificial intelligence and it's quite surprising that it's, I can't say easy, I mean physicists worked really hard to find these theories, but apparently it was possible for human minds to find these simple rules in the universe. It could have been different, right?
Speaker 2
25:59
It could have been different. It's awe-inspiring. So let me ask another absurdly big question.
Speaker 2
26:08
What is intelligence in your
Speaker 3
26:12
view? So I have of course a definition. I wasn't sure what you were gonna say because you could have just as easily said, I have no clue. Which many people would say, but I'm not modest in this question.
Speaker 3
26:28
So the informal version, which I worked out together with Shane Leck, who co-founded DeepMind, is that intelligence measures an agent's ability to perform well in a wide range of environments. So that doesn't sound very impressive, and these words have been very carefully chosen and there is a mathematical theory behind that and we come back to that later and If you look at this this definition by itself, It seems like yeah, okay, but it seems a lot of things are missing. But if you think it through, then you realize that most, and I claim all of the other traits, at least of rational intelligence, which we usually associate with intelligence, are emergent phenomena from this definition. Like, you know, creativity, memorization, planning, knowledge.
Speaker 3
27:22
You all need that in order to perform well in a wide range of environments. So you don't have to explicitly mention that in a definition.
Speaker 2
27:30
Interesting, so yeah, so the consciousness, abstract reasoning, all these kinds of things are just emergent phenomena that help you in towards, can you say the definition again? So multiple environments. Did you mention the word goals?
Speaker 3
27:46
No, but we have an alternative definition. Instead of performing well, you can just replace it by goals. So intelligence measures an agent's ability to achieve goals in a wide range of environments.
Speaker 3
27:55
That's more or less equal.
Speaker 2
27:56
But it's interesting, because in there, there's an injection of the word goals. So we want to specify there should be a goal.
Speaker 3
28:03
Yeah, but perform well is sort of, what does it mean? It's the same problem. Yeah, there's a little bit
Speaker 2
28:08
of a gray area, but it's much closer to something that could be formalized. In your view, are humans, where do humans fit into that definition? Are they general intelligence systems that are able to perform in, like how good are they at fulfilling that definition, at performing well in multiple environments?
Speaker 3
28:31
Yeah, that's a big question. I mean, the humans are performing best among all species we know of. Depends,
Speaker 2
28:41
you could say that trees and plants are doing a better job, they'll probably outlast us. Yeah, but they are in a
Speaker 3
28:47
much more narrow environment, right? I mean, you just, you know, have a little bit of air pollutions and these trees die and we can adapt, right? We build houses, we build filters,
Speaker 2
28:57
we do geoengineering, so- The multiple environment part.
Speaker 3
29:01
Yeah, that is very important, yeah. So that distinguished narrow intelligence from wide intelligence, also in the AI research. So let
Speaker 2
29:09
me ask the Alan Turing question. Can machines think? Can machines be intelligent?
Speaker 2
29:16
So in your view, I have to kind of ask, the answer's probably yes, but I want to kind of hear what your thoughts on it, can machines be made to fulfill this definition of intelligence, to achieve intelligence?
Speaker 3
29:30
Well, we are sort of getting there, and on a small scale, we are already there. The wide range of environments is still missing, but we have self-driving cars, we have programs that play Go and chess, we have speech recognition. So it's pretty amazing, but you can, you know, these are narrow environments.
Speaker 3
29:49
But if you look at AlphaZero, that was also developed by DeepMind. I mean, got famous with AlphaGo and then came AlphaZero a year later. That was truly amazing. So reinforcement learning algorithm, which is able just by self-play to play chess and then also Go.
Speaker 3
30:08
And I mean, yes, they're both games, but they're quite different games. And don't feed them the rules of the game. And the most remarkable thing, which is still a mystery to me, that usually for any decent chess program, I don't know much about Go, you need opening books and end game tables and so on, and nothing in there, nothing was put in there.
Speaker 2
30:29
Just like- Especially with AlphaZero, the self-play mechanism starting from scratch, being able to learn, actually new strategies is incredible. Yeah, it
Speaker 3
30:40
rediscovered all these famous openings within 4 hours by itself. What I was really happy about, I'm a terrible chess player, but I like Queen Gambi. And AlphaZero figured out that this is
Speaker 2
30:52
the best opening. Finally, somebody proved you correct.
Speaker 3
31:00
So yes, to answer your question, yes, I believe that general intelligence is possible. And it also, I mean, it depends how you define it. Do you say AGI, general intelligence, artificial general intelligence, only refers to if you achieve human level or a subhuman level, but quite broad, is it also general intelligence, so we have to distinguish, Or it's only super human intelligence, general artificial intelligence?
Speaker 2
31:24
Is there a test in your mind, like the Turing test and natural language or some other test that would impress the heck out of you that would kind of cross the line of your sense of intelligence within the framework that you said. Well, the Turing test, well,
Speaker 3
31:41
it has been criticized a lot, but I think it's not as bad as some people think. Some people think it's too strong, So it tests not just for a system to be intelligent but it also has to fake human deception. Deception, right, which is, you know, much harder.
Speaker 3
31:59
And on the other hand they say it's too weak, yeah, because it just maybe fakes, you know, emotions or intelligent behavior. It's not real. But I don't think that's the problem or big problem. So if you would pass the Turing test, So a conversation over terminal with a bot for an hour or maybe a day or so, and you can fool a human into not knowing whether this is a human or not, so that's the Turing test.
Speaker 3
32:27
I would be truly impressed. And we have this annual competitions, the Loebner Prize. And I mean, it started with Eliza, that was the first conversational program. And what is it called?
Speaker 3
32:40
The Japanese Mitsuko or so, that's the winner of the last couple of years. And- It's
Speaker 2
32:45
quite impressive.
Speaker 3
32:45
Yeah, it's quite impressive. And then Google has developed Mina, right? Just recently, that's an open domain conversational bot.
Speaker 3
32:55
Just a couple of weeks ago, I think.
Speaker 2
32:57
Yeah, I kind of like the metric that sort of the Alexa price has proposed. I mean, maybe it's obvious to you, it wasn't to me, of setting sort of a length of a conversation, like you want the bot to be sufficiently interesting that you'd wanna keep talking to it for like 20 minutes. And that's a surprisingly effective in aggregate metric because you really, like nobody has the patience to be able to talk to a bot that's not interesting and intelligent and witty and is able to go on to different tangents, jump domains, be able to say something interesting to maintain your attention.
Speaker 3
33:36
Maybe many humans will also fail this test. That's, unfortunately, we set, just like with autonomous vehicles, with chatbots, we also set a bar that's way too high to reach. I said the Turing test is not as bad as some people believe, but what is really not useful about the Turing test, it gives us no guidance how to develop these systems in the first place.
Speaker 3
34:00
Of course, we can develop them by trial and error and, you know, do whatever and then run the test and see whether it works or not. But a mathematical definition of intelligence gives us, you know, an objective which we can then analyze by theoretical tools or computational and maybe even prove how close we are. And we will come back to that later with the IXE model. So I mentioned the compression, right?
Speaker 3
34:31
So in natural language processing, they have achieved amazing results. And 1 way to test this, of course, you take the system, you train it, and then you see how well it performs on the task. But a lot of performance measurement is done by so-called perplexity, which is essentially the same as complexity or compression length. So the NLP community develops new systems and then they measure the compression length and then they have ranking and leaks because there's a strong correlation between compressing well and then the systems performing well at the task at hand.
Speaker 3
35:07
It's not perfect, but it's good enough for them as an intermediate aim. So You
Speaker 2
35:15
mean a measure, so this is kind of almost returning to the Kolmogorov complexity. So you're saying good compression usually means good intelligence. Yes.
Speaker 2
35:26
So you mentioned you're 1 of the only people who dared boldly to try to formalize the idea of artificial general intelligence, to have a mathematical framework for intelligence, just like as we mentioned, termed AIXI, A-I-X-I. So let me ask the basic question. What is AIXI?
Speaker 3
35:54
Okay, so let me first say what it stands for because.
Speaker 2
35:57
What it stands for, actually, that's probably the more basic question. Yeah. The first question
Speaker 3
36:02
is usually how it's pronounced, but finally I put it on the website how it's pronounced, and you figured it out. The name comes from AI, artificial intelligence, and the X, I, is the Greek letter Xi, which are used for Solomonov's distribution for quite stupid reasons, which I'm not willing to repeat here in front of camera. So it just happened to be more or less arbitrary, I chose this Xi.
Speaker 3
36:31
But it also has nice other interpretations. So there are actions and perceptions in this model, where an agent has actions and perceptions over time. So this is a index I, x index I. So there's the action at time I, and then followed by a perception at time I.
Speaker 2
36:49
We'll go with that. I'll edit out the first part. I'm just kidding.
Speaker 3
36:53
I have some more interpretations. So at some point, maybe 5 years ago or 10 years ago, I discovered in Barcelona, it was on a big church, there was in stone engraved some text and the word Aix appeared there
Speaker 2
37:10
a couple of times.
Speaker 3
37:12
I was very surprised and happy about that. And I looked it up, so it is Catalan language, and it means with some interpretation, or that's it, that's the right thing to do, yeah, eureka.
Speaker 2
37:24
Oh, so it's almost like destined somehow came to you in a dream, so okay.
Speaker 3
37:32
And similar, there's a Chinese word, also written like Aixi, if you transcribe that to Pinyin. And the final 1 is that is AI crossed with induction because that is, and that's going more to the content now. So good old fashioned AI is more about, you know, planning and known deterministic world.
Speaker 3
37:48
And induction is more about often, you know, IID data and inferring models. And essentially what this AIXE model does is combining these 2.
Speaker 2
37:56
And I actually also recently, I think, heard that in Japanese, AI means love. So if you can combine XI somehow with that, I think we can, there might be some interesting ideas there. So I see, let's then take the next step.
Speaker 2
38:12
Can you maybe talk at the big level of what is this mathematical framework.
Speaker 3
38:19
Yeah, so it consists essentially of 2 parts. 1 is the learning and induction and prediction part, and the other 1 is the planning part. So let's come first to the learning, induction, prediction part, which essentially I explained already before.
Speaker 3
38:35
So what we need for any agent to act well is that it can somehow predict what happens. I mean if you have no idea what your actions do how can you decide which actions are good or not? So you need to have some model of what your actions affect. So what you do is you have some experience, you build models like scientists of your experience, then you hope these models are roughly correct, and then you use these models for prediction.
Speaker 2
39:03
And the model is, sorry to interrupt, and the model is based on your perception of the world, how your actions will affect that world.
Speaker 3
39:10
That's not... So what's, how do you think about the model? That's not the important part, but it is technically important, but at this stage we can just think about predicting, say, stock market data, weather data, or IQ sequences, 12345, what comes next, yeah?
Speaker 3
39:24
So, of course, our actions affect what we're doing, but I come back to that in a second.
Speaker 2
39:30
So, and I'll keep just interrupting, So just to draw a line between prediction and planning, what do you mean by prediction in this way? It's trying to predict the environment without your long-term action in that environment. What is prediction?
Speaker 3
39:49
Okay, if you want to put the actions in now, okay, then let's put in now, yeah.
Speaker 2
39:54
We don't have to put them now. Yeah, yeah. Scratch it, scratch it, dumb question, okay.
Speaker 3
39:58
So the simplest form of prediction is that you just have data which you passively observe, and you want to predict what happens without interfering. As I said, weather forecasting, stock market, IQ sequences, or just anything. And Solomonov's theory of induction based on compression.
Speaker 3
40:18
So you look for the shortest program which describes your data sequence and then you take this program run it which reproduces your data sequence by definition and then you let it continue running and then it will produce some predictions and you can rigorously prove that for any prediction task, this is essentially the best possible predictor. Of course, if there's a prediction task, or a task which is unpredictable, like you know, fair coin flips. Yeah, I cannot predict the next fair coin flip. What Solomonov does is says, okay, next head is probably 50%.
Speaker 3
40:51
It's the best you can do. So if something is unpredictable, Solomonov will also not magically predict it. But if there is some pattern and predictability, then Solomonov induction will figure that out eventually, and not just eventually, but rather quickly, and you can have proof convergence rates, whatever your data is. So that is pure magic in a sense.
Speaker 3
41:14
What's the catch? Well, the catch is that it's not computable and we come back to that later. You cannot just implement it in even this Google resources here and run it and predict the stock market and become rich. I mean, Ray Solomonov already tried it at the time.
Speaker 2
41:28
But so the basic task is you're in the environment and you're interacting with the environment to try to learn a model of that environment and the model is in the space of all these programs and your goal is to get a bunch of programs that are simple.
Speaker 3
41:41
And so let's go to the actions now. But actually good that you asked, Usually I skip this part, although there is also a minor contribution which I did, so the action part, but I usually sort of just jump to the decision part. So let me explain the action part now.
Speaker 3
41:53
Thanks for asking.
Speaker 2
41:54
Yes.
Speaker 3
41:55
So you have to modify it a little bit by now not just predicting a sequence which just comes to you but you have an observation then you act somehow and then you want to predict the next observation based on the past observation and your action. Then you take the next action you don't care about predicting it because you're doing it. Then you get the next observation, and you want, well, before you get it, you want to predict it, again, based on your past action and observation sequence.
Speaker 3
42:24
It's just condition extra on your actions. There's an interesting alternative that you also try to predict your own actions. If you want. In the past or the future?
Speaker 3
42:38
Your future actions.
Speaker 2
42:39
That's interesting. Wait, let me wrap. I think my brain is broke.
Speaker 3
42:45
We should maybe discuss that later after I've explained the ICSI model. That's an interesting variation. But that is a really interesting variation.
Speaker 2
42:51
And a quick comment, I don't know if you want to insert that in here, but you're looking at that, in terms of observations, you're looking at the entire, the big history, the long history of the observations.
Speaker 3
43:03
Exactly, that's very important, the whole history from birth sort of of the agent. And we can come back to that, and also why this is important here. Often, you know, in RL, you have MDPs, micro decision processes, which are much more limiting.
Speaker 3
43:15
Okay, so now we can predict conditioned on actions. So even if influence environment, but prediction is not all we want to do, right? We also want to act really in the world. And the question is how to choose the actions.
Speaker 3
43:29
And We don't want to greedily choose the actions. You know, just what is best in the next time step. And first I should say, how do we measure performance? So we measure performance by giving the agent reward.
Speaker 3
43:43
That's the so-called reinforcement learning framework. So every time step, you can give it a positive reward, a negative reward, or maybe no reward. It could be very scarce, right? Like if you play chess, just at the end of the game, you give plus 1 for winning or minus 1 for losing.
Speaker 3
43:56
So in the IXE framework, that's completely sufficient. So occasionally you give a reward signal, and you ask the agent to maximize reward but not greedily sort of you know the next 1 next 1 because that's very bad in the long run if you're greedy. So but over the lifetime of the agent so let's assume the agent lives for m time steps let's say dies in sort of 100 years sharp that's just you know the simplest model to explain.
Speaker 2
44:19
So it looks at
Speaker 3
44:20
the future reward sum and ask what is my action sequence, or actually more precisely my policy, which leads in expectation, because I don't know the world, to the maximum reward sum. Let me give you an analogy. In chess for instance, we know how to play optimally in theory, it's just a minimax strategy.
Speaker 3
44:42
I play the move which seems best to me under the assumption that the opponent plays the move which is best for him, so worst for me, under the assumption that I play, again, the best move, and then you have this expecting max 3 to the end of the game, and then you backpropagate, and then you get the best possible move. So that is the optimal strategy which von Neumann already figured out a long time ago for playing adversarial games. Luckily or maybe unluckily for the theory it becomes harder. The world is not always adversarial so it can be if there are other humans even cooperative or nature is usually I mean the dead nature is stochastic you know you know things just happen randomly or don't care about you So what you have to take into account is the noise, and not necessarily adversariality.
Speaker 3
45:30
So you replace the minimum on the opponent's side by an expectation, which is general enough to include also adversarial cases. So now instead of a minimax strategy you have an expectimax strategy. So far so good, so that is well known, it's called sequential decision theory. But the question is, on which probability distribution do you base that?
Speaker 3
45:52
If I have the true probability distribution, like say I play Begummin, right? There's dice and there's certain randomness involved. I can calculate probabilities and feed it in the expectimax or the sequential decision tree, come up with the optimal decision if I have enough compute. But for the real world, we don't know that.
Speaker 3
46:09
What is the probability the driver in front of me breaks? I don't know. So it depends on all kinds of things and especially new situations I don't know. So this is this unknown thing about prediction and there's where Solomonov comes in.
Speaker 3
46:24
So what you do is in sequential decision tree you just replace the true distribution which we don't know by this universal distribution I didn't explicitly talk about it but this is used for universal prediction and plug it into the sequential decision tree mechanism. And then you get the best of both worlds. You have a long-term planning agent, But it doesn't need to know anything about the world because the Solomon of Induction part learns.
Speaker 2
46:51
Can you explicitly try to describe the universal distribution and how Solomon of Induction plays a role here? I'm trying to understand.
Speaker 3
47:00
So what it does it, so in the simplest case, I said take the shortest program describing your data, run it, have a prediction which would be deterministic. Yes. Okay, but you should not just take the shortest program, but also consider the longer ones, but give it lower a priori probability.
Speaker 3
47:18
So in the Bayesian framework, you say a priori, any distribution, which is a model or a stochastic program has a certain a priori probability, which is 2 to the minus, and why 2 to the minus length, you know, I could explain length of this program. So longer programs are punished, a priori. And then you multiply it with the so-called likelihood function, which is, as the name suggests, is how likely is this model given the data at hand. So if you have a very wrong model, it's very unlikely that this model is true.
Speaker 3
47:54
And so it's a very small number. So even if the model is simple, it gets penalized by that. And what you do is then you take just the sum, but this is the average over it. And this gives you a probability distribution.
Speaker 3
48:07
So it's universal distribution, also Lomonoff distribution.
Speaker 2
48:10
So it's weighed by the simplicity of the program and the likelihood. Yes. It's kind of a nice idea.
Speaker 2
48:17
Yeah. So OK. And then you said there's you're planning N or M, or I forgot the letter, steps into the future. So how difficult is that problem?
Speaker 2
48:28
What's involved there? OK, so basic optimization problem, What are we talking about?
Speaker 3
48:31
Yeah, so you have a planning problem up to horizon M, and that's exponential time in the horizon M, which is, I mean, it's computable, but intractable. I mean, even for chess, it's already intractable to do that exactly, and, you know, for Go. But it could be also discounted kind of framework, or?
Speaker 3
48:49
Yeah, so having a hard horizon, you know, at 100 years, it's just for simplicity of discussing the model. And also sometimes the math is simple. But there are lots of variations. Actually, a quite interesting parameter.
Speaker 3
49:03
There's nothing really problematic about it, but it's very interesting. So for instance, you think, no, let's let the parameter M tend to infinity, right? You want an agent which lives forever, right? If you do it naively, you have 2 problems.
Speaker 3
49:17
First, the mathematics breaks down because you have an infinite reward sum, which may give infinity, and getting reward 0.1 every time step is infinity, and giving reward 1 every time step is infinity, so equally good. Not really what we want. Other problem is that if you have an infinite life you can be lazy for as long as you want for 10 years and then catch up with the same expected reward and you know think about yourself or you know maybe you know some friends or so if they knew they lived forever you know why work hard now you know just enjoy your life you know and then catch up later. So that's another problem with infinite horizon.
Speaker 3
49:56
And you mentioned, yes, we can go to discounting, but then the standard discounting is so-called geometric discounting. So a dollar today is about worth as much as, you know, 1 dollar and 5 cents tomorrow. So if you do this so-called geometric discounting, you have introduced an effective horizon. So the agent is now motivated to look ahead a certain amount of time effectively.
Speaker 3
50:18
It's like a moving horizon. And for any fixed effective horizon, there is a problem to solve which requires larger horizons. So if I look ahead, you know, 5 time steps, I'm a terrible chess player, right? I need to look ahead long.
Speaker 3
50:34
If I play Go, I probably have to look ahead even longer. So for every problem, for every horizon, there is a problem which this horizon cannot solve. But I introduced the so-called near harmonic horizon, which goes down with 1 over t rather than exponentially t, which produces an agent which effectively looks into the future proportional to its age. So if it's 5 years old, it plans for 5 years.
Speaker 3
50:57
If it's 100 years old, it then plans for 100 years.
Speaker 2
51:00
And it's
Speaker 3
51:00
a little bit similar to humans too, right? I mean, children don't play in a hat very long, but then we get adult, we play a hat more longer. Maybe when we get very old, I mean, we know that we don't live forever, and maybe then our horizon shrinks again.
Speaker 2
51:15
So that's really interesting. So adjusting the horizon, is there some mathematical benefit of that? Or is it just a nice, I mean, intuitively, empirically, it would probably be a good idea to sort of push the horizon back, extend the horizon as you experience more of the world, but is there some mathematical conclusions here
Speaker 3
51:35
that are beneficial? With Solomon and with actual sort of prediction part, we have extremely strong finite time, but finite data results. So you have so and so much data, then you lose so and so much.
Speaker 3
51:47
So the theory is really great. With the IXE model, with the planning part, many results are only asymptotic, which, well, this is... What does asymptotic mean? Asymptotic means you can prove, for instance, that in the long run, if the agent acts long enough, then it performs optimal or some nice thing happens.
Speaker 3
52:06
So, but you don't know how fast it converges. So it may converge fast, but we're just not able to prove it because of difficult problem, or maybe there's a bug in the model so that it's really that slow. So that is what asymptotic means, sort of eventually, but we don't know how fast. And if I give the agent a fixed horizon M, then I cannot prove asymptotic results, right?
Speaker 3
52:32
So I mean, sort of if it dies in 100 years, then in 100 years it's over, I cannot say eventually. So this is the advantage of the discounting that I can prove asymptotic results.
Speaker 2
52:42
So just to clarify, so I, okay, I made, I've built up a model, we're now in the moment, I have this way of looking several steps ahead. How do I pick what action I will take? It's like with a playing chess, right?
Speaker 3
53:00
You do this mini max. In this case here, do expect the max based on the Solomonov distribution. You propagate back and then, well, an action falls out.
Speaker 3
53:12
The action which maximizes the future expected reward on the Solomonov distribution and then
Speaker 2
53:16
you just take this action. And then repeat.
Speaker 3
53:19
And then you get a new observation, and you feed it in this action and observation, then you repeat.
Speaker 2
53:23
And the reward, so on.
Speaker 3
53:24
Yeah, so you're in row 2, yeah.
Speaker 2
53:26
And then maybe you can even predict your own action. I love that idea. But Okay, this big framework, what is it, it's kind of a beautiful mathematical framework to think about artificial general intelligence.
Speaker 2
53:41
What can you, what does it help you into it about how to build such systems? Or maybe from another perspective, what does it help us in understanding AGI?
Speaker 3
53:56
So when I started in the field, I was always interested in 2 things. 1 was AGI, the name didn't exist then, but what called general AI or strong AI, and the physics here of everything. So I switched back and forth between computer science and physics quite often.
Speaker 2
54:14
You said the theory of everything.
Speaker 3
54:15
The theory of everything, yeah, just like. It was basically the 2 biggest
Speaker 2
54:18
problems before all of humanity.
Speaker 3
54:23
Yeah, I can explain if you wanted some later time, why I'm interested in these 2 questions. Can I
Speaker 2
54:30
ask you, on a small tangent, if it was 1 to be solved, which 1 would you, if an apple fell on your head and there was a brilliant insight and you could arrive at the solution to 1, would it be AGI or the theory of everything? Definitely AGI because once the AGI problem is solved, they can ask the AGI to solve the other problem for me. Yeah, brilliantly put.
Speaker 2
54:57
Okay, so as you were saying about it.
Speaker 3
55:01
OK, so and the reason why I didn't settle, I mean, this thought about once you have solved AGI, it solves all kinds of other, not just the theory of every problem, but all kinds of more useful problems to humanity is very appealing to many people. And I had this thought also. But I was quite disappointed with the state of the art of the field of AI.
Speaker 3
55:25
There was some theory about logical reasoning, but I was never convinced that this will fly. And then there was this more heuristic approaches with neural networks, and I didn't like these heuristics. And also I didn't have any good idea myself. So that's the reason why I toggled back and forth quite some while and even worked 4 and a half years in a company developing software or something completely unrelated.
Speaker 3
55:49
But then I had this idea about the AXI model. And so what it gives you, it gives you a gold standard. So I have proven that this is the most intelligent agent which anybody could build in quotation mark because it's just mathematical and you need infinite compute yeah but this is the limit and this is completely specified it's not just a framework and you know every year tens of frameworks are developed which are just skeletons and then pieces are missing and usually these missing pieces you know turn out to be really really difficult And so this is completely and uniquely defined, and we can analyze that mathematically. And we have also developed some approximations.
Speaker 3
56:37
I can talk about that a little bit later. That would be sort of the top-down approach, like, say, for Neumann's minimax theory, that's the theoretical optimal play of games. And now we need to approximate it, put heuristics in, prune the tree, blah, blah, blah, and so on. So we can do that also with the iXE model, but for general AI.
Speaker 3
56:55
It can also inspire those, and most of, most researchers go bottom up, right? They have their systems, they try to make it more general, more intelligent. It can inspire in which direction to go. What do you mean by that?
Speaker 3
57:09
So if you have some choice to make, right? So how should I evaluate my system if I can't do cross-validation? How should I do my learning if my standard regularization doesn't work well? So the answer is always this.
Speaker 3
57:22
We have a system which does everything. That's Ixi. It's just, you know, completely in the ivory tower, completely useless from a practical point of view. But you can look at it and see, ah, yeah, maybe I can take some aspects.
Speaker 3
57:34
Instead of Kolmogorov complexity, let's just take some compressors which have been developed so far. And for the planning, well, we have UCT, which has also been used in Go. And at least it's inspired me a lot to have this formal definition. And if you look at other fields, you know, like, I always come back to physics because I have a physics background.
Speaker 3
57:58
Think about the phenomenon of energy. That was a long time a mysterious concept and at some point it was completely formalized and that really helped a lot and you can point out a lot of these things which were first mysterious and wake and then they have been rigorously formalized. Speed and acceleration has been confused, right, until it was formally defined, there was a time like this. And people, you know, often, you know, who don't have any background, still confuse it.
Speaker 3
58:27
So, and this Ixie model, or the intelligence definitions, which is sort of the dual to it, we come back to that later, formalizes the notion of intelligence uniquely and rigorously.
Speaker 2
58:38
So in a sense, it serves as kind of the light at the end of the tunnel. So for.
Speaker 3
58:43
Yes, yeah.
Speaker 2
58:44
So, I mean, there's a million questions I could ask her. So maybe kind of, okay, let's feel around in the dark a little bit. So there's been here at DeepMind, but in general, been a lot of breakthrough ideas, just like we've been saying around reinforcement learning.
Speaker 2
58:59
So how do you see the progress in reinforcement learning is different? Like which subset of IEXE does it occupy? The current, like you said, maybe the Markov assumption is made quite often in reinforcement learning.
Speaker 3
59:17
There's other assumptions made in order to make the system work. What do you see as the difference connection between reinforcement learning and IXE? So the major difference is that essentially all other approaches, they make stronger assumptions.
Speaker 3
59:35
So in reinforcement learning, the Markov assumption is that the next state or next observation only depends on the previous observation and not the whole history, which makes, of course, the mathematics much easier rather than dealing with histories. Of course they profit from it also because then you have algorithms that run on current computers and do something practically useful, but for general AI all the assumptions which are made
Omnivision Solutions Ltd