Sergey Levine is a professor at Berkeley and a world-class researcher in deep learning, reinforcement learning, robotics, and computer vision, including the development of algorithms for end-to-end training of neural network policies that combine perception and control, scalable algorithms for inverse reinforcement learning, and deep RL algorithms.

Support this podcast by signing up with these sponsors:
- ExpressVPN at https://www.expressvpn.com/lexpod
- Cash App - use code "LexPodcast" and download:
- Cash App (App Store): https://apple.co/2sPrUHe
- Cash App (Google Play): https://bit.ly/2MlvP5w

EPISODE LINKS:
Sergey's Twitter: https://twitter.com/svlevine
Sergey's Website: http://rail.eecs.berkeley.edu/
Sergey's Papers: https://scholar.google.com/citations?user=8R35rCwAAAAJ

PODCAST INFO:
Podcast website:
https://lexfridman.com/podcast
Apple Podcasts:
https://apple.co/2lwqZIr
Spotify:
https://spoti.fi/2nEwCF8
RSS:
https://lexfridman.com/feed/podcast/
Full episodes playlist:
https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4
Clips playlist:
https://www.youtube.com/playlist?list=PLrAXtmErZgOeciFP3CBCIEElOJeitOr41

OUTLINE:
0:00 - Introduction
3:05 - State-of-the-art robots vs humans
16:13 - Robotics may help us understand intelligence
22:49 - End-to-end learning in robotics
27:01 - Canonical problem in robotics
31:44 - Commonsense reasoning in robotics
34:41 - Can we solve robotics through learning?
44:55 - What is reinforcement learning?
1:06:36 - Tesla Autopilot
1:08:15 - Simulation in reinforcement learning
1:13:46 - Can we learn gravity from data?
1:16:03 - Self-play
1:17:39 - Reward functions
1:27:01 - Bitter lesson by Rich Sutton
1:32:13 - Advice for students interesting in AI
1:33:55 - Meaning of life

CONNECT:
- Subscribe to this YouTube channel
- Twitter: https://twitter.com/lexfridman
- LinkedIn: https://www.linkedin.com/in/lexfridman
- Facebook: https://www.facebook.com/LexFridmanPage
- Instagram: https://www.instagram.com/lexfridman
- Medium: https://medium.com/@lexfridman
- Support on Patreon: https://www.patreon.com/lexfridman

Clean in some way, like for example, clean in the sense that the classes in your multi-class classification problems separate linearly.

So they have some kind of good representation, and we call this a feature representation.

And for a long time, people were very worried about features in the world of supervised learning, because somebody had to actually build those features.

So you couldn't just take an image and plug it into your logistic regression or your SVM or something.

Someone had to take that image and process it using some handwritten code.

And then neural nets came along, and they could actually learn the features.

And Suddenly, we could apply learning directly to the raw inputs, which was great for images, but it was even more great for all the other fields where people hadn't come up with good features

And 1 of those fields was actually reinforcement learning.

Because in reinforcement learning, the notion of features, if you don't use neural nets and you have to design your own features, is very, very opaque.

It's very hard to imagine, let's say I'm playing chess or Go.

What is a feature with which I can represent the value function for Go or even the optimal policy for Go linearly?

I don't even know how to start thinking about it.

They would write down, you know, an expert chess player looks for whether the knight is in the middle of the board or not.

So that's a feature, is knight in middle of board?

And they would write these like long lists of kind of arbitrary made up stuff.

And that was really kind of getting us nowhere.

But that's a little, chess is a little more accessible than the robotics problem.

Right, there's at least experts in the different features for chess, but still like the neural network there, to me that's, I mean you put it eloquently and almost made it seem like a natural step to add neural networks, but the fact that neural networks are able to discover features in the control problem, It's very interesting, it's hopeful.

I'm not sure what to think about it, but it feels hopeful that the control problem has features to be learned.

Like, I guess my question is, is it surprising to you how far the deep side of deep reinforcement learning was able to like what the space of problems has been able to tackle from, especially in games with the AlphaStar and AlphaZero and just the representation power there and in the robotics space.

And what is your sense of the limits of this representation power and the control

I think that in regard to the limits that here, I think that 1 thing that makes it a little hard to fully answer this question is because in settings where we would like to push these things to the limit, we encounter other bottlenecks.

So, like, The reason that I can't get my robot to learn how to, I don't know, do the dishes in the kitchen, it's not because its neural net is not big enough.

It's because when you try to actually do trial and error learning, reinforcement learning directly in the real world, where you have the potential to gather these large, highly varied and complex data sets, you start running into other problems.

Like 1 problem you run into very quickly, It'll first sound like a very pragmatic problem, but it actually turns out to be a pretty deep scientific problem.

Take the robot, put it in your kitchen, have it try to learn to do the dishes with trial and error.

And then we'll have no more dishes to clean.

Now you might think this is a very practical issue, but there's something to this, which is that if you have a person trying to do this, a person will have some degree of common sense.

They'll break 1 dish, they'll be a little more careful with the next 1.

If they break all of them, they're going to go and get more or something like that.

There's all sorts of scaffolding that comes very naturally to us for our learning process.

Like, you know, if I have to learn something through trial and error, I have the common sense to know that I have to, you know, try multiple times.

If I screw something up, I ask for help, or I reset things, or something like that.

And all of that is kind of outside of the classic reinforcement learning problem formulation.

There are other things that can also be categorized as kind of scaffolding but are very important.

Like, for example, where do you get your reward function?

If I want to learn how to pour a cup of water, well, how do I know if I've done it correctly?

Now, that probably requires an entire computer vision system to be built just to determine that.

So there are all sorts of things like this that start to come up when we think through what we really need to get reinforcement learning to happen at scale in the real world.

And I think that many of these things actually suggest a little bit of a shortcoming in the problem formulation and a few deeper questions that we have to resolve.

I talked to like David Silver about AlphaZero and it seems like there's no, again, that we haven't hit the limit at all in the context when there is no broken dishes.

So in the case of Go, it's really about just scaling compute.

So again, like the bottleneck is the amount of money you're willing to invest in compute and then maybe the different, the scaffolding around how difficult it is to scale compute maybe.

And it's interesting, now we move to the real world and there's the broken dishes, there's all the, and the reward function like you mentioned, that's really nice.

Do you think, there's this kind of sample efficiency question that people bring up, you know, not having to break 100,000 dishes.

Well, 1 way we can think about that is that Maybe we need to be better at reusing our data, building that iceberg.

So perhaps it's too much to hope that you can have a machine that in isolation, in the vacuum without anything else, can just master complex tasks in minutes the way that people do.

Perhaps what it really needs to do is have an existence, a lifetime, where it does many things and the previous things that it has done, prepare it to do new things more efficiently.

And the study of these kinds of questions typically falls under categories like multitask learning or meta-learning, but they all fundamentally deal with the same general theme, which is use experience for doing other things to learn to do new things efficiently and quickly.

So what do you think about, if you just look at 1 particular case study of a Tesla autopilot that has quickly approaching towards a million vehicles on the road where some percentage of the time, 30, 40% of the time is driven using the computer vision, multitask, HydroNet, right?

And then the other percent, that's what they call it, HydroNet, The other percent is human controlled.

From the human side, how can we use that data?

Do you have ideas in this autonomous vehicle space when people can lose their lives?

You know, it's a it's a safety critical environment.

So I think that actually, the kind of problems that come up when we want systems that are reliable, and that can kind of understand the limits of their capabilities, they're actually very similar to the kind of problems that come up when we're doing off-policy reinforcement learning.

So as I mentioned before, in off-policy reinforcement learning, the big problem is you need to know when you can trust the predictions of your model.

Because if you're trying to evaluate some pattern of behavior for which your model doesn't give you an accurate prediction, then you shouldn't use that to modify your policy.

It's actually very similar to the problem that we're faced when we actually then deploy that thing and we want to decide whether we trust it in the moment or not.

So perhaps we just need to do a better job of figuring out that part.

And that's a very deep research question, of course.

also a question that a lot of people are working on.

So I'm pretty optimistic that we can make some progress on that over the next few years.

What's the role of simulation in reinforcement learning, deep reinforcement learning, reinforcement learning?

It's been essential for the breakthroughs so far, for some interesting breakthroughs.

Do you think it's a crutch that we rely on?

I mean, again, this connects to our off-policy discussion, but do you think we can ever get rid of simulation, or do you think simulation will actually take over, will create more and more realistic simulations that will allow us to solve actual real-world problems, like transfer the models we learn in simulation to real world problems.

I think that simulation is a very pragmatic tool that we can use to get a lot of useful stuff to work right now.

But I think that in the long run, we will need to build machines that can learn from real data, because that's the only way that we'll get them to improve perpetually.

Because if we can't have our machines learn from real data, if they have to rely on simulated data, eventually the simulator becomes the bottleneck.

If your machine has any bottleneck that is built by humans and that doesn't improve from data, it will eventually be the thing that holds it back.

And if you're entirely reliant on your simulator, that'll be the bottleneck.

If you're entirely reliant on a manually designed controller, that's going to be the bottleneck.

It's very pragmatic, but it's not a substitute for being able to utilize real experience.

By the way, this is something that I think is quite relevant now, especially in the context of some of the things we've discussed because some of these scaffolding issues that I mentioned, things like the broken dishes and the unknown reward function, like these are not problems that you would ever stumble on when working in a purely simulated kind of environment.

But they become very apparent when we try to actually run these things in the real world.

To throw a brief wrench into our discussion, let me ask, do you think we're living in a simulation?

Do you think that's a useful thing to even think about, about the fundamental physics nature of reality?

Or another perspective, the reason I think the simulation hypothesis is interesting is to think about how difficult is it to create sort of a virtual reality game type situation that will be sufficiently convincing to us humans or sufficiently enjoyable that we wouldn't wanna leave?

I mean, that's actually a practical engineering challenge.

And I personally really enjoy virtual reality, but it's quite far away, but I kind of think about what would it take for me to want to spend more time in virtual reality versus the real world?

And that's sort of a nice, clean question, because at that point, we've reached, if I want to live in a virtual reality, that means we're just a few years away, we're a majority of the population lives in a virtual reality and that's how we create the simulation, right?

You don't need to actually simulate the quantum gravity and just every aspect of the universe.

And that's an interesting question for reinforcement learning too.

Is if you want to make sufficiently realistic simulations that may, it blend the difference between sort of the real world and the simulation, thereby just some of the things we've been talking about, kind of the problems go away if we can create actually interesting, rich simulations.

It's an interesting question, and it actually, I think your question casts your previous question in a very interesting light.

Because in some ways, asking whether we can, well, the more kind of practical version of this, like, can we build simulators that are good enough to train essentially AI systems that will work in the world?

And it's kind of interesting to think about this, about what this implies.

If true, it kind of implies that it's easier to create the universe than it is to create a brain.

And that seems like, put this way, it seems kind of weird.

The aspect of the simulation most interesting to me is the simulation around the humans.

That seems to be a complexity that makes the robotics problem harder.

Now I don't know if every robotics person agrees with that notion.

Just as a quick aside, what are your thoughts about when the human enters the picture of the robotics problem?

How does that change the reinforcement learning problem, the learning problem in general?

Yeah, I think that's a, it's a kind of a complex question.

And I guess my hope for a while had been that if we build these robotic learning systems that, that are multitask, that utilize lots of prior data and that learn from their own experience, the bit where they have to interact with people will be perhaps handled in much the same way as all the other bits.

If they have prior experience of interacting with people and they can learn from their own experience of interacting with people for this new task, Maybe that'll be enough.

Now, of course, if it's not enough, there are many other things we can do.

And there's quite a bit of research in that area.

But I think it's worth a shot to see whether the multi-agent interaction, the ability to understand that other beings in the world have their own goals, intentions, and thoughts, and so on, whether that kind of understanding can emerge automatically from simply learning to do things and maximize utility.

You've said something about gravity, that you don't need to explicitly inject anything into the system, that it can be learned from the data, and gravity is an example of something that can be learned from data, sort of like the physics of the world.

What are the limits of what we can learn from data?

Do you really, do you think we can, so a very simple, clean way to ask that is, do you really think we can learn gravity from just data?

So, something that I think is a common kind of pitfall when thinking about prior knowledge and learning is to assume that just because we know something, then that it's better to tell the machine about that rather than have it figure it out on its own.

In many cases, things that are important that affect many of the events that the machine will experience are actually pretty easy to learn.

Like, you know, if things, if every time you drop something, it falls down, like, yeah, you might not get the, you know, you might get kind of the Newton's version, not Einstein's version, but it'll be pretty good.

And it will probably be sufficient for you to act rationally in the world because you see the phenomenon all the time.

So things that are readily apparent from the data, we might not need to specify those by hand.

It might actually be easier to let the machine figure them out.

It just feels like there might be a space of many local, local minima in terms of theories of this world that we would discover and get stuck on.

That Newtonian mechanics is not necessarily easy to come by.

Yeah, and in fact, in some fields of science, for example, human civilizations that sell full of these local optimums.

So, for example, if you think about how people tried to figure out biology and medicine, for the longest time, the kind of rules, the kind of principles that serve us very well in our day-to-day lives, actually serve us very poorly in understanding medicine and biology.

We had kind of very superstitious and weird ideas about how the body worked until the advent of the modern scientific method.

So that does seem to be a failing of this approach, but it's also a failing of human intelligence, arguably.

Maybe a small aside, but some, you know, the idea of self-play is fascinating in reinforcement learning, sort of these competitive, creating a competitive context in which agents can play against each other in a, sort of at the same skill level and thereby increasing each other's skill level.

It seems to be this kind of self-improving mechanism is exceptionally powerful in the context where it could be applied.

First of all, is that beautiful to you that this mechanism work as well as it does and also can be generalized to other contexts like in the robotic space or anything that's applicable to the real world?

I think that it's a very interesting idea but I suspect that the bottleneck to actually generalizing it to the robotic setting is actually going to be the same as the bottleneck for everything else.

That we need to be able to build machines that can get better and better through natural interaction with the world.

And once we can do that, then they can go out and play with, they can play with each other, they can play with people, they can play with the natural environment.

But before we get there, we've got all these other problems we have to get out of the way.

You have to interact with a natural environment that...

Well, because in a self-play setting, you still need a mediating mechanism.

So the reason that self-play works for a board game is because the rules of that board game mediate the interaction between the agents.

So the kind of intelligent behavior that will emerge depends very heavily on the nature of that mediating mechanism.

So on the side of reward functions, that's coming up with good reward functions seems to be the thing that we associate with general intelligence.

Like human beings seem to value the idea of developing our own reward functions, at arriving at meaning and so on.

And yet for reinforcement learning, we often kind of specify this, the given.

What's your sense of how we develop good reward functions?

Yeah, I think that's a very complicated and very deep question.

And you're completely right that classically in reinforcement learning, this question, I guess, kind of been treated as a non-issue, that you sort of treat the reward as this external thing that comes from some other bit of your biology and you kind of don't worry about it.

And I do think that that's actually, you know, a little bit of a mistake that we should worry about it.

And we can approach it in a few different ways.

We can approach it, for instance, by thinking of reward as a communication medium.

We can say, well, how does a person communicate to a robot what its objective is?

You can approach it also as sort of more of an intrinsic motivation medium.

You could say, can we write down kind of a general objective that leads to good capability?

Like, for example, can you write down some objective such that even in the absence of any other task, if you maximize that objective, you'll sort of learn useful things?

This is something that has sometimes been called unsupervised reinforcement learning, which I think is a really fascinating area of research, especially today.

We've done a bit of work on that recently.

1 of the things we've studied is whether we can have some notion of unsupervised reinforcement learning by means of information theoretic quantities, like for instance, minimizing a Bayesian measure of surprise.

This is an idea that was pioneered actually in the computational neuroscience community by folks like Carl Friston.

And we've done some work recently that shows that you can actually learn pretty interesting skills by essentially behaving in a way that allows you to make accurate predictions about the world.

It seems a little circular, like do the things that will lead to you getting the right answer for prediction.

But you can, you know, by doing this, you can sort of discover stable niches in the world.

You can discover that if you're playing Tetris, then correctly, you know, clearing the rows will let you play Tetris for longer and keep the board nice and clean, which sort of satisfies some desire for order in the world.

And as a result, get some degree of leverage over your domain.

Is there a role for a human notion of curiosity in itself being the reward, sort of discovering new things about the world?

So 1 of the things that I'm pretty interested in is actually whether discovering new things can actually be an emergent property of some other objective that quantifies capability.

So new things for the sake of new things, maybe it might not by itself be the right answer, but perhaps we can figure out an objective for which discovering new things is actually the natural consequence.

That's something we're working on right now, but I don't have a clear answer for you there yet that's still a work in progress.

You mean just that it's a curious observation to see sort of creative patterns of curiosity on the way to optimize for a particular...

On the way to optimize for a particular measure of capability.

Is there ways to understand or anticipate unexpected, unintended consequences of particular reward functions?

Sort of anticipate the kind of strategies that might be developed and try to avoid highly detrimental strategies.

Yeah, so classically, this is something that has been pretty hard in reinforcement learning because it's difficult for a designer to have good intuition about what a learning algorithm will come up with when they give it some objective.

1 way to mitigate it is to actually define an objective that says, don't do weird stuff.

You can actually quantify it and say just like don't enter situations that have low probability under the distribution of states you've seen before.

It turns out that that's actually 1 very good way to do off-policy reinforcement learning, actually.

If we slowly venture in speaking about reward functions into greater and greater levels of intelligence, there's, I mean, Stuart Russell thinks about this, the alignment of AI systems with us humans.

So how do we ensure that AGI systems align with us humans?

It's kind of a reward function question of specifying the behavior of AI systems such that their success aligns with the broader intended success interest of human beings.

Do you have kind of concerns of where reinforcement learning fits into this?

Or are you really focused on the current moment of us being quite far away and trying to solve the robotics problem?

I don't have a great answer to this, but, you know, and I do think that this is a problem that's important to figure out.

For my part, I'm actually a bit more concerned about the other side of this equation that, you know, maybe rather than unintended consequences for objectives that are specified too well, I'm actually more worried right now about unintended consequences for objectives that are not optimized well enough, which might become a very pressing problem when we, for instance, try to use these techniques for safety critical systems like cars and aircraft and so on.

I think at some point we'll face the issue of objectives being optimized too well, but right now I think we're more likely to face the issue of them not being optimized well enough.

But you don't think unintended consequences can arise even when you're far from optimality, sort of like on the path to it?

Oh no, I think unintended consequences can absolutely arise.

It's just, I think right now, the bottleneck for improving reliability, safety, and things like that is more with systems that need to work better, that need to optimize their objective better.

Do you have thoughts, concerns about existential threats of human level intelligence?

years from now, do you have concerns about existential threats of AI systems?

I think there are absolutely existential threats for AI systems, just like there are for any powerful technology.

But I think that these kinds of problems can take many forms, and some of those forms will come down to people with nefarious intent.

Some of them will come down to AI systems that have some fatal flaws, and some of them will of course come down to AI systems that are too capable in some way.

But Among this set of potential concerns, I would actually be much more concerned about the first 2 right now, and principally the 1 with nefarious humans, because just through all of human history, actually it's the nefarious humans that have been the problem, not the nefarious machines, than I am about the others.

And I think that right now, the best that I can do to make sure things go well is to build the best technology I can and also hopefully promote responsible use of that technology.

Do you think RL systems has something to teach us humans?

You said nefarious humans getting us in trouble.

I mean, machine learning systems have in some ways have revealed to us the ethical flaws in our data.

In that same kind of way, can reinforcement learning teach us about ourselves?

What have you learned about yourself from trying to build robots and reinforcement learning systems?

I'm not sure what I've learned about myself, but Maybe part of the answer to your question might become a little bit more apparent once we see more widespread deployment of reinforcement learning for decision-making support in domains like healthcare, education, social media, etc.

And I think we will see some interesting stuff emerge there.

We will see, for instance, what kind of behaviors these systems come up with in situations where there is interaction with humans and where they have possibility of influencing human behavior.

I think we're not quite there yet, but maybe in the next few years we'll see some interesting stuff come out in that area.

I hope outside the research space, because the exciting space where this could be observed is sort of large companies that deal with large data, and I hope there's some transparency.

Because 1 of the things that's unclear when I look at social networks and just online is why an algorithm did something, or whether even an algorithm was involved.

And that'd be interesting from a research perspective just to observe the results of algorithms to open up that data or to at least be sufficiently transparent about the behavior of these AI systems in the real world.

What's your sense, I don't know if you looked at the Block Post-Bitter Lesson by Erich Sutton, where it looks at sort of the big lesson of researching AI and reinforcement learning is that simple methods, general methods that leverage computation seem to work well.

So basically don't try to do any kind of fancy algorithms, just wait for computation to get fast.

I think the high level idea makes a lot of sense.

I'm not sure that my takeaway would be that we don't need to work on algorithms.

I think that my takeaway would be that we should work on general algorithms.

Actually, I think that this idea of needing to better automate the acquisition of experience in the real world actually follows pretty naturally from Rich Sutton's conclusion.

So if the claim is that automated general methods plus data leads to good results, then it makes sense that we should build general methods and we should build the kind of methods that we can deploy and get them to go out there and collect their experience autonomously.

I think that 1 place where I think that the current state of things falls a little bit short of that is actually that the going out there and collecting the data autonomously which is easy to do in a simulated board game but very hard to do in the real world.

Yeah, it keeps coming back to this 1 problem, right?

It's, so your mind is focused there now in this real world.

See all Lex Fridman transcripts on Youtube

Sergey Levine: Robotics and Machine Learning | Lex Fridman Podcast #108