An introductory lecture for MIT course 6.S094 on the basics of deep learning including a few key ideas, subfields, and the big picture of why neural networks have inspired and energized an entire new generation of researchers. For more lecture videos on deep learning, reinforcement learning (RL), artificial intelligence (AI & AGI), and podcast conversations, visit our website or follow TensorFlow code tutorials on our GitHub repo.

INFO:
Website: https://deeplearning.mit.edu
GitHub: https://github.com/lexfridman/mit-deep-learning
Slides: http://bit.ly/deep-learning-basics-slides
Playlist: http://bit.ly/deep-learning-playlist
Blog post: https://link.medium.com/TkE476jw2T

OUTLINE:
0:00 - Introduction
0:53 - Deep learning in one slide
4:55 - History of ideas and tools
9:43 - Simple example in TensorFlow
11:36 - TensorFlow in one slide
13:32 - Deep learning is representation learning
16:02 - Why deep learning (and why not)
22:00 - Challenges for supervised learning
38:27 - Key low-level concepts
46:15 - Higher-level methods
1:06:00 - Toward artificial general intelligence

CONNECT:
- If you enjoyed this video, please subscribe to this channel.
- Twitter: https://twitter.com/lexfridman
- LinkedIn: https://www.linkedin.com/in/lexfridman
- Facebook: https://www.facebook.com/lexfridman
- Instagram: https://www.instagram.com/lexfridman

It's really good to see everybody here, make it in the cold.

on deep learning that we're running throughout this month.

The website that you can get all the content, the videos, the lectures, and the code is deeplearning.mit.edu.

The videos and slides will be made available there, along with a GitHub repository that's accompanying the course.

Assignments for registered students will be emailed later on in the week.

And you can always contact us with questions, concerns, comments at hcaihumancenteredai.mit.edu.

So let's start through the basics, the fundamentals.

To summarize in 1 slide, what is deep learning?

It is a way to extract useful patterns from data in an automated way was as little human effort involved as possible, hence the automated.

How the fundamental aspect that we'll talk about a lot is the optimization of neural networks.

The practical nature that we'll provide through the code and so on is that there's libraries that make it accessible and easy to do some of the most powerful things in deep learning using Python, TensorFlow, and Friends.

The hard part always with machine learning and artificial intelligence in general is asking good questions and getting good data.

A lot of times the exciting aspects of what's the news covers and a lot of the exciting aspects of what is published in the prestigious conferences and an archive and a blog post is the methodology.

The hard part is applying that methodology to solve real world problems, to solve fascinating, interesting problems, and that requires data.

That requires asking the right questions of that data, organizing that data, and labeling, selecting aspects of that data that can reveal the answers to the questions you ask.

So why has this breakthrough over the past decade of the application of neural networks, the ideas of neural networks, what has happened, what has changed, have been around since the 1940s and ideas have been percolating even before.

The digitization of information, data, the ability to access data easily in a distributed fashion across the world.

All kinds of problems have now a digital form that can be accessed by learning algorithms.

Hardware, compute, both the Moore's Law of CPU and GPU and ASICs, Google's TPU systems, hardware that enables the efficient, effective large-scale execution of these algorithms.

Community, people here, people all over the world being able to work together, to talk to each other, to feed the fire of excitement behind machine learning.

The tooling, as we'll talk about TensorFlow, PyTorch, and everything in between, that enables a person with an idea to reach a solution in less and less and less time.

Higher and higher levels of abstraction empower people to solve problems in less and less time with less and less knowledge, where the idea and the data become the central point, not the effort that takes you from idea to the solution.

And there's been a lot of exciting progress, some of which we'll talk about, from face recognition to the general problem of scene understanding, image classification, to speech, text, natural language processing, transcription, translation, in medical applications and medical diagnosis, and cars being able to solve many aspects of perception in autonomous vehicles with drivable area lane detection, object detection, digital assistants, the ones on your phone and beyond, the ones in your home, ads, recommender systems from Netflix to Search to Social, Facebook, and of course, the deep reinforcement learning successes in the playing of games, from board games to Starcraft and Dota.

Deep learning is more than a set of tools to solve practical problems.

Pamela McCordick said in 79, AI began with the ancient wish to forge the gods.

Throughout our history, throughout our civilization, human civilization, we've dreamed about creating echoes of whatever is in this mind of ours in the machine and creating living organisms.

From the popular culture in the 1800s with Frankenstein to Ex Machina, this vision, this dream of understanding intelligence and creating intelligence has captivated all of us.

And deep learning is at the core of that because there's aspects of it, the learning aspects, that captivate our imagination about what is possible, given data and methodology, what learning, learning to learn, and beyond, how far that can take us.

And here visualized is just 3% of the neurons and 1 millionth of the synapses in our own brain.

This incredible structure that's in our mind and there's only echoes of it, small shadows of it in our artificial neural networks that we're able to create.

But nevertheless, those echoes are inspiring to us.

The history of neural networks on this pale blue dot of ours started quite a while ago with summers and winters, with excitements and periods of pessimism, starting in the 40s with neural networks and the implementation of those neural networks as a perceptron in the 50s, with ideas of back propagation, restricted Boltzmann machines, recurring neural networks in the 70s and 80s, with convolutional neural networks, and the MNIST dataset, with datasets beginning to percolate, and LSTMs, bidirectional RNNs in the 90s, and the rebranding and the rebirth of neural networks under the flag of deep learning and deep belief nets in 2006.

The birth of ImageNet, the data set that, on which the possibilities of what deep learning can bring to the world has been first illustrated in the recent years in 2009.

And AlexNet, the network that on ImageNet performed exactly that with a few ideas like dropout that improve neural networks over time every year by year improving the performance of neural networks.

In 2014, the idea of GANs that Yanle Kun called The most exciting idea of the last 20 years, the generative adversarial networks, the ability to, with very little supervision, generate data, to generate ideas after forming representation of those.

From the understanding, from the high-level abstractions of what is extracted in the data, be able to generate new samples, create.

The idea of being able to create as opposed to memorize is really exciting.

And on the applied side, in 2014 with DeepFace, the ability to do face recognition.

There's been a lot of breakthroughs on the computer vision front, that being 1 of them.

The world was inspired, captivated in 2016 with AlphaGo and 17 with AlphaZero, beating with less and less and less effort the best players in the world at Go.

The problem that for most of the history of artificial intelligence thought to be unsolvable.

And new ideas with capsule networks, and this year is the year, 2018 was the year of natural language processing.

Google's BERT and others that will talk about breakthroughs on ability to understand language, understand speech, and everything including generation that's built all around that.

And there's a parallel history of tooling starting in the 60s with the Perceptron and the wiring diagrams.

They're ending with this year with PyTorch 1.0 and TensorFlow 2.0.

These really solidified, exciting, powerful ecosystems of tools that enable you to do very to do a lot with very little effort.

The sky is the limit thanks to the tooling.

So let's then from the big picture take into the smallest.

Everything should be made as simple as possible.

So let's start simple with a little piece of code before we jump into the details and a big run through everything that is possible in deep learning.

At the very basic level with just a few lines of code, really 6 here, 6 little pieces of code, You can train a neural network to understand what's going on in an image.

The classic that I will always love, MNIST dataset, the handwriting digits where the input to a neural network, a machine learning system, is the picture of a handwritten digit, and the output is the number that's in that digit.

It's as simple as in the first step, import the library, TensorFlow.

Third step, like Lego bricks, stack on top of each other, the neural network, layer by layer, with a hidden layer, an input layer, an output layer.

Evaluate the model in step 5 on the testing data set, and that's it.

You're ready to predict what's in the image.

And much of this code, obviously much more complicated, or much more elaborate and rich and interesting and complex will be making available on GitHub on our repository that accompanies these courses.

Today we've released the first tutorial on driver scene segmentation.

And then on the tooling side, in 1 slide, before we dive into the neural networks and deep learning, The tooling side, amongst many other things, TensorFlow is a deep learning library, an open source library from Google.

The most popular 1 to date, the most active with a large ecosystem.

It's not just something you import in Python and to solve some basic problems.

Much of what we'll do in this course will be the highest level API with Keras.

But there's also the ability to run in the browser with TensorFlow.js, on the phone with TensorFlow Lite, in the cloud, without any need to have a computer, hardware, anything, any

on your own machine, you can run all the code that we're providing in the cloud with Google Collaboratory, and the optimized ASICs hardware that Google has optimized for TensorFlow with their TPU, Tensor Processing Unit, ability to visualize TensorBoard, models that provide in TensorFlow Hub.

And there's just an entire ecosystem, including, most importantly, I think, documentation and blogs that make it extremely accessible to understand the fundamentals of the tooling that allow you to solve the problems from natural language processing to computer vision, to GANs, generative adversarial neural networks, and everything in between, deep reinforcement learning and so on.

So that's why we're excited to sort of work both in the theory in this course, in this series of lectures, and in the tooling and the applied side of TensorFlow.

It really makes it exceptionally, these ideas exceptionally accessible.

So deep learning at the core is the ability to form higher and higher level of abstractions of representations in data and raw patterns, higher and higher levels of understanding of patterns.

And those representations are extremely important and effective for being able to interpret data.

Under certain representations data is trivial to understand.

Cat versus dog, blue dot versus green triangle.

In this task drawing a line under polar coordinates is trivial.

Under Cartesian coordinates is very difficult, well impossible to do accurately.

And that's a trivial example of a representation.

So our task with deep learning, with machine learning in general, is forming representations that map the topology, whatever the topology, the rich space of the problem that you're trying to deal with of the raw inputs, map it in such a way that the final representation is trivial to work with, trivial to classify, trivial to perform regression, trivial to generate new samples of that data.

And that representation of higher and higher levels of representation is really the dream of artificial intelligence.

That is what understanding is, making the complex simple, like Einstein back in a few slides ago said.

And that, with Juergen Schmidhuber and whoever else said it, I don't know, that's been the dream of all of science in general, of the history of science is the history of compression progress, of forming simpler and simpler representations of ideas.

The models of the universe of our solar system with the earth at the center of it, is much more complex to perform, to do physics on than a model where the sun is at the center.

Those higher and higher levels of simple representations enable us to do extremely powerful things.

That has been the dream of science and the dream of artificial intelligence.

the grander world of machine learning and artificial intelligence?

It's the ability to more and more remove the input of human experts, remove the human from the picture, the human costly inefficient effort of human beings in the picture.

Deep learning automates much of the extraction from the raw, gets us closer and closer to the raw data without the need of human involvement, human expert involvement.

Ability to form representations from the raw data as opposed to having a human being needing to extract features as was done in the 80s and 90s and the early aughts to extract features with which then the machine learning algorithms can work with.

The automated extraction of features enables us to work with large and larger data sets removing the human completely except from the supervision labeling step at the very end.

There's always a balance between excitement and disillusionment.

The Gartner hype cycle, as much as we don't like to think about it, applies to almost every single technology.

Of course the magnitude of the peaks and the draws is different.

of an inflated expectation with deep learning.

And that's something we have to think about as we talk about some of the ideas and exciting possibilities of the future.

And we're still driving cars that we'll talk about in future lectures in this course, we're at the same.

In fact, we're a little bit beyond the peak.

And so it's up to us, this is MIT and the engineers and the people working on this in the world to carry us through the draw, to carry us through the future as the ups and downs of the excitement progresses forward into the plateau of productivity.

especially with humanoid robotics, robotic manipulation, and even, yes, autonomous vehicles, majority of the aspects of autonomous vehicles do not involve to an extensive amount machine learning to date.

The problems are not formulated as data-driven learning.

Instead, they're model-based optimization methods that don't learn from data over time.

And then from the speakers these couple of weeks, we'll get to see how much machine learning is starting to creep in.

But the example shown here with the Boston, with amazing humanoid robotics and Boston dynamics.

To date almost no machine learning has been used except for trivial perception.

The same with autonomous vehicles almost no machine learning and deep learning has been used except with perception.

Some aspect of enhanced perception from the visual texture information.

Plus what's becoming, what's starting to be used a little bit more is use of recurring neural networks to predict the future, to predict the intent of the different players in the scene in order to anticipate what the future is.

Most of the success that you see today, the 10 million miles that Waymo has achieved, has been attributed mostly to non-machine learning methods.

Here's a really clean example of unintended consequences.

Ethical issues we have to really think about.

When an algorithm learns from data based on an objective function, a loss function, The power, the consequences of an algorithm that optimizes that function is not always obvious.

Here's an example of a human player playing the game of Coast Runners with a it's a boat racing game where the task is to go around the racetrack and try to win the race.

And the objective is to get as many points as possible.

The finishing time, how long it took you to finish, the finishing position, where you were in the ranking, and picking up quote unquote turbos, those little green things along the way, they give you points.

So we design an agent, in this case an RL agent, that optimizes for the rewards.

And what we find on the right here, the optimal, the agent discovers that the optimals actually has nothing to do with finishing the race or the ranking.

They can get much more points by just focusing on the turbos and collecting those little green dots because they regenerate.

So you go in circles over and over and over, slamming into the wall, collecting the green turbos.

Now that's a very clear example of a well-reasoned, a formulated objective function that has totally unexpected consequences, at least without sort of considering those consequences ahead of time.

And so that shows the need for AI safety for a human in the loop of machine learning.

That's why not deep learning exclusively.

The challenge of deep learning algorithms, of deep learning applied, is to ask the right question and understand what the answers mean.

You have to take a step back and look at the difference, the distinction, the levels, degrees of what the algorithm is accomplishing.

For example, image classification is not necessarily scene understanding.

In fact, it's very far from scene understanding.

Classification may be very far from understanding.

And the data sets can vary drastically across the different benchmarks and the data sets used.

The professionally done photographs versus synthetically generated images versus real world data.

And the real world data is where the big impact is.

So oftentimes the 1 doesn't transfer to the other.

Solving all of these problems of different lighting variations, of pose variation, interclass variation, all the things that we take for granted as human beings with our incredible perception system all have to be solved in order to gain greater and greater understanding of a scene.

And all the other things we have to close the gap on that we're not even close to yet.

Here's an image from the Andrej Karpathy blog from a few years ago of former President Obama stepping on a scale.

We can classify, we can do semantic segmentation of the scene, we can do object detection, we can do a little bit of 3D reconstruction from a video version of the scene.

But what we can't do well is all the things we take for granted.

We can't tell the images in the mirrors versus in reality as different.

We can't deal with the sparsity of information.

a few pixels on President Obama's face, we can still identify him as the president.

The 3D structure of the scene, that there's a foot on top of a scale, that there's human beings behind from a single image.

Things we can trivially do using all the common sense semantic knowledge that we have cannot do.

The physics of the scene, that there's gravity,

And the biggest thing, the hardest thing, is

And what's on people's minds about what's on other people's minds, and so on.

Mental models of the world, being able to infer what people are thinking about.

Being able to infer, there's been a lot of exciting work here at MIT about what people are looking at.

But we're not even close to solving that problem either.

But what they're thinking about, we're not even, we haven't even begun to really think about that problem.

And I think at the core of that I think I'm harboring on the visual perception problem because it's 1 we take really for granted as human beings especially when trying to solve real-world problems especially when trying to solve autonomous driving, is we have 540 million years of data for visual perception, so we take it for granted.

We don't realize how difficult it is and we kind of focus all our attention on this recent development of a hundred thousand years of abstract thought being able to play chess being able to reason but the visual perception is nevertheless extremely difficult at all at every single layer of what's required to perceive interpret and understand the fundamentals of a scene.

And a trivial way to show that is just all the ways you can mess with these image classification systems by adding a little bit of noise.

The last few years, there's been a lot of papers, a lot of work to show that you can mess with these systems by adding noise here with 99% accuracy, predict a dog, add a little bit of distortion, immediately the system predicts with 99% accuracy that it's an ostrich.

And you can do that kind of manipulation with just a single pixel.

So that's just a clean way to show the gap between image classification on an artificial data set like ImageNet and real world perception that has to be solved, especially for life critical situations like autonomous driving.

I really like this Max Tegmark's visualization of this rising sea of the landscape of human competence from Hans Marwack.

And this is the difference as we progress forward and we discuss some of these machine learning methods is there is the human intelligence, the general human intelligence, let's call Einstein here, that's able to generalize over all kinds of problems, over all kinds of, from the common sense to the incredibly complex.

And then there is the way we've been doing, especially data-driven machine learning, which is savants, which is specialized intelligence, extremely smart at a particular task, but not being able to transfer except in the very narrow neighborhood on this little landscape of different of art, cinematography, book writing at the peaks and chess arithmetic and theorem proving and vision at the bottom in the lake.

And there's this rising sea as we solve problem after problem, the question can the methodology and the approach of deep learning of everything we're doing now keep the sea rising?

Or do fundamental breakthroughs have to happen in order to generalize and solve these problems?

And so from the specialized where the successes are, the systems are essentially boiled down to given the data set and given the ground truth for that data set, here's the apartment cost in the Boston area, be able to input several parameters, and based on those parameters, predict the apartment cost.

That's the basic premise approach behind the successful supervised deep learning systems today.

If you have good enough data, there's good enough ground truth and can be formalized, we can solve it.

Some of the recent promise that we will do an entire series of lectures in the third week on deeper enforcement learning, showed that from raw sensory information with very little annotation, through self play, where their systems learn without human supervision, are able to perform extremely well in these constrained contexts.

Here, pong to pixels, being able to perceive the raw pixels of this pong game as raw input and learn the fundamental quote-unquote physics of this game.

Understand how it is this game behaves and how to be able to win this game.

That's kind of a step toward general purpose artificial intelligence.

But it is a very small step because it's in a simulated, very trivial situation.

Would less and less human supervision be able to solve huge real world problems from the top supervised learning where majority of the teaching is done by human beings throughout the annotation process through labeling all the data by showing different examples and further and further down to semi-supervised learning, reinforcement learning and supervised learning, removing the teacher from the picture and making that teacher extremely efficient when it is needed.

Of course, data augmentation is 1 way, as we'll talk about, so taking a small number of examples and messing with that set of examples, augmenting that set of examples through trivial and through complex methods of cropping, stretching, shifting, and so on, including through generative networks, modifying those images to grow a small data set into a large 1 to minimize, to decrease further and further the input that's the human, the input of the human teacher.

But still, that's quite far away from the incredibly efficient, both teaching and learning that humans do.

This is a video, and there's many of them online for the first time, a human baby walking.

1 day you're on all fours, and the next day you put your 2 hands up and then you figure out the rest.

Well, you can kind of, ish, you can kind of play around with it.

But the point is you're extremely efficient.

a few examples are able to learn the fundamental aspect of how to solve a particular problem.

Machines in most cases need thousands, millions and sometimes more examples depending on the life critical nature of the application.

The data flow of Supervised learning systems is there's input data, there's a learning system, and there is output.

Now in the training stage for the output, we have the ground truth.

And so we use that ground truth to teach the system.

In the testing stage, when it goes out into the wild, there's new input data over which we have to generalize with the learning system and have to make our best guess.

In the training stage, the processes with neural networks is given the input data for which we have the ground truth, pass it through the model, get the prediction, and given that we have the ground truth, we can compare the prediction to the ground truth, look at the error, and based on the error, adjust the weights.

The types of predictions we can make is regression and classification.

Regression is a continuous and classification is categorical.

Here, if we look at weather, the regression problem says what is the temperature going to be tomorrow and the classification formulation of that problem says is it going to be hot or cold or some threshold definition of what hot or cold is.

On the classification front it could be multi-class which is the standard formulation where you're tasked with saying what is, there's only, a particular entity can only be 1 thing, and then there's multi-label where a particular entity can be multiple things.

And overall, the input to the system can be not just a single sample of the particular data set and the output doesn't have to be a particular sample of the ground truth data set.

It can be a sequence, sequence to sequence, a single sample to a sequence, a sequence to a sample, and so on.

From video captioning, where it's video captioning to translation, to natural language generation, to of course the one-to-one general computer vision.

Let's step back from the big to the small, to a single neuron, inspired by our own brain, the biological neural networks in our brain, and the computational block that is behind a lot of the intelligence in our mind.

The artificial neuron has inputs with weights on them, plus a bias, an activation function and an output.

As I showed it before, here visualizes the thalamocortical system with 3 million neurons and 476 million synapses.

The full brain has a hundred billion, billion neurons and a thousand trillion synapses.

ResNet and some of the other state-of-the-art networks have in tens, hundreds of millions of edges of synapses.

The human brain has 10 million times more synapses than artificial neural networks and there's other differences.

The topology is asynchronous and not constructed in layers.

The learning algorithm for artificial neural networks is back propagation for our biological networks, we don't know.

That's 1 of the mysteries of the human brain.

The power consumption, human brains are much more efficient than neural networks.

That's 1 of the problems that we're trying to solve and ASICs are starting to begin to solve some of these problems.

In the biological neural networks, you really never stop learning.

You're always learning, always changing both on the hardware and the software.

In artificial neural networks oftentimes there's a training stage, there's a distinct training stage and there's a distinct testing stage when you release the thing in the wild.

Online learning is an exceptionally difficult thing that we're still in the very early stages of.

This neuron takes a few inputs, the fundamental computational block behind neural networks.

Takes a few inputs, applies weights, which are the parameters that are learned, sums them up, puts it into a nonlinear activation function after adding the bias, also a learned parameter, and gives an output.

See all Lex Fridman transcripts on Youtube

Deep Learning Basics: Introduction and Overview