New lecture on recent developments in deep learning that are defining the state of the art in our field (algorithms, applications, and tools). This is not a complete list, but hopefully includes a good sampling of new exciting ideas. For more lecture videos visit our website or follow code tutorials on our GitHub repo.

INFO:
Website: https://deeplearning.mit.edu
GitHub: https://github.com/lexfridman/mit-deep-learning
Slides: http://bit.ly/2HiZyvP
Playlist: http://bit.ly/deep-learning-playlist

OUTLINE:
0:00 - Introduction
2:00 - BERT and Natural Language Processing
14:00 - Tesla Autopilot Hardware v2+: Neural Networks at Scale
16:25 - AdaNet: AutoML with Ensembles
18:32 - AutoAugment: Deep RL Data Augmentation
22:53 - Training Deep Networks with Synthetic Data
24:37 - Segmentation Annotation with Polygon-RNN++
26:39 - DAWNBench: Training Fast and Cheap
29:06 - BigGAN: State of the Art in Image Synthesis
30:14 - Video-to-Video Synthesis
32:12 - Semantic Segmentation
36:03 - AlphaZero & OpenAI Five
43:34 - Deep Learning Frameworks
44:40 - 2019 and beyond

CONNECT:
- If you enjoyed this video, please subscribe to this channel.
- Twitter: https://twitter.com/lexfridman
- LinkedIn: https://www.linkedin.com/in/lexfridman
- Facebook: https://www.facebook.com/lexfridman
- Instagram: https://www.instagram.com/lexfridman

The thing I would very much like to talk about today is the state of the art in deep learning.

really at the height of some of the great accomplishments that have happened but also stand at the beginning.

And it's up to us to define where this incredible data-driven technology takes us.

And so I'd like to talk a little bit about the breakthroughs that happened in 2017 and 2018 that take us to this point.

lecture is not on the state-of-the-art results on main machine learning benchmarks.

So the various image classification, object detection or the NLP benchmarks or the GAN benchmarks.

This isn't about the cutting-edge algorithm that's available on GitHub that performs best on a particular benchmark.

Ideas and developments that are at the cutting edge of what defines this exciting field of deep learning.

And so I'd like to go through a bunch of different areas that I think are really exciting.

Now of course this is also not a lecture that's complete.

There's other things that I may be totally missing that happened in 2017 and 18 that are particularly exciting to people here, people beyond.

For example, medical applications of deep learning is something I totally don't touch on.

And protein folding and all kinds of applications that there has been some exciting developments from DeepMind and so on that don't touch on.

So forgive me if your favorite developments are missing but hopefully this encompasses some of the really fundamental things that have happened both on the theory side, on the application side and on the community side of all of us being able to work together on this on these kinds of technologies.

in terms of deep learning is the year of natural language processing.

Many have described this year as the ImageNet moment in

for computer vision when AlexNet was the first neural network that really gave that big jump in performance on computer vision and started to inspire people what's possible with deep learning, with purely learning-based methods.

same way, there's been a series of developments from

and led up to 18 with the development of BERT that

has made on benchmarks and in our ability to apply NLP to solve various NLP tasks, natural language processing tasks, a total leap.

So let's tell the story of what takes us there.

I've mentioned a little bit on Monday about the encoder-decoder recurrent neural networks.

So this idea of recurrent neural networks encode sequences of data and output something.

Output either a single prediction or another sequence.

When the input sequence and the output sequence are not the same, necessarily the same size.

We have to translate from 1 language to another.

The encoder-decoder architecture takes the following process.

It takes in the sequence of words or the sequence of samples as the input and uses the recurring units whether it's LSTM or GRUs or beyond and encodes that sentence into a single vector.

So forms an embedding of that sentence of what it represented, a representation of that sentence.

And then feeds that representation in the decoder recurrent neural network that then generates the the sequence of words that form the sentence in the language that's being translated to.

So first you encode by taking a sequence and mapping it to a fixed size vector representation and then you decode by taking that fixed size vector representation and unrolling it into the sentence that can be of different length than the input sentence Okay, that's the encoder-decoder structure for recurrent neural networks It's been very effective for machine translation and dealing with arbitrary length input sequences, arbitrary length output sequences.

beyond, it's an improvement on the encoder-decoder architecture.

It allows the, it provides a mechanism that allows to look back at the input sequence.

So as opposed to saying you have a sequence that's the input sentence and that all gets collapsed into a single vector representation, you're allowed to look back at the particular samples from the input sequence

can also learn which aspects are important for which aspects of the decoding process.

Which aspects of the input sequence are important to the output sequence.

Visualize another way, and there's a few visualizations here that are quite incredible that are done by Jay Alomar.

I highly recommend you follow the links and look at the further details of these visualizations of attention.

So if we look at neural machine translation, the encoder RNN takes a sequence of words and throughout, after every sequence, forms a set of hidden representations, a hidden state that captures the representation of the words that followed.

And those sets of hidden representations as opposed to being collapsed to a single fixed size vector are then all pushed forward to the decoder that are then used by the decoder to translate but in a selective way where the decoder here visualized on the y-axis the input language and on the x the output language.

The decoder weighs the different parts of the input sequence differently in order to determine how to best translate, generate the word that forms the translation in the full output sentence.

Allowing, expanding the encoded decoder architecture to allow for

selective attention to the input sequence

as opposed to collapsing everything down into a fixed representation.

In the encoding process, allowing the encoder to also selectively look in forming the hidden representations at other parts of the input sequence in order to form those representations.

It allows you to determine for certain words what are the important relevant aspects of the input sequence that can help you encode that word the best.

So it improves the encoder process by allowing it to look at the entirety of the context that's self-attention Building on that, transformer it's using the self-attention mechanism in the encoder to form these sets of representations on the input sequence.

And then as part of the decoding process, follow the same but in reverse with a bunch of self-attention that's able to look back again.

So it's self-attention on the encoder, attention on the decoder and that's where the magic that's where the entirety of the magic is that's able to capture the rich context of the input sequence in order to generate in a contextual way the output sequence.

So let's take a step back then and look at what is critical to natural language in order to be able to reason about words, construct a language model and be able to reason about the words in order to classify a sentence, or translate a sentence, or compare 2 sentences and so on.

The sentences are collections of words, or characters, and those characters and words have to have an efficient representation that's meaningful for that kind of understanding and that's what the process of embedding is.

We talked a little bit about it on Monday and so the traditional Word2Vec process of embedding is you use some kind of trick in an unsupervised way to map words into a compressed representation So language modeling is the process of determining which words follow each other usually So 1 way you can use in a skip-gram model taking huge data sets of words you know there's writing all over the place taking those data sets and feeding a neural network that in a supervised way looks at which words are usually follow the input So the input is a word the output is which word are statistically likely to follow that word and the same with the preceding word And doing this kind of unsupervised learning which is what Word2Vec does if you throw away the output and the input and just take in the hidden representation form in the middle that's how you form this compressed embedding a meaningful representation that when 2 words are related in a language modeling sense 2 words are related they're going to be in that representation close to

each other and when they're totally unrelated have nothing to do with each other

ELMo is the approach of using bi-directional LSTMs to learn that representation.

And what bi-directional, bi-directionally.

So looking not just the sequence to lead up to the word but in both directions.

The sequence that followed the sequence that before.

And that allows you to learn the rich full context of the word.

In learning the rich full context of the word.

In learning the rich full context of the word you're forming representations that are much better able to represent the statistical language model behind the kind of corpus of language that you're looking at.

And this has taken a big leap in ability to then further algorithms that then with the language model a reasoning about doing things like sentence classification, sentence comparison, so on, translation, that representation is much more effective for working with language.

The idea of the OpenAI transformer is the next step forward is taking the same transformer that I mentioned previously, the encoder with self-attention, decoder with attention looking back at the input sequence, and using it, taking the language learned by the decoder and using that as a language model and then chopping off layers and training on a specific language task like sentence classification.

Now BERT is the thing that did the big leap in performance.

With the transformer formulation, there's always, there's no bi-directional element.

So the encoding step and the decoding step.

It takes in the full sequence of the sentence and masks out some percentage of the words, 15% of the words, 15% of the samples, the tokens from the sequence.

And tasks the entire encoding self-attention mechanism to predict the words that are missing.

That construct and then you stack a ton of them together.

A ton of those encoders self-attention feedforward network, self-attention feedforward network together.

And that allows you to learn the rich context of the language to then at the end perform all kinds of tasks.

You can create first of all like ELMo and like Word2Vec create rich contextual embeddings.

Take a set of words and represent them in the space that's very efficient to reason with.

You can do language classification, you can do sentence pair classification, you could do the similarity of 2 sentences, multiple choice question answering, general question answering, tagging of sentences.

Okay I lingered on that 1 a little bit too long, but it is also the 1 I'm really excited about and really if there's a breakthrough this year has been it's thanks to BERT.

The other thing I'm very excited about is totally jumping away from the NeurIPS, the theory, those kind of academic developments in deep learning and into the world of applied deep learning.

So Tesla has a system called Autopilot where the hardware version 2 of that system is a implementation of the NVIDIA Drive PX2 system which runs a ton of neural networks.

There's 8 cameras on the car and a variant of the Inception network is now taking in all 8 cameras at different resolutions as input and performing various tasks like drivable area segmentation, like object detection and some basic localization tasks.

So you have now a huge fleet of vehicles where it's not engineers some I'm sure are engineers but it's really regular consumers people that have purchased the car have no understanding in many cases of what a neural network's limitations and capabilities are, so on.

Now it has, a neural network is controlling the well-being, it's decisions, it's perceptions, and the control decision based on those perceptions are controlling the life of a human being.

And that to me is 1 of the great sort of breakthroughs of 17 and

In terms of the development of what AI can do in a practical sense in impacting the world.

And so 1 billion miles, over 1 billion miles have been driven in autopilot.

Now there's 2 types of systems currently operating in Tesla's.

There's hardware version 1, hardware version 2.

Hardware version 1 was Intel Mobileye Monocular Camera Perception System.

As far as we know that was not using a neural network and it was a fixed system that wasn't learning at least online learning in the Teslas.

The other is hardware version 2 and it's about half and half now in terms of the miles driven.

The hardware version 2 has a neural network that's always learning, there's weekly updates, it's always improving the model, shipping new weights and so on.

That's the exciting set of breakthroughs.

In terms of AutoML, the dream of automating some aspects or all aspects or as many aspects as possible of the machine learning process where you can just drop in a data set that you're working on and the system will automatically determine all the parameters from the details of the architectures, the size of the architecture, the different modules in that architecture, the hyper parameters used for training the architecture, running that, doing inference, everything.

All is done for you, all you just feed it is data.

So that's been the success of the neural architecture search in 16 and 17 and there's been a few ideas with Google AutoML that's really trying to almost create an API where you just drop in your data set and it's using reinforcement learning and recurrent neural networks to, given a few modules, stitch them together in such a way where the objective function is optimizing the performance of the overall system.

And they showed a lot of exciting results.

Google showed and others that outperform state-of-the-art systems both in terms of efficiency and in terms of accuracy.

Now in 18 there have been a few improvements on this direction and 1 of them is Adonet where it's now using the same reinforcement learning AutoML formulation to build ensembles on neural networks So in many cases state-of-the-art performance can be achieved by as opposed to taking a single architecture is building up a multitude, an ensemble, a collection of architectures.

And that's what is doing here is given candidate architectures, stitching them together to form an ensemble to get state-of-the-art performance.

Now that state-of-the-art performance is not a leap, a breakthrough leap forward, but it's nevertheless a step forward and it's a very exciting field that's going

There's an area of machine learning that's heavily understudied and I think it's extremely exciting area.

with AlexNet achieving the breakthrough performance of showing that what deep learning networks are capable of.

From that point on, from 2012 to today, there's been nonstop, extremely active developments of different architectures that, even on just ImageNet alone, on doing the image classification task, have improved performance over and over and over with totally new ideas.

Now on the other side, on the data side there's been very few ideas about how to do data augmentation.

So data augmentation is the process of, you know, it's what kids always do when you learn about an object, right?

Is you look at an object and you kind of like twist it around, is taking the raw data and messing it with such a way that it can give you much richer representation of what this data can look like in other forms, in other contexts in the real world.

There's been very few developments, I think still.

And this auto-augment is just a step, a tiny step into that direction that I hope that we as a community invest a lot of effort in.

So what Auto Augment does is it says okay so there's these data augmentation methods like translating the image, sharing the image, doing color manipulation like color inversion.

Let's take those as basic actions you can take and then use reinforcement learning and an RNN again construct to stitch those actions together in such a way that can augment data like on ImageNet to when you train on that data it gets state-of-the-art performance.

So mess with the data in a way that optimizes the way you mess with the data.

So, and then they've also showed that given that the set of data augmentation policies that are learned to optimize for example for ImageNet, given some kind of architecture, you can take that learned set of policies for data augmentation and apply it to a totally different data set So there's the process of transfer learning So what is transfer learning?

So we talked about transfer learning you have a neural network that learns to do cat versus dog, or no, learns to do a thousand class classification problem on ImageNet and then you transfer, you chop off a few layers and you transfer on the task of your own data set of cat versus dog.

What you're transferring is the weights that are learned on the ImageNet classification task and now you're then fine-tuning those weights on the specific

personal cat versus dog dataset you have.

transfer as part of the transfer learning process take the data augmentation policies learned on ImageNet and transfer those.

You can transfer both the weights and the policies.

That's a really super exciting idea, I think.

It wasn't quite demonstrated extremely well here in terms of performance.

So it got an improvement in performance and so on but it kind of inspired an idea that's something that we need to really think about.

How to augment data in an interesting way such that given just a few samples of data we can generate huge data sets in a way that you can then form meaningful, complex, rich representations from.

I think that's really exciting and 1 of the ways that you break open the problem of how do we learn a lot from a little.

Training deep neural networks with synthetic data, this also really an exciting topic that a few groups, but especially NVIDIA has invested a lot in and here's a from a CVPR 2018 probably my favorite work on this topic is they really went crazy and said okay let's mess with synthetic data in every way we could possibly can.

So on the left there is showing a set of backgrounds, then there's also a set of artificial objects, and you have a car or some kind of object that you're trying to classify.

So let's take that car and mess with it with every way possible.

Apply lighting variation to it with every way possible.

So what NVIDIA is really good at is creating realistic scenes.

And they said, okay, let's create realistic scenes but let's also go way above board and not do realistic at all.

Do things that can't possibly happen in reality and so generally these huge data sets I want you to train and again achieve quite interesting, quite

good performance on image classification.

to ImageNet and so on these kinds of tasks.

You're not going to outperform networks that were trained on ImageNet but they show that with just a small sample from from those real images they can fine tune this network train on synthetic images totally fake images to achieve state of

Again, another way to generate, to get to learn a lot from very little by generating fake worlds synthetically.

The process of annotation, which for supervised learning, is what you need to do in order to train the network.

You need to be able to provide ground truth, you need to be able to provide ground truth, you need to be able to label whatever the entity that is being learned.

And so for image classification that's saying what is going on in the image.

And part of that was done on ImageNet by doing a Google search for creating candidates.

Then there is the object detection task of detecting the bounty box.

And so saying, drawing the actual bounty box is a little bit more difficult but it's a couple of clicks and so on.

Then if we take the final, the probably 1 of the higher complexity tasks of perception of image understanding is segmentation.

Is actually drawing either pixel level or polygons the outline of a particular object.

Now if you have to annotate that, that's extremely costly.

So the work with polygon RNN is to use recurring neural networks to make suggestions for polygons.

There's a few tricks to form these high resolution polygons.

So the idea is it drops in a single point.

You draw a bounding box around an object.

You use convolutional neural networks to drop the first point and then use recurrent neural networks to draw around it.

There's a few tricks and this tool is available online.

Again, the dream with AutoML is to remove the human from the picture as much as possible.

With data augmentation remove the human from the picture as much as possible for menial data.

Automate the boring stuff and in this case the act of drawing a polygon tried to automate it as much as possible.

The interesting other dimension along which deep learning is recently been trying to be optimized is how do we make deep learning accessible?

So the Dawn bench from Stanford, the benchmark, the Dawn bench benchmark from Stanford, asked, formulated an interesting competition which got a lot of attention and a lot of progress.

want to achieve 93% accuracy on ImageNet and 94% on CIFAR-10 let's now compete, that's like the requirement.

That's not compete how you can do it in the least amount of time and for the least amount of dollars.

Do the training in the least amount of time and the training in the least amount of dollars.

Like literally dollars you're allowed to spend to do this.

And Fast AI, you know, it's a renegade group, awesome renegade group of deep learning researchers, have been able to train on ImageNet in 3 hours.

So this is for training process for 25 bucks.

So the key idea that they were playing with is quite simple but really boils down to messing with the learning rate throughout the process of training.

So the learning rate is how much you based on the loss function, based on the error the neural network observes how much do you adjust the weights?

So they found that if they crank up the learning rate while decreasing the momentum which is a parameter of the optimization process where they do it that jointly they're able to make the network learn really fast.

That's really exciting and the benchmark itself is also really exciting because that's exactly for people sitting in this room that opens up the door to doing all kinds of fundamental deep learning problems without the resources of Google DeepMind or OpenAI or Facebook or so on, without computational resources.

That's important for academia, that's important for independent researchers and so on.

There's been a lot of work on generative adversarial neural networks and in some ways there has not been breakthrough ideas in GANs for quite a bit.

And I think big GAN from Google DeepMind ability to generate Incredibly high resolution images And it's the same GAN technique So in terms of breakthroughs innovations But scaled So

increase the model capacity And increase the batch size the number of images that are fed

I encourage you to go online and look at them.

It's hard to believe that they're generated.

So that was, so 2018 for GANs was a year of scaling and parameter tuning as opposed to break through new ideas Video to video synthesis This work is from NVIDIA is looking at the problem so there's been a lot of work on going from image to image.

So from a particular image generating another image.

So whether it's colorizing an image or just traditionally defined GANs.

The idea with video-to-video synthesis that a few people have been working on but Nivea took a good step forward is to make the video, to make the temporal consistency, the temporal dynamics part of the optimization process.

So if you look here at the comparison for this particular, So the input is the labels in the top left and the output of the NVIDIA approach is on the bottom right See it's very temporally consistent.

If you look at the image to image mapping that's state of the art, pix2pix HD it's very jumpy, it's not temporally consistent at all.

And there's some naive approaches for trying to maintain temporal consistency that's in the bottom left.

So you can apply this to all kinds of tasks, all kinds of video to video mapping.

Here is mapping it to face edges, edge detection on faces, mapping it to faces, generating faces from just edges.

You can look at body pose to actual images.

As input to the network you can take the pose of the person and generate the video of the person.

The problem of perception sort of began with AlexNet and ImageNet has been further and further developments where the input, the problem is of basic image classification where the input is an image and the output is a classification what's going on in that image.

And the fundamental architecture can be reused for more complex tasks like detection, like segmentation and so on.

Interpreting what's going on in the image.

So these large networks from VGG net, Google net, ResNet, SC net, DenseNet, all these networks are forming rich representation that can then be used for all kinds of tasks, whether that task is object detection.

This here shown is the region-based methods where the neural network is tasked, the convolutional layers make region proposals, so a bunch of candidates to be considered and then there's a step that's determining what's in those different regions and forming bounding boxes around them

in a for loop way and then there is the 1 shot method, single shot method where in a single pass all of the bounding boxes in their classes are generated and there has been a tremendous amount of work in the space of object detection.

Some are single-shot methods, some are region-based methods and there's been a lot of exciting work but not, I would say, breakthrough ideas.

And then we take it to the highest level of perception which is semantic segmentation.

The state-of-the-art performance is, at least for the open source systems, is DeepLab v3 plus

So semantic segmentation to catch it all up started in 2014 with fully convolutional neural networks chopping off the fully connected layers and then outputting the heat map very grainy, very

Then improving that with SegNet, performing max pooling with a breakthrough idea that's reused in a lot of cases is dilated convolution, 8-chars convolutions.

Having some spacing which increases the field of view of the convolutional filter.

The key idea behind DeepLab V3 that is the state of the art is the multi-scale processing without increasing the parameters.

The multi-scale is achieved by quote-unquote, the atrius rate.

So taking those atrius convolutions and increasing the spacing.

And you can think of the increasing that spacing

by enlarging the model's field of view and so you can consider all these different scales of processing and looking at the layers of features

so allowing you to be able to grasp the greater context as part of the upsampling deconvolutional step.

And that's what's producing the state of our performances and that's where we have the

notebook tutorial on GitHub showing this deep lab architecture trained on Cityscapes.

So Cityscapes is a driving segmentation data set that is 1 of the most commonly used for the task of driving scene segmentation.

Okay, on the deep reinforcement learning front.

but I think the excitement really settled in 2018.

See all Lex Fridman transcripts on Youtube

Deep Learning State of the Art (2019) - MIT