The talks at the Deep Learning School on September 24/25, 2016 were amazing. I clipped out individual talks  from the full live streams and provided links to each below in case that's useful for people who want to watch specific talks several times (like I do). Please check out the official website (http://www.bayareadlschool.org) and full live streams below.

Having read, watched, and presented deep learning material over the past few years, I have to say that this is one of the best collection of introductory deep learning talks I've yet encountered. Here are links to the individual talks and the full live streams for the two days:

1. Foundations of Deep Learning (Hugo Larochelle, Twitter) - https://youtu.be/zij_FTbJHsk
2. Deep Learning for Computer Vision (Andrej Karpathy, OpenAI) - https://youtu.be/u6aEYuemt0M
3. Deep Learning for Natural Language Processing (Richard Socher, Salesforce) - https://youtu.be/oGk1v1jQITw
4. TensorFlow Tutorial (Sherry Moore, Google Brain) - https://youtu.be/Ejec3ID_h0w
5. Foundations of Unsupervised Deep Learning (Ruslan Salakhutdinov, CMU) - https://youtu.be/rK6bchqeaN8
6. Nuts and Bolts of Applying Deep Learning (Andrew Ng) - https://youtu.be/F1ka6a13S9I
7. Deep Reinforcement Learning (John Schulman, OpenAI) - https://youtu.be/PtAIh9KSnjo
8. Theano Tutorial (Pascal Lamblin, MILA) - https://youtu.be/OU8I1oJ9HhI
9. Deep Learning for Speech Recognition (Adam Coates, Baidu) - https://youtu.be/g-sndkf7mCs
10. Torch Tutorial (Alex Wiltschko, Twitter) - https://youtu.be/L1sHcj3qDNc
11. Sequence to Sequence Deep Learning (Quoc Le, Google) - https://youtu.be/G5RY_SUJih4
12. Foundations and Challenges of Deep Learning (Yoshua Bengio) - https://youtu.be/11rsu_WwZTc

Full Day Live Streams:
Day 1: https://youtu.be/eyovmAtoUx0
Day 2: https://youtu.be/9dXiAecyJrY

Go to http://www.bayareadlschool.org for more information on the event, speaker bios, slides, etc. Huge thanks to the organizers (Shubho Sengupta et al) for making this event happen.

CONNECT:
- If you enjoyed this video, please subscribe to this channel.
- AI Podcast: https://lexfridman.com/ai/
- Show your support: https://www.patreon.com/lexfridman
- LinkedIn: https://www.linkedin.com/in/lexfridman
- Twitter: https://twitter.com/lexfridman
- Facebook: https://www.facebook.com/lexfridman
- Instagram: https://www.instagram.com/lexfridman
- Slack: https://deep-mit-slack.herokuapp.com

So thank you very much for the introduction.

So today I'll speak about deep learning, especially in the context of computer vision.

So what you saw in the previous talk is neural networks.

So you saw that neural networks are organized into these layers, fully connected layers, where neurons in 1 layer are not connected, but they're connected fully to all the neurons in the previous layer.

And we saw that basically we have this layer-wise structure from input until output.

And there are neurons and nonlinearities, et cetera.

Now, so far we have not made too many assumptions about the inputs.

So in particular, here we just assume that an input is some kind of a vector of numbers that we plug into this neural network.

So that's both a bug and a feature to some extent.

Because in most real world applications, we actually can make some assumptions about the input that makes learning much more efficient.

So in particular, usually we don't just want to plug into neural networks vectors of numbers, but they actually have some kind of a structure.

So we don't have vectors of numbers, but these numbers are arranged in some kind of a layout, like an n-dimensional array of numbers.

So for example, spectrograms are two-dimensional arrays of numbers.

Images are three-dimensional arrays of numbers.

Videos would be four-dimensional arrays of numbers.

Text you could treat as one-dimensional array of numbers.

And so whenever you have this kind of local connectivity structure in your data, then you'd like to take advantage of it.

And convolutional neural networks allow you to do that.

So before I dive into convolutional neural networks and all the details of the architectures, I'd like to briefly talk about a bit of the history of how this field evolved over time.

So I like to start off usually with talking about Hubel and Wiesel and the experiments that they performed in 1960s.

So what they were doing is trying to study the computations that happened in the early visual cortex areas of a cat.

And so they had cats, and they plugged in electrodes that could record from the different neurons.

And then they showed the cat different patterns of light.

And they were trying to debug neurons effectively and try to show them different patterns and see what they responded to.

And a lot of these experiments inspired some of the modeling that came in afterwards.

So in particular, 1 of the early models that tried to take advantage of some of the results of these experiments was the model called neurocognitron from Fukushima in the 1980s.

And so what you saw here was this architecture that again is layer-wise, similar to what you see in the cortex, where we have these simple and complex cells, where the simple cells detect small things in the visual field, and then you have this local connectivity pattern, and the simple and complex cells alternate in this layered architecture throughout.

And so this looks a bit like a com net because you have some of its features, like, say, the local connectivity.

But at the time, this was not trained with back propagation.

These were specific heuristically chosen updates.

And this was unsupervised learning back then.

So the first time that we've actually used back propagation to train some of these networks was an experiment of Jan Lekun in the 1990s.

And so This is an example of 1 of the networks that was developed back then in 1990s by Yann LeCun as LeanNet 5.

And this is what you would recognize today as a convolutional neural network.

So it has a lot of the very convolutional layers and it's alternating and it's a similar kind of design to what you would see in the Fukushima's neurocognitron, but this was actually trained with back propagation end to end using supervised learning.

Now, so this happened in roughly 1990s, and we're here in 2016, basically about 20 years later.

Now, computer vision has for a long time kind of worked on larger images.

And a lot of these models back then were applied to very small kind of settings, like say, recognizing digits in zip codes and things like that.

And they were very successful in those domains.

But back at least when I entered computer vision roughly 2011, it was thought that a lot of people were aware of these models.

But it was thought that they would not scale up naively into large, complex images, that they would be constrained to these toy tasks for a long time.

Or I shouldn't say toy, because these were very important tasks, but certainly like smaller visual recognition problems.

And so in computer vision in roughly 2011, it was much more common to use these feature-based approaches at the time.

So when I entered my PhD in 2011 working on computer vision, you would run a state-of-the-art object detector on this image.

And you might get something like this, where cars were detected in trees.

And you would kind of just shrug your shoulders and say, well, that just happens sometimes.

You kind of just accept it as something that would just happen.

Things actually worked relatively decent, I should say.

But definitely, there were many mistakes that you would not see today about 4 years in

And so a lot of computer vision kind of looked much more like this.

When you look into a paper that tried to do image classification, you would find this section in the paper on the features that they used.

And so they would use a gist, hog, et cetera, and then a second page of features and all their hyperparameters.

So all kinds of different histograms, and you would extract this kitchen sink of features and a third page here.

And so you end up with this very large, complex code base, because some of these feature types are implemented in MATLAB, some of them in Python, some of them in C++.

And you end up with this large code base of extracting all these features, caching them, and then eventually plugging them into linear classifiers to do some kind of visual recognition task.

But it worked to some extent, but there were definitely a room for improvement.

So a lot of this changed in computer vision in 2012 with this paper from Alex Kurchevsky, Ilya Satskever, and Jeff Hinton.

So this is the first time that someone took a convolutional neural network that is very similar to the 1 that you saw from 1998 from Yanma Kun.

And I'll go into details of how they differ exactly.

But they took that kind of network, they scaled it up, and they made it much bigger, and they trained it on a much bigger data set on GPUs.

And things basically ended up working extremely well.

And this is the first time that computer vision community has really noticed these models and adopted them to work on larger images.

So we saw that the performance of these models are working extremely well.

And this is the first time that the computer vision community has really noticed these models and adopted them to work on larger images.

So we saw that the performance of these models has improved drastically.

Here we are looking at the ImageNet ILS VRC visual recognition challenge over the years.

And we're looking at the top 5 error, so low is good.

And you can see that from 2010 in the beginning, these were feature-based methods.

And then in 2012, we had this huge jump in performance.

And that was due to the first convolutional neural network in

And then we've managed to push that over time, and now we're down to about

I think the results for ImageNet Challenge 2016 are actually due to come out today, but I don't think that actually they've come out yet.

I was waiting for the result, but I don't think this is up yet.

All right, well, we'll get to find out very soon what happens right here.

Just to put this in context, by the way, because you're just looking at numbers, like 3.57, how good is that?

So something that I did about 2 years ago now is that I tried to measure the human accuracy on this data set.

And so what I did for that is I developed this web interface where I would show myself ImageNet images from the test set, and then I had this interface here where I would have all the different classes of ImageNet, there's 1,000 of them, and some example images, and then basically you go down this list and you scroll for a long time, and you find what class you think that image might be.

And then I competed against the ComNet at the time.

Well, some of the things, like hot dog, seems very easy.

Well, it turns out that some of the images in a test set of ImageNet are actually mislabeled.

But also, some of the images are just very difficult to guess.

So in particular, if you have this terrier, there's 50 different types of terriers.

And it turns out to be a very difficult task to find exactly which type of terrier that is.

Turns out that convolutional neural networks are actually extremely good at this.

And so this is where I would lose points compared to ConvNet.

So I estimate that human accuracy based on this is roughly 2% to 5% range, depending on how much time you have, and how much expertise you have, and how many people you involve, and how much they really want to do this, which is not too much.

And so really, we're doing extremely well.

And I think the error rate, if I remember correctly, was about 1.5%.

So if we get below 1.5%, I would be extremely suspicious on ImageNet.

So to summarize, basically, what we've done is before 2012, computer vision looked somewhat like this, where we had these feature extractors, and then we trained a small portion at the end of the feature extraction step.

And so we only trained this last piece on top of these features that were fixed.

And we've basically replaced the feature extraction step with a single convolutional neural network.

And now we train everything completely end-to-end.

So I'm going to go into details of how this works in a bit.

Also, in terms of code complexity, we kind of went from a setup that looks, whoops, I'm way ahead.

Okay, we went from a setup that looks something like that in papers to something like, instead of extracting all these things, we just say apply 20 layers with 3 by 3 conv or something like that, and things work quite well.

This is, of course, an over-exaggeration, but I think it's a correct first-order statement to make, is that we've definitely seen that we've reduced code complexity quite a lot because these architectures are so homogeneous compared to what we've done before.

So it's also remarkable that we had this reduction in complexity.

We had this amazing performance on ImageNet.

1 other thing that was quite amazing about the results in 2012 that is also a separate thing that did not have to be the case is that the features that you learn by training on ImageNet turn out to be quite generic, and you can apply them in different settings.

So in other words, this transfer learning works extremely well.

And of course, I didn't go into details of convolutional networks yet, but we start with an image, and we have a sequence of layers, just like in a normal neural network.

And when you pre-train this network on ImageNet, then it turns out that the features that you learn in the middle are actually transferable, and you can use them on different data sets, and that this works extremely well.

You might imagine that you could have a convolutional network that works extremely well on ImageNet, but when you try to run it on something else, like BIRDS dataset or something, that it might just not work well.

But that is not the case, and that's a very interesting finding in my opinion.

So people noticed this back in roughly 2013, after the first convolutional networks.

They noticed that you can actually take many computer vision data sets.

And it used to be that you would compete on all of these kind of separately and design features maybe for some of these separately.

And you can just shortcut all those steps that we had designed.

And you can just take these pre-trained features that you get from ImageNet, and you can just train a linear classifier on every single data set on top of those features, and you obtain many state-of-the-art results across many different data sets.

And so this was quite a remarkable finding back then, I believe.

And the code complexity, of course, got much more manageable.

So now all this power is actually available to you with very few lines of code.

If you want to just use a convolutional network on images, it turns out to be only a few lines of code.

If you use, for example, Keras, it's 1 of the deep learning libraries that I'm going to go into and I'll mention again later in the talk.

But basically, you just load a state-of-the-art convolutional neural network.

You take an image, you load it, and you compute your predictions.

And it tells you that this is an African elephant inside that image.

And this took a couple hundred or couple 10 milliseconds if you have a GPU.

And so everything got much faster, much simpler, works really well, transfers really well.

So this was really a huge advance in computer vision.

And so as a result of all these nice properties, ComNets today are everywhere.

So here's a collection of some of the things that I try to find across different applications.

So for example, you can search Google Photos for different types of categories, like in this case Rubik's Cube.

You can find house numbers very efficiently.

Of course, this is very relevant in self-driving cars, and we're doing perception in the cars.

Accomplishable networks are very relevant there.

Medical image diagnosis, recognizing Chinese characters, doing all kinds of medical segmentation tasks.

Quite random tasks like whale recognition, and more generally, many Kaggle challenges.

Satellite image analysis, recognizing different types of galaxies.

You may have seen recently that a WaveNet from DeepMind, also a very interesting paper that they generate music and they generate speech.

And So this is a generative model, and that's also just a com net is doing most of the heavy lifting here.

So it's a convolutional network on top of sound.

In the context of reinforcement learning and agent and environment interactions, We've also seen a lot of advances of using ConvNets as the core computational building block.

So when you want to play Atari games, or you want to play AlphaGo, or Doom, or StarCraft, or if you want to get robots to perform interesting manipulation tasks, all of this uses ConvNets as a core computational block to do very impressive things.

Not only are we using it for a lot of different applications, we are also finding uses in art.

So here are some examples from deep dreams, so you can basically simulate what it looks like, what it feels like maybe to be on some drugs, so you can take images and you can just hallucinate features using com nets or you might be familiar with neural style, which allows you to take arbitrary images and transfer arbitrary styles of different paintings like Van Gogh on top of them, and this is all using convolutional networks.

The last thing I'd like to note that I find also interesting is that in the process of trying to develop better computer vision architectures and trying to basically optimize for performance on the ImageNet challenge, we've actually ended up converging to something that potentially might function something like your visual cortex in some ways.

And so these are some of the experiments that I find interesting, where they've studied macaque monkeys.

And they record from a subpopulation of the IT cortex.

This is the part that does a lot of object recognition.

So basically, they take a monkey and they take a com net, and they show them images.

And then you look at what those images are represented at the end of this network.

So inside the monkey's brain or on top of your convolutional network.

And so you look at representations of different images, and then it Turns out that there's a mapping between those 2 spaces that actually seems to indicate to some extent that some of the things we're doing somehow ended up converging to something that the brain could be doing, as well, in the visual cortex.

I'm now going to dive into convolutional networks and try to explain briefly how these networks work.

Of course, there's an entire class on this that I taught, which is a convolutional networks class.

And so I'm going to distill some of those 13 lectures into 1 lecture.

So convolutional neural network is really just a single function.

It goes from, it's a function from the raw pixels of some kind of an image.

You take the raw pixels, you put it through this function, and you get 1,000 numbers at the end.

In the case of image classification, if you're trying to categorize images into 1,000 different classes.

And really, functionally, all that's happening in a convolutional network is just dot products and max operations.

But they're wired up together in interesting ways so that you are basically doing visual recognition.

And in particular, this function f has a lot of knobs in it.

So these W's here that participate in these dot products and in these convolutions and fully connected layers and so on, these W's are all parameters of this network.

So normally, you might have about on the order of 10 million parameters.

And those are basically knobs that change this function.

And so we'd like to change those knobs, of course, so that when you put images through that function, you get probabilities that are consistent with your training data.

And it turns out that we can do that tuning automatically with backpropagation through that search process.

Now, more concretely, a convolutional neural network is made up of a sequence of layers, just as in the case of normal neural networks.

But we have different types of layers that we play with.

Here, I'm using rectified linear unit, ReLU, for short, as a non-linearity.

So I'm making that an explicit its own layer.

Pooling layers and fully connected layers.

The core building block of the convolutional network is the convolutional layer, and we have nonlinearities interspersed.

We are getting rid of the pooling layers, and fully connected layers can be represented, they are basically equivalent to convolutional layers as well.

And so really, it is just a sequence of conv layers in the simplest case.

So let me explain convolutional layer, because that is the core computational building block here that does all the heavy lifting.

So, the entire conv net is this collection of layers, And these layers don't function over vectors.

So they don't transform vectors as a normal neural network, but they function over volumes.

So a layer will take a volume, a three-dimensional volume of numbers, an array.

In this case, for example, we have a 32 by

by 3 image, so those 3 dimensions are the width, height, and I'll refer to the third dimension as depth.

That's not to be confused with the depth of a network, which is the number of layers in that network.

So this convolutional layer accepts a three-dimensional volume, and it produces a three-dimensional volume using some weights.

So the way it actually produces this output volume is as follows, we are going to have these filters in a convolutional layer.

So these filters are always small spatially, like, say, for example, a 5 by 5 filter, but their depth extends always through the input depth of the input volume.

So since the input volume has 3 channels, the depth is 3, then our filters will always match that number.

So we have depth of 3 in our filters as well.

And then we can take those filters, and we can basically convolve them with the input volume.

So what that amounts to is we take this filter.

So that's just the point that the channels here must match.

We take that filter, and we slide it through all spatial positions of the input volume.

And along the way, as we're sliding this filter, we're computing dot products.

So W transpose X plus B, where W are the filters, and X is a small piece of the input volume, and B is the offset.

And so this is basically the convolutional operation, you're taking this filter and sliding it through at all spatial positions, and you are computing dot products.

So when you do this, you end up with this activation map.

So in this case, we get a 28 by 28 activation map.

28 comes from the fact that there are 28 unique positions to place this

So there are 28 by 28 unique positions you can place that filter in, and in every 1 of those you are going to get a single number of how well that filter likes that part of the input.

So that carves out a single activation map, and in a convolutional layer, we don't have a single filter, but a set of filters, so here is a green filter, we slide it through the input volume, it has its own parameters, so there are 75 numbers here that make up a filter, there are different 75 numbers, we convolve them through, get a new activation map, and we continue doing this for all the filters in that convolutional layer.

So, for example, if we had 6 filters in this convolutional layer, we might end up with 28 by 28 activation maps 6 times, and we stacked them along the depth dimension to arrive at the output volume of 28 by 28 by 6.

And so really what we've done is we've re-represented the original image, which is 32 by 32 by 3, into a kind of a new image that is 28 by

by 6, where this image has the 6 channels that tell you how well every filter matches or likes every part of the input image.

So let's compare this operation to, say, using a fully connected layer as you would in a normal neural network.

So in particular, we saw that we processed a 32 by 32 by 3 volume into 28 by 28 by

And 1 question you might want to ask is, how many parameters would this require if we wanted a fully connected layer of the same number of output neurons here.

times 28 times 6 number of neurons fully connected.

Turns out that that would be quite a few parameters, right?

See all Lex Fridman transcripts on Youtube

Deep Learning for Computer Vision (Andrej Karpathy, OpenAI)