See all Lex Fridman transcripts on Youtube

youtube thumbnail

Deep Learning for Computer Vision (Andrej Karpathy, OpenAI)

1 hours 25 minutes 16 seconds

🇬🇧 English

S1

Speaker 1

00:00

So thank you very much for the introduction. So today I'll speak about deep learning, especially in the context of computer vision.

S2

Speaker 2

00:07

So what you saw in the previous talk is neural networks. So you saw that neural networks are organized into these layers, fully connected layers, where neurons in 1 layer are not connected, but they're connected fully to all the neurons in the previous layer. And we saw that basically we have this layer-wise structure from input until output.

S2

Speaker 2

00:25

And there are neurons and nonlinearities, et cetera. Now, so far we have not made too many assumptions about the inputs. So in particular, here we just assume that an input is some kind of a vector of numbers that we plug into this neural network. So that's both a bug and a feature to some extent.

S2

Speaker 2

00:41

Because in most real world applications, we actually can make some assumptions about the input that makes learning much more efficient. So in particular, usually we don't just want to plug into neural networks vectors of numbers, but they actually have some kind of a structure. So we don't have vectors of numbers, but these numbers are arranged in some kind of a layout, like an n-dimensional array of numbers. So for example, spectrograms are two-dimensional arrays of numbers.

S2

Speaker 2

01:09

Images are three-dimensional arrays of numbers. Videos would be four-dimensional arrays of numbers. Text you could treat as one-dimensional array of numbers. And so whenever you have this kind of local connectivity structure in your data, then you'd like to take advantage of it.

S2

Speaker 2

01:22

And convolutional neural networks allow you to do that. So before I dive into convolutional neural networks and all the details of the architectures, I'd like to briefly talk about a bit of the history of how this field evolved over time. So I like to start off usually with talking about Hubel and Wiesel and the experiments that they performed in 1960s. So what they were doing is trying to study the computations that happened in the early visual cortex areas of a cat.

S2

Speaker 2

01:48

And so they had cats, and they plugged in electrodes that could record from the different neurons. And then they showed the cat different patterns of light. And they were trying to debug neurons effectively and try to show them different patterns and see what they responded to. And a lot of these experiments inspired some of the modeling that came in afterwards.

S2

Speaker 2

02:06

So in particular, 1 of the early models that tried to take advantage of some of the results of these experiments was the model called neurocognitron from Fukushima in the 1980s. And so what you saw here was this architecture that again is layer-wise, similar to what you see in the cortex, where we have these simple and complex cells, where the simple cells detect small things in the visual field, and then you have this local connectivity pattern, and the simple and complex cells alternate in this layered architecture throughout. And so this looks a bit like a com net because you have some of its features, like, say, the local connectivity. But at the time, this was not trained with back propagation.

S2

Speaker 2

02:44

These were specific heuristically chosen updates. And this was unsupervised learning back then. So the first time that we've actually used back propagation to train some of these networks was an experiment of Jan Lekun in the 1990s. And so This is an example of 1 of the networks that was developed back then in 1990s by Yann LeCun as LeanNet 5.

S2

Speaker 2

03:05

And this is what you would recognize today as a convolutional neural network. So it has a lot of the very convolutional layers and it's alternating and it's a similar kind of design to what you would see in the Fukushima's neurocognitron, but this was actually trained with back propagation end to end using supervised learning. Now, so this happened in roughly 1990s, and we're here in 2016, basically about 20 years later. Now, computer vision has for a long time kind of worked on larger images.

S2

Speaker 2

03:38

And a lot of these models back then were applied to very small kind of settings, like say, recognizing digits in zip codes and things like that. And they were very successful in those domains. But back at least when I entered computer vision roughly 2011, it was thought that a lot of people were aware of these models. But it was thought that they would not scale up naively into large, complex images, that they would be constrained to these toy tasks for a long time.

S2

Speaker 2

04:02

Or I shouldn't say toy, because these were very important tasks, but certainly like smaller visual recognition problems. And so in computer vision in roughly 2011, it was much more common to use these feature-based approaches at the time. And they didn't work actually that well. So when I entered my PhD in 2011 working on computer vision, you would run a state-of-the-art object detector on this image.

S2

Speaker 2

04:23

And you might get something like this, where cars were detected in trees. And you would kind of just shrug your shoulders and say, well, that just happens sometimes. You kind of just accept it as something that would just happen. And of course, this is a caricature.

S2

Speaker 2

04:37

Things actually worked relatively decent, I should say. But definitely, there were many mistakes that you would not see today about 4 years in

S1

Speaker 1

04:44

2016,

S2

Speaker 2

04:45

5 years later. And so a lot of computer vision kind of looked much more like this. When you look into a paper that tried to do image classification, you would find this section in the paper on the features that they used.

S2

Speaker 2

04:56

So this is 1 page of features. And so they would use a gist, hog, et cetera, and then a second page of features and all their hyperparameters. So all kinds of different histograms, and you would extract this kitchen sink of features and a third page here. And so you end up with this very large, complex code base, because some of these feature types are implemented in MATLAB, some of them in Python, some of them in C++.

S2

Speaker 2

05:19

And you end up with this large code base of extracting all these features, caching them, and then eventually plugging them into linear classifiers to do some kind of visual recognition task. So it was quite unwieldy. But it worked to some extent, but there were definitely a room for improvement. So a lot of this changed in computer vision in 2012 with this paper from Alex Kurchevsky, Ilya Satskever, and Jeff Hinton.

S2

Speaker 2

05:41

So this is the first time that someone took a convolutional neural network that is very similar to the 1 that you saw from 1998 from Yanma Kun. And I'll go into details of how they differ exactly. But they took that kind of network, they scaled it up, and they made it much bigger, and they trained it on a much bigger data set on GPUs. And things basically ended up working extremely well.

S2

Speaker 2

06:00

And this is the first time that computer vision community has really noticed these models and adopted them to work on larger images. So we saw that the performance of these models are working extremely well. And this is the first time that the computer vision community has really noticed these models and adopted them to work on larger images. So we saw that the performance of these models has improved drastically.

S2

Speaker 2

06:15

Here we are looking at the ImageNet ILS VRC visual recognition challenge over the years. And we're looking at the top 5 error, so low is good. And you can see that from 2010 in the beginning, these were feature-based methods. And then in 2012, we had this huge jump in performance.

S2

Speaker 2

06:29

And that was due to the first convolutional neural network in

S1

Speaker 1

06:32

2012.

S2

Speaker 2

06:33

And then we've managed to push that over time, and now we're down to about

S1

Speaker 1

06:36

3.57%.

S2

Speaker 2

06:38

I think the results for ImageNet Challenge 2016 are actually due to come out today, but I don't think that actually they've come out yet. I have this second tab here opened. I was waiting for the result, but I don't think this is up yet.

S2

Speaker 2

06:54

OK, no, nothing. All right, well, we'll get to find out very soon what happens right here. So I'm very excited to see that. Just to put this in context, by the way, because you're just looking at numbers, like 3.57, how good is that?

S2

Speaker 2

07:07

That's actually really, really good. So something that I did about 2 years ago now is that I tried to measure the human accuracy on this data set. And so what I did for that is I developed this web interface where I would show myself ImageNet images from the test set, and then I had this interface here where I would have all the different classes of ImageNet, there's 1,000 of them, and some example images, and then basically you go down this list and you scroll for a long time, and you find what class you think that image might be. And then I competed against the ComNet at the time.

S2

Speaker 2

07:39

And this was GoogleNet in

S1

Speaker 1

07:42

2014.

S2

Speaker 2

07:44

And so HotDog is a very simple class. You can do that quite easily. But why is the accuracy not 0%?

S2

Speaker 2

07:50

Well, some of the things, like hot dog, seems very easy. Why isn't it trivial for humans to see? Well, it turns out that some of the images in a test set of ImageNet are actually mislabeled. But also, some of the images are just very difficult to guess.

S2

Speaker 2

08:02

So in particular, if you have this terrier, there's 50 different types of terriers. And it turns out to be a very difficult task to find exactly which type of terrier that is. You can spend minutes trying to find it. Turns out that convolutional neural networks are actually extremely good at this.

S2

Speaker 2

08:15

And so this is where I would lose points compared to ConvNet. So I estimate that human accuracy based on this is roughly 2% to 5% range, depending on how much time you have, and how much expertise you have, and how many people you involve, and how much they really want to do this, which is not too much. And so really, we're doing extremely well. And so we're down to 3%.

S2

Speaker 2

08:36

And I think the error rate, if I remember correctly, was about 1.5%. So if we get below 1.5%, I would be extremely suspicious on ImageNet. That seems wrong. So to summarize, basically, what we've done is before 2012, computer vision looked somewhat like this, where we had these feature extractors, and then we trained a small portion at the end of the feature extraction step.

S2

Speaker 2

08:59

And so we only trained this last piece on top of these features that were fixed. And we've basically replaced the feature extraction step with a single convolutional neural network. And now we train everything completely end-to-end. And this turns out to work quite nicely.

S2

Speaker 2

09:11

So I'm going to go into details of how this works in a bit. Also, in terms of code complexity, we kind of went from a setup that looks, whoops, I'm way ahead. Okay, we went from a setup that looks something like that in papers to something like, instead of extracting all these things, we just say apply 20 layers with 3 by 3 conv or something like that, and things work quite well. This is, of course, an over-exaggeration, but I think it's a correct first-order statement to make, is that we've definitely seen that we've reduced code complexity quite a lot because these architectures are so homogeneous compared to what we've done before.

S2

Speaker 2

09:45

So it's also remarkable that we had this reduction in complexity. We had this amazing performance on ImageNet. 1 other thing that was quite amazing about the results in 2012 that is also a separate thing that did not have to be the case is that the features that you learn by training on ImageNet turn out to be quite generic, and you can apply them in different settings. So in other words, this transfer learning works extremely well.

S2

Speaker 2

10:08

And of course, I didn't go into details of convolutional networks yet, but we start with an image, and we have a sequence of layers, just like in a normal neural network. And at the end, we have a classifier. And when you pre-train this network on ImageNet, then it turns out that the features that you learn in the middle are actually transferable, and you can use them on different data sets, and that this works extremely well. And so that didn't have to be the case.

S2

Speaker 2

10:28

You might imagine that you could have a convolutional network that works extremely well on ImageNet, but when you try to run it on something else, like BIRDS dataset or something, that it might just not work well. But that is not the case, and that's a very interesting finding in my opinion. So people noticed this back in roughly 2013, after the first convolutional networks. They noticed that you can actually take many computer vision data sets.

S2

Speaker 2

10:48

And it used to be that you would compete on all of these kind of separately and design features maybe for some of these separately. And you can just shortcut all those steps that we had designed. And you can just take these pre-trained features that you get from ImageNet, and you can just train a linear classifier on every single data set on top of those features, and you obtain many state-of-the-art results across many different data sets. And so this was quite a remarkable finding back then, I believe.

S2

Speaker 2

11:12

So things worked very well on ImageNet. Things transferred very well. And the code complexity, of course, got much more manageable. So now all this power is actually available to you with very few lines of code.

S2

Speaker 2

11:23

If you want to just use a convolutional network on images, it turns out to be only a few lines of code. If you use, for example, Keras, it's 1 of the deep learning libraries that I'm going to go into and I'll mention again later in the talk. But basically, you just load a state-of-the-art convolutional neural network. You take an image, you load it, and you compute your predictions.

S2

Speaker 2

11:41

And it tells you that this is an African elephant inside that image. And this took a couple hundred or couple 10 milliseconds if you have a GPU. And so everything got much faster, much simpler, works really well, transfers really well. So this was really a huge advance in computer vision.

S2

Speaker 2

11:55

And so as a result of all these nice properties, ComNets today are everywhere. So here's a collection of some of the things that I try to find across different applications. So for example, you can search Google Photos for different types of categories, like in this case Rubik's Cube. You can find house numbers very efficiently.

S2

Speaker 2

12:16

Of course, this is very relevant in self-driving cars, and we're doing perception in the cars. Accomplishable networks are very relevant there. Medical image diagnosis, recognizing Chinese characters, doing all kinds of medical segmentation tasks. Quite random tasks like whale recognition, and more generally, many Kaggle challenges.

S2

Speaker 2

12:34

Satellite image analysis, recognizing different types of galaxies. You may have seen recently that a WaveNet from DeepMind, also a very interesting paper that they generate music and they generate speech. And So this is a generative model, and that's also just a com net is doing most of the heavy lifting here. So it's a convolutional network on top of sound.

S2

Speaker 2

12:53

And other tasks like image captioning. In the context of reinforcement learning and agent and environment interactions, We've also seen a lot of advances of using ConvNets as the core computational building block. So when you want to play Atari games, or you want to play AlphaGo, or Doom, or StarCraft, or if you want to get robots to perform interesting manipulation tasks, all of this uses ConvNets as a core computational block to do very impressive things. Not only are we using it for a lot of different applications, we are also finding uses in art.

S2

Speaker 2

13:26

So here are some examples from deep dreams, so you can basically simulate what it looks like, what it feels like maybe to be on some drugs, so you can take images and you can just hallucinate features using com nets or you might be familiar with neural style, which allows you to take arbitrary images and transfer arbitrary styles of different paintings like Van Gogh on top of them, and this is all using convolutional networks. The last thing I'd like to note that I find also interesting is that in the process of trying to develop better computer vision architectures and trying to basically optimize for performance on the ImageNet challenge, we've actually ended up converging to something that potentially might function something like your visual cortex in some ways. And so these are some of the experiments that I find interesting, where they've studied macaque monkeys. And they record from a subpopulation of the IT cortex.

S2

Speaker 2

14:13

This is the part that does a lot of object recognition. And so they record. So basically, they take a monkey and they take a com net, and they show them images. And then you look at what those images are represented at the end of this network.

S2

Speaker 2

14:24

So inside the monkey's brain or on top of your convolutional network. And so you look at representations of different images, and then it Turns out that there's a mapping between those 2 spaces that actually seems to indicate to some extent that some of the things we're doing somehow ended up converging to something that the brain could be doing, as well, in the visual cortex. So that's just some intro. I'm now going to dive into convolutional networks and try to explain briefly how these networks work.

S2

Speaker 2

14:49

Of course, there's an entire class on this that I taught, which is a convolutional networks class. And so I'm going to distill some of those 13 lectures into 1 lecture. So we'll see how that goes. I won't cover everything, of course.

S2

Speaker 2

15:01

OK. So convolutional neural network is really just a single function. It goes from, it's a function from the raw pixels of some kind of an image. So we take 224 by 224 by 3 image.

S2

Speaker 2

15:12

So 3 here is for the call channels, RGB. You take the raw pixels, you put it through this function, and you get 1,000 numbers at the end. In the case of image classification, if you're trying to categorize images into 1,000 different classes. And really, functionally, all that's happening in a convolutional network is just dot products and max operations.

S2

Speaker 2

15:30

That's everything. But they're wired up together in interesting ways so that you are basically doing visual recognition. And in particular, this function f has a lot of knobs in it. So these W's here that participate in these dot products and in these convolutions and fully connected layers and so on, these W's are all parameters of this network.

S2

Speaker 2

15:48

So normally, you might have about on the order of 10 million parameters. And those are basically knobs that change this function. And so we'd like to change those knobs, of course, so that when you put images through that function, you get probabilities that are consistent with your training data. And so that gives us a lot to tune.

S2

Speaker 2

16:05

And it turns out that we can do that tuning automatically with backpropagation through that search process. Now, more concretely, a convolutional neural network is made up of a sequence of layers, just as in the case of normal neural networks. But we have different types of layers that we play with. So we have convolutional layers.

S2

Speaker 2

16:21

Here, I'm using rectified linear unit, ReLU, for short, as a non-linearity. So I'm making that an explicit its own layer. Pooling layers and fully connected layers. The core building block of the convolutional network is the convolutional layer, and we have nonlinearities interspersed.

S2

Speaker 2

16:42

We are getting rid of the pooling layers, and fully connected layers can be represented, they are basically equivalent to convolutional layers as well. And so really, it is just a sequence of conv layers in the simplest case. So let me explain convolutional layer, because that is the core computational building block here that does all the heavy lifting. So, the entire conv net is this collection of layers, And these layers don't function over vectors.

S2

Speaker 2

17:05

So they don't transform vectors as a normal neural network, but they function over volumes. So a layer will take a volume, a three-dimensional volume of numbers, an array. In this case, for example, we have a 32 by

S1

Speaker 1

17:15

32

S2

Speaker 2

17:15

by 3 image, so those 3 dimensions are the width, height, and I'll refer to the third dimension as depth. We have 3 channels. That's not to be confused with the depth of a network, which is the number of layers in that network.

S2

Speaker 2

17:26

So this is just the depth of a volume. So this convolutional layer accepts a three-dimensional volume, and it produces a three-dimensional volume using some weights. So the way it actually produces this output volume is as follows, we are going to have these filters in a convolutional layer. So these filters are always small spatially, like, say, for example, a 5 by 5 filter, but their depth extends always through the input depth of the input volume.

S2

Speaker 2

17:51

So since the input volume has 3 channels, the depth is 3, then our filters will always match that number. So we have depth of 3 in our filters as well. And then we can take those filters, and we can basically convolve them with the input volume. So what that amounts to is we take this filter.

S2

Speaker 2

18:09

Oh, yeah. So that's just the point that the channels here must match. We take that filter, and we slide it through all spatial positions of the input volume. And along the way, as we're sliding this filter, we're computing dot products.

S2

Speaker 2

18:19

So W transpose X plus B, where W are the filters, and X is a small piece of the input volume, and B is the offset. And so this is basically the convolutional operation, you're taking this filter and sliding it through at all spatial positions, and you are computing dot products. So when you do this, you end up with this activation map. So in this case, we get a 28 by 28 activation map.

S2

Speaker 2

18:41

28 comes from the fact that there are 28 unique positions to place this

S1

Speaker 1

18:48

5

S2

Speaker 2

18:48

by 5 filter into this 32 by 32 space. So there are 28 by 28 unique positions you can place that filter in, and in every 1 of those you are going to get a single number of how well that filter likes that part of the input. So that carves out a single activation map, and in a convolutional layer, we don't have a single filter, but a set of filters, so here is a green filter, we slide it through the input volume, it has its own parameters, so there are 75 numbers here that make up a filter, there are different 75 numbers, we convolve them through, get a new activation map, and we continue doing this for all the filters in that convolutional layer.

S2

Speaker 2

19:28

So, for example, if we had 6 filters in this convolutional layer, we might end up with 28 by 28 activation maps 6 times, and we stacked them along the depth dimension to arrive at the output volume of 28 by 28 by 6. And so really what we've done is we've re-represented the original image, which is 32 by 32 by 3, into a kind of a new image that is 28 by

S1

Speaker 1

19:46

28

S2

Speaker 2

19:46

by 6, where this image has the 6 channels that tell you how well every filter matches or likes every part of the input image. So let's compare this operation to, say, using a fully connected layer as you would in a normal neural network. So in particular, we saw that we processed a 32 by 32 by 3 volume into 28 by 28 by

S1

Speaker 1

20:07

6

S2

Speaker 2

20:08

volume. And 1 question you might want to ask is, how many parameters would this require if we wanted a fully connected layer of the same number of output neurons here. So we wanted

S1

Speaker 1

20:18

28

S2

Speaker 2

20:18

by 28 by 6, or times,

S1

Speaker 1

20:19

28

S2

Speaker 2

20:21

times 28 times 6 number of neurons fully connected. How many parameters would that be? Turns out that that would be quite a few parameters, right?

S2

Speaker 2

20:28

Because every single neuron in the output volume would be fully connected to all of the 32 by 32 by 3 numbers here. So basically, every 1 of those 28 by 28 by 6 neurons is connected to 32 by 32 by 3. Turns out to be about 15 million parameters, and also on that order of number of multiplies. So you're doing a lot of compute, and you're introducing a huge amount of parameters into your network.

S2

Speaker 2

20:50

Now, since we're doing convolution instead, you'll notice that, think about the number of parameters that we've introduced with this example convolutional layer. So we've used, we had 6 filters, and every 1 of them was a 5 by 5 by 3 filter. So basically, we just have 5 by 5 by 3 filters. We have 6 of them.

S2

Speaker 2

21:09

If you just multiply that out, we have 450 parameters. And in this, I'm not counting the biases. I'm just counting the raw weights. So compared to 15 million, we've only introduced very few parameters.

S2

Speaker 2

21:19

Also, how many multiplies have we done? So computationally, how many flops are we doing? Well, we have 28 by 28 by 6 outputs to produce, and every 1 of these numbers is a function of a 5 by 5 by 3 region in the original image. So basically, we have 28 by 28 by 6.

S2

Speaker 2

21:35

And then every 1 of them is computed by doing 5 times 5 times 3 multiplies. So you end up with only on the order of 350,000 multiplies. So we've reduced from 15 million to quite a few. So we're doing less flops, and we're using fewer parameters.

S2

Speaker 2

21:50

And really, what we've done here is we've made assumptions, right? So we've made the assumption that because the fully connected layer, if this was a fully connected layer, could compute the exact same thing. But it would, so a specific setting of those 15 million parameters would actually produce the exact output of this convolutional layer. But we've done it much more efficiently.

S2

Speaker 2

22:08

We've done that by introducing these biases. So in particular, we've made assumptions. We've assumed, for example, that since we have these fixed filters that we're sliding across space. We've assumed that if there's some interesting feature that you'd like to detect in 1 part of the image, like, say, top left, then that feature will also be useful somewhere else, like on the bottom right, because we fixed these filters and applied them at all the spatial positions equally.

S2

Speaker 2

22:30

You might notice that this is not always something that you might want. For example, if you're getting inputs that are centered face images, and you're doing some kind of a face recognition or something like that, then you might expect that you might want different filters at different spatial positions. Like say for eye regions, you might want to have some eye-like filters, And for mouth region, you might want to have mouth-specific features and so on. And so in that case, you might not want to use convolutional layer, because those features have to be shared across all spatial positions.

S2

Speaker 2

22:55

And the second assumptions that we made is that these filters are small locally. And So we don't have global connectivity. We have this local connectivity. But that's OK, because we end up stacking up these convolutional layers in sequence.

S2

Speaker 2

23:06

And so the neurons at the end of the com net will grow their receptive field as you stack these convolutional layers on top of each other. So at the end of the com net, those neurons end up being a function of the entire image, eventually. So just to give you an idea about what these activation maps look like concretely, here's an example of an image on the top left. This is a part of a car, I believe.

S2

Speaker 2

23:26

And we have these different filters at, we have 32 different small filters here. And so if we were to convolve these filters with this image, we end up with these activation maps. So this filter, if you convolve it, you get this activation map, and so on. So this 1, for example, has some orange stuff in it.

S2

Speaker 2

23:40

So when we convolve with this image, you see that this white here is denoting the fact that that filter matches that part of the image quite well. And so we get these activation maps. You stack them up. And then that goes into the next convolutional layer.

S2

Speaker 2

23:53

So the way this looks like then is that we've processed this with some kind of a convolutional layer. We get some output. We apply a rectified linear unit, some kind of non-linearity as normal. And then we would just repeat that operation.

S2

Speaker 2

24:06

So we keep plugging these conv volumes into the next convolutional layer. And so they plug into each other in sequence. And so we end up processing the image over time. So that's the convolutional layer.

S2

Speaker 2

24:18

You'll notice that there are a few more layers. So in particular, the pooling layer, I'll explain very briefly. Pooling layer is quite simple. If you've used Photoshop or something like that, you've taken a large image and you've resized it, you've down sampled the image.

S2

Speaker 2

24:32

Well, pooling layers do basically something exactly like that, but they're doing it on every single channel independently. So for every 1 of these channels independently in an input volume, we will pluck out that activation map, we will down sample it, and that becomes a channel in the output volume. So it's really just a downsampling operation on these volumes. So for example, 1 of the common ways of doing this in the context of neural networks, especially, is to use max pooling operation.

S2

Speaker 2

24:57

So in this case, it would be common to, say, for example, use 2 by 2 filters stride

S1

Speaker 1

25:01

2

S2

Speaker 2

25:03

and do max operation. So if this is an input channel in a volume, then we're basically, what that amounts to is we're truncating it into these 2 by 2 regions. And we're taking a max over 4 numbers to produce 1 piece of the output.

S2

Speaker 2

25:18

OK, so this is a very cheap operation that down samples your volumes. It's really a way to control the capacity of the network. So you don't want too many numbers. You don't want things to be too computationally expensive.

S2

Speaker 2

25:27

It turns out that a pooling layer allows you to down sample your volumes. You're going to end up doing less computation, and it turns out to not hurt the performance too much. So we use them basically as a way of controlling the capacity of these networks. And the last layer that I want to briefly mention, of course, is the fully connected layer, which is exactly what you're familiar with.

S2

Speaker 2

25:45

So we have these volumes throughout as we've processed the image. At the end, you're left with this volume. And now you'd like to predict some classes. So what we do is we just take that volume, we stretch it out into a single column, and then we apply a fully connected layer, which really amounts to just a matrix multiplication.

S2

Speaker 2

25:59

And then that gives us probabilities after applying a softmax or something like that. So let me now show you briefly a demo of what a convolutional network looks like. So this is ConvNetJS. This is a deep learning library for training convolutional neural networks that is implemented in JavaScript.

S2

Speaker 2

26:19

I wrote this maybe 2 years ago at this point. So here we are training a convolutional network on the CIFAR 10 dataset, a dataset of 50,000 images, each image is 32 by 32 by 3, and there are 10 different classes. So here we are training this network in the browser, and you can see that the loss is decreasing, which means that we're better classifying these inputs. And so here's the network specification, which you can play with, because this is all done in the browser.

S2

Speaker 2

26:45

So you can just change this and play with this. So this is an input image, and this convolutional network, I'm showing here all the intermediate activations and all the intermediate, basically activation maps that we are producing. So here we have a set of filters, we are convolving them with the image and getting all the activation maps. I'm also showing the gradients, but I don't want to dwell on that too much.

S2

Speaker 2

27:07

And then you threshold, so ReLU thresholding anything below 0 gets clamped at 0, and then you pull, so this is just a downsampling operation, And then another convolution, ReLU pool, conv, ReLU pool, et cetera, until at the end we have a fully connected layer. And then we have our softmax so that we get probabilities out. And then we apply a loss to those probabilities and back propagate. And so here we see that I've been training in this tab for the last maybe 30 seconds, or 1 minute, and we are already getting about 30% accuracy on CIFAR-10.

S2

Speaker 2

27:36

These are test images from CIFAR-10 and the outputs of the convolutional network, and you can see it learned this is already a car or something like that, so this trains pretty quickly in JavaScript. So you can play with this and change the architecture and so on. Another thing I'd like to show you is this video, because it gives you, again, this very intuitive, visceral feeling of exactly what this is computing, is there is a very good video by Jason Yosinski from

S3

Speaker 3

27:59

recent advance.

S2

Speaker 2

28:01

I'm going to play this in a bit. This is from the deep visualization toolbox. So you can download this code, and you can play with this.

S2

Speaker 2

28:07

It's this interactive convolutional network demo.

S4

Speaker 4

28:10

This is neural networks have enabled computers to better see and understand the world.

S3

Speaker 3

28:14

They can recognize school buses and z-plane. Top left corner, we show the input. In this case, the popular deep.

S2

Speaker 2

28:19

So what we're seeing here is these are activation maps in some particular shown in real time as this demo is running. So these are for the conv1 layer of an AlexNet, which we're going to go into in much more detail. But these are the different activation maps that are being produced at this point.

S3

Speaker 3

28:36

Neural network called AlexNet running in CAFE. By interacting with the network, we can see what some of the neurons are doing. For example, on this first layer, a unit in the center responds strongly to light to dark edges.

S3

Speaker 3

28:51

Its neighbor, 1 neuron over, responds to edges in the opposite direction, dark to light. Using optimization, We can synthetically produce images that light up each neuron on this layer to

S5

Speaker 5

29:02

see what each neuron is looking for.

S3

Speaker 3

29:05

We can scroll through every layer in the network to see what it does, including convolution, pooling, and normalization layers. We can switch back and forth between showing the actual activations and showing images synthesized to produce high activation. By the time we get to the fifth convolutional layer, the features being computed represent abstract concepts.

S3

Speaker 3

29:29

For example, This neuron seems to respond to faces. We can further investigate this neuron by showing a few different types of information. First, we can artificially create optimized images using new regularization techniques that are described in our paper. These synthetic images show that this neuron fires in response to a face and shoulders.

S3

Speaker 3

29:46

We can also plot the images from the training set that activate this neuron the most, as well as pixels from those images most responsible for the high activations, computed via the deconvolution technique. This feature responds to multiple faces in different locations. And by looking at the deconv, we can see that it would respond more strongly if we had even darker eyes and rosier lips. We can also confirm that it cares about the head and shoulders, but ignores the arms and torso.

S3

Speaker 3

30:13

We can even see that it fires to some extent for cat faces. Using backprop or d-conv, we can see that this unit depends most strongly on a couple units in the previous layer of conv4 and on about a dozen or so in conv3. Now let's look at another neuron on this layer. So what's this unit doing?

S3

Speaker 3

30:32

From the top 9 images, we might conclude that it fires for different types of clothing. But examining the synthetic images shows that it may be detecting not clothing per se, but wrinkles. In the live plot, we can see that it's activated by my shirt. And smoothing out half of my shirt causes that half of the activations to decrease.

S3

Speaker 3

30:52

Finally, here's another interesting neuron. This 1 has learned to look for printed text in a variety of sizes, colors, and fonts. This is pretty cool because we never asked the network

S5

Speaker 5

31:04

to look for wrinkles or text or faces.

S3

Speaker 3

31:06

The only labels we provided were at the very last layer, so the only reason the network learned features like text and faces in the middle was to support final decisions at that last layer. For example, the text detector may provide good evidence that a rectangle is, in fact, a book seen on edge. And detecting many books next to each other might be a good way of detecting a bookcase, which was 1 of the categories we trained the net to recognize.

S3

Speaker 3

31:31

In this video, we've shown some of the features of the DeepViz toolbox.

S2

Speaker 2

31:34

Okay, so I encourage you to play with that, it's really fun. So I hope that gives you an idea about exactly what's going on. There are these convolutional layers, we downsample them from time to time.

S2

Speaker 2

31:42

There's usually some fully connected layers at the end, but mostly it's just these convolutional operations stacked on top of each other. So what I'd like to do now is I'll dive into some details of how these architectures are actually put together. The way I'll do this is I'll go over all the winners of the ImageNet challenges, and I'll tell you about the architectures, how they came about, how they differ. And so you'll get a concrete idea about what these architectures look like in practice.

S2

Speaker 2

32:04

So we'll start off with the AlexNet in

S1

Speaker 1

32:05

2012.

S2

Speaker 2

32:08

So the AlexNet, just to give you an idea about the sizes of these networks and the images that they process, it took 227 by 227 by 3 images. And the first layer of an AlexNet, for example, was a convolutional layer that had 11 by 11 filters applied with a stride of 4. And there are 96 of them.

S2

Speaker 2

32:26

Stride of 4 I didn't fully explain because I wanted to save some time. But intuitively, it just means that as you're sliding this filter across the input, you don't have to slide it 1 pixel at a time, but you can actually jump a few pixels at a time. But intuitively, it just means that as you're sliding this filter across the input, you don't have to slide it 1 pixel at a time, but you can actually jump a few pixels at a time. So we have 11 by 11 filters with a stride, a skip of 4.

S2

Speaker 2

32:41

And we have 96 of them. You can try to compute, for example, what is the output volume if you apply this sort of convolutional layer on top of this volume. And I didn't go into details of how you compute that, but basically, there are formulas for this, and you can look into details in the class. But you arrive at 55 by 55 by 96 volume as output.

S2

Speaker 2

33:02

The total number of parameters in this layer, we have 96 filters. Every 1 of them is 11 by 11 by 3, because that's the input depth of these images. So basically, it just amounts to 11 times 11 times 3, and you have 96 filters, so about 35,000 parameters in the first layer. And then the second layer is a pooling layer, so we apply 3 by 3 filters at stride of 2 and they do max pooling, so you can compute the output volume size of that after applying this to that volume.

S2

Speaker 2

33:34

And you arrive, if you do some very simple arithmetic there, you arrive at 27 by 27 by 96. This is the downsampling operation. You can think about what is the number of parameters in this pooling layer. And of course, it's

S1

Speaker 1

33:47

0.

S2

Speaker 2

33:48

So pooling layers compute a fixed function, a fixed downsampling operation. There are no parameters involved in a pooling layer. All the parameters are in convolutional layers and the fully connected layers, which are, to some extent, equivalent to convolutional layers.

S2

Speaker 2

34:01

So we can go ahead and just basically, based on the description in the paper, although it's non-trivial, I think based on the description of this particular paper, but you can go ahead and decipher what the volumes are throughout. You can look at the kind of patterns that emerge in terms of how you actually increase number of filters in higher convolutional layers. So we started off with 96, then we go to 256 filters, then to 384, and eventually 4,096 units of fully connected layers. You'll see also normalization layers here, which have since become slightly deprecated.

S2

Speaker 2

34:31

It's not very common to use the normalization layers that were used at the time for the AlexNet architecture. What's interesting to note is how this differs from the 1998 YAMLACUM network. So in particular, I usually like to think about 4 things that hold back progress, so at least in deep learning. So the data is a constraint, compute.

S2

Speaker 2

34:52

And then I like to differentiate between algorithms and infrastructure, algorithms being something that feels like research, and infrastructure being something that feels like a lot of engineering has to happen. And so in particular, we've had progress in all those 4 fronts. So we see that in 1998, the data you could get a hold of maybe would be on the order of a few thousand, whereas now we have a few million. So we have 3 orders of magnitude of increase in number of data.

S2

Speaker 2

35:14

Compute, GPUs have become available, and we use them to train these networks. They are about, say, roughly 20 times faster than CPUs. And then, of course, CPUs we have today are much, much faster than CPUs that they had back in 1998. So I don't know exactly to what that works out to, but I wouldn't be surprised if it's, again, on the order of 3, orders of magnitude of improvement, again.

S2

Speaker 2

35:34

I'd like to actually skip over the algorithm and talk about infrastructure. So in this case, we're talking about NVIDIA releasing the CUDA library that allows you to efficiently create all these matrix vector operations and apply them on arrays of numbers. So that's a piece of software that we rely on and that we take advantage of that wasn't available before. And finally, algorithms is kind of an interesting 1, because in those 20 years, there's been much less improvement in algorithms than all these other 3 pieces.

S2

Speaker 2

36:02

So in particular, what we've done with the 1998 network is we've made it bigger. So you have more channels. You have more layers by a bit. And the 2 really new things algorithmically are dropout and rectified linear units.

S2

Speaker 2

36:16

So dropout is a regularization technique developed by Geoff Hinton and colleagues. And rectified linear units are these nonlinearities that train much faster than sigmoids and tanhs. And this paper actually had a plot that showed that the rectified linear units trained a bit faster than sigmoids. And that's intuitively because of the vanishing gradient problems.

S2

Speaker 2

36:36

And when you have very deep networks with sigmoids, those gradients vanish, as Hugo was talking about in the last lecture. So what's interesting also to note, by the way, is that both Dropout and Relu are basically like 1 line or 2 lines of code to change. So it's about 2 line diff total in those 20 years. And both of them consist of setting things to 0.

S2

Speaker 2

36:56

So with the Relu, you set things to 0 when they're lower than 0. And with Dropout, you set things to 0 at random. So it's a good idea to set things to 0. Apparently, that's what we've learned.

S2

Speaker 2

37:06

So if you try to find a new cool algorithm, look for one-line diffs that set something to

S1

Speaker 1

37:10

0,

S2

Speaker 2

37:10

probably will work better. And we could add you here to this list. Now, some of the newest things that happened, some of the comparing is, again, giving you an idea about the hyperparameters that were in this architecture.

S2

Speaker 2

37:24

It was the first use of rectified linear units. We haven't seen that as much before. This network used the normalization layers, which are not used anymore, at least in the specific way that they use them in this paper. They used heavy data augmentation.

S2

Speaker 2

37:38

So you don't only pipe these images into the networks exactly as they come from the data set, but you jitter them spatially around a bit. And you warp them, and you change the colors a bit, and you just do this randomly. Because you're trying to build in some invariances to these small perturbations. And you're basically hallucinating additional data.

S2

Speaker 2

37:54

It was the first real use of dropout. And roughly, you see standard hyperparameters, like say batch sizes of roughly 128, using stochastic gradient descent with momentum, usually

S1

Speaker 1

38:07

0.9.

S2

Speaker 2

38:09

And the momentum learning rates of 1e negative 2, you reduce them in normal ways. So you reduce roughly by a factor of 10 whenever validation stops improving. And weight decay of just a bit, 5e negative 4.

S2

Speaker 2

38:21

And ensembling always helps. So you train 7 independent convolutional networks separately, and then you just average their predictions. It always gives you additional

S1

Speaker 1

38:35

2%

S2

Speaker 2

38:35

improvement. So this is AlexNet, the winner of

S1

Speaker 1

38:40

2012.

S2

Speaker 2

38:40

In 2013, the winner was the ZFNet, this was developed by Matthew Zeiler and Rob Ferguson in 2013. And this was an improvement on top of AlexNet architecture, in particular, 1 of the bigger differences here, the first convolutional layer went from 11 by 11 stride 4 to 7 by 7 stride

S1

Speaker 1

38:53

2.

S2

Speaker 2

38:57

And they noticed that these convolutional layers in the middle, if you make them larger, if you scale them up, then you actually gain performance. So they managed to improve a tiny bit. Matthew Zyler then went, he became the founder of Clarify, and he worked on this a bit more inside Clarify and he managed to push the performance to 11%, which was the winning entry at the time.

S2

Speaker 2

39:17

But we don't actually know what gets you from 14% to 11%, because Matthew never disclosed the full details of what happened there. But he did say that it was more tweaking of these hyperparameters and optimizing that a bit. So that was 2013 winner. In 2014, we saw a slightly bigger diff to this.

S2

Speaker 2

39:34

So 1 of the networks that was introduced then was a VGG net from Karen Simonian and Andrew Zisterman. What's beautiful about VGG net, and they explored a few architectures here, and the 1 that ended up working best was this D column, which is why I'm highlighting it. What's beautiful about the VGG net is that it's so simple. So you might have noticed in these previous networks, you have these different filter sizes, different layers, and you do different amount of strides, and everything kind of looks a bit hairy, and you're not sure where these hyperparameters are coming from.

S2

Speaker 2

40:00

VGG net is extremely uniform. All you do is 3 by 3 convolutions with stride 1, pad 1, and you do 2 by 2 max poolings with stride 2. And you do this throughout completely homogeneous architecture, and you just alternate a few conv and a few pool layers, and you get top performance. So they managed to reduce the error down to 7.3% in the VGG net, just with a very simple and homogeneous architecture.

S2

Speaker 2

40:24

I've also here written out this D architecture, just so you can see. I'm not sure how instructive this is, because it's kind of dense. But you can definitely see, and you can look at this offline, perhaps, but you can see how these volumes develop, and you can see the kinds of sizes of these filters. So they're always 3 by 3, but the number of filters, again, grows.

S2

Speaker 2

40:43

So we started off with

S1

Speaker 1

40:44

64,

S2

Speaker 2

40:44

and then we go to

S1

Speaker 1

40:45

128, 256, 512.

S2

Speaker 2

40:47

So we're just doubling it over time. I also have a few numbers here just to give you an idea of the scale at which these networks normally operate. So we have on the order of 140 million parameters.

S2

Speaker 2

40:58

This is actually quite a lot. I'll show you in a bit that this can be about 5 or 10 million parameters, and it works just as well. And it's about 100 megabytes for image in terms of memory in the forward pass. And then the backward pass also needs roughly on that order.

S2

Speaker 2

41:12

So that's roughly the numbers that we're working with here. Also, you can note that most of the, And this is true mostly in convolutional networks, is that most of the memory is in the early convolutional layers. Most of the parameters, at least in the case where you use these giant fully connected layers at the top, would be here. So the winner, actually, in 2014 was not the VGG net.

S2

Speaker 2

41:31

I only present it because it's such a simple architecture. But the winner was actually Google net with a slightly hairier architecture, we should say. So it's still a sequence of things. But in this case, they've put inception modules in sequence.

S2

Speaker 2

41:44

And this is an example of Inception module. I don't have too much time to go into the details, but you can see that it consists basically of convolutions and different kinds of strides and so on. So the GoogleNet looks slightly hairier, but it turns out to be more efficient in several respects. So for example, it works a bit better than VGGnet, at least at the time.

S2

Speaker 2

42:06

It only has 5 million parameters compared to VGGnet's 140 million parameters, so a huge reduction. And you do that, by the way, by just throwing away fully connected layers. So you'll notice in this breakdown I did, these fully connected layers here have 100 million parameters and 16 million parameters. Turns out you don't actually need that.

S2

Speaker 2

42:22

So if you take them away, that actually doesn't hurt performance too much. So you can get a huge reduction of parameters. And It was slightly, we can also compare to the original AlexNet. So compared to the original AlexNet, we have fewer parameters, a bit more compute, and a much better performance.

S2

Speaker 2

42:40

So GoogleNet was really optimized to have a low footprint, both memory-wise, both computation-wise, and both parameter-wise. But it looks a bit uglier. And VGGnet is a very beautiful homogeneous architecture, but there are some inefficiencies in it. OK, so that's 2014.

S2

Speaker 2

42:56

Now, in 2015, we had a slightly bigger delta on top of the architectures. So right now, these architectures, if Jan Lekhoene looked at them maybe in 1998, he would still recognize everything. So everything looks very simple. You've just played with hyperparameters.

S2

Speaker 2

43:08

So 1 of the first bigger departures, I would argue, was in 2015 with the introduction of residual networks. And so this is work from Kangming He and colleagues in Microsoft Research Asia. They did not only win the ImageNet challenge in 2015, but they won a whole bunch of challenges, and this was all by applying these residual networks that were trained on ImageNet and fine-tuned on all of these different tasks, and you can crush a lot of different tasks whenever you get a new awesome com net. So at this time the performance was basically 3.57% from these residual networks.

S2

Speaker 2

43:42

So this is 2015. Also this paper tried to argue that if you look at the number of layers, it goes up. And then they made the point that with residual networks, as we'll see in a bit, you can introduce many more layers and that that correlates strongly with performance. We've since found that, in fact, you can make these residual networks quite a lot shallower, like say on the order of 20 or 30 layers, and they work just as well.

S2

Speaker 2

44:05

So it's not necessarily the depth here, but I'll go into that in a bit. But you get a much better performance. What's interesting about this paper is this plot here, where they compare these residual networks, and I'll go into details of how they work in a bit, and these what they call plane networks, which is everything I've explained until now. And the problem with plane networks is that when you try to scale them up and introduce additional layers, they don't get monotonically better.

S2

Speaker 2

44:28

So if you take a 20-layer model, and this is on CIFAR-10 experiments. If you take a 20 layer model and you run it, and then you take a 56 layer model, you'll see that the 56 layer model performs worse. And this is not just on the test data, so it's not just an overfitting issue. This is on the training data.

S2

Speaker 2

44:45

The 56 layer model performs worse on the training data than the 20-layer model, even though the 56-layer model can imitate 20-layer model by setting 36 layers to compute identities. So basically, it's an optimization problem that you can't find the solution once your problem size grows that much bigger in this plain net architecture. So in the residual networks that they proposed, they found that when you wire them up in a slightly different way, you monotonically get a better performance as you add more layers. So more layers, always strictly better, and you don't run into these optimization issues.

S2

Speaker 2

45:19

So comparing residual networks to plain networks, in plain networks, as I've explained already, you have this sequence of convolutional layers where every convolutional layer operates over volume before and produces volume. In residual networks, we have this first convolutional layer on top of the raw image. Then there's a pooling layer. So at this point, we've reduced to 56 by 56 by 64, the original image.

S2

Speaker 2

45:41

And then from here on, they have these residual blocks with these funny skip connections. And this turns out to be quite important. So let me show you what these look like. So the original Kyming paper had this architecture here shown under original.

S2

Speaker 2

45:56

So on the left, you see original residual networks design. Since then, they had an additional paper that played with the architecture and found that there's a better arrangement of layers inside this block that works better empirically. And so the way this works, so concentrate on the proposed 1 in the middle, since that works so well, is you have this pathway where you have this representation of the image X. And then instead of transforming that representation X to get a new X to plug in later, we end up having this X, we go off, and we do some compute on the side, so that's that residual block doing some computation, And then you add your result on top of x.

S2

Speaker 2

46:33

So you have this addition operation here going to the next residual block. So you have this x, and you always compute deltas to it. And I think it's not intuitive that this should work much better or why that works much better. I think it becomes a bit more intuitively clear if you actually understand the backpropagation dynamics and how backprop works.

S2

Speaker 2

46:50

And this is why I always urge people also to implement backprop themselves to get an intuition for how it works, what it's computing, and so on. Because if you understand backprop, you'll see that addition operation is a gradient distributor. So you get a gradient from the top, and this gradient will flow equally to all the children that participated in that addition. So you have gradient flowing here from the supervision.

S2

Speaker 2

47:10

So you have supervision at the very bottom here in this diagram, and it kind of flows upwards. And it flows through these digital blocks and then gets added to the stream. And so you end up with, but this addition distributes that gradient always identically through. So what you end up with is this kind of a gradient superhighway, as I like to call it, where these gradients from your supervision go directly to the original convolutional layer.

S2

Speaker 2

47:30

And then on top of that, you get these deltas from all the residual blocks. So these blocks can come on online and can help out that original stream of information. This is also related to, I think, why LSTMs, long short-term memory networks, work better than recurrent neural networks, because they also have these kind of addition operations in the LSTM. And it just makes the gradients flow significantly better.

S2

Speaker 2

47:54

Then there were some results on top of residual networks that I thought were quite amusing. So recently, for example, we had this result on deep networks with stochastic depth. The idea here was that the authors of this paper noticed that you have these residual blocks that compute deltas on top of your stream. And you can basically randomly throw out layers.

S2

Speaker 2

48:13

So you have these, say, 100 blocks, 100 residual blocks, and you can randomly drop them out. And at test time, similar to dropout, you introduce all of them, and they all work at the same time, but you have to scale things a bit, just like with dropout. But basically, it's kind of an unintuitive result, because you can throw out layers at random, And I think it breaks the original notion of what we had of comnets as like these feature transformers where they compute more and more complex features over time or something like that. And I think it seems much more intuitive to think about these residual networks, at least to me, as some kinds of dynamical systems where you have this original representation of the image x, and then every single residual block is kind of like a vector field, because it computes a delta on top of your signal.

S2

Speaker 2

49:00

And so these vector fields nudge your original representation x towards a space where you can decode the answer, y, of the class of that x. And so if you drop off some of these residual blocks at random, then if you haven't applied 1 of these vector fields, then the other vector fields that come later can kind of make up for it. And they basically nudge the, they pick up the slack and they nudge it along anyways. And so that's possibly why this, the image I currently have in mind of how these things work.

S2

Speaker 2

49:25

So much more like dynamical systems. In fact, another experiments that people are playing with that I also find interesting is you don't have, you can share these residual blocks. So it starts to look more like a recurrent neural network. So these residual blocks would have shared connectivity.

S2

Speaker 2

49:38

And then you have this dynamical system, really, where you're just running a single RNN, a single vector field that you keep iterating over and over. And then your fixed point gives you the answer. So it's kind of interesting what's happening. It looks very funny.

S2

Speaker 2

49:52

We've had many more interesting results. So people are playing a lot with these residual networks and improving on them in various ways. So As I mentioned already, it turns out that you can make these residual networks much shallower and make them wider. So you introduce more channels.

S2

Speaker 2

50:06

And that can work just as well, if not better. So it's not necessarily depth that is giving you a lot of the performance. You can scale down the depth. And if you increase the width, that can actually work better.

S2

Speaker 2

50:18

And they're also more efficient if you do it that way. There's more funny regularization techniques. Here, swap out is a funny regularization technique that actually interpolates between plain nets, ResNets, and dropout. So that's also a funny paper.

S2

Speaker 2

50:31

We have fractal nets. We actually have many more different types of nets. And so people have really experimented with this a lot. I'm really eager to see what the winning architecture will be in 2016 as a result of a lot of this.

S2

Speaker 2

50:41

1 of the things that has really enabled this rapid experimentation in the community is that somehow we've developed, luckily, this culture of sharing a lot of code among ourselves. So for example, Facebook has released, just as an example, Facebook has released residual networks code in Torch that is really good, that a lot of these papers I believe have adopted and worked on top of, and that allowed them to actually really scale up their experiments and explore different architectures. So it's great that this has happened. Unfortunately, a lot of these papers are coming on archive, and it's kind of a chaos as these are being uploaded.

S2

Speaker 2

51:14

So at this point, I think this is a natural point to plug very briefly my archivesanity.com. So this is the best website ever. And what it does is it crawls archive, and it takes all the papers, and it analyzes all the papers, the full text of the papers, and creates TF-IDF bag of words features for all the papers. And then you can do things like you can search a particular paper, like residual networks paper here, and you can look for similar papers on archive.

S2

Speaker 2

51:38

And so this is a sorted list of basically all the residual networks papers that are most related to that paper. Or you can also create user accounts, And you can create a library of papers that you like. And then Archive Sanity will train a support vector machine for you. And basically, you can look at what are archive papers over the last month that I would enjoy the most.

S2

Speaker 2

51:55

And that's just computed by Archive Sanity. And so it's like a curated feed specifically for you. So I use this quite a bit, and I find it useful. So I hope that other people do as well.

S2

Speaker 2

52:04

OK, so we saw convolutional neural networks. I explained how they work. I explained some of the background context. I've given you an idea of what they look like in practice.

S2

Speaker 2

52:13

And we went through case studies of the winning architectures over time. But so far, we've only looked at image classification specifically. So we're categorizing images into some number of bins. So I'd like to briefly talk about addressing other tasks in computer vision and how you might go about doing that.

S2

Speaker 2

52:27

So the way to think about doing other tasks in computer vision is that really what we have is you can think of this convolutional neural network as this block of compute that has a few million parameters in it. And it can do basically arbitrary functions that are very nice over images. And so it takes an image, gives you some kind of features. And now different tasks will basically look as follows.

S2

Speaker 2

52:50

You want to predict some kind of a thing in different tasks. There will be different things. And you always have a desired thing. And then you want to make the predicted thing much more closer to the desired thing.

S2

Speaker 2

52:59

And you back propagate. So this is the only part, usually, that changes from task to task. You'll see that these comm nets don't change too much. What changes is your loss function at the very end.

S2

Speaker 2

53:07

And that's what actually helps you really transfer a lot of these winning architectures. You usually use these pre-trained networks and you don't worry too much about the details of that architecture, because you're only worried about, you know, adding a small piece at the top or changing the loss function or substituting a new data set and so on. So just to make this slightly more concrete, in image classification, we apply this compute block, we get these features, and then if I want to do classification, I would basically predict 1,000 numbers that give me the log probabilities of different classes. And then I have a predicted thing, a desired thing, particular class, and I can backprop.

S2

Speaker 2

53:39

If I'm doing image captioning, it also looks very similar. Instead of predicting just a vector of

S1

Speaker 1

53:44

1,000

S2

Speaker 2

53:46

numbers, I now have, for example, 10,000 words in some kind of vocabulary, and I'd be predicting 10,000 numbers and a sequence of them. And so I can use a recurrent neural network, which you will hear much more about, I think, in Richard's lecture just after this. And so I produce a sequence of 10,000 dimensional vectors, and that's just a description.

S2

Speaker 2

54:03

And they indicate the probabilities of different words to be emitted at different time steps. Or for example, if you want to do localization, again, most of the block stays unchanged. But now we also want some kind of an extent in the image. So suppose we want to classify, we don't want to classify this as an airplane, but we want to localize it with x, y, width, height, bounding box coordinates.

S2

Speaker 2

54:24

And if we make the assumption that there is always a single 1 thing in the image, like a single airplane in every image, then you can afford to predict that. So we predict these softmax scores, just like before, and apply the cross-entropy loss. And then we can predict xy with height on top of that. And we use an L2 loss, or a Hoover loss, or something like that.

S2

Speaker 2

54:43

So you just have a predicted thing, a desired thing, and you just backprop. If you want to do reinforcement learning, because you want to play different games, you just predict some different thing and it has different semantics. So in this case, we would be predicting 8 numbers that give us the probabilities of taking different actions, For example, there are 8 discrete actions in Atari, then we predict 8 numbers and train this in a slightly different manner. In the case of reinforcement learning, you don't know what the correct action is to take at any point in time, but you can still get a desired thing eventually because you run these rollouts over time, and you just see what happens.

S2

Speaker 2

55:23

And then that helps inform exactly what the correct answer should have been, or what the desired thing should have been in any 1 of those rollouts at any point in time. I don't want to dwell on this too much in this lecture, though it's outside of the scope. You'll hear much more about reinforcement learning in a later lecture. If you wanted to do segmentation, for example, then you don't want to predict a single vector of numbers for a single image.

S2

Speaker 2

55:46

But every single pixel has its own category that you would like to predict. So a dataset will be colored like this, and you have different classes in different areas. And instead of predicting a single vector of classes, you predict an array of 224 by 224, since that's the extent of the original image, times

S1

Speaker 1

56:05

20

S2

Speaker 2

56:05

if you have 20 different classes. And then you basically have 224 by 224 independent soft maxes here, and you can pose this and back propagate. This would be slightly more difficult, because you see here I have deconv layers mentioned here, and I didn't explain the convolutional layers.

S2

Speaker 2

56:22

They are related to convolutional layers, they do a similar operation, but backwards in some way. A convolutional layer does these downsampling operations, a deconv layer does these upsampling operations. You can implement a deconv layer using a conv layer. So a deconv forward pass is the conv layer backward pass, and the deconv backward pass is the conv layer forward pass.

S2

Speaker 2

56:45

So they are basically an identical operation, but are you upsampling or downsampling. So you can use deconv layers or hypercolumns. And there are different things that people do in segmentation literature, but that's just a rough idea, as you're just changing the loss function at the end. If you wanted to do autoencoders, so you want to do some unsupervised learning or something like that.

S2

Speaker 2

57:02

Well, you're just trying to predict the original image. So you're trying to get the convolutional network to implement the identity transformation. And the trick, of course, that makes it non-trivial is that you're forcing the representation to go through this representational bottleneck of 7 by 7 by 512. So the network must find an efficient representation of the original image so that it can decode it later.

S2

Speaker 2

57:20

So that would be a autoencoder. You again have an L2 loss at the end, and you backprop. Or if you want to do variational autoencoders, you have to introduce a reparameterization layer, and you have to append an additional small loss that makes your posterior be your prior, but it's just like an additional layer. And then you have an entire generative model.

S2

Speaker 2

57:36

And you can actually sample images as well. If you wanted to do detection, things get a little more hairy, perhaps, compared to localization or something like that. So 1 of my favorite detectors, perhaps, to explain is the YOLO detector, because it's perhaps the simplest 1. It doesn't work the best, but it's the simplest 1 to explain, and it has the core idea of how people do detection in computer vision.

S2

Speaker 2

57:57

And so the way this works is we reduced the original image to a 7 by 7 by 512 feature. So really, there are these 49 discrete locations that we have. And at every single 1 of these 49 locations, we're going to predict a class. So that's shown here on the top right.

S2

Speaker 2

58:15

So every single 1 of these 49 will be some kind of a softmax. And then additionally, at every single position, we're going to predict some number of bounding boxes. And so there's going to be a b number of bounding boxes. Say b is 10.

S2

Speaker 2

58:28

So we're going to be predicting 50 numbers. And the 5 comes from the fact that every bounding box will have 5 numbers associated with it. So you have to describe the x, y, the width, and the height. And you have to also indicate some kind of a confidence of that bounding box.

S2

Speaker 2

58:43

So that's the fifth number, some kind of a confidence measure. So you basically end up predicting these bounding boxes. They have positions. They have class.

S2

Speaker 2

58:50

They have confidence. And then you have some true bounding boxes in the image. So you know that there are certain true boxes, and they have certain class. And what you do then is you match up the desired thing with the predicted thing, and whatever, so say, for example, you had 1 bounding box of a cat, then you would find the closest predicted bounding box, and you would mark it as a positive, and you would try to make that associated grid cell predict cat.

S2

Speaker 2

59:15

And you would nudge the prediction to be slightly more towards the cat box. And so all of this can be done with simple losses, and you just back propagate that, and then you have a detector. Or if you want to get much more fancy, you could do dense image captioning. So in this case, this is a combination of detection and image captioning.

S2

Speaker 2

59:32

This is a paper with my equal co-author Justin Johnson and Fei-Fei Lee from last year. And so what we did here is image comes in and it becomes much more complex. I don't want to go into it as much. But the first order approximation is that instead, it's basically detection, But instead of predicting fixed classes, we instead predict a sequence of words.

S2

Speaker 2

59:49

So we use a recurrent neural network there. But basically, you can take an image then, and you can predict. You can both detect and describe everything in a complex visual scene. So that's just some overview of the model.