The talks at the Deep Learning School on September 24/25, 2016 were amazing. I clipped out individual talks  from the full live streams and provided links to each below in case that's useful for people who want to watch specific talks several times (like I do). Please check out the official website (http://www.bayareadlschool.org) and full live streams below.

Having read, watched, and presented deep learning material over the past few years, I have to say that this is one of the best collection of introductory deep learning talks I've yet encountered. Here are links to the individual talks and the full live streams for the two days:

1. Foundations of Deep Learning (Hugo Larochelle, Twitter) - https://youtu.be/zij_FTbJHsk
2. Deep Learning for Computer Vision (Andrej Karpathy, OpenAI) - https://youtu.be/u6aEYuemt0M
3. Deep Learning for Natural Language Processing (Richard Socher, Salesforce) - https://youtu.be/oGk1v1jQITw
4. TensorFlow Tutorial (Sherry Moore, Google Brain) - https://youtu.be/Ejec3ID_h0w
5. Foundations of Unsupervised Deep Learning (Ruslan Salakhutdinov, CMU) - https://youtu.be/rK6bchqeaN8
6. Nuts and Bolts of Applying Deep Learning (Andrew Ng) - https://youtu.be/F1ka6a13S9I
7. Deep Reinforcement Learning (John Schulman, OpenAI) - https://youtu.be/PtAIh9KSnjo
8. Theano Tutorial (Pascal Lamblin, MILA) - https://youtu.be/OU8I1oJ9HhI
9. Deep Learning for Speech Recognition (Adam Coates, Baidu) - https://youtu.be/g-sndkf7mCs
10. Torch Tutorial (Alex Wiltschko, Twitter) - https://youtu.be/L1sHcj3qDNc
11. Sequence to Sequence Deep Learning (Quoc Le, Google) - https://youtu.be/G5RY_SUJih4
12. Foundations and Challenges of Deep Learning (Yoshua Bengio) - https://youtu.be/11rsu_WwZTc

Full Day Live Streams:
Day 1: https://youtu.be/eyovmAtoUx0
Day 2: https://youtu.be/9dXiAecyJrY

Go to http://www.bayareadlschool.org for more information on the event, speaker bios, slides, etc. Huge thanks to the organizers (Shubho Sengupta et al) for making this event happen.

Thank you, everybody, and thanks for coming back very soon after lunch.

I'll try to make it entertaining to avoid some post-food coma.

a lot to owe for being here to Andrew and Chris and my PhD at Stanford here.

I figured there's going to be a broad range of capabilities in the room.

So I'm sorry I will probably bore some of you for the first 2 thirds of the talk, because I'll go over the basics of what's NLP, natural language processing, what's deep learning, and what's really at the intersection of the 2.

And then the last third, I'll talk a little bit about some exciting new research that's happening right now.

So let's get started with what is natural language processing.

It's really a field at the intersection of computer science, AI, and linguistics.

And a lot of these statements here, we could really talk and philosophize a lot about, but I'll move through them pretty quickly.

For me, the goal of natural language processing is for computers to process or, scare quotes, understand natural language in order to perform tasks that are actually useful for people, such as question answering.

The caveat here is that really fully understanding and representing the meaning of language, or even defining it, is quite an elusive goal.

So whenever I say the model understands, I'm sorry, I shouldn't say that.

Really, these models don't understand in the sense that we understand language anything.

So whenever somebody says they can read or represent the full meaning in its entire glory, it's usually not quite true.

Really, perfect language understanding is, in some sense, AI complete in the sense that you need to understand all of visual inputs and thought and a lot of other complex things.

So a little more concretely, as we try to tackle this overall problem of understanding language, what are the different levels that we often look at?

It often, and for many people, starts at speech.

And then once you have speech, you might say, all right, now I know what phonemes are smaller, parts of words.

That's morphology or morphological analysis.

Once I know what the meaning of words are, I might try to understand how they are put together in grammatical ways, such that the sentences are understandable, or at least grammatically correct to a lot of speakers of the language.

Once we go and we understand the structure, we actually want to get to the meaning.

And that's really where I think most of the interesting, most of my interest lies in semantic interpretation, actually trying to get to the meaning in some useful capacity.

And then after that, we might say, well, if we understand now the meaning of a whole sentence, how do we actually interact?

How do we have spoken dialogue system, things like that?

Whereas deep learning has really improved the state of the art significantly, is really in speech recognition, and syntax, and semantics.

And the interesting thing is that we're kind of actually skipping some of these levels.

Deep learning doesn't require often morphological analysis to create very useful systems.

And in some cases, it actually skips syntactic analysis entirely as well.

It doesn't have to know about the grammar.

It doesn't have to be taught about what noun phrases are, prepositional phrases.

It can actually get straight to some semantically useful tasks right away.

And that's going to be 1 of the sort of advantages that we don't have to actually be as inspired by linguistics as traditional natural language processing had to be.

Well, there's a lot of complexity in representing and learning, and especially using linguistic situational world and visual knowledge.

Really, all of these are connected when it gets to the meaning of language.

To really understand what read means, can you do that without visual understanding, for instance?

If you have, For instance, this sentence here, Jane hit June, and then she fell, or and then she ran.

Depending on which verb comes after she, the definition, the meaning of she actually changes.

And this is 1 subtask you might look at, so-called correference resolution in general, where you try to understand who does she refer to, and it depends on the meaning, again, somewhat scare quotes here, of the verb

So here we have a very simple sentence for words.

Now, that simple sentence can actually have at least 4 different meanings, if you can think about it for a little bit, right?

You made her a duck that she loves for Christmas as her dinner, you made her duck like me just now, and so on.

There are actually 4 different meanings, and to know which 1 requires, in some sense, situational awareness or knowledge to really disambiguate what is meant here.

Now, where does it actually become useful in terms of applications?

Well, they actually range from very simple things that we assume or are given now,

every day, to more and more complex and more in the realm of research.

The simple ones are things like spell checking, or keyword search, and finding synonyms, and Then the medium sort of difficulty ones are to extract information from websites, trying to extract sort of product prices or dates and locations, people or company names, so-called named entity recognition.

You can go a little bit above that and try to classify sort of reading levels for school text, for instance, or do sentiment analysis that can be helpful if you have a lot of customer emails that come in and you want to prioritize highly the ones of customers who are really, really annoyed with you right now.

And then the really hard ones, and I think in some sense the most interesting ones, are machine translation, trying to actually be able to translate between all the different languages in the world.

Question answering, clearly something that is a very exciting and useful piece of technology, especially over very large, complex domains.

I know pretty much everybody here would love to have some simple automated email reply system.

And then spoken dialogue systems, Bots are very hip right now.

These are all sort of complex things that are still in the realm of research to do them really well.

We're making huge progress, especially with deep learning on these 3, but there's still nowhere near human accuracy.

I mentioned we have morphology, and words, and syntax, and semantics, and so on.

We can look at 1 example, namely machine translation, and look at how did people try to solve this problem of machine translation.

Well, it turns out they tried all of these different levels with varying degrees of success.

You can try to have a direct translation of words to other words.

The problem is that is often a very tricky mapping.

The meaning of 1 word in English might have 3 different words in German, and vice versa.

You can have 3 of the same words in English, meaning all the single same word in German, for instance.

So then people said, well, let's try to maybe do a syntactic transfer where we have whole phrases, like to kick the bucket, just means sterben in German.

And then semantic transfer might be, well, let's try to find a logical representation of the whole sentence, the actual meaning in some human understandable form, and then try to just find another surface representation of that.

Now, of course, that will also get rid of a lot of the subtleties of language.

And so there are tricky problems in all these kinds of representations.

Now, the question is, what does deep learning do?

You've already saw at least 2 methods, standard neural networks before and convolutional neural networks for vision.

And in some sense, there's going to be a huge similarity here to these methods, because just like images that are essentially a long list of numbers, a vector, and standard neural networks with a hidden state is also just a vector or a list of numbers, that is also going to be the main representation that we will use throughout, for characters, for words, for short phrases, for sentences, and in some cases for entire documents.

And with that, We are sort of finishing up the whirlwind of what's NLP.

Of course, you could give an entire lecture on almost every single slide I just gave.

But we'll continue at that speed to try to squeeze this complex deep learning for NLP subject area into an hour and a half.

I think there are 2 of the most important basic Lego blocks that you nowadays want to know in order to be able to sort of creatively play around with more complex models, and those are going to be word vectors, and sequence models, namely recurrent neural networks.

And I kind of split this into words, sentences, and multiple sentences, but really, you could use recurrent neural networks for shorter phrases and multiple sentences, but we will see that they have limitations as you move to longer and longer sequences and use the default neural network sequence models.

So let's start with words, And maybe 1 last blast from the past here, to represent the meaning of words, we actually used to use taxonomies like WordNet that kind of defines each word in relationship to lots of other ones.

So you can, for instance, define hypernyms and is a relationships.

You might say the word panda, for instance, in its first meaning as a noun, basically goes through this complex stack, this directed acyclic graph, most of it is roughly just a tree.

And in the end, like everything, it is an entity, but it's actually a physical entity, a type of object.

It's a whole object, it's a living thing, it's an organism, animal, and so on.

So you basically can define a word like this.

And another way, At each node of this tree, you actually have so-called syn sets, or synonym sets.

Here's an example for the synonym set of the word good.

Good can have a lot of different meanings.

Can actually be both an adjective, as well as an adverb, as well as a noun.

Now, what are the problems with this kind of discrete representation?

If you're a human, you want to find synonyms.

But they're never going to be quite sufficient to capture all the nuances that we have in language.

So for instance, the synonyms here for good were adapt, expert, practice, proficient, and skillful.

But of course, you would use these words in slightly different contexts.

You would not use the word expert in exactly all the same contexts as you would use the meaning of good, or the word good.

Likewise, it will be missing a lot of new words.

Language is this interesting living organism.

You might have some kids, they say YOLO, and all of a sudden, you need to update your dictionary.

Likewise, maybe in Silicon Valley, you might see ninja a lot, and now you need to update your dictionary again.

And that is basically going to be a Sisyphus job.

Nobody will ever be able to really capture all the meanings and this living, breathing organism that language is.

Some people might think ninja should just be deleted from the dictionary, and they don't want to include it.

I just think nifty or badass is kind of a silly word and should not be included in a proper dictionary, but it's being used in real language and so on.

As soon as you change your domain, you have to ask people to update it.

And it's also hard to compute accurate word similarities.

Some of these words are subtly different, and it's really a continuum in which we can measure their similarities.

So instead, what we're going to use and what is also The first step for deep learning, we'll actually realize it's not quite deep learning in many cases, but it is sort of the first step to use deep learning in NLP, is we will use distributional similarities.

Basically, the idea is that we'll use the neighbors of a word to represent that word itself.

pretty old concept, and here is an example, for instance, for the word banking, we might actually represent banking in terms of all of these other words that are around it.

So let's do a very simple example, where we look at a window around each word.

And so here, the window length, that's just for simplicity, say it's 1.

We represent each word only with the words 1 to the left and 1 to the right of it.

We'll just use the symmetric context around each word.

So if the 3 sentences in my corpus, of course, we would always want to use corpora with billions of words instead of just a couple.

But just to give you an idea of what's being captured in these word vectors is I like deep learning, I like NLP, and I enjoy flying.

And now, this is a very simple so-called co-occurrence statistic.

You'll just simply see here, I, for instance, appears twice in its window size of 1 here.

The word like is in its window and its context, and the word enjoy is once in the context.

And for like, you have twice to its left, I, and once deep, and once NLP.

It turns out, if you just take those vectors, now this could be a vector representation, just each row could be a vector representation for words.

Unfortunately, as soon as your vocabulary increases, that vector dimensionality would change, and hence you'd have to retrain your whole model.

It's also very sparse, and really, it's going to be somewhat noisy if you use that vector.

Now, another better thing to do might be to run SVD or something similar like PCA dimensionality reduction on such a co-occurrence matrix.

And that actually gives you a reasonable first approximation to word vectors.

Now what works even better than simple PCA is actually a model introduced by Tomasz Mikoloff in 2013, called Word2Vec.

So instead of capturing co-occurrence counts directly out of a matrix like that, he'll actually go through each window in a large corpus and try to predict a word that's in the center of each window and use that to predict the words around it.

You can train almost online, though few people do this, and add words to your vocabulary very quickly in a streaming fashion.

So now let's look a little bit at this model Word2Vec, because it's first a very simple NLP model, and 2, it's very instructive.

We won't go into too many details, but at least look at a couple of equations.

So again, main goal is to predict the surrounding words in a window of some length that we define m as a hyperparameter of every word.

Now, the objective function will essentially try to maximize here the log probability of any of these context words, given the center word.

So we go through our entire corpus T, very long sequence.

And at each time step, j, we will basically look at all the words in the context of the current word, T, and basically try to maximize here this probability of trying to be able to predict that word that is around the current word, T.

And theta, all the parameters, namely all the word vectors that we'd want to optimize.

So now, how do we actually define this probability p here?

The simplest way to do this, and this is not the actual way, but it's the simplest and first to understand and derive this model, is with this very simple inner product here, and that's why we can't quite call it deep.

There's not going to be many layers of nonlinearities like we see in deep neural networks.

higher that inner product is, with this very simple inner product here, and that's why we can't quite call it deep, there's not going to be many layers of nonlinearities like we see in

deep neural networks, it's really just a simple inner product, and the higher that inner product is, the more likely these 2 will be predicting 1 another.

So here, C, the context, is the center word, sorry, O is the outside word, and basically, this inner product, the larger it is, the more likely we are going to predict this.

And these are both just standard n-dimensional vectors.

And now, in order to get a real probability, we'll essentially apply softmax to all the potential inner products that you might have in your vocabulary.

And 1 thing you will notice here is, well, this denominator is actually going to be a very large sum.

We'll want to sum here over all potential inner products for every single window.

So now the real methods that we would use are going to approximate the sum in a variety of clever ways.

Now, I could literally talk the next hour and a half just about how to optimize the details of this equation, but then we'll all deplete our mental energy for the rest of the day.

And so I'm just going to point you to the class I taught earlier this year, CS294D, where we will have lots of different slides that go into all the details of this equation, how to approximate it, and then how to optimize it.

It's going to be very similar to the way we optimize any other neural network.

We're going to use stochastic gradient descent.

We're going to look at mini-patches of a couple of hundreds of windows at a time, and update those word vectors, and we are going to take simple gradients of each of

these vectors as we go through windows in a large corpus.

All right, now, we briefly mentioned PCA-like methods, based on senior value decomposition, or a standard PCA.

We also had this word2vec model, 1 that combines the best of both worlds, namely GloVe, or global vectors, introduced by Geoffrey Pennington in

And it has a very similar idea, and you will notice here, there is some similarity, you have this inner product again, for different pairs, but this model will go over the core occurrence matrix.

Once you have the core occurrence matrix, it is more efficient to predict once how often 2 words appear next to each other, rather than do it 50 times each time that pair appears in an actual corpus.

So in some sense, you can be more efficiently going through all the co-occurrence statistics, and you're going to basically try to minimize this subtraction here.

And what that basically means is that each inner product will try to approximate the log probability of these 2 words actually co-occurring.

Now, you have this function here, which essentially will allow us to not overly weight certain pairs that occur very, very frequently.

The, for instance, co-occurs with lots of different words.

And you want to basically lower the importance of all the words that co-occur with the.

In fact, we trained this on Common Crawl, which is a really great data set of most of the internet.

And it gets also very good performance on small corpora because it makes use very efficiently of these co-occurrence statistics.

And that's essentially what word vectors are always capturing.

So if in 1 sentence you just want to remember every time you hear word vectors in deep learning, 1, they are not quite deep, even though we call them step 1 of deep learning, and 2, they are really just capturing core occurrence counts.

How often does a word appear in the context of other words?

So let's look at some interesting results of these GloVe vectors.

Here, the first thing we do is look at nearest neighbors.

So now that we have these n-dimensional vectors, usually we say n between 50 to at most

Good general number is 100 or 200 dimensional.

Each word is now represented as a single vector.

And so we can look in this vector space for words that appear close by.

We started and looked for the nearest neighbors of frog.

And well, it turned out these are the nearest neighbors, which was a little confusing since we're not biologists.

But fortunately, when you actually look up and Google what those mean, you'll see that they are actually all, indeed, different kinds of frogs.

Some appear very rarely in the corpus, and others, like toad, are much more frequent.

Now, 1 of the most exciting results that came out of word vectors are actually these word analogies.

So the idea here is, can linearly, can there be relationships between different word vectors that simply fall out of very linear and simple addition and subtraction.

So the idea here is, What is man to woman equal to king to something else?

As in, what is the right analogy when I try to basically fill in here the last missing word?

Now, the way we're going to do this is a very simple cosine similarity.

Basically, just take, let's take an example here, the vector of woman.

We subtract the word vector we learned of man, And we add the word vector of king.

And the resulting vector I, the arg max for this, turns out to going to be queen for a lot of these different models.

Again, we're capturing co-occurrence statistics.

So man might, in its context, often have things like running and fighting and other silly things that men do.

And then you subtract those kinds of words from the context and you add them again.

In some sense, It's intuitive, though surprising that it works out that well for so many different examples.

So here are some other examples similar to the King and Queen example, where we basically took these 200-dimensional vectors and we projected them down to 2 dimensions.

Again, with a very simple method like PCA.

And what we find is actually quite interestingly, even in just the 2 first principle components of this space, we have some very interesting sort of female-male relationships.

So man to woman is similar to uncle and aunt, brother and sister, sir and madam, and so on.

So this is an interesting semantic relationship that falls out of essentially co-occurrence counts in specific windows around each word in a large corpus.

Here's another 1 that's more of a syntactic relationship.

We actually have here superlatives, like slow, slower, and slowest is in a similar vector relationship to short, shorter, and shortest.

So this was very exciting, and of course, when you see an interesting qualitative result, you want to try to quantify who can do better in trying to understand these analogies, and what are the different modes and hyperparameters that modify the performance.

Now, this is something that you will notice in pretty much every deep learning project ever, which is more data will give you better performance.

That's probably the single most useful thing you can do to machine learning or deep learning system is to train it with more data, and we found that too.

Now there are different vector sizes too, which is a common hyperparameter, like I said usually between 50 to at most 500.

Here we have 300 dimensional that essentially gave us the best performance for these different kinds of semantics and tactic relationships.

Now, in many ways, having a single vector for words can be oversimplifying, Some words have multiple meanings, maybe they should have multiple vectors.

Sometimes the word meaning changes over time, and so on.

So there's a lot of simplifying assumptions here, but again, our final goal for deep NLP is going to be to create useful systems.

And it turns out this is a useful first step to create such systems that mimic some human language behavior in order to create useful applications for us.

All right, but words, word vectors are very useful.

But words, of course, never appear in isolation.

And what we really want to do is understand words in their context.

And so this leads us to the second section here on recurrent neural networks.

See all Lex Fridman transcripts on Youtube

Deep Learning for Natural Language Processing (Richard Socher, Salesforce)