1 hours 29 minutes 3 seconds
🇬🇧 English
Speaker 1
00:00
Thank you, everybody, and thanks for coming back very soon after lunch. I'll try to make it entertaining to avoid some post-food coma. So I actually have
Speaker 2
00:10
a lot to owe for being here to Andrew and Chris and my PhD at Stanford here. It's really, it's always fun to be back. I figured there's going to be a broad range of capabilities in the room.
Speaker 2
00:24
So I'm sorry I will probably bore some of you for the first 2 thirds of the talk, because I'll go over the basics of what's NLP, natural language processing, what's deep learning, and what's really at the intersection of the 2. And then the last third, I'll talk a little bit about some exciting new research that's happening right now. So let's get started with what is natural language processing. It's really a field at the intersection of computer science, AI, and linguistics.
Speaker 2
00:54
And you could define a lot of goals. And a lot of these statements here, we could really talk and philosophize a lot about, but I'll move through them pretty quickly. For me, the goal of natural language processing is for computers to process or, scare quotes, understand natural language in order to perform tasks that are actually useful for people, such as question answering. The caveat here is that really fully understanding and representing the meaning of language, or even defining it, is quite an elusive goal.
Speaker 2
01:25
So whenever I say the model understands, I'm sorry, I shouldn't say that. Really, these models don't understand in the sense that we understand language anything. So whenever somebody says they can read or represent the full meaning in its entire glory, it's usually not quite true. Really, perfect language understanding is, in some sense, AI complete in the sense that you need to understand all of visual inputs and thought and a lot of other complex things.
Speaker 2
01:54
So a little more concretely, as we try to tackle this overall problem of understanding language, what are the different levels that we often look at? It often, and for many people, starts at speech. And then once you have speech, you might say, all right, now I know what phonemes are smaller, parts of words. I understand how words form.
Speaker 2
02:14
That's morphology or morphological analysis. Once I know what the meaning of words are, I might try to understand how they are put together in grammatical ways, such that the sentences are understandable, or at least grammatically correct to a lot of speakers of the language. Once we go and we understand the structure, we actually want to get to the meaning. And that's really where I think most of the interesting, most of my interest lies in semantic interpretation, actually trying to get to the meaning in some useful capacity.
Speaker 2
02:45
And then after that, we might say, well, if we understand now the meaning of a whole sentence, how do we actually interact? What's the discourse? How do we have spoken dialogue system, things like that? Whereas deep learning has really improved the state of the art significantly, is really in speech recognition, and syntax, and semantics.
Speaker 2
03:05
And the interesting thing is that we're kind of actually skipping some of these levels. Deep learning doesn't require often morphological analysis to create very useful systems. And in some cases, it actually skips syntactic analysis entirely as well. It doesn't have to know about the grammar.
Speaker 2
03:21
It doesn't have to be taught about what noun phrases are, prepositional phrases. It can actually get straight to some semantically useful tasks right away. And that's going to be 1 of the sort of advantages that we don't have to actually be as inspired by linguistics as traditional natural language processing had to be. So why is NLP hard?
Speaker 2
03:42
Well, there's a lot of complexity in representing and learning, and especially using linguistic situational world and visual knowledge. Really, all of these are connected when it gets to the meaning of language. To really understand what read means, can you do that without visual understanding, for instance? If you have, For instance, this sentence here, Jane hit June, and then she fell, or and then she ran.
Speaker 2
04:07
Depending on which verb comes after she, the definition, the meaning of she actually changes. And this is 1 subtask you might look at, so-called correference resolution in general, where you try to understand who does she refer to, and it depends on the meaning, again, somewhat scare quotes here, of the verb
Speaker 3
04:31
that follows this pronoun.
Speaker 2
04:33
Similarly, there is a lot of ambiguity. So here we have a very simple sentence for words. I made her duck.
Speaker 2
04:40
Now, that simple sentence can actually have at least 4 different meanings, if you can think about it for a little bit, right? You made her a duck that she loves for Christmas as her dinner, you made her duck like me just now, and so on. There are actually 4 different meanings, and to know which 1 requires, in some sense, situational awareness or knowledge to really disambiguate what is meant here. So that's sort of the high level of NLP.
Speaker 2
05:10
Now, where does it actually become useful in terms of applications? Well, they actually range from very simple things that we assume or are given now,
Speaker 3
05:19
we use them all the time
Speaker 2
05:19
every day, to more and more complex and more in the realm of research. The simple ones are things like spell checking, or keyword search, and finding synonyms, and Then the medium sort of difficulty ones are to extract information from websites, trying to extract sort of product prices or dates and locations, people or company names, so-called named entity recognition. You can go a little bit above that and try to classify sort of reading levels for school text, for instance, or do sentiment analysis that can be helpful if you have a lot of customer emails that come in and you want to prioritize highly the ones of customers who are really, really annoyed with you right now.
Speaker 2
06:01
And then the really hard ones, and I think in some sense the most interesting ones, are machine translation, trying to actually be able to translate between all the different languages in the world. Question answering, clearly something that is a very exciting and useful piece of technology, especially over very large, complex domains. Can be used for automated email replies. I know pretty much everybody here would love to have some simple automated email reply system.
Speaker 2
06:30
And then spoken dialogue systems, Bots are very hip right now. These are all sort of complex things that are still in the realm of research to do them really well. We're making huge progress, especially with deep learning on these 3, but there's still nowhere near human accuracy. So let's look at the representations.
Speaker 2
06:51
I mentioned we have morphology, and words, and syntax, and semantics, and so on. We can look at 1 example, namely machine translation, and look at how did people try to solve this problem of machine translation. Well, it turns out they tried all of these different levels with varying degrees of success. You can try to have a direct translation of words to other words.
Speaker 2
07:16
The problem is that is often a very tricky mapping. The meaning of 1 word in English might have 3 different words in German, and vice versa. You can have 3 of the same words in English, meaning all the single same word in German, for instance. So then people said, well, let's try to maybe do a syntactic transfer where we have whole phrases, like to kick the bucket, just means sterben in German.
Speaker 2
07:38
OK, not a fun example. And then semantic transfer might be, well, let's try to find a logical representation of the whole sentence, the actual meaning in some human understandable form, and then try to just find another surface representation of that. Now, of course, that will also get rid of a lot of the subtleties of language. And so there are tricky problems in all these kinds of representations.
Speaker 2
08:01
Now, the question is, what does deep learning do? You've already saw at least 2 methods, standard neural networks before and convolutional neural networks for vision. And in some sense, there's going to be a huge similarity here to these methods, because just like images that are essentially a long list of numbers, a vector, and standard neural networks with a hidden state is also just a vector or a list of numbers, that is also going to be the main representation that we will use throughout, for characters, for words, for short phrases, for sentences, and in some cases for entire documents. They will all be vectors.
Speaker 2
08:43
And with that, We are sort of finishing up the whirlwind of what's NLP. Of course, you could give an entire lecture on almost every single slide I just gave. So we're very, very high level. But we'll continue at that speed to try to squeeze this complex deep learning for NLP subject area into an hour and a half.
Speaker 2
09:05
I think there are 2 of the most important basic Lego blocks that you nowadays want to know in order to be able to sort of creatively play around with more complex models, and those are going to be word vectors, and sequence models, namely recurrent neural networks. And I kind of split this into words, sentences, and multiple sentences, but really, you could use recurrent neural networks for shorter phrases and multiple sentences, but we will see that they have limitations as you move to longer and longer sequences and use the default neural network sequence models. So let's start with words, And maybe 1 last blast from the past here, to represent the meaning of words, we actually used to use taxonomies like WordNet that kind of defines each word in relationship to lots of other ones. So you can, for instance, define hypernyms and is a relationships.
Speaker 2
10:05
You might say the word panda, for instance, in its first meaning as a noun, basically goes through this complex stack, this directed acyclic graph, most of it is roughly just a tree. And in the end, like everything, it is an entity, but it's actually a physical entity, a type of object. It's a whole object, it's a living thing, it's an organism, animal, and so on. So you basically can define a word like this.
Speaker 2
10:28
And another way, At each node of this tree, you actually have so-called syn sets, or synonym sets. Here's an example for the synonym set of the word good. Good can have a lot of different meanings. Can actually be both an adjective, as well as an adverb, as well as a noun.
Speaker 2
10:47
Now, what are the problems with this kind of discrete representation? Well, they can be great as a resource. If you're a human, you want to find synonyms. But they're never going to be quite sufficient to capture all the nuances that we have in language.
Speaker 2
11:05
So for instance, the synonyms here for good were adapt, expert, practice, proficient, and skillful. But of course, you would use these words in slightly different contexts. You would not use the word expert in exactly all the same contexts as you would use the meaning of good, or the word good. Likewise, it will be missing a lot of new words.
Speaker 2
11:29
Language is this interesting living organism. We change it all the time. You might have some kids, they say YOLO, and all of a sudden, you need to update your dictionary. Likewise, maybe in Silicon Valley, you might see ninja a lot, and now you need to update your dictionary again.
Speaker 2
11:45
And that is basically going to be a Sisyphus job. Nobody will ever be able to really capture all the meanings and this living, breathing organism that language is. So it's also very subjective. Some people might think ninja should just be deleted from the dictionary, and they don't want to include it.
Speaker 2
12:03
I just think nifty or badass is kind of a silly word and should not be included in a proper dictionary, but it's being used in real language and so on. It requires human labor. As soon as you change your domain, you have to ask people to update it. And it's also hard to compute accurate word similarities.
Speaker 2
12:18
Some of these words are subtly different, and it's really a continuum in which we can measure their similarities. So instead, what we're going to use and what is also The first step for deep learning, we'll actually realize it's not quite deep learning in many cases, but it is sort of the first step to use deep learning in NLP, is we will use distributional similarities. So what does that mean? Basically, the idea is that we'll use the neighbors of a word to represent that word itself.
Speaker 3
12:49
It is a
Speaker 2
12:49
pretty old concept, and here is an example, for instance, for the word banking, we might actually represent banking in terms of all of these other words that are around it. So let's do a very simple example, where we look at a window around each word. And so here, the window length, that's just for simplicity, say it's 1.
Speaker 2
13:11
We represent each word only with the words 1 to the left and 1 to the right of it. We'll just use the symmetric context around each word. And here's a simple example corpus. So if the 3 sentences in my corpus, of course, we would always want to use corpora with billions of words instead of just a couple.
Speaker 2
13:29
But just to give you an idea of what's being captured in these word vectors is I like deep learning, I like NLP, and I enjoy flying. And now, this is a very simple so-called co-occurrence statistic. You'll just simply see here, I, for instance, appears twice in its window size of 1 here. The word like is in its window and its context, and the word enjoy is once in the context.
Speaker 2
13:54
And for like, you have twice to its left, I, and once deep, and once NLP. It turns out, if you just take those vectors, now this could be a vector representation, just each row could be a vector representation for words. Unfortunately, as soon as your vocabulary increases, that vector dimensionality would change, and hence you'd have to retrain your whole model. It's also very sparse, and really, it's going to be somewhat noisy if you use that vector.
Speaker 2
14:24
Now, another better thing to do might be to run SVD or something similar like PCA dimensionality reduction on such a co-occurrence matrix. And that actually gives you a reasonable first approximation to word vectors. Very old method, works reasonably well. Now what works even better than simple PCA is actually a model introduced by Tomasz Mikoloff in 2013, called Word2Vec.
Speaker 2
14:49
So instead of capturing co-occurrence counts directly out of a matrix like that, he'll actually go through each window in a large corpus and try to predict a word that's in the center of each window and use that to predict the words around it. That way, you can very quickly train. You can train almost online, though few people do this, and add words to your vocabulary very quickly in a streaming fashion. So now let's look a little bit at this model Word2Vec, because it's first a very simple NLP model, and 2, it's very instructive.
Speaker 2
15:27
We won't go into too many details, but at least look at a couple of equations. So again, main goal is to predict the surrounding words in a window of some length that we define m as a hyperparameter of every word. Now, the objective function will essentially try to maximize here the log probability of any of these context words, given the center word. So we go through our entire corpus T, very long sequence.
Speaker 2
15:52
And at each time step, j, we will basically look at all the words in the context of the current word, T, and basically try to maximize here this probability of trying to be able to predict that word that is around the current word, T. And theta, all the parameters, namely all the word vectors that we'd want to optimize. So now, how do we actually define this probability p here? The simplest way to do this, and this is not the actual way, but it's the simplest and first to understand and derive this model, is with this very simple inner product here, and that's why we can't quite call it deep.
Speaker 2
16:34
There's not going to be many layers of nonlinearities like we see in deep neural networks. It's really just a simple inner product. And the
Speaker 3
16:40
higher that inner product is, with this very simple inner product here, and that's why we can't quite call it deep, there's not going to be many layers of nonlinearities like we see in
Speaker 2
16:40
deep neural networks, it's really just a simple inner product, and the higher that inner product is, the more likely these 2 will be predicting 1 another. So here, C, the context, is the center word, sorry, O is the outside word, and basically, this inner product, the larger it is, the more likely we are going to predict this. And these are both just standard n-dimensional vectors.
Speaker 2
17:04
And now, in order to get a real probability, we'll essentially apply softmax to all the potential inner products that you might have in your vocabulary. And 1 thing you will notice here is, well, this denominator is actually going to be a very large sum. We'll want to sum here over all potential inner products for every single window. That would be too slow.
Speaker 2
17:25
So now the real methods that we would use are going to approximate the sum in a variety of clever ways. Now, I could literally talk the next hour and a half just about how to optimize the details of this equation, but then we'll all deplete our mental energy for the rest of the day. And so I'm just going to point you to the class I taught earlier this year, CS294D, where we will have lots of different slides that go into all the details of this equation, how to approximate it, and then how to optimize it. It's going to be very similar to the way we optimize any other neural network.
Speaker 2
18:00
We're going to use stochastic gradient descent. We're going to look at mini-patches of a couple of hundreds of windows at a time, and update those word vectors, and we are going to take simple gradients of each of
Speaker 3
18:16
these vectors as we go through windows in a large corpus.
Speaker 2
18:18
All right, now, we briefly mentioned PCA-like methods, based on senior value decomposition, or a standard PCA. We also had this word2vec model, 1 that combines the best of both worlds, namely GloVe, or global vectors, introduced by Geoffrey Pennington in
Speaker 1
18:38
2014.
Speaker 2
18:40
And it has a very similar idea, and you will notice here, there is some similarity, you have this inner product again, for different pairs, but this model will go over the core occurrence matrix. Once you have the core occurrence matrix, it is more efficient to predict once how often 2 words appear next to each other, rather than do it 50 times each time that pair appears in an actual corpus. So in some sense, you can be more efficiently going through all the co-occurrence statistics, and you're going to basically try to minimize this subtraction here.
Speaker 2
19:16
And what that basically means is that each inner product will try to approximate the log probability of these 2 words actually co-occurring. Now, you have this function here, which essentially will allow us to not overly weight certain pairs that occur very, very frequently. The, for instance, co-occurs with lots of different words. And you want to basically lower the importance of all the words that co-occur with the.
Speaker 2
19:46
So you can train this very fast. It scales to gigantic corpora. In fact, we trained this on Common Crawl, which is a really great data set of most of the internet. It's many billions of tokens.
Speaker 2
20:00
And it gets also very good performance on small corpora because it makes use very efficiently of these co-occurrence statistics. And that's essentially what word vectors are always capturing. So if in 1 sentence you just want to remember every time you hear word vectors in deep learning, 1, they are not quite deep, even though we call them step 1 of deep learning, and 2, they are really just capturing core occurrence counts. How often does a word appear in the context of other words?
Speaker 2
20:30
So let's look at some interesting results of these GloVe vectors. Here, the first thing we do is look at nearest neighbors. So now that we have these n-dimensional vectors, usually we say n between 50 to at most
Speaker 1
20:44
500.
Speaker 2
20:45
Good general number is 100 or 200 dimensional. Each word is now represented as a single vector. And so we can look in this vector space for words that appear close by.
Speaker 2
20:57
We started and looked for the nearest neighbors of frog. And well, it turned out these are the nearest neighbors, which was a little confusing since we're not biologists. But fortunately, when you actually look up and Google what those mean, you'll see that they are actually all, indeed, different kinds of frogs. Some appear very rarely in the corpus, and others, like toad, are much more frequent.
Speaker 2
21:23
Now, 1 of the most exciting results that came out of word vectors are actually these word analogies. So the idea here is, can linearly, can there be relationships between different word vectors that simply fall out of very linear and simple addition and subtraction. So the idea here is, What is man to woman equal to king to something else? As in, what is the right analogy when I try to basically fill in here the last missing word?
Speaker 2
22:00
Now, the way we're going to do this is a very simple cosine similarity. Basically, just take, let's take an example here, the vector of woman. We subtract the word vector we learned of man, And we add the word vector of king. And the resulting vector I, the arg max for this, turns out to going to be queen for a lot of these different models.
Speaker 2
22:26
And that was very surprising. Again, we're capturing co-occurrence statistics. So man might, in its context, often have things like running and fighting and other silly things that men do. And then you subtract those kinds of words from the context and you add them again.
Speaker 2
22:44
In some sense, It's intuitive, though surprising that it works out that well for so many different examples. So here are some other examples similar to the King and Queen example, where we basically took these 200-dimensional vectors and we projected them down to 2 dimensions. Again, with a very simple method like PCA. And what we find is actually quite interestingly, even in just the 2 first principle components of this space, we have some very interesting sort of female-male relationships.
Speaker 2
23:17
So man to woman is similar to uncle and aunt, brother and sister, sir and madam, and so on. So this is an interesting semantic relationship that falls out of essentially co-occurrence counts in specific windows around each word in a large corpus. Here's another 1 that's more of a syntactic relationship. We actually have here superlatives, like slow, slower, and slowest is in a similar vector relationship to short, shorter, and shortest.
Speaker 2
23:49
Or strong, stronger, and strongest. So this was very exciting, and of course, when you see an interesting qualitative result, you want to try to quantify who can do better in trying to understand these analogies, and what are the different modes and hyperparameters that modify the performance. Now, this is something that you will notice in pretty much every deep learning project ever, which is more data will give you better performance. That's probably the single most useful thing you can do to machine learning or deep learning system is to train it with more data, and we found that too.
Speaker 2
24:23
Now there are different vector sizes too, which is a common hyperparameter, like I said usually between 50 to at most 500. Here we have 300 dimensional that essentially gave us the best performance for these different kinds of semantics and tactic relationships. Now, in many ways, having a single vector for words can be oversimplifying, Some words have multiple meanings, maybe they should have multiple vectors. Sometimes the word meaning changes over time, and so on.
Speaker 2
24:56
So there's a lot of simplifying assumptions here, but again, our final goal for deep NLP is going to be to create useful systems. And it turns out this is a useful first step to create such systems that mimic some human language behavior in order to create useful applications for us. All right, but words, word vectors are very useful. But words, of course, never appear in isolation.
Speaker 2
25:21
And what we really want to do is understand words in their context. And so this leads us to the second section here on recurrent neural networks. So we already went over the basic definition of standard neural networks. Really the main difference between a standard neural network and a recurrent neural network, which I'll abbreviate as RNN now, is that we will tie the weights at each time step.
Speaker 2
25:48
And that will allow us to essentially condition the neural network on all the previous words, in theory. In practice, how we can optimize it, it won't be really all the previous words. It'll be more like, at most, the last 30 words. But in theory, this is what a powerful model can do.
Speaker 2
26:04
So let's look at the definition of a recurrent neural network. And this is going to be a very important definition, so we'll go into a little bit of details here. So let's assume for now we have our word vectors as given, and we'll represent each sequence in the beginning as just a list of these word vectors. Now, what we're going to do is we're computing a hidden state, ht, at each time step.
Speaker 2
26:28
And the way we're going to do this is with a simple neural network architecture. In fact, you can think of this summation here as really just a single layer neural network, if you were to concatenate the 2 matrices in these 2 vectors. But intuitively, we basically will map our current word vector at that time step t, sometimes I use these square brackets to denote that we are taking the word vector from that time step in there. We map that with a linear layer, a simple matrix vector product, and we sum up that matrix vector product to another matrix vector product of the previous hidden state at the previous time step.
Speaker 2
27:11
We sum those 2, and we apply, in 1 case, a simple sigmoid function to define this standard neural network layer. That will be HT. And now at each time step, we want to predict some kind of class probability over a set of potential events, classes, words, and so on. And we use the standard softmax classifier.
Speaker 2
27:33
Some other communities call it the logistic regression classifier. So here, we have a simple matrix, Ws for the softmax weights. We have basically a number of rows, a number of classes that we have, and the number of columns is the same as the hidden dimension. Sometimes, we want to predict the next word in the sequence in order to be able to identify the most likely sequence.
Speaker 2
28:06
So for instance, if I ask for a speech recognition system, what is the price of wood? Now in isolation, if you hear wood, you would probably assume it's the W-O-U-L-D, auxiliary verb would, but in this particular context, the price of, it wouldn't make sense to have a verb following that. And so it's more like the W-O-O-D to find the price of wood. So language modeling is a very useful task, and it's also very instructive to use as an example for where recurrent neural networks really shine.
Speaker 2
28:38
So in our case here, this softmax is going to be quite a large matrix that goes over the entire vocabulary of all the possible words that we have. So each word is going to be our class. The classes for language models are the words in our vocabulary. And so we can define here this y hat t, the jth 1 is basically denoting the probability that the jth word at the jth index will come next, after all the previous words.
Speaker 2
29:09
It's a very useful model, again, for speech recognition, for machine translation, for just finding a prior for language in general. All right. Again, the main difference to standard neural networks, we have the same set of W8s at all of the different time steps. Everything else is pretty much a standard neural network.
Speaker 2
29:31
We often initialize the first h0 here, just either randomly or all zeros. And again, in language modeling in particular, the next word is our class of the softmax. Now we can measure basically the performance of language models with terms called perplexity, which is the average log likelihood of the probabilities of being able to predict the next word. So you want to really give the highest probability to the word that will actually appear next in a long sequence.
Speaker 2
30:09
And then, the higher that probability is, the lower your perplexity, and hence the model is less perplexed to see the next word. In some sense, you can think of language modeling as almost NLP complete in some silly sense that if you can actually predict every single word that follows after any arbitrary sequence of words in a perfect way, you would have disambiguated a lot of things. You can say, for instance, what is the answer to the following question, ask the question, and then the next couple of words would be the predicted answer. So there's no way we can actually ever do a perfect job in language modeling, but there are certain contexts where we can give a high probability to the right next couple of words.
Speaker 2
30:55
Now, this is the standard recurrent neural network, and 1 problem with this is that we will modify the hidden state here at every time step. So even if I have words like the and a and a sentence period and things like that, it will significantly modify my hidden state. Now, that can be problematic. Let's say, for instance, I want to train a sentiment analysis algorithm, and I talk about movies and I talk about the plot for a very long time.
Speaker 2
31:25
Then I say, oh, man, this movie was really wonderful. It was great to watch. And then especially the ending, and you talk again for like 50 time steps or 50 words or 100 words about the plot. Now, all these plot words will essentially modify my hidden state.
Speaker 2
31:38
So if at the end of that whole sequence I want to classify the sentiment, the word wonderful and great that I mentioned somewhere in the middle might be completely gone. Because I keep updating my hidden state with all of these content words that talk about the plot. Now, the way to improve this is by using better kinds of recurrent units. And I will introduce here a particular kind, so-called gated recurrent units, introduced by Cho.
Speaker 2
32:08
And in some sense, we will learn more about the LSTM tomorrow, when Kwok gives his lecture, but GeoUser is a special case of LSTMs. And The main idea is that we want to have the ability to keep certain memories around without having the current input modify them at all. So again, this example of sentiment analysis. I say something's great.
Speaker 2
32:31
That should somehow be captured in my hidden state, and I don't want all of the content words that talk about the plot in the movie review to modify that it is actually overall a great movie. And then we also want to allow error messages to flow at different strengths depending on the input. So if I say, great, I want that to modify a lot of things in the past. So let's define a GRU.
Speaker 2
32:55
Fortunately, since you already know the basic Lego block of a standard neural network, there is only 1 or 2 subtleties here that are different. There are a couple of different steps that we'll need to compute at every time step. So in the standard RNN, what we did was just have this 1 single neural network that we hope would capture all this complexity of the sequence. Instead now, we'll first compute a couple of gates at that time step.
Speaker 2
33:23
So the first thing we will compute is the so-called update gate, just yet another neural network layer, based on the current input word vector and the past hidden state. So these look quite familiar, but this will just be an intermediate value, and we'll call it the update gate. Then we'll also compute a reset gate. It's yet another standard neural network layer.
Speaker 2
33:45
Again, just matrix vector product, summation matrix vector product, some kind of non-linearity here, namely a sigmoid. It's actually important in this case that it is a sigmoid. Just basically, both of these will be vectors with numbers that are between 0 and
Speaker 1
33:59
1.
Speaker 2
34:01
Now we'll compute a new memory content, an intermediate h-tilt here, with yet another neural network, but then we have this little funky symbol in here. Basically, this will be an element-wise multiplication. So basically, what this will allow us to do is, if that reset gate is 0, we can essentially ignore all the previous memory elements, and only store the new word information.
Speaker 2
34:30
So, for instance, if I talked for a long time about the plot, now I say this was an awesome movie. Now you want to basically be able to ignore if your whole goal of this sequence classification model is to capture sentiment, you want to be able to ignore past content. And this is, of course, if this was entirely a 0 vector. Now, this will be more subtle.
Speaker 2
34:52
This is a long vector of maybe 100 or 200 dimensions. So maybe some dimensions should be reset, but others maybe not. And then here, we'll have our final memory that essentially combines these 2 states, the previous hidden state and this intermediate 1 at our current time step. And what this will allow us to do is essentially also say, well, maybe you want to ignore everything that's currently happening and only update the last time step.
Speaker 2
35:21
We basically copy over the previous time step and the hidden state of that and ignore the current thing. Again, simple example, in sentiment, maybe there's a lot of talk about the plot when the movie was released, you want to have the ability to ignore that and copy that in the beginning and it may have said, it was an awesome movie. So here is an attempt at a clean illustration, I have to say, personally, I find the equations a little more intuitive than the visualizations that we try to do, but some people are more visual here. So this is, in some ways, basically, here we have our word vector, and it goes through different layers, and then some of these layers will essentially modify other outputs of previous time steps.
Speaker 2
36:06
So this is a pretty nifty model, and it is really the second most important basic Lego block that we are going to learn about
Speaker 3
36:19
today, and so I want to
Speaker 2
36:20
make sure we take a little bit of time, I will repeat this here. If the reset gate, this R value is close to 0, those kinds of hidden dimensions are basically allowed to be dropped. And if the update gate z basically is 1, then we can copy information of that unit through many, many different time steps.
Speaker 2
36:44
And if you think about optimization a lot, what this will also mean is that the gradient can flow through the recurrent neural network through multiple time steps until it actually matters and you want to update a specific word, for instance, and go all the way through many different time steps. So then, what this also allows us is to actually have some units that have different update frequencies. Some you might want to reset every other word. Other ones you might really cap, like they have some long-term context and they stay around for much longer.
Speaker 2
37:21
All right. This is the GRU. It's the second most important building block for today. There are, like I said, a lot of other variants of recurrent neural networks.
Speaker 2
37:33
Lots of amazing work in that space right now, and tomorrow Kwak will talk a lot about some more advanced methods. So now that you understand word vectors and neural network sequence models, you really have the 2 most important concepts for deep NLP. And that's pretty awesome. So congrats.
Speaker 2
37:56
We can now, in some ways, really play around with those 2 Lego blocks, plus some slight modifications of them, very creatively, and built a lot of really cool models. A lot of the models that I will show you, and that you can read and see and read the latest papers that are now coming out almost every week on archive, will have some kind of component of these, we will use these 2 components in a major way. Now, this is 1 of the few slides now with something really, really new. Because I want to keep it exciting for the people who already knew all of this stuff and took the class and everything.
Speaker 2
38:33
This is tackling an important problem, which is, in all of these models that you will see in pretty much most of these papers, we have in the end 1 final softmax here. And that softmax is basically our default way of classifying what we can see next, what kinds of classes we can predict. The problem with that is, of course, that that will only ever predict accurately frequently seen classes that we had at training time. But in the case of language modeling, for instance, where our classes are the words, we may see a test time some completely new words.
Speaker 2
39:08
Maybe I'm just going to introduce to you a new name, Srini, for instance. And nobody may have seen that word at training time, but now that I mentioned him and I will introduce him to you, you should be able to predict the word trini and that person in a new context. And so the solution that we're literally going to release only next week in the new paper, is to essentially combine the standard softmax that we can train with a pointer component. And that pointer component will allow us to point to previous contexts and then predict based on that to see that word.
Speaker 2
39:46
So let's, for instance, take the example of language modeling again. We may read a long article about the Fed chair, Janet Yellen. And maybe the word Yellen had not appeared in training time before. So we couldn't ever predict it, even though we just learned about it.
Speaker 2
40:03
And now a couple of sentences later, interest rates were raised, and then misses, and now we want to predict that next word. Now, if that hadn't appeared in our Softmax standard training procedure at training time, we would never be able to predict it. What this model will do, and we're calling it a pointer sentinel mixture model, is it will essentially first try to see, would any of these previous words maybe be the right candidate? So we can really take into consideration the previous context of, say, the last 100 words.
Speaker 2
40:34
And if we see that word and that word makes sense after we train it, of course, then we might give a lot of probability mass to just that word at this current position in our previous immediate context at test time. And then we have also the sentinel, which is basically going to be the rest of the probability, if we cannot refer to some of the words that we just saw. And that 1 will go directly to our standard Softmax. And then what we'll essentially have is a mixture model that allows us to say either we have, or we have a combination of both, of essentially words that just appeared in this context, and words that we saw in our standard softmax language modeling system.
Speaker 2
41:18
So I think this is a pretty important next step, because it will allow us to predict things we've never seen at training time. And that's something that's clearly a human capability that most, or pretty much none of these language models, had before. And so to look at how much it actually helps, it'll be interesting to look at some of the performance before. So again, what we're measuring here is perplexity.
Speaker 2
41:42
And the lower, the better, because it's essentially inverse here of the actual probability that we assign to the correct next word. And in just 2010, so 6 years ago, this was some great work, early work by Tomasz Michalof, where he compared to a lot of standard natural language processing methods, syntactic neural, syntactic models that essentially tried to predict the next word and had a perplexity of 107. And he was able to use the standard recurrent neural networks and actually an ensemble of 8 of them, to really significantly push down the perplexity, especially when you combine it with standard count-based methods for language modeling. So in 2010, he made great progress by pushing it down to 87, and now this is 1 of the great examples of how much progress is being made in the field thanks to deep learning, where 2 years ago, and his collaborators were able to push that down even further to 78 with a very large LSTM, similar to a GRU-like model, but even more advanced.
Speaker 2
42:57
Kwok will teach you the basics of LSTMs tomorrow. Then last year, the performance was pushed down even further by YARN-GUL. And then this 1 actually came out just a couple of weeks ago. Variational recurrent highway networks pushed it down even further.
Speaker 2
43:17
But this pointer sentinel model is able to get it down to
Speaker 1
43:19
70.
Speaker 2
43:20
So in just a short amount of time, we pushed it down by more than 10 perplexity points in 2 years. And that is really an increased speed in performance that we're seeing now that deep learning is changing a lot of areas of natural language processing. All right, now we have our basic Lego blocks, the word vectors, and the GRU sequence models.
Speaker 2
43:47
And now we can talk a little bit about some of the ongoing research that we're working on. And I'll start that with maybe a controversial question, which is, could we possibly reduce all NLP tasks to essentially question answering tasks over some kind of input. And in some ways, that's a trivial observation, that you could do that. But it actually might help us to think of models that could take any kind of input, a question about that input, and try to produce an output sequence.
Speaker 2
44:22
So let me give you a couple of examples of what I mean by this. So here we have the first 1 is a task that we would standardly associate with question answering. I'll give you a couple of facts. Mary walked to the bathroom.
Speaker 2
44:35
Sandra went to the garden. Daniel went back to the garden. Sandra took the milk there. Where's the milk?
Speaker 2
44:41
And now you might have to logically reason, try to find the sentence about milk, maybe Sandra took the milk there, and I would have to do a resolution and find out what does there refer to, and then you try to find, you know, the previous sentence that mentions Sandra, see that it is garden, and then give an answer garden. So this is a simple logical reasoning question answering task. And that's what most people in the QA field sort of associated with some kinds of question answers. But we can also say, everybody's happy, and the question is, what's the sentiment?
Speaker 2
45:18
And the answer is positive. So this is a different subfield of NLP that tackles sentiment analysis. We can go further and ask, what are the named entities of a sentence like Jane has a baby in Dresden, and you want to find out that Jane is a person and Dresden is a location. This is an example of sequence tagging.
Speaker 2
45:39
You can even go as far and say, I think this model is incredible, and the question is, what's the translation into French? And you get, je pense que ce modèle est incroyable. And that, in some ways, would be phenomenal if we were able to actually tackle all these different kinds of tasks with the same kind of model. So maybe it would be an interesting new goal for NLP to try to develop a single joint model for general question answering.
Speaker 2
46:18
I think it would push us to think about new kinds of sequence models and new kinds of reasoning capabilities in an interesting way. Now, there are 2 major obstacles to actually achieving the single joint model for arbitrary QA tasks. The first 1 is that we don't even have a single model architecture that gets consistent state-of-the-art results across a variety of different tasks. So for instance, for question answering, this is a data set called Bobby that Facebook published last year.
Speaker 2
46:46
Strongly supervised memory networks get the state-of-the-art. For sentiment analysis, you had tree LSTM models developed by Kai-Sheng Tai here at Stanford last year, and for part of speech tagging, you might have bidirectional LSTM conditional random fields. 1 thing you do notice is all the current state-of-the-art methods are deep learning. Sometimes they still connect to other traditional methods like conditional random fields and undirected graphical models, but there's always some kind of deep learning component in them.
Speaker 2
47:22
So that is the first obstacle. The second 1 is that really fully joint multitask learning is very, very hard. Usually when we do do it, we restrict it to lower layers. So for instance, in natural language processing, all we're currently able to share in some principled way are word vectors.
Speaker 2
47:42
We take the same word vectors we trained, for instance, with GloVe or Word2vec, and we initialize our deep neural network sequence models with those word vectors. In computer vision, we're actually a little further ahead, and you're able to use multiple of the different layers. And you initialize a lot of your CNN models with a first pre-trained CNN that was pre-trained on ImageNet, for instance. Now, usually, people evaluate multitask learning with only 2 tasks.
Speaker 2
48:12
They train on a first task, and then they evaluate the model that they initialize from the first on the second task. But they often ignore how much the performance degrades on the original task. So when somebody takes an ImageNet CNN and applies it to a new problem, they rarely ever go back and say, how much did my accuracy actually decrease on the original data set. And furthermore, we usually only look at tasks that are actually related, and then we find, oh, look, there's some amazing transfer learning capability going on.
Speaker 2
48:42
What we don't look at often in the literature and in most people's work, is that when the tasks aren't related to 1 another, they actually hurt each other. And this is so-called catastrophic forgetting. It's not, there's not too much work around that right now. Now, I also would like to say that right now, almost nobody uses the exact same decoder or classifier for a variety of different kinds of outputs.
Speaker 2
49:12
We at least replace the softmax to try to predict different kinds of problems. All right, so this is the second obstacle now. For now, we'll only tackle the first obstacle. And this is basically what motivated us to come up with dynamic memory networks.
Speaker 2
49:29
They're essentially an architecture to try to tackle arbitrary question answering tasks. When I'll talk about dynamic memory networks, it's important to note here that for each of the different tasks I'll talk about, it'll be a different dynamic memory network. It won't have the exact same weights. It'll just be the same general architecture.
Speaker 2
49:50
So the high-level idea for DMNs is as follows. Imagine you had to read a bunch of facts like these here. They're all very simple in and of themselves. But if I now ask you a question, I showed you these and I ask, where is Sandra?
Speaker 2
50:07
It would be very hard, even if you read all of them, it would be hard to remember. And so the idea here is that for complex questions, we might actually want to allow you to have multiple glances at the input. And just like I promised, 1 of our most important basic Lego blocks will be this GRU we just introduced in the previous section. Now, here's this whole model in all its gory details.
Speaker 2
50:37
And we'll dive into all of that in the next couple of slides. So don't worry. It's a big model. A couple of observations.
Speaker 2
50:45
So the first 1 is, I think we're moving in deep learning now to try to use more proper software engineering principles, basically to modularize, encapsulate certain capabilities, and then take those as basic Lego blocks and build more complex models on top of them. A lot of times, nowadays, you just have a CNN. That's like 1 little block in a complex paper, and then other things happen on top. Here, we'll have the GRU or word vectors, basically, as 1 module, a sub-module, in these different ones here.
Speaker 2
51:18
And I'm not even mentioning word vectors anymore. But word vectors still play a crucial role. And each of these words is essentially represented as this word vector, but we just kind of assume that it's there. OK, so let's walk on a very high level through this model.
Speaker 2
51:32
There are essentially 4 different modules. There's the input module, which will be a neural network sequence model, a GRU. There's a question module, an episodic memory module, and an answering module. And sometimes you also have these semantic memory modules here.
Speaker 2
51:48
But for now, these are really just our word vectors. And we'll ignore that for now. So let's go through this. Here is our corpus.
Speaker 2
51:55
And our question is, where is the football? And this is our input that should allow us to answer this question. Now, if I ask this question, I will essentially use the final representation of this question to learn to pay attention to the right kinds of inputs that seem relevant for given what I know to answer this question. So where's the football?
Speaker 2
52:18
Well, it would make sense to basically pay attention to all the sentences that mention football and maybe especially the last ones if the football moves around a lot. So what we'll observe here is that this last sentence will get a lot of attention. So John put down the football. And now, what we'll basically do is that this hidden state of this recurrent neural network model will be given as input to another recurrent neural network, because it seemed relevant to answer this current question at hand.
Speaker 2
52:49
Now, we'll basically agglomerate all these different facts that seem relevant at the time in this another GRU, in this final vector m. And now this vector m, together with the question, will be used to go over the inputs again if the model deems that it doesn't have enough information yet to answer the question. So if I ask you, where's the football, and it so far only found that John put down the football, you don't know enough. You still don't know where it is, but you now have a new fact, namely, John seems relevant to answer the question.
Speaker 2
53:18
And that fact is now represented in this vector m, which is also just the last hidden state of another recurrent neural network. Now we'll go over the inputs again. Now that we know that John and the football are relevant, we'll learn to pay attention to John moved to the bedroom, and John went to the hallway. Again, those are going to get agglomerated here in this recurrent neural network, And now the model thinks that it actually knows enough, because it basically intrinsically captured things about the football.
Speaker 2
53:54
John found a location, and so on. Of course, we didn't have to tell it anything about their people, their locations. If x moves to y, and y is in the set of locations. Then this happens.
Speaker 2
54:05
None of that. You just give it a lot of stories like that, and in its hidden states, it will capture these kinds of patterns. So then we have the final vector m, And we'll give that to an answer module, which produces in our standard softmax way the answer. All right, now let's zoom into the different modules of this overall dynamic memory network architecture.
Speaker 2
54:28
The input, fortunately, is just a standard GRU, the way we defined it before. So simple word vectors, hidden states, reset gates, update gates, and so on. The question module is also just a GRU, a separate 1 with its own weights. And the final vector, Q, here, is just going to be the last hidden state of that recurrent neural network sequence model.
Speaker 2
54:56
Now, the interesting stuff happens in the episodic memory module, which is essentially a sort of meta-gated GRU, where this gate will basically define, is defined and computed by the attention mechanism, and will basically say this current state sentence SI here seems to matter. And the superscript T is the episode that we have. So each episode basically means we're going over the input entirely 1 time. So it starts at g1 here.
Speaker 2
55:33
And what this basically will allow us to do is to say, well, if g is
Speaker 1
55:40
0,
Speaker 2
55:40
then what we'll do is basically just copy over the past states from the input. Nothing will happen. And unlike before in all these GRU equations, this G is just a single scalar number.
Speaker 2
55:52
It will basically say, if G is 0, then this sentence is completely irrelevant to my current question at hand. I can completely skip it, right? And there are lots of examples, like married travel to the hallway, that are just completely irrelevant to answering the current question. In those cases, this g will be 0, and we're just copying the previous hidden state of this recurrent neural network over.
Speaker 2
56:19
Otherwise, we'll have a standard GRU model. So now, of course, the big question is how do we compute this G? And this might look a little ugly, but it's quite simple. Basically, we're going to compute 2 vector similarities, multiplicative and additive 1 with absolute values of all the single values of the sentence vector that we currently have and the question vector and the first, the memory state of the previous pass of the input.
Speaker 2
56:48
And the first pass of the input, the memory state is initialized to be just a question. And then afterwards, it agglomerated relevant facts. So intuitively here, if the sentence mentions John, for instance, and the question is, or mentions football, and the question is, where's the football? Then you'd hope that the question vector Q mentions has some units that are more active because football was mentioned, and the sentence vector mentions football, so it has some units that are more active because football is mentioned.
Speaker 2
57:17
And hence, some of these inner products or absolute values of subtractions are going to be large. And then what we're going to do is just plug that into a standard through standard single layer neural network and then a standard linear layer here. And then we apply a softmax to essentially weight all of these different potential sentences that we might have to compute the final gate. So this will basically be a soft attention mechanism that sums to 1 and will pay most attention to the facts that seem most relevant, given what I know so far in the question.
Speaker 2
57:55
Then when the end of the input is reached, all these relevant facts here are summarized in another GRU that basically moves up here. And you can train a classifier also, if you have the right kind of supervision, to basically train that the model knows enough to actually answer the question and stop iterating over the inputs. If you don't have that kind of supervision, you can also just say, I will go over the inputs a fixed number of times. And that works reasonably well, too.
Speaker 2
58:26
All right, there's a lot to sink in. So I'll give you a couple of seconds. Basically, we pay attention to different facts given a certain question. We iterate over the input multiple times.
Speaker 2
58:38
And we agglomerate the facts that seem relevant given the current knowledge and the question. Now, I don't usually talk about neuroscience. I'm not a neuroscientist, but there is a very interesting relationship here that a friend of mine, Sam Gershman, pointed out, which is that the episodic memory in general for humans is actually the memory of autobiographical events. So it's the time when we remember the first time we went to school or something like that.
Speaker 2
59:05
And it's essentially a collection of our past personal experiences that occurred at a particular time in a particular place. And just like our episodic memory that can be triggered with a variety of different inputs, this is also, this episodic memory is also triggered with a specific question at hand. And what's also interesting is the hippocampus, which is the seat of the episodic memory in humans, is actually active during transitive inference. So transitive inference is going from A to B to C to have some connection from A to C.
Speaker 2
59:36
Or in this case here, with this football, for instance, you first had to find facts about John and the football, and then finding where John was, and then finding the location of John. So those are examples of transitive inference. And it turns out that you also need, in the DMN, these multiple passes to enable the capability to do transitive inference.
Omnivision Solutions Ltd