Before Deep Learning Era

Back in the days before the deep learning era — when a Neural Network was more of a scary, enigmatic mathematical curiosity than a powerful Machine Learning or Artificial Intelligence tool — there were surprisingly many relatively successful applications of classical data mining algorithms in the Natural Language Processing (NLP) domain. It seemed that problems like spam filtering or part of speech tagging could be solved using rather straightforward and understandable models.

But not every problem can be solved this way. Simple models fail to adequately capture linguistic subtleties like context, idioms, or irony (though humans often fail at that one too). Algorithms based on overall summarization (e.g. bag-of-words ) turned out to be not powerful enough to capture sequential nature of text data, whereas n-grams struggled to model general context and suffered severely from a curse of dimensionality . Even HMM -based models had trouble overcoming these issues due to their Markovian nature (as in, their memorylessness). Of course, these methods were also used when tackling more complex NLP tasks, but not to great success.

First breakthrough - Word2Vec

The NLP field saw its first major jump in improvement in the form of a semantically rich representation of words, an accomplishment enabled by the application of neural networks.. Prior to this, the most common representation was a so-called one-hot encoding, where each word is transformed into a unique binary vector with only one non-zero entry. This approach suffered greatly from sparsity, and didn't take into account the meaning of particular words at all.

word2vec

Figure 1: Word2Vec representations of words projected onto a two-dimensional space.

Instead, imagine starting with a string of words, removing the middle one, and having a neural network predict context given a middle word ( skip-gram ). Or alternatively, asking it to predict the center word based on context (i.e., Continuous Bag of Words, CBOW ). Of course, such model is useless, but it turns out that as a side effect, it produces a surprisingly powerful vector representation that preserves the semantic structure of words.

Further improvements

Even though the new powerful Word2Vec representation boosted the performance of many classical algorithms, there was still a need for a solution capable of capturing sequential dependencies in a text (both long- and short-term). The first concept for this problem was so-called vanilla Recurrent Neural Networks (RNNs). Vanilla RNNs take advantage of the temporal nature of data by feeding words to the network sequentially while using the information about previous words stored in a hidden-state.


Figure 2: A recurrent neural network. Image courtesy of an excellent Colah's post on LSTMs

It turned out that these networks handled local dependencies very well, but were difficult to train due to the vanishing gradient . To address this issue, computer scientists & machine learning researchers developed a new network topology called long short-term memory ( LSTM ). LSTM handles the problem by introducing special units in the network called memory cells. This sophisticated mechanism allows for finding longer patterns without a significant increase in the number of parameters. Many popular architectures are also variations of LSTM, such as mLSTM or GRU , which, thanks to an intelligent simplification of a memory cell update mechanism, significantly decreased the number of parameters needed.

After the astounding success of After the astounding success of Convolutional Neural Networks in Computer Vision, it was only a matter of time before they were incorporated into NLP. Today, 1D convolutions are popular building blocks of many successful applications, including semantic segmentation , fast machine translation , and general sequence to sequence learning framework — which beats recurrent networks and can be trained an order of magnitude faster due to an easier parallelization.

👀 Convolutional Neural Networks, were first used to solve Computer Vision problems and remain state-of-the-art in that space. Learn more about their applications and capabilities here.

Typical NLP problems

There are a variety of language tasks that, while simple and second-nature to humans, are very difficult for a machine. The confusion is mostly due to linguistic nuances like irony and idioms. Let’s take a look at some of the areas of NLP that researchers are trying to tackle (roughly in order of their complexity):
The most common and possibly easiest one is sentiment analysis. This is, essentially, determining the attitude or emotional reaction of a speaker/writer toward a particular topic (or in general). Possible sentiments are positive, neutral, and negative. Check out this great article about using Deep Convolutional Neural Networks for gauging sentiment in tweets. Another interesting experiment showed that a Deep Recurrent Net could the learn sentiment by accident .

Unsupervised sentiment neuron
Figure 3: Activation of a neuron from a net used to generate next character of text. It is clear that it learned the sentiment even though it was trained in an entirely unsupervised environment.

A natural generalization of the previous case is document classification, where instead of assigning one of three possible flags to each article, we solve an ordinary classification problem. According to a comprehensive comparison of algorithms , it is safe to say that Deep Learning is the way to go fortext classification.

Now, we move on to the real deal: Machine Translation. Machine translation has posed a serious challenge for quite some time. It is important to understand that this an entirely different task than the two previous ones we’ve discussed. For this task, we require a model to predict a sequence of words, instead of a label. Machine translation makes clear what the fuss is all about with Deep Learning, as it has been an incredible breakthrough when it comes to sequential data. In this blog post you can read more about how — yep, you guessed it — Recurrent Neural Networks tackle translation, and in this one you can learn about how they achieve state-of-the-art results.

Say you need an automatic text summarization model, and you want it to extract only the most important parts of a text while preserving all of the meaning. This requires an algorithm that can understand the entire text while focusing on the specific parts that carry most of the meaning. This problem is neatly solved by attention mechanisms , which can be introduced as modules inside an end-to-end solution .

Lastly, there is question answering, which comes as close to Artificial Intelligence as you can get. For this task, not only does the model need to understand a question, but it is also required to have a full understanding of a text of interest and know exactly where to look to produce an answer. For a detailed explanation of a question answering solution (using Deep Learning, of course), check out this article .

Attention Mechanism
Figure 4: Beautiful visualization of an attention mechanism in a recurrent neural network trained to translate English to French.

Since Deep Learning offers vector representations for various kinds of data (e.g., text and images), you can build models to specialize in different domains. This is how researchers came up with ** visual question answering **. The task is "trivial": Just answer a question about an image. Sounds like a job for a 7-year-old, right? Nonetheless, deep models are the first to produce any reasonable results without human supervision. Results and a description of such a model are in this paper .

🍔 🍳 🍟 Starving for applications? Get your hands dirty and implement your NLP chatbot using LSTMs.

Recap

So, now you know. Deep Learning appeared in NLP relatively recently due to computational issues, and we needed to learn much more about Deep Neural Networks to understand their capabilities. But once we did, it changed the game forever.