Natural Language Processing deals with the interaction between machines and individuals through human language. This definition is comprised of three main sections: Speech Recognition, Natural Language Understanding, and Natural Language Generation.
In this article our overview will include only the second and the third one, assuming we already have data, such as a large written corpus, ready to be processed.
The challenge of enabling computers to analyze, process and react consequently to an external language stimulus is far from being an easy job, and, if we look back just a few years ago, the industry was able to perform similar to human beings on a small set of sub-tasks.
This broad area has been segmented into more subfields, each of them addressing a more specific task.
Some of them have been outside from earlier times when the computer revolution came, achieving already decent results. Others raised only later along with Deep Learning techniques. Either way, the entire field has gained a huge boost in performance thanks to the Deep Learning Era.
NLP domain where Deep Learning is currently applied might be divided into ten smaller task-related areas, each one grouping together tasks all related to themselves.
Question Answering |
Machine Translation |
Dialogue |
Summarization |
Natural Language Inference |
Sentiment Classification |
Semantic Role Labeling |
Relation Extraction |
Semantic Parsing |
Commonsense Reasoning |
Let’s break them down one at a time, and highlight some of the key concepts and aspects.
We assume at least basic knowledge of what a Neural Network and a Recurrent Neural Network are. You may be not familiar with what I will refer to as “architecture split” in each section: It’s simply a quick way to visualize the length of the inputs the model is fed with, compared to the number of outputs the model produces. Moreover, the main ideas and concepts adopted by models will be explained, whereby they haven’t been introduced yet.
Question Answering
The model takes a passage or a paragraph and a question about something included in the paragraph. It aims to be able to answer correctly. The question can be either open-domain or not. The baseline for this task is Bi-LSTM, abbreviation for Bidirectional Long Short-Term Memory, a particular variant of Recurrent NN which has a stronger ability to let information flow from previous time steps through the model, thus minimizing the tendency of Vanilla RNN to be affected at a particular time step mainly by information coming from the closer previous time steps. The Bidirectional component lets the model process information in both temporal orders forward and backward. In terms of number of input and output, the architecture is many to many, where input and output length isn’t equal. A key feature to adopt in order to have acceptable performance is the Attention mechanism. This method lets the model be able to focus on different parts of the source text when generating the answer.
The evaluation metrics for this task can vary depending on the dataset. Nonetheless, some common ones are EM and F1 score.
Currently, leaderboards are ruled by Transformers-based models. The Transformers architecture brings the Attention idea even further, applying the same mechanism not only in the answer generation phase but also in the reading comprehension phase: the model can learn better dependencies between words in the source text, producing a better understanding of the passage.
Machine Translation

This task is as simple to explain as complex to accomplish. It’s been a benchmark for NLP since the very beginning. The model aims to output the translation into another language for a specific sentence given as input. The baseline is Bi-LSTM, that we’ve already mentioned before. The architecture split is many to many, with input and output of different lengths. A key feature to adopt is still Attention mechanism. Another thing worth to be introduced is the Encoder/Decoder framework: the structure has two main parts. The Encoder is first, it takes the source text as input, and is responsible for outputting a representation of the text, an understanding of the input sequence. This information is later fed into the remaining part of the model, the Decoder, which takes the Encoder output as initial input and tries to generate the next word with respect to the input sentence. This exact procedure is repeated every time, feeding as new input sequence the previous output, and trying to predict the next word, producing an output sentence one word longer every iteration. The Decoder’s job is taking the sentence predicted so far, and outputs the next word, until it produces a stop signal, meaning that the generation process is completed.
The Evaluation metric for this task is specifically assigned and is called BLUE (BLUE BiLingual Understudy Evaluation).
State-of-the-art performances are still achieved by Transformers-based models.
Dialogue
The Dialogue field is probably the widest, spanning a large variety of subtasks. Basically, every type of conversation, in whatever context, of whatever subject, could be grouped under the dialogue area. Having a talk with a counterpart is everything the model is required to do. Sounds easy, right? Actually, it is absolutely not! We would like to mention here some of the most common deficiencies you might encounter: genericness, irrelevant responses, repetition, lack of context, lack of consistent persona. Those are just some of the pitfalls to avoid when dealing with NLP models and especially in a Dialogue related environment.
The baseline is Bi-LSTM, and the architecture split is many to many, with input and output of different lengths. Key features for this task are Attention and Encoder/Decoder framework, something we already have discussed.
The evaluation metric usually varies depending on the dataset. F1 score and Perplexity are some of those you might see. Be aware that in the Dialogue field often human evaluation is used as the main metric when empirical formulas aren’t explanatory enough.
Summarization

The last task in the Language Generation domain is Summarization. Given an input text, the model aims to produce a summary as output, which is shorter and contains the main information of the input text. The baseline is still Bi-LSTM and the key features are Attention, Encoder/Decoder structure, like other Language Modeling tasks. Architecture split is many to many, with different input and output length.
A special evaluation metric is used and is called ROUGE (Recall-Oriented Understudy for Gisting Evaluation).
Even here top places in leaderboards are holded by Transformers-based models.
Leaving the Language Modeling world, we jump onto the next block which has two task areas that don’t refer to Language Modeling but still has been not achieving satisfactory results before Deep Learning techniques.
Natural Language Inference
Given a premise the model aims to determine whether a hypothesis is true, false or undetermined.
The baseline model is Bi-LSTM. Architecture split is many to one: meaning we have many inputs (the sentence length) and only one output (a label indicating whether the hypothesis is entailment, contradiction or neutral).
The evaluation metric used is the accuracy.

Sentiment Classification
It consists of classifying the polarity of a given text or simply classify input text into some predefined classes. The baseline for this task could be either Vanilla RNN or Convolutional NN. CNN is a neural network architecture born in the Computer Vision world. A Convolution is the math transformation that stands between CNN layers. The peculiarity of a Convolution is being able to detect features to extract regardless of their absolute position in the sentence. So, for instance, what makes a certain feature relevant is simply having the same occurrence nearby every time, disregarding his particular position in the sentence.
This approach can sometimes be very effective in sentiment classification, where the model aims to extract features of the overall sentiment of the text, and very similar expressions lead to the same sentiment even though they are located differently in the phrase.
Architecture split is many to one.
The evaluation metric used is accuracy.
As you could imagine, even for this task, leaderboards are ruled by Transformers-based models.
The next block is for branches that are present in NLP from the earliest times and managed to get good results thanks to rule-based algorithms and hand-crafted features. It has two large task areas and a lot of crucial sub-tasks, sometimes considered as preprocessing tasks in NLP, tend to be grouped together within them.
Semantic Role Labeling

This task aims to model the predicate-argument structure of a sentence and is often described as answering “Who did what to whom”. The baseline is Bi-LSTM. The architecture split is many to many, where input and output length are exactly the same.
The evaluation metric is accuracy.
Relation Extraction
This is the task of extracting semantic relationships from a text. Extracted relationships usually occur between two or more entities of a certain type (e.g. Person, Organization, Location) and fall into a number of semantic categories (e.g. married to, employed by, lives in).

Bi-LSTM and CNN are the baselines for the task. Architecture split is many to many with an equal number of inputs and outputs.
The evaluation metric is the F1 score.
The last block comprehends two complex tasks that before Deep Learning rising weren’t eligible to be applied in the end-to-end model.
Semantic Parsing
Semantic Parsing is the task of translating natural language into a logical form: a machine-understandable representation of its meaning. Representations may be an executable language such as SQL or more abstract representations such as Abstract Meaning Representation (AMR). The baseline is Bi-LSTM; Architecture split is many to many and different input/output length.

The metric used for performance evaluation is accuracy.
Commonsense Reasoning
This innovative task requires the model to use what humans connote as commonsense. This involves the ability to go beyond patterns in the text and theoretically use no information coming from the actual input source text. In order to do that, the model needs to acquire a certain surrounding-world knowledge, which will later be permanent. From time to time, when processing new text, the model is supposed to use his general knowledge to draw conclusions and predicting results, including judgments about the physical properties, purpose, intentions, and behavior of people and objects, as well as possible outcomes of their actions and interactions.

The baseline is represented by Bi-LSTM, and architecture split is many to many, having input and output of different lengths.
Typically, when the model tries to produce different outputs such as purpose and intention, it uses multiple Encoder/Decoder structures.
The evaluation metric is chosen depending on the dataset and according to the specific sub-task, the model tries to accomplish.
Deep Learning has represented a real breakthrough for the entire NLP field and especially for certain areas. At the moment we are capable to achieve human-like performances for the vast majority of tasks. New benchmarks are set at an incredible pace, so much that the community is hardly able to keep up with them. As a result, researchers and practitioners are progressing and taking advantage of that, and this is extremely beneficial for everyone.
So, why has deep learning been so successful lately? It may be attributed to four main factors: data abundance and accessibility, the ability of neural networks to scale up in size, complemented by functional frameworks and a huge amount of computational power. Considering that all of these factors are doing nothing but increasing and improving, It would be reasonable to think that this field has still room to grow and performance will be projected to increase even more in the future.