A couple of years ago Google released a groundbreaking model that revolutionized the way of representing language in Natural Language Processing (NLP). This new creature takes the name of BERT (Bidirectional Encoder Representations from Transformers) and can be seen as the successor of another Muppet-named model, ELMo. In less than two years BERT inspired a large number of researchers, leading the NLP community to study and analyze this model in depth. The numerous papers published on the topic gave birth to a new area of research known as BERTology.
BERT’s anatomy
The BERT model consists of a neural network with millions of parameters that aim to grasp as much semantic and syntactic knowledge as possible from raw text and use it to perform specific language-related tasks.
BERT receives sentences as input and returns a real-valued vector as output, which can then be entered into a classifier to provide high-informative input in turn.
The novelties introduced by this model, which coincide with its strengths, can be summarized in two points:
- BERT adopts a two-phase training. In phase 1, called pre-training, the network is trained on a couple of generalistic tasks (namely masked language modeling and next sentence prediction) in an unsupervised fashion, receiving a huge amount of raw text as input data. In phase 2 this same network is trained on a specific task (e.g. question answering, sentiment analysis, etc.) using instead a supervised approach, receiving a relatively small amount of labeled text as data. This double-training method allows BERT to learn both general semantic (in the pre-training phase) and task-specific knowledge of a language (in the fine-tuning step). It is important to underline that the pre-training phase results in a long and computational expensive operation, since it requires a huge amount of data in order to learn meaningful language patterns. Once the pre-training is complete the model is stored, ready to be used when necessary and fine-tuned for any supervised task, requiring for this phase a much lower computational effort.

- The layers of BERT’s network are composed of Transformer blocks. This technology, published in the famous paper Attention is all you need, relies on the so-called Attention mechanism to provide high-quality output vectors. The basic idea underlying Attention and Transformers consist of taking into account the whole sentence while processing a term in it: the more a word in the sentence is related to the processed term, the more it will influence the output vector provided. Although this was already done by Recurrent Neural Networks, Attention mechanisms enabled us to take into account entire sentences simultaneously, making the computation of the output vectors faster and parallelizable.

The BERTology world
Articles contributing to enrich the BERTology field can be divided in two spheres. The first set of publications focuses on understanding what aspects of language BERT is able to learn. To give an outline of the results gathered on this, it has been shown that the model can encode information about entity types and semantic roles but has problems representing numbers. Moreover, BERT seems able to learn some common sense properties of objects (e.g. houses are big and people can walk into houses) but cannot infer relationships between objects (e.g. it does not know that houses are bigger than people). For an extensive overview on these matters, I would suggest reading the paper A Primer in BERTology: What we know about how BERT works.
The second set of articles points to novel models that can improve BERT’s performances while still relying on the novelties highlighted above (namely two-phase training and Transformer technology). Amongst these it is worth mentioning:
- RoBERTa, a BERT network with optimized hyperparameters and pre-trained on a bigger corpus. Thanks to these minor changes, RoBERTa outperformed BERT and achieved the state-of-the-art in 4 tasks of the GLUE benchmark.
- XLNet, a model that solves BERT’s pre-training task limitations. In particular, XLNet avoids the token corruption practice (applied instead in masked language modeling) performing input sequence permutations. With this trick, XLNet outperformed BERT in 20 tasks including question answering and natural language inference.
- DistilBERT, a lightweight BERT network (with less layers and parameters) that can achieve comparable performances with respect to the original BERT architecture thanks to the knowledge distillation solution used in pre-training.
- Electra, a model that proposes a sample-efficient task (denoted as replaced token detection) alternative to masked language modeling. In replaced token detection, a certain amount of terms in the input corpus is substituted by plausible alternatives. The network is consequently trained to discriminate whether each input token is modified or is the original one. The substantial improvement brought by this pre-training task consists in the fact that the loss function is defined over all input tokens (contrarily to masked language modeling where the loss is defined over masked tokens exclusively), resulting in a better exploitation of the whole input corpus. Electra established new state-of-the-art performances for the SQuAD 2.0 task.
Although the improvements brought by these models in language representation are impressive, their applicability is limited by the fact that they rely on big neural networks with millions of parameters. Because of this, options such as DistilBERT, which aim to decrease the computational effort required, assume great relevance, and make it possible to deploy BERTology’s power in lightweight systems and mobile devices.