The interaction between humans and machines is getting more and more natural and accessible, making computers part of everyone’s life. What is the tool that is making this possible? One of the most ancient yet advanced technologies invented by humans: language. The introduction of conversational agents such as Google Assistant, Apple’s Siri or Amazon’s Alexa allowed people to interact with machines by simply talking, overcoming any type of interface. However, teaching a machine how to understand and use language is not trivial. It is enough to think that humans are the only ‘animals’ that rely on words and syntax to talk, demonstrating that a certain amount of intellect is needed in order to construct a structured system of communication. Languages are complex and change over time, but nevertheless, we have been able to make computers use this resource. A problem that researchers had to face when teaching machines to speak consisted in the fact that the building blocks of languages are words, while computers chew numbers. A clever idea to transform words into digits was needed in order to achieve the goal.
The keystone that made this possible was the introduction of textual embeddings, vectors of real numbers able to encode the semantic meaning of words or sentences in a Euclidean space. The idea behind these vectors is to represent words or sentences having a similar semantic meaning with vectors “close” to each other.
These vectors are then used as input for machine learning models and, after their introduction, they have become ubiquitous in Natural Language Processing frameworks.
Word embedding vectors are obtained by unsupervised machine learning algorithms that receive a large amount of raw text as unique input. The idea underlying these algorithms is as simple as it is brilliant: learn the embedding of a certain term using its neighbor words in the input text as data. This approach relies on the so-called distributional hypothesis, formulated in 1954 by the American linguist Zelling S. Harris, according to which words that occur in similar contexts tend to have similar meanings. This kind of algorithms learn a representative vector for a term using its context exclusively as data, without the need for any supervision. This observation evidences the power of this approach: the potentially infinite amount of raw text retrievable on the web makes computational power the only limit to word embedding algorithms.
Similar to word embeddings, sentence embeddings are vectors of real numbers associated with pieces of text (phrases, paragraphs, documents, etc.) able to represent semantically similar sentences with arrays close to each other. Even if the purposes of these two kinds of algorithms are the same, sentence embedding models adopt a different approach to provide the representative vectors, making use of word-level representations as building blocks. During the last decade, plenty of sentence embedding models have been published, causing progress in this field to advance at an extremely high pace. These algorithms can be divided into two families, depending on the approach adopted:
- Models that obtain sentence embeddings via compositions (mean, concatenation, etc.) of pre-trained word embeddings.
- Models that use word embeddings as input of a deep neural network and adopt its hidden layers as sentence embeddings.
Each of these algorithms has its own pros and cons, which vary from computational cost to embedding quality and flexibility. For this reason, a best-performing sentence embedding has not been decreed and each model has reasons to be adopted, depending on the use case.
A bit of history
Before the introduction of word embeddings, terms were represented by one-hot vectors, arrays of huge dimension containing all 0 values with the exception of a single non-zero entry in correspondence of the index of the represented word.
The number of elements in such vectors is equal to the number of unique words in a language which generally is on the order of 105. Using this kind of representation as input for Natural Language Processing models leads to two principal issues: first, the dimension of the problem is huge, a great amount of memory is required and the so-called curse of dimensionality problem occurs. Second, the features of these vectors are totally uncorrelated and redundant; explaining this with an example, the words apple and pear share a lot of semantic meaning and so their representations are expected to be similar, but the one-hot vectors associated to them are orthogonal from a mathematical point of view. Word embeddings have been introduced precisely to solve these issues: the representations of words obtained by embedding algorithms have a much lower dimension (typically between 50 and 500) and their features are not uncorrelated.
Word embedding algorithms have been proposed for the first time in 2003 in the publication A neural probabilistic language model (Bengio et al., 2003), but became famous only ten years later with the introduction of Word2Vec (Mikolov et al., 2013), which still represents one of the most used models in this field.
New frontiers of text representation
In the last years several embedding algorithms have been proposed, both at word and sentence level, in order to improve the quality of the information encoded in the representative vectors. Among the most recent word embedding models published we can mention:
- FastText (Bojanowski et al., 2016). The novelty introduced by this algorithm consists of using sub-word information to construct the representative vector of a term. The FastText model learns embeddings for entire words but also for the most frequent group of characters (n-grams) in the input text. With this approach, the embeddings provided encode information related to prefixes and suffixes of a language (the most frequent n-grams) and are able to represent out-of-vocabulary words as an average of n-gram embeddings. After its introduction, the FastText model has become one of the most adopted word embedding algorithms in the NLP community.
- ELMo (Peters et al., 2018). This algorithm was proposed to overcome the problems related to polysemes. Previous word embedding models provide fixed representative vectors for each word, even if the latter had more than a semantic meaning. On the contrary, ELMo provides flexible word embeddings which take into account the entire sentence to derive the semantic meaning of the represented term. This result is obtained relying on a pre-trained bi-directional language model, whose hidden states are used to compose the final embedding of a term via a weighted average. ELMo obtained much relevance in the NLP community since it has been the first algorithm to propose an efficient solution to the problem of polysemes.
- Flair (Akbik et al., 2018). This algorithm was proposed with the objective of having both the properties of the aforementioned models, namely provide flexible word embeddings to overcome the polysemes problem, and encode the semantic meaning of prefixes and suffixes. This goal is reached using a character-level bi-directional language model to provide the representative vectors. This novel word embedding approach was able to reach the state-of-the-art in several NER tasks.
On the other hand, the most relevant sentence embedding algorithms of the last years are:
- Smooth Inverse Frequency weighting (Arora et al., 2017). This model obtains sentence representations via a weighted average of pre-trained word embeddings (whose weight depends on the frequency of the represented term) followed by the removal of the first principal component extracted from a set of sentence embeddings. Even if this model is very simple and requires a low computational effort, in the setting of semantic-level tasks it achieves results comparable with those obtained by other algorithms based on deep architectures. Moreover, the removal of principal components has become a widespread post-processing practice when dealing with sentence embeddings.
- Sentence-BERT (Reimers et al., 2019). This algorithm, based on BERT’s language representation model (Devlin et al., 2018), makes use of Transformer technology (Vaswani et al., 2017) and relies on an exaggeratedly huge deep architecture. These two characteristics make Sentence-BERT able to provide extremely informative sentence embeddings, remarkably improving the state-of-the-art of several semantic-level tasks.
- Static Fuzzy-Bag-of-Words. This algorithm relies on fuzzy sets theory to justify a new type of composition used to obtain sentence embeddings: the max-pooling operation.
Despite the simplicity and the low computational effort required, this model is able to obtain highly informative sentence embeddings and reaches results in semantic-level tasks comparable with other computationally expensive approaches.
The integration of these algorithms in linguistic tools such as machine translators and conversational agents permitted to achieve surprising results in the NLP field, making the boundary between human and artificial intelligence more and more labile.