Every single one of us plays a role and can make a difference in helping the global community to get through the COVID-19 pandemic.
Given the rapidly increasing number of literature on this topic, the need to gather as much useful information as possible and make sense of all the available knowledge in a short time frame is more urgent than ever. To find a solution faster, a coalition of leading research groups issued the COVID-19 Open Research Dataset Challenge (CORD-19): “a call to action to the world’s artificial intelligence experts to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions”.
We decided to contribute with all the resources we had available. Our goal is to use our Natural Language Processing tools to make the process of searching for information more efficient in order to limit the spread of COVID-19 and help flatten the contagion curve, eventually saving lives. We use a Question Answering model, that automatically answers questions in natural language with information from a large database.
To validate the AI model our tools run on, we reached out to Centro Medico Sant’Agostino, a Milan-based healthcare facility that helped us by allocating a team of specialized researchers to evaluate the results of our work. Eventually, the model can be used in a simple conversational interface, like a chatbot.
Let’s analyze the engine
RECORD makes use of BERT, a pre-trained unsupervised natural language processing model that can outperform several NLP tasks after fine-tuning. The details of how BERT works can be found in this paper by Google AI.
The engine basically works in two steps:
Information retrieval (IR)
When we give a query to the engine as input, it searches for the most semantically-related papers in the dataset. But how could we do that in a rapid and efficient way? Papers are embedded into vectors using DistilBERT, a lightweight Sentence-BERT network pre-trained on a general domain corpus and fine-tuned precisely to obtain semantically meaningful sentence embeddings. The most related papers are selected according to the highest values of cosine similarity between the input query and each document. One may wonder: “Why to use a sentence-level representation to vectorize long texts instead of document-level?” Because Sentence-BERT obtained impressive performances in Semantic Textual Similarity task and that led us to choose it as the best embedding model for this purpose.
So, the question is: “how to analyze long texts at sentence-level?”
The most challenging thing in doing information retrieval is to do it with long sequences of text. Several approaches to this problem rely on the use of paper abstracts instead of the entire text, but this can lead to a loss of information. RECORD addresses this problem by splitting the papers into separate chunks, and we compute the similarity between the input query and each paper as the similarity between the query and the closest (in terms of cosine similarity distance) chunk. Of course this approach leads to a computationally expensive embedding step, due to the high number of embedding vectors to compute. This is acceptable because it is not run on runtime and it allows to perform a fine-grained analysis of all the available data, searching at chunk-level the papers which are most semantically-related to the query.
Question Answering (QA)
In the Question Answering step, RECORD provides the best answers to the question using the input query and the chunks of the papers previously selected.
Again, we use a BERTLARGE architecture pre-trained on a general-domain corpus, fine-tuned on the Stanford Question Answering Dataset (SQuAD) 1.1 dataset, that reaches the state-of-the-art for this NLP task.
For a given paper, RECORD processes all the chunks and for each of them provides an answer to the query, together with a score. These scores are used to select the best chunk according to the highest one, highlighting the answer.
Moreover, RECORD provides for each answer several additional information like authors and paper reputation, so that the user can assess the reliability.
How does RECORD perform?
To evaluate RECORD’s performance, we created a testset of specific questions, together with specialists of the health facility Centro Medico Sant’Agostino. The testset is composed of 112 questions, divided in 11 different tasks. Tasks can be specifically medical like “potential COVID-19 risk factors” or “virus origin and evolution”, but they also involve the social and economic impact that coronavirus can have.
We defined a scoring mechanism to evaluate the provided answers. An answer can receive one (and only one) of these scores (they are not the same scores of the QA step!):
- Score 0: the answer topic is different from the question topic.
- Score 1: the topic of the answer is correct, but the text does not answer the question.
- Score 2: the topic is correct, but the answer is generic and not precise.
- Score 3: the answer is consistent and precise.
Since RECORD’s goal is to correctly answer to a given question, we want to evaluate his performances in providing at least one good answer. Hence, we chose for an input query the best 5 papers in IR step, and then we look at the maximum score among the generated 5 answers.
The barplot here reported aggregates the maximum scores obtained by answers relative to all the questions in the testset. Here we notice that 78% of the maximums achieve a positive score of 2 or 3.
Conclusion
We created RECORD as a valid tool that can help research in the struggle against COVID-19.
Its main advantage is that we are able to filter only the papers correlated to the request in which we can directly access the specific paragraph containing the answer, saving a lot of time. It is a non trivial task, considering the complexity of the topic and the continuous growth of published papers.
The analysis conducted along with the Centro Medico Sant’Agostino shows that RECORD provides remarkable results in terms of consistency and precision of the extracted answers, except when the information is not contained in the dataset (solvable with an alert for the user when the question has no answer in the data).
The great potential is that this approach is completely unsupervised, which allows us to apply the engine to any application in different industries and any set of documents!