Source
Neurointerfaces
DATE OF PUBLICATION
10/19/2022
Authors
Mikhail Burtsev Vasily Konovalov Anastasia Chizhikova
Share

Multilingual Case-Insensitive Named Entity Recognition

Abstract

Although capitalisation is an important feature for the Named Entity Recognition (NER) task, the NER input data is not always cased. Recent studies suggest two main methods of dealing with such inconsistency: truecasing and training a model on a modified dataset. Furthermore, while developing virtual assistants there is often a need to support interaction in several languages. It has been shown that multilingual BERT can be successfully used for cross-lingual transfer, performing on datasets in various languages with scores comparable to those obtained with language-specific models. In this paper, we address the task of Named Entity Recognition on inconsistently capitalised data in English and Russian. We demonstrate that using multilingual BERT trained on a concatenation of original and lowered datasets is the most effective way to solve the task. Our model achieves the highest average result on CoNLL-2003 and Collection 3 datasets while being robust to missing casing.

Join AIRI