Источник
Neurointerfaces
Дата публикации
19.10.2022
Авторы
Михаил Бурцев Василий Коновалов Анастасия Чижикова
Поделиться

Multilingual Case-Insensitive Named Entity Recognition

Аннотация

Although capitalisation is an important feature for the Named Entity Recognition (NER) task, the NER input data is not always cased. Recent studies suggest two main methods of dealing with such inconsistency: truecasing and training a model on a modified dataset. Furthermore, while developing virtual assistants there is often a need to support interaction in several languages. It has been shown that multilingual BERT can be successfully used for cross-lingual transfer, performing on datasets in various languages with scores comparable to those obtained with language-specific models. In this paper, we address the task of Named Entity Recognition on inconsistently capitalised data in English and Russian. We demonstrate that using multilingual BERT trained on a concatenation of original and lowered datasets is the most effective way to solve the task. Our model achieves the highest average result on CoNLL-2003 and Collection 3 datasets while being robust to missing casing.

Присоединяйтесь к AIRI в соцсетях