Context-based text-graph embeddings in word-sense induсtion tasks

NLP, Word Sense Induction, BERT, Graph clustering

Abstract

Word Sense Induction (WSI) is the process of automatically discovering multiple senses or meanings of a word. WSI task can be described as grouping contexts of a given word by its senses which are not provided beforehand. Modern WSI systems are given small text fragments only and should cluster them into some unidentified number of clusters. In the present work contextualized word embeddings, calculated by BERT, are applied in conjunction with clustering techniques to the WSI task for the Russian language. We hypothesize that novel language model embeddings, already viable for sense induction, may be enhanced by graph-based post-processing. We evaluate that proposition on 3 datasets from the Russian language WSI competition task. Fusion of graph algorithms and vector representations allowed us to beat one of the tasks' baseline (wiki-wiki, ARI = 0.7513) and demonstrate viability of further research. This work provides insight into how vector sentence representations can be organized for more efficient sense extraction.

Full text