BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment
Abstract
In recent years, there has been substantial progress in using pretrainedLanguage Models (LMs) on a range of tasks aimed at improvingthe understanding of biomedical texts. Nonetheless, existingbiomedical LLMs show limited comprehension of complex,domain-specific concept structures and the factual information encodedin biomedical Knowledge Graphs (KGs). In this work, wepropose BALI (Biomedical Knowledge Graph and Language ModelAlignment), a novel joint LM and KG pre-training method thataugments an LM with external knowledge by the simultaneouslearning of a dedicated KG encoder and aligning the representationsof both the LM and the graph. For a given textual sequence, welink biomedical concept mentions to the Unified Medical LanguageSystem (UMLS) KG and utilize local KG subgraphs as cross-modalpositive samples for these mentions. Our empirical findings indicatethat implementing our method on several leading biomedical LMs,such as PubMedBERT and BioLinkBERT, improves their performanceon a range of language understanding tasks and the qualityof entity representations, even with minimal pre-training on a smallalignment dataset sourced from PubMed scientific abstracts.
Similar publications
partnership