Source
IEEE Access
DATE OF PUBLICATION
08/30/2023
Authors
Ilya Makarov Dmitry Kiselev Xinze Li
Share

Predicting Molecule Toxicity via Descriptor-Based Graph Self-Supervised Learning

Abstract

Predicting molecular properties with Graph Neural Networks (GNNs) has recently drawn a lot of attention, with compound toxicity prediction being one of the biggest challenges. In cases where there is insufficient labeled molecule data, an effective approach is to pre-train GNNs on large-scale unlabeled molecular data and then fine-tune them for downstream tasks. Among pre-training strategies, node-level pre-training involves masking and predicting atom properties, while motif-based methods capture rich information in subgraphs. These approaches have shown effectiveness across various downstream tasks. However, current pre-training frameworks face two main challenges: (1) node-level auxiliary tasks do not preserve useful domain knowledge, and (2) the fusion of motif-based methods and node-level tasks is computationally extensive. To address these challenges, we propose Descriptor-based Graph Self-supervised Learning (DGSSL), a method that utilizes domain knowledge to enhance graph representation learning. We extract domain knowledge from a descriptor language known as fragmentary code of substructure superposition (FCSS), where molecules are described using substructures that can serve as centers for weak bonds. Specifically, DGSLL identifies descriptor centers in molecules and encodes motif-like information as special atomic numbers in the pre-training tasks. This enables node-level self-supervised pre-training frameworks for GNNs to also capture rich information in local subgraphs. Experimental results demonstrate that our method achieves state-of-the-art performance on three toxicity-related benchmarks and show their significance in an ablation experiment.

Join AIRI