Attention-based Models in Self-supervised Monocular Depth Estimation

Convolutional neural nets, Computer vision, Self-supervised learning, Image processing

Abstract

Depth estimation problem has various real world applications where high quality of depth prediction is crucial. Depth maps can be estimated from labeled data. However, it is hard to obtain a good quality and dense ground truth. Nevertheless, it is possible to use unlabeled sequences of images in the self-supervised learning setting as they contain the information about changes of camera’s position. Thus, this study is focused on the monocular self-supervised depth estimation task. In the study three different visual attention mechanisms are compared. Also, the novel attention mechanism is proposed. It aggregates inter-channel information and spatial information to capture long-range dependencies. The proposed modifications to the Monodepth2 architecture lead to the improved performance on the public KITTI dataset. The provided experimental results show that including attention to the depth decoder significantly improves model’s quality. In comparison, adding attention to the encoder does not lead to better performance. Additionally, the proposed attention mechanism shows high quality in terms of the all used metrics on the chosen dataset.

Full text