Exploring Efficiency of Vision Transformers for Self-Supervised Monocular Depth Estimation

Human-centered computing, Human computer interaction (HCI), Interaction paradigms, Mixed / augmented reality, Artificial intelligence, Computer vision, Localization, Spatial registration and tracking, 3D reconstruction

Abstract

Depth estimation is a crucial task for the creation of depth maps, one of the most important components for augmented reality (AR) and other applications. However, the most widely used hardware for AR and smartphones has only sparse depth sensors with different ground truth depth acquisition methods. Thus, depth estimation models that are robust for downstream AR tasks performance can only be trained reliably using self-supervised learning based on camera information. Previous works in the field mostly focus on self-supervised models with pure convolutional architectures, without taking global spatial context into account.

In this paper, we utilize vision transformer architectures for self-supervised monocular depth estimation and propose VTDepth, a vision transformer-based model, which provides a solution to the problem of the global spatial context. We compare various combinations of convolutional and transformer architectures for self-supervised depth estimation and show that the best combination of models is an encoder with a transformer basis and convolutional decoder. Our experiments demonstrate the efficiency of VTDepth for self-supervised depth estimation. Our set of models achieves state-of-the-art performance for self-supervised learning on NYUv2 and KITTI datasets. All the source code and pretrained weights are to be released
publicly.

Full text