Indoor 3D Scene Understanding Using Depth Sensors
One of the main goals in computer vision is to achieve a human-like understanding of images. This understanding has been recently represented in various forms, including image classification, object detection, semantic segmentation, among many others. Nevertheless, image understanding has been mainly studied in the 2D image frame, so more information is needed to relate them to the 3D world. With the emergence of 3D sensors (e.g. the Microsoft Kinect), which provide depth along with color information, the task of propagating 2D knowledge into 3D becomes more attainable and enables interaction between a machine (e.g. robot) and its environment. This dissertation focuses on three aspects of indoor 3D scene understanding: (1) 2D-driven 3D object detection for single frame scenes with inherent 2D information, (2) 3D object instance segmentation for 3D reconstructed scenes, and (3) using room and floor orientation for automatic labeling of indoor scenes that could be used for self-supervised object segmentation. These methods allow capturing of physical extents of 3D objects, such as their sizes and actual locations within a scene.
Overview
Abstract
One of the main goals in computer vision is to achieve a human-like understanding of images. This understanding has been recently represented in various forms, including image classification, object detection, semantic segmentation, among many others. Image understanding has been mainly studied in the 2D image frame and is motivated by impressive performance in numerous challenges and backed up by challenging and large-scale datasets. Nevertheless, 2D image understanding results are constrained to the image frame, so more information is needed to relate them to the 3D world. With the emergence of 3D sensors (e.g. the Microsoft Kinect), which provide depth along with color information, the task of propagating 2D knowledge into 3D becomes more attainable. The importance of 3D scene understanding lies in extending the knowledge from the image frame to the real world, which enables interaction between a machine (e.g. robot) and its environment.
This dissertation focuses on three aspects of indoor 3D scene understanding: (1) 2D-driven 3D object detection for single frame scenes with inherent 2D information, (2) 3D object instance segmentation for 3D reconstructed scenes, and (3) using room and floor orientation for automatic labeling of indoor scenes that could be used for self-supervised object segmentation.
In the first part, we propose a technique that places 3D bounding boxes around objects in an RGB-D scene. Our approach makes best use of the 2D information to quickly reduce the search space in 3D, benefiting from state-of-the-art 2D object detection techniques. We then use the 3D information to orient, place, and score bounding boxes around objects. We independently estimate the orientation for every object, using previous techniques that utilize normal information. Object locations and sizes in 3D are learned using a multilayer perceptron (MLP).
In the second part, we propose a novel method for instance label segmentation of dense 3D voxel grids. We target volumetric scene representations, which have been acquired with depth sensors or multi-view stereo methods and which have been processed with semantic 3D reconstruction or scene completion methods. The main task is to learn shape information about individual object instances in order to accurately separate them, including connected and incompletely scanned objects. We solve the 3D instance-labeling problem with a multi-task learning strategy. The first goal is to learn an abstract feature embedding, which groups voxels with the same instance label close to each other while separating clusters with different instance labels from each other. The second goal is to learn instance information by densely estimating directional information of the instance's center of mass for each voxel. This is particularly useful to find instance boundaries in the clustering post-processing step, as well as, for scoring the segmentation quality for the first goal.
In the third part, we propose RGB-based semantic segmentation using self-supervised depth pretraining. Although well-known large-scale datasets, such as ImageNet, have driven image understanding forward, most of these datasets require extensive manual annotation and are thus not easily scalable. This limits the advancement of image understanding techniques. The impact of these large-scale datasets can be observed in almost every vision task and technique in the form of pre-training for initialization. In this work, we propose an easily scalable and self-supervised technique that can be used to pre-train any semantic RGB segmentation method. In particular, our pre-training approach makes use of automatically generated labels that can be obtained using depth sensors. These labels, denoted by HN-labels, represent different height and normal patches, which allow mining of local semantic information that is useful in the task of semantic RGB segmentation. We show how our proposed self-supervised pre-training with HN-labels can be used to replace ImageNet pre-training, while using 25x less images and without requiring any manual labeling.
Brief Biography
Jean Lahoud received his Bachelor of Engineering in Mechanical Engineering from Notre Dame University in Lebanon in 2012. In 2014, he obtained his Master of Engineering degree in the same field from American University of Beirut in Lebanon. Before joining KAUST, Jean was a product manager at Murex systems in Lebanon. He is currently a PhD candidate in Electrical Engineering at the Image and Video Understanding Lab under the supervision of Prof. Bernard Ghanem.