One of the main goals in computer vision is to achieve a human-like understanding of images. This understanding has been recently represented in various forms, including image classification, object detection, semantic segmentation, among many others. Nevertheless, image understanding has been mainly studied in the 2D image frame, so more information is needed to relate them to the 3D world. With the emergence of 3D sensors (e.g. the Microsoft Kinect), which provide depth along with color information, the task of propagating 2D knowledge into 3D becomes more attainable and enables interaction between a machine (e.g. robot) and its environment. This dissertation focuses on three aspects of indoor 3D scene understanding: (1) 2D-driven 3D object detection for single frame scenes with inherent 2D information, (2) 3D object instance segmentation for 3D reconstructed scenes, and (3) using room and floor orientation for automatic labeling of indoor scenes that could be used for self-supervised object segmentation. These methods allow capturing of physical extents of 3D objects, such as their sizes and actual locations within a scene.