Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild

1Meta AI 2Michigan State University 3Caltech


Recognizing scenes and objects in 3D from a single image is a longstanding goal of computer vision with applications in robotics and AR/VR. For 2D recognition, large datasets and scalable solutions have led to unprecedented advances. In 3D, existing benchmarks are small in size and approaches specialize in few object categories and specific domains, e.g. urban driving scenes. Motivated by the success of 2D recognition, we revisit the task of 3D object detection by introducing a large benchmark, called Omni3D. Omni3D re-purposes and combines existing datasets resulting in 234k images annotated with more than 3 million instances and 98 categories. 3D detection at such scale is challenging due to variations in camera intrinsics and the rich diversity of scene and object types. We propose a model, called Cube R-CNN, designed to generalize across camera and scene types with a unified approach. We show that Cube R-CNN outperforms prior works on the larger Omni3D and existing benchmarks. Finally, we prove that Omni3D is a powerful dataset for 3D object recognition and show that it improves single-dataset performance and can accelerate learning on new smaller datasets via pre-training.

The Omni3D Dataset

Scene Examples (in order ARKit, KITTI, Hypersim, nuScenes, Objectron, SUN RBG-D).

We curate a large and diverse 3D object detection dataset with the following properties:

  • 234k RGB images
  • 3M oriented 3D boxes annotations
  • Indoor and outdoor scenes
  • Varying focal lengths and resolutions

To support a new 3D AP metric we have implemented a fast and accurate 3D IoU algorithm (source).

Cube R-CNN

Cube R-CNN Overview

We design a simple yet effective model for general 3D object detection which leverages many key advances from the monocular object detection techniques of recent years. At its core, our method build on Faster R-CNN (detectron2) to parameterize a 3D head in order to estimate a virtual 3D cuboid, which is then compared to 3D GT vertices.

Virtual Depth Visualization
An important feature of Cube R-CNN is its use of a virtual camera space to make predictions in, which maintains effective image resolution and focal length across diverse camera sensors. For example, consider the case where two camera sensors (a) and (b) above can produce very similar images despite the metric depth being nearly twice as far away for camera in (b). We show in experiments that addressing the ambiguity of varying camera sensors is critical for scaling to large/diverse 3D object datasets.

Predictions on COCO Images

Here we show the generalization power of Cube R-CNN trained on Omni3D, by predicting 3D objects in unseen COCO images. Note that we do not have known camera intrinsics for COCO and as such we set the focal length to a constant scale from its image height and assume the principal point is centered in the image, consistent for all below examples.

Predictions on Project Aria (+ tracking)

We also demonstrate the zero-shot performance of Cube R-CNN using unseen Project Aria data. The demo uses known camera poses and a simple implementation for object tracking/smoothing.

The tracking is built using a custom Hungarian algorithm with the following features used to process the instantaneously predicted 3D cuboids into smooth tracks:

  • 3D Intersection over Union (source)
  • Chamfer distance (source)
  • Category distribution similarity
  • Slow decay for missing detections


  author        = {Garrick Brazil and Abhinav Kumar and Julian Straub and Nikhila Ravi and Justin Johnson and Georgia Gkioxari},
  title         = {{Omni3D}: A Large Benchmark and Model for {3D} Object Detection in the Wild},
  booktitle     = {CVPR},
  address       = {Vancouver, Canada},
  month         = {June},
  year          = {2023},
  organization  = {IEEE},