Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs

Haonan Chang¹, Kowndinya Boyalakuntla¹, Shiyang Lu¹, Siwei Cai², Eric Jing¹, Shreesh Keskar¹, Shijie Geng¹ Adeeb Abbas², Lifeng Zhou², Kostas Bekris¹, Abdeslam Boularias¹

¹Rutgers University, ²Drexel University

CoRL'23 Paper arXiv Code DOVE-G

Dataset for Open Vocabulary Entity Grounding (DOVE-G)

Abstract

We present an Open-Vocabulary 3D Scene Graph (OVSG), a formal framework for grounding a variety of entities, such as object instances, agents, and regions, with free-form text-based queries.

Unlike conventional semantic-based object localization approaches, our system facilitates context-aware entity localization, allowing for queries such as pick up a cup on a kitchen table or navigate to a sofa on which someone is sitting. In contrast to existing research on 3D scene graphs, OVSG supports free-form text input and open-vocabulary querying.

Through a series of comparative experiments using the ScanNet dataset and a self-collected dataset, we demonstrate that our proposed approach significantly surpasses the performance of previous semantic-based localization techniques. Moreover, we highlight the practical application of OVSG in real-world robot navigation and manipulation experiments.

Open Vocabulary 𝟑d Scene Graph (OVSG)

Backbone of our system: OVIR-3D

At the core of our system lies Open-Vocabulary 3D Instance Retrieval(OVIR-3D). Given a 3D scan reconstructed from an RGB-D video and a text query, the proposed method retrieves relevant 3D instances (see examples a-c). Notably, instances that aren't even part of the original annotations can be detected (see examples d-e), such as the cushions on the sofa.

Our Pipeline

Our proposed pipeline operates on inputs like positional P_u, user L_u, RGBD Scan I.

The top section demonstrates the creation of G_s , with P_u and L_u directly channeled into it, while the RGBD Scan I merges with the OVIR-3D system, leading to position and Detic feature outputs for each item. The language descriptions undergo encoding processes, including a special Spatial Relationship Encoder for poses.

The bottom section delves into the formation of G_q, derived from the example query I want to find Tom's bottle in the laboratory. An LLM breaks down the query into elements which are then feature-encoded to create G_q .

The challenge is matching G_q within G_s, using a new proposal and ranking algorithm, where the desired entity is the central node of the top-ranked candidate.

Results

OVSG

ConceptFusion

Robot Application

Navigation

The ROSMASTER R2 Ackermann Steering Robot, equipped with an Astra Pro Plus Depth Camera, YDLidar TG 2D lidar sensor, IMU, and wheel encoder.

We generated a comprehensive map using an Intel RealSense D455 camera and ORB-SLAM3, which fed into the Open-vocabulary pipeline, and utilized the Robot Operating System (ROS) Navigation stack with the TEB planner for goal-oriented navigation. For real-world navigation testing with OVSG, a language-based object navigation task was designed, featuring seven objects in a laboratory, each paired with three queries.

Manipulation

Utility of OVSG in real-world manipulation scenarios on a complex pick-and-place task.

The evaluations utilized a Kuka IIWA 14 robot arm with a Robotiq 3-finger adaptive gripper and an Intel Realsense D435 camera for RGB-D captures at 1280 x 720 resolution. The gripper, operating in "Pinch Mode," was used to position the camera above a table before the robot processed RGB-D data and object queries via the OVSG system, subsequently moving towards target objects using the robot's ROS interface. For real-world testing, a block-building task was introduced, where blocks had to be picked and placed based on natural language queries, challenging due to multiple similar blocks requiring spatial context for identification.

BibTeX

If you find our work useful, consider citing:

OVSG

@inproceedings{
      chang2023contextaware,
      title={Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs},
      author={Haonan Chang and Kowndinya Boyalakuntla and Shiyang Lu and Siwei Cai and Eric Pu Jing and Shreesh Keskar and Shijie Geng and Adeeb Abbas and Lifeng Zhou and Kostas Bekris and Abdeslam Boularious},
      booktitle={7th Annual Conference on Robot Learning},
      year={2023},
      url={https://openreview.net/forum?id=cjEI5qXoT0}
      }

OVIR-3D

@inproceedings{
        lu2023ovird,
        title={{OVIR}-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data},
        author={Shiyang Lu and Haonan Chang and Eric Pu Jing and Abdeslam Boularias and Kostas Bekris},
        booktitle={7th Annual Conference on Robot Learning},
        year={2023},
        url={https://openreview.net/forum?id=gVBvtRqU1_}
        }