We present an Open-Vocabulary 3D Scene Graph (OVSG), a formal framework for grounding a variety of entities, such as object instances, agents, and regions, with free-form text-based queries.
Unlike conventional semantic-based object localization approaches, our system facilitates context-aware entity localization, allowing for queries such as pick up a cup on a kitchen table or navigate to a sofa on which someone is sitting. In contrast to existing research on 3D scene graphs, OVSG supports free-form text input and open-vocabulary querying.
Through a series of comparative experiments using the ScanNet dataset and a self-collected dataset, we demonstrate that our proposed approach significantly surpasses the performance of previous semantic-based localization techniques. Moreover, we highlight the practical application of OVSG in real-world robot navigation and manipulation experiments.
At the core of our system lies Open-Vocabulary 3D Instance Retrieval(OVIR-3D). Given a 3D scan reconstructed from an RGB-D video and a text query, the proposed method retrieves relevant 3D instances (see examples a-c). Notably, instances that aren't even part of the original annotations can be detected (see examples d-e), such as the cushions on the sofa.
Our proposed pipeline operates on inputs like positional Pu, user Lu, RGBD Scan I.
The top section demonstrates the creation of Gs , with Pu and Lu directly channeled into it, while the RGBD Scan I merges with the OVIR-3D system, leading to position and Detic feature outputs for each item. The language descriptions undergo encoding processes, including a special Spatial Relationship Encoder for poses.
The bottom section delves into the formation of Gq, derived from the example query I want to find Tom's bottle in the laboratory. An LLM breaks down the query into elements which are then feature-encoded to create Gq .
The challenge is matching Gq within Gs, using a new proposal and ranking algorithm, where the desired entity is the central node of the top-ranked candidate.
@inproceedings{
chang2023contextaware,
title={Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs},
author={Haonan Chang and Kowndinya Boyalakuntla and Shiyang Lu and Siwei Cai and Eric Pu Jing and Shreesh Keskar and Shijie Geng and Adeeb Abbas and Lifeng Zhou and Kostas Bekris and Abdeslam Boularious},
booktitle={7th Annual Conference on Robot Learning},
year={2023},
url={https://openreview.net/forum?id=cjEI5qXoT0}
}
@inproceedings{
lu2023ovird,
title={{OVIR}-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data},
author={Shiyang Lu and Haonan Chang and Eric Pu Jing and Abdeslam Boularias and Kostas Bekris},
booktitle={7th Annual Conference on Robot Learning},
year={2023},
url={https://openreview.net/forum?id=gVBvtRqU1_}
}