EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Anonymous1,

Abstract

Embodied AI represents a critical frontier in robotics, where the capacity to devise coherent action sequences for robots to accomplish specific tasks in physical environments is both essential and fraught with inherent challenges. Recently, there has been an increasing interest in developing large foundation models capable of generating embodied plans by processing natural language instructions and environmental observations. In this paper, we introduce EmbodiedGPT, a large-scale end-to-end multi-model foundation model for embodied AI.

Our contributions can be categorized into three folds: First, we meticulously craft a large-scale embodied planning dataset, encompassing extensive videos and planning instructions derived from the Ego4D dataset, with scrupulous prompt design and quality assurance. Second, we advocate a cost-effective training method for end-to-end multimodal large models, generating a sequence of sub-goals in planning pursuant to observations by harnessing the potency of the “Chain of Thought.” Specifically, we augment the 7B language model’s capacity to produce high-quality planning by employing prefix adapters to train it on the EgoCOT dataset, circumventing overly divergent language model responses. Lastly, we introduce a paradigm for extracting task-related features from LLM-generated planning queries, forming a closed loop between high-level planning and low-level control. Our comprehensive experiments substantiate that our model effectively bolsters the performance of embodied tasks, including Embodied Planning, Embodied Control, Visual Captioning, and Visual Q\&A.

Framework

The whole framework is shown as follow:

Interpolate start reference image.

Start Frame


Demos in Franka Kitchen

Open the cabinet

Animation 1

Turn the light on

Animation 2

Open the microwave oven

Animation 3

Slide the door open

Animation 4

Demos in Meta-World

Assemble task

Animation 1

Place the bin

Animation 2

Hammer the nail

Animation 3

Open the drawer

Animation 4

BibTeX

@article{anonymousembodiedgpt,
  author    = {Anonymous},
  title     = {EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought},
  journal   = {Under Review},
  year      = {2023},
}