Presentation: LAV-ACT: Language-Augmented Visual ACT for Bimanual Robotic Manipulation, ICARA 2025.
Автор: Dhurba Tripathi
Загружено: 2025-06-09
Просмотров: 16
Project page: https://dktpt44.github.io/LAV-ACT/
ICARA 2025
========================
Abstract:
Bimanual robotic manipulation, involving the coordinated use of two robotic arms, is essential for tasks requiring complex, synchronous actions. Action Chunking with Transformers (ACT) is a representative framework that enables robots to break down complex tasks into manageable sequences, facilitating autonomous learning of multi-step actions. However, we observe critical limitations in the ACT framework: it relies solely on visual observations as input, focusing on task-specific action predictions, and it uses a simple ResNet-based feature extractor for image processing, which is often insufficient for complex and multi-view bimanual arm observations. In this paper, we introduce an enhanced language-driven version of ACT that leverages Voltron—a language-driven representation model—to incorporate both visual observations and language prompts into dense, multi-modal embeddings. These embeddings are used to condition the ResNet backbone feature maps through Featurewise Linear Modulation (FiLM), allowing our model to integrate contextually relevant linguistic information with visual data for more adaptive action chunking. Extensive experiments show that our approach significantly improves the performance of bimanual robot arms in executing complex, multi-step tasks guided by language cues, outperforming traditional ACT methods.

Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: