Continuously Improving Mobile Manipulation with Autonomous Real-World RL

Russell Mendonca1    Bernadette Bucher2    Jiuguang Wang2    Deepak Pathak1
1Carnegie Mellon University          2Boston Dynamics AI Institute In Submission

Our approach learns skills directly in the real world via RL, without any demonstrations or simulation training.

Abstract

To build generalist robots capable of executing a wide array of tasks across diverse environments, robots must be endowed with the ability to engage directly with the real world to acquire and refine skills without extensive instrumentation or human supervision. This work presents a fully autonomous real-world reinforcement learning framework for mobile manipulation that can both independently gather data and refine policies through accumulated experience in the real world. It has several key components: 1) automated data collection strategies by guiding the robot’s exploration toward object interactions, 2) using goal cycles for world RL such that the robot changes goals once it has made sufficient progress, where the different goals serve as resets for one another, 3) efficient control by leveraging basic task knowledge present in behavior priors in conjunction with policy learning and 4) formulating generic rewards that combine human-interpretable semantic information with low-level, fine-grained state information. We demonstrate our approach on Boston Dynamics Spot robots in continually improving performance on a set of four challenging mobile manipulation tasks and show that this enables competent policy learning, obtaining an average success rate of 80\% across tasks, a 3-4 times improvement over existing approaches.

Approach Overview

VRB Model

Task-relevant Autonomy: The robot needs to collect data with high signal to noise ratio, to learn more efficiently. We use an auto-grap procedure which uses segmentation models to identify objects of interest and grasp them before running the neural policy. Further we use goal cycles and/or multiple robots to automate resets for continual learning.

Efficient Control: We combine a neural controller along with priors, which can take the form of planners with simplified models or simple scripts that generate suboptimal behavior. Neural policy learning is driven by model-free RL using Q learning by sampling data from both the online policy and prior.

Flexible Supervision: We use language-guided detection models and vision segmentation models to identify objects of interest. We combine this with low-level depth observations for state estimation, which is used to specify rewards.

Efficient continual improvement

Our approach leverages both RL as well as a prior (such as a planner with a simplified model, or a simple script to generate behaviors). This learns much faster than vanilla RL, and reaches much better performance than just using the prior. Note that we use task-relevant autonomy procedures for all methods to make learning feasible.

Exploration Learning Time-Lapse

We show the evolution of the robot's behavior as it practices and learns skills. These are learned over 8-10 hours practice in the real world.