Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers

Learning Vision-Guided Quadrupedal Locomotion
End-to-End with Cross-Modal Transformers

¹UC San Diego, ²Tsinghua University, ³UC Berkeley

Abstract

We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method that leverages both proprioceptive states and visual observations for locomotion control. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We transfer our learned policy from simulation to a real robot by running it indoor and in-the-wild with unseen obstacles and terrain. Our method not only significantly improves over baselines, but also achieves far better generalization performance, especially when transferred to the real robot.

Method

We propose to incorporate both proprioceptive and visual information for locomotion tasks using a novel Transformer model, LocoTransformer. Our model consists of the following two components: (i) Separate modality encoders for proprioceptive and visual inputs that project both modalities into a latent feature space; (ii) A shared Transformer encoder that performs cross-modality attention over proprioceptive features and visual features, as well as spatial attention over visual tokens to predict actions and predict values.

Results

We evaluate our method in simulation and the real world. In simulation, we simulate a quadruped robot in a set of challenging and diverse environments. In the real world, we conduct experiments in indoor scenarios with obstacles and in-the-wild with complex terrain and novel obstacles.

Self-attention from our shared Transformer module.

We visualize the self-attention between the proprioceptive token and all visual tokens in the last layer of our Transformer model. We plot the attention weight over raw visual input where warmer color represents larger attention weight.

In the environment with obstacles, the agent learns to automatically attend to obstacles.

On challenging terrain, the agent attends to the goal destination and the local terrain in an alternative manner.

Visualization in Simulation

To understand how the policy learned by our method behave to obstacles and challenging terrain in the simulated environments, we visualize a representative set of episodes and corresponding decisions made by the policy learned with our method. We evaluate all methods in 6 distinct environments with varying terrain, obstacles to avoid, and spheres to collect for reward bonuses.

BibTeX

@inproceedings{ yang2022learning, title={Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers}, author={Ruihan Yang and Minghao Zhang and Nicklas Hansen and Huazhe Xu and Xiaolong Wang}, booktitle={International Conference on Learning Representations}, year={2022}, url={https://openreview.net/forum?id=nhnJ3oo6AB} }

Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers

ICLR 2022

Abstract

Video

Method

Results

Self-attention from our shared Transformer module.

Visualization in the Real World

Our LocoTransformer agent in different real-world scenarios

Footpath

Courtyard

Outdoor with Chairs

Footpath & Grassland

Hallway

In Community

Indoor – Fix Obstacle

Our LocoTransformer

State-Depth-Concat

Indoor – Unseen Chair & Desk

Our LocoTransformer

State-Depth-Concat

In The Wild - Forest

Our LocoTransformer

State-Depth-Concat

In The Wild – Near Glass Wall

Our LocoTransformer

State-Depth-Concat

Visualization in Simulation

Wide Obstacle & Sphere

Thin Obstacle & Sphere

Moving Obstacle

Our LocoTransformer

State-Depth-Concat

State-Only

Mountain

Our LocoTransformer

State-Depth-Concat

State-Only

Generalization on Unseen Environment with chairs & tables

Our LocoTransformer

State-Depth-Concat

State-Only

BibTeX

Learning Vision-Guided Quadrupedal Locomotion
End-to-End with Cross-Modal Transformers