Learning Vision-Guided Quadrupedal Locomotion
End-to-End with Cross-Modal Transformers

Ruihan Yang*,  Minghao Zhang*,  Nicklas Hansen,  Huazhe Xu,  Xiaolong Wang

Code (coming soon)

We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method that leverages both proprioceptive states and visual observations for locomotion control. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We transfer our learned policy from simulation to a real robot by running it indoor and in-the-wild with unseen obstacles and terrain. Our method not only significantly improves over baselines, but also achieves far better generalization performance, especially when transferred to the real robot.

Demo Video


We propose to incorporate both proprioceptive and visual information for locomotion tasks using a novel Transformer model, LocoTransformer. Our model consists of the following two components: (i) Separate modality encoders for proprioceptive and visual inputs that project both modalities into a latent feature space; (ii) A shared Transformer encoder that performs cross-modality attention over proprioceptive features and visual features, as well as spatial attention over visual tokens to predict actions and predict values.


We evaluate our method in simulation and the real world. In simulation, we simulate a quadruped robot in a set of challenging and diverse environments. In the real world, we conduct experiments in indoor scenarios with obstacles and in-the-wild with complex terrain and novel obstacles.

Self-attention from our shared Transformer module.

We visualize the self-attention between the proprioceptive token and all visual tokens in the last layer of our Transformer model. We plot the attention weight over raw visual input where warmer color represents larger attention weight.

In the environment with obstacles, the agent learns to automatically attend to obstacles.

On challenging terrain, the agent attends to the goal destination and the local terrain in an alternative manner.

Visualization in Simulation

To understand how the policy learned by our method behave to obstacles and challenging terrain in the simulated environments, we visualize a representative set of episodes and corresponding decisions made by the policy learned with our method. We evaluate all methods in 6 distinct environments with varying terrain, obstacles to avoid, and spheres to collect for reward bonuses.

Wide Obstacle & Sphere

wide cuboid obstacles on a flat terrain, including spheres that give a reward bonus when collected

Thin Obstacle & Sphere

numerous thin cuboid obstacles on a flat terrain, including spheres that give a reward bonus when collected

Moving Obstacle

similar to the Thin Obstacle, but obstacles are now dynamically moving in random directions updated at low frequency

Our LocoTransformer State-Depth-Concat State-Only


a rugged mountain range with a goal on the top of the mountain

Our LocoTransformer State-Depth-Concat State-Only

Generalization on Unseen Environment with chairs & tables

We evaluate eneralization ability of methods by transferring policies trained on Thin Obstacle to unseen environment with chairs & tables.

Our LocoTransformer State-Depth-Concat State-Only

Visualization in the Real World

To validate our method in different real-world scenes beyond the simulation, we conduct real-world experiments in both indoor scenarios with obstacles and in-the-wild scenarios.

Indoor – Fix Obstacle

Fixed obstacles placed in hallway, obstacles share similar appearance with obstacles in training in simulation

Our LocoTransformer State-Depth-Concat

Indoor – Unseen Chair & Desk

Fixed chairs & desks which does not appear in training placed in hallway

Our LocoTransformer State-Depth-Concat

In The Wild - Forest

In the forest with leaves and twigs on the uneven ground

Our LocoTransformer State-Depth-Concat

In The Wild – Near Glass Wall

In front of a set of poles and near a glass wall

Our LocoTransformer State-Depth-Concat

Our LocoTransformer agent in different real-world scenarios

Footpath Courtyard
Outdoor with Chairs Footpath & Grassland
Hallway In Community



@misc{yang2021learning, title={Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers}, author={Ruihan Yang and Minghao Zhang and Nicklas Hansen and Huazhe Xu and Xiaolong Wang}, year={2021}, eprint={2107.03996}, archivePrefix={arXiv}, primaryClass={cs.LG} }