Learning Vision-Guided Quadrupedal Locomotion
End-to-End with Cross-Modal Transformers

Ruihan Yang*,  Minghao Zhang*,  Nicklas Hansen,  Huazhe Xu,  Xiaolong Wang

Code (coming soon)

We propose to solve quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method for quadrupedal locomotion that leverages a Transformer-based model for fusing proprioceptive states and visual observations. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We show that our method obtains significant improvements over policies with only proprioceptive state inputs, and that Transformer-based models further improve generalization across environments.


We propose to incorporate both proprioceptive and visual information for locomotion tasks using a novel Transformer model, LocoTransformer. Our model consists of the following two components: (i) separate modality encoders for proprioceptive inputs and visual inputs that project both modalities into a latent feature space; (ii) a shared transformer encoder that performs spatial attention over visual tokens, as well as cross-modality attention over proprioceptive features and visual features to predict the actions and values.


We evaluate our proposed method on challenging simulated environments, including tasks such as maneuvering around obstacles of different sizes and shapes, dynamically moving obstacles, as well as rough mountainous terrain. We show that jointly learning policies with both proprioceptive states and vision significantly improves locomotion in challenging environments, and that policies further benefit from adopting our cross-modal Transformer. We also show that LocoTransformer generalizes much better to unseen environments. Lastly, we qualitatively show our method learns to anticipate changes in the environment using vision as guidance.

Self-attention from our shared Transformer module.

We visualize the self-attention weight between the proprioceptive token and all visual tokens in the last layer of our Transformer model. We plot the attention weight over raw visual input where warmer color represents larger attention weight. Agent learns to automatically attend to critical visual regions (obstacles in (a)(b)(c)(d), high slope terrain in (e)(h), goal location in (f)(g)(h)) for planning its motion.

Video Demos

To understand how the policy learned by our method behave to obstacles and challenging terrain in the environments, we visualize a representative set of episodes and corresponding decisions made by the policy learned with our method. We evaluate all methods in 6 distinct environments with varying terrain, obstacles to avoid, and spheres to collect for reward bonuses.

Wide Obstacle

wide cuboid obstacles on a flat terrain, without spheres

Wide Obstacle & Sphere

wide cuboid obstacles on a flat terrain, including spheres that give a reward bonus when collected

Thin Obstacle

numerous thin cuboid obstacles on a flat terrain, without spheres

Thin Obstacle & Sphere

numerous thin cuboid obstacles on a flat terrain, including spheres that give a reward bonus when collected

Moving Obstacle

similar to the Thin Obstacle, but obstacles are now dynamically moving in random directions updated at low frequency

Our LocoTransformer State-Depth-Concat State-Only


a rugged mountain range with a goal on the top of the mountain

Our LocoTransformer State-Depth-Concat State-Only

Generalization on Unseen Environment with chairs & tables

We evaluate eneralization ability of methods by transferring policies trained on Thin Obstacle to unseen environment with chairs & tables.

Our LocoTransformer State-Depth-Concat State-Only

Short Video



@misc{yang2021learning, title={Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers}, author={Ruihan Yang and Minghao Zhang and Nicklas Hansen and Huazhe Xu and Xiaolong Wang}, year={2021}, eprint={2107.03996}, archivePrefix={arXiv}, primaryClass={cs.LG} }