Fixed obstacles placed in hallway, obstacles share similar appearance with obstacles in training in simulation
We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method that leverages both proprioceptive states and visual observations for locomotion control. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We transfer our learned policy from simulation to a real robot by running it indoor and in-the-wild with unseen obstacles and terrain. Our method not only significantly improves over baselines, but also achieves far better generalization performance, especially when transferred to the real robot.
We propose to incorporate both proprioceptive and visual information for locomotion tasks using a novel Transformer model, LocoTransformer. Our model consists of the following two components: (i) Separate modality encoders for proprioceptive and visual inputs that project both modalities into a latent feature space; (ii) A shared Transformer encoder that performs cross-modality attention over proprioceptive features and visual features, as well as spatial attention over visual tokens to predict actions and predict values.
We evaluate our method in simulation and the real world. In simulation, we simulate a quadruped robot in a set of challenging and diverse environments. In the real world, we conduct experiments in indoor scenarios with obstacles and in-the-wild with complex terrain and novel obstacles.
We visualize the self-attention between the proprioceptive token and all visual tokens in the last layer of our Transformer model. We plot the attention weight over raw visual input where warmer color represents larger attention weight.
In the environment with obstacles, the agent learns to automatically attend to obstacles.
On challenging terrain, the agent attends to the goal destination and the local terrain in an alternative manner.
To validate our method in different real-world scenes beyond the simulation, we conduct real-world experiments in both indoor scenarios with obstacles and in-the-wild scenarios.
Fixed obstacles placed in hallway, obstacles share similar appearance with obstacles in training in simulation
Fixed chairs & desks which does not appear in training placed in hallway
In the forest with leaves and twigs on the uneven ground
In front of a set of poles and near a glass wall
To understand how the policy learned by our method behave to obstacles and challenging terrain in the simulated environments, we visualize a representative set of episodes and corresponding decisions made by the policy learned with our method. We evaluate all methods in 6 distinct environments with varying terrain, obstacles to avoid, and spheres to collect for reward bonuses.
wide cuboid obstacles on a flat terrain, including spheres that give a reward bonus when collected
numerous thin cuboid obstacles on a flat terrain, including spheres that give a reward bonus when collected
similar to the Thin Obstacle, but obstacles are now dynamically moving in random directions updated at low frequency
a rugged mountain range with a goal on the top of the mountain
We evaluate eneralization ability of methods by transferring policies trained on Thin Obstacle to unseen environment with chairs & tables.
@inproceedings{
yang2022learning,
title={Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers},
author={Ruihan Yang and Minghao Zhang and Nicklas Hansen and Huazhe Xu and Xiaolong Wang},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=nhnJ3oo6AB}
}