We propose to solve quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method for quadrupedal locomotion that leverages a Transformer-based model for fusing proprioceptive states and visual observations. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We show that our method obtains significant improvements over policies with only proprioceptive state inputs, and that Transformer-based models further improve generalization across environments.
We propose to incorporate both proprioceptive and visual information for locomotion tasks using a novel Transformer model, LocoTransformer. Our model consists of the following two components: (i) separate modality encoders for proprioceptive inputs and visual inputs that project both modalities into a latent feature space; (ii) a shared transformer encoder that performs spatial attention over visual tokens, as well as cross-modality attention over proprioceptive features and visual features to predict the actions and values.
We evaluate our proposed method on challenging simulated environments, including tasks such as maneuvering around obstacles of different sizes and shapes, dynamically moving obstacles, as well as rough mountainous terrain. We show that jointly learning policies with both proprioceptive states and vision significantly improves locomotion in challenging environments, and that policies further benefit from adopting our cross-modal Transformer. We also show that LocoTransformer generalizes much better to unseen environments. Lastly, we qualitatively show our method learns to anticipate changes in the environment using vision as guidance.
We visualize the self-attention weight between the proprioceptive token and all visual tokens in the last layer of our Transformer model. We plot the attention weight over raw visual input where warmer color represents larger attention weight. Agent learns to automatically attend to critical visual regions (obstacles in (a)(b)(c)(d), high slope terrain in (e)(h), goal location in (f)(g)(h)) for planning its motion.
To understand how the policy learned by our method behave to obstacles and challenging terrain in the environments, we visualize a representative set of episodes and corresponding decisions made by the policy learned with our method. We evaluate all methods in 6 distinct environments with varying terrain, obstacles to avoid, and spheres to collect for reward bonuses.
wide cuboid obstacles on a flat terrain, without spheres
wide cuboid obstacles on a flat terrain, including spheres that give a reward bonus when collected
numerous thin cuboid obstacles on a flat terrain, without spheres
numerous thin cuboid obstacles on a flat terrain, including spheres that give a reward bonus when collected
similar to the Thin Obstacle, but obstacles are now dynamically moving in random directions updated at low frequency
a rugged mountain range with a goal on the top of the mountain
We evaluate eneralization ability of methods by transferring policies trained on Thin Obstacle to unseen environment with chairs & tables.