Neural Volumetric Memory for
Visual Locomotion Control

1UC San Diego, 2Institute of AI and Fundamental Interactions 3MIT CSAIL

CVPR 2023 Highlight

Our robot walks with complete automation, without remote control!


Legged robots have the potential to expand the reach of autonomy beyond paved roads. Difficult locomotion tasks, however, require perception, and are often partially-observable. The standard way to address partial-observability in state-of-the-art visual-locomotion methods is to concatenate images channel-wise via frame-stacking. This naive approach is in contrast to the modern paradigm in computer vision that explicitly models optical flow and the 3D geometry of interest. Inspired by this gap, we propose a neural volumetric memory architecture (NVM) that explicitly accounts for the SE(3) equivariance of the 3D world. Unlike prior approaches, NVM is a volumetric format, and it aggregates feature volumes from multiple camera views by first applying 3D translation and rotations to bring them back to the ego-centric frame of the robot. We test the learned visual-locomotion policy on a physical robot and show that our approach, learning legged locomotion with neural volumetric memory, produce performance gains over prior works on challenging terrains. We also include ablation studies, and show that the representation stored in the neural volumetric memory capture sufficient geometric information to reconstruct the scene.


Volumetric Memory for Legged Locomotion

Legged locomotion using ego-centric camera views is intrinsically a partially-observed problem. To make the control problem tangible, our robot needs to aggregate information from previous frames and correctly infer the occluded terrain underneath it. During locomotion, the camera mounted directly on the robot chassis undergo large and spurious changes in pose, making integrating individual frames to a coherent representation non-trivial. To account for these camera pose changes, we propose neural volumetric memory (NVM) --- a 3D representation format for scene features. It takes as input a sequence of visual observations and outputs a single 3D feature volume representing the surrounding 3D structure.

Learning NVM via Self-Supervision

Although the behavior cloning objective is sufficient in producing a good policy, being equivariant to translation and rotation automatically offers a standalone, self-supervised learning objective just for the neural volumetric memory.

Visualization in the Real World

We use the same policy for all different scenarios.

To validate our method in different real-world scenes beyond the simulation, we conduct real-world experiments in both indoor scenarios with stepping stones and in-the-wild scenarios.

Our NVM agent in different real-world scenarios

Stairs with Obstacles

Moving Objects

Curvy Stairs

Rugged Terrain




Stepping Stones






Visual reconstruction from learned decoder

We visualize the synthesized visual observation in our self-supervised task. For every tuple, the first image shows the robot moving in the environment, the second image is the input visual observation, the third image is the synthesized visual observation using 3D feature volume and the estimated relative camera. For the input visual observation, we apply extensive data augmentation to the image to improve the robustness of our model.

Visualization in Simulation

To understand how the policy learned by our method behaves in the simulated environments, we visualize a representative set of episodes and corresponding decisions made by the policy learned with our method.






  title={Neural Volumetric Memory for Visual Locomotion Control},
  author={Ruihan Yang and Ge Yang and Xiaolong Wang},
  booktitle={Conference on Computer Vision and Pattern Recognition 2023},