Our robot walks with complete automation, without remote control!
Abstract
Legged robots have the potential to expand the reach of autonomy beyond paved roads. Difficult locomotion tasks, however, require perception, and are often partially-observable. The standard way to address partial-observability in state-of-the-art visual-locomotion methods is to concatenate images channel-wise via frame-stacking. This naive approach is in contrast to the modern paradigm in computer vision that explicitly models optical flow and the 3D geometry of interest. Inspired by this gap, we propose a neural volumetric memory architecture (NVM) that explicitly accounts for the SE(3) equivariance of the 3D world. Unlike prior approaches, NVM is a volumetric format, and it aggregates feature volumes from multiple camera views by first applying 3D translation and rotations to bring them back to the ego-centric frame of the robot. We test the learned visual-locomotion policy on a physical robot and show that our approach, learning legged locomotion with neural volumetric memory, produce performance gains over prior works on challenging terrains. We also include ablation studies, and show that the representation stored in the neural volumetric memory capture sufficient geometric information to reconstruct the scene.
Video
Volumetric Memory for Legged Locomotion
Legged locomotion using ego-centric camera views is intrinsically a partially-observed problem. To make the control problem tangible, our robot needs to aggregate information from previous frames and correctly infer the occluded terrain underneath it. During locomotion, the camera mounted directly on the robot chassis undergo large and spurious changes in pose, making integrating individual frames to a coherent representation non-trivial.
To account for these camera pose changes, we propose neural volumetric memory (NVM) --- a 3D representation format for scene features. It takes as input a sequence of visual observations and outputs a single 3D feature volume representing the surrounding 3D structure.
Learning NVM via Self-Supervision
Although the behavior cloning objective is sufficient in producing a good policy, being equivariant to translation and rotation automatically offers a standalone, self-supervised learning objective just for the neural volumetric memory.
Visualization in the Real World
We use the same policy for all different scenarios.
To validate our method in different real-world scenes beyond the simulation, we conduct real-world experiments in both indoor scenarios with stepping stones and in-the-wild scenarios.
Our NVM agent in different real-world scenarios
Stairs with Obstacles
Moving Objects
Curvy Stairs
Rugged Terrain
Slope
Terrain
Terrain
Stepping Stones
Our NVM
Baseline
Stages
Our NVM
Baseline
Visual reconstruction from learned decoder
We visualize the synthesized visual observation in our self-supervised task. For every tuple, the first image shows the robot moving in the environment, the second image is the input visual observation, the third image is the synthesized visual observation using 3D feature volume and the estimated relative camera. For the input visual observation, we apply extensive data augmentation to the image to improve the robustness of our model.
Visualization in Simulation
To understand how the policy learned by our method behaves in the simulated environments, we visualize a representative set of episodes and corresponding decisions made by the policy learned with our method.
Stones
Stages
Stairs
Obstacles
BibTeX
@inproceedings{
yang2023neural,
title={Neural Volumetric Memory for Visual Locomotion Control},
author={Ruihan Yang and Ge Yang and Xiaolong Wang},
booktitle={Conference on Computer Vision and Pattern Recognition 2023},
year={2023},
url={https://openreview.net/forum?id=JYyWCcmwDS}
}
Our robot walks with complete automation, without remote control!
Abstract
Legged robots have the potential to expand the reach of autonomy beyond paved roads. Difficult locomotion tasks, however, require perception, and are often partially-observable. The standard way to address partial-observability in state-of-the-art visual-locomotion methods is to concatenate images channel-wise via frame-stacking. This naive approach is in contrast to the modern paradigm in computer vision that explicitly models optical flow and the 3D geometry of interest. Inspired by this gap, we propose a neural volumetric memory architecture (NVM) that explicitly accounts for the SE(3) equivariance of the 3D world. Unlike prior approaches, NVM is a volumetric format, and it aggregates feature volumes from multiple camera views by first applying 3D translation and rotations to bring them back to the ego-centric frame of the robot. We test the learned visual-locomotion policy on a physical robot and show that our approach, learning legged locomotion with neural volumetric memory, produce performance gains over prior works on challenging terrains. We also include ablation studies, and show that the representation stored in the neural volumetric memory capture sufficient geometric information to reconstruct the scene.
Video
Volumetric Memory for Legged Locomotion
Legged locomotion using ego-centric camera views is intrinsically a partially-observed problem. To make the control problem tangible, our robot needs to aggregate information from previous frames and correctly infer the occluded terrain underneath it. During locomotion, the camera mounted directly on the robot chassis undergo large and spurious changes in pose, making integrating individual frames to a coherent representation non-trivial. To account for these camera pose changes, we propose neural volumetric memory (NVM) --- a 3D representation format for scene features. It takes as input a sequence of visual observations and outputs a single 3D feature volume representing the surrounding 3D structure.
Learning NVM via Self-Supervision
Although the behavior cloning objective is sufficient in producing a good policy, being equivariant to translation and rotation automatically offers a standalone, self-supervised learning objective just for the neural volumetric memory.
Visualization in the Real World
We use the same policy for all different scenarios.
To validate our method in different real-world scenes beyond the simulation, we conduct real-world experiments in both indoor scenarios with stepping stones and in-the-wild scenarios.
Our NVM agent in different real-world scenarios
Stairs with Obstacles
Moving Objects
Curvy Stairs
Rugged Terrain
Slope
Terrain
Terrain
Stepping Stones
Our NVM
Baseline
Stages
Our NVM
Baseline
Visual reconstruction from learned decoder
We visualize the synthesized visual observation in our self-supervised task. For every tuple, the first image shows the robot moving in the environment, the second image is the input visual observation, the third image is the synthesized visual observation using 3D feature volume and the estimated relative camera. For the input visual observation, we apply extensive data augmentation to the image to improve the robustness of our model.
Visualization in Simulation
To understand how the policy learned by our method behaves in the simulated environments, we visualize a representative set of episodes and corresponding decisions made by the policy learned with our method.
Stones
Stages
Stairs
Obstacles
BibTeX