Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data.
EgoVLA is a vision-language-action (VLA) model that combines the broad diversity of human egocentric videos with the precision of robot demonstrations. It is first pretrained on large-scale human manipulation data, learning to predict future hand and wrist motions from visual observations, language instructions, and proprioceptive signals. By aligning human and robot action spaces through a unified representation based on wrist pose and MANO hand parameters, EgoVLA enables efficient fine-tuning on in-domain robot demonstrations. Rather than replacing robot data, the human video pretraining complements it by improving generalization across diverse tasks, visual scenes, and spatial configurations—reducing the need for task-specific robot data and enabling more flexible and scalable manipulation capabilities.
EgoVLA takes visual history, language instruction, and action query token as input. The latent features are converted to human action with the action head. We use the wrist pose and MANO hand parameter as human/robot unified action space.
Unified Action Space: MANO hand parameters are used as a shared action space for humans and robots. For robot hands, during training, optimized mano parameters produce the same fingertip position as the robot hand fingertip. A small MLP maps predicted finger tip positions to joint commands during deployment.
Red lines indicate ground truth, while green lines represent EgoVLA's predicted human wrist motion.
A key challenge in learning-based robotics—beyond data scarcity—is the lack of scalable and reproducible evaluation. Real-world testing is expensive, time-consuming, and often unsafe, especially for resource-limited settings like academic labs. Recent studies show that simulation results often align well with real-world performance, making them a reliable evaluation proxy. To support consistent benchmarking in humanoid manipulation, we introduce the Ego Humanoid Manipulation Benchmark, built with NVIDIA Isaac Lab. Rather than enabling direct sim-to-real transfer, our benchmark—similar in spirit to LIBERO and SIMPLER—serves as a reproducible testbed for evaluating manipulation policies. It features the Unitree H1 humanoid with two Inspire dexterous hands and includes 12 tasks, ranging from short-horizon atomic actions to long-horizon, multi-stage skills.
Pretraining on egocentric human videos significantly boosts both in-domain performance and out-of-domain generalization. EgoVLA outperforms the no-pretraining baseline across all tasks, with especially strong gains on long-horizon and fine-grained manipulation tasks. It also generalizes better to unseen visual backgrounds, maintaining high success and progress rates, while models trained only on robot data see a sharp drop in performance.
Left:Data Mixture Ablation. EgoVLA pretrained on different mixtures of human egocentric datasets, evaluated on Unseen visual backgrounds for short-horizon tasks. Greater diversity consistently improves generalization performance.
Right:Spatial Distribution. Success rate and progress of EgoVLA under Unseen visual backgrounds, visualized across object spawning positions. The model maintains strong performance across a wide area, with higher success in regions commonly associated with effective bimanual manipulation.
@misc{yang2025egovlalearningvisionlanguageactionmodels,
title={EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos},
author={Ruihan Yang and Qinxi Yu and Yecheng Wu and Rui Yan and Borui Li and An-Chieh Cheng and Xueyan Zou and Yunhao Fang and Hongxu Yin and Sifei Liu and Song Han and Yao Lu and Xiaolong Wang},
year={2025},
eprint={2507.12440},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2507.12440},
}