Whole-Body Conditioned Egocentric Video Prediction – The Berkeley Artificial Intelligence Research Blog

Whole-Body Conditioned Egocentric Video Prediction – The Berkeley Artificial Intelligence Research Blog

Whole-Body Conditioned Egocentric Video Prediction – The Berkeley Artificial Intelligence Research Blog

https://bair.berkeley.edu/blog/2025/07/01/peva/

Publish Date:

Source Domain: bair.berkeley.edu

Summary

The article discusses the advancement of the Predict Ego-centric Video from human Actions (PEVA) model, designed for generating egocentric video predictions from whole-body human movements. Recent advancements in world models have enabled simulation of future outcomes to facilitate planning and control, but many still lack a physical grounding. To create a comprehensive world model for embodied agents, the authors introduce PEVA, conditioned on kinematic pose trajectories structured by the body’s joint hierarchy. Through hierarchical evaluation, the model demonstrates the ability to generate videos for atomic human actions, simulate counterfactuals, and maintain coherence over long prediction horizons. PEVA, an autoregressive conditional diffusion transformer, incorporates features such as random timeskips, sequence-level training, and action embeddings to handle high-dimensional, temporally extended human movements. The model’s effectiveness is showcased through quantitative results, comparing it against baselines on various perceptual metrics while maintaining coherence and scaling to larger model capacities. While promising, future work aims to extend the model for closed-loop control and interactive environments by incorporating high-level task goals and object-centric representations.

Key Points:

  • PEVA Framework: PEVA aims to predict ego-centric videos conditioned on human whole-body actions and kinematic pose trajectories.

  • Hierarchical Evaluation: The model is evaluated on increasingly challenging tasks to analyze its embodied prediction and control capabilities from a first-person view.

  • Advanced Modeling Techniques: PEVA extends the Conditional Diffusion Transformer with random timeskips, sequence-level training, and action embeddings to model complex human motions.

  • Effectiveness: The model outperforms baselines in various perceptual metrics, maintains coherence over long horizons, and shows good scaling properties with model size.

  • Future Directions: Future work entails expanding PEVA for closed-loop control, interactive environments, and integrating high-level goals and object-centric representations.