ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos

We introduce ObjectForesight, a framework for predicting future 6-DoF object trajectories from a short history of egocentric observations. We train the model on just generic human videos of interactions without any task/skill-specific separation, and demonstrate generalization to diverse plausible manipulations of objects in new scenes.

Our contributions are: (1) formalizing 3D object dynamics prediction from human videos, (2) a geometry-aware, object-centric diffusion model for 6-DoF trajectory prediction, and (3) a large-scale dataset of 2 million+ object-centric 3D trajectories with pseudo-ground-truth.

Abstract

Humans can effortlessly anticipate how objects might move or change through interaction—imagining a cup being lifted, a knife slicing, or a lid being closed. We aim to endow computational systems with a similar ability to predict plausible future object motions directly from passive visual observation. We introduce ObjectForesight, a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. Unlike conventional world/dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level, enabling geometrically grounded and temporally coherent predictions that capture object affordances and trajectories. To train such a model at scale, we leverage recent advances in segmentation, mesh reconstruction, and 3D pose estimation to curate a dataset of 2 million+ short clips with pseudo-ground-truth 3D object trajectories. Through extensive experiments, we show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes—establishing a scalable framework for learning physically grounded, object-centric dynamics models directly from observation.

Model Architecture

ObjectForesight combines a geometry-aware 3D point encoder (PointTransformerV3) with a Diffusion Transformer (DiT) to forecast future object motion. From an anchor-frame point cloud, a short pose history with normalized boxes, and the object mask, we build a scene embedding and predict a multi-modal distribution over future 6-DoF pose sequences via denoising diffusion.

Geometry + Context: A PointTransformerV3 encoder conditions point features on motion context (FiLM) and produces an object-centric scene embedding.
Stable Tokens: We predict depth-normalized 9D pose tokens for improved numerical stability.
Diffusion Forecasting: A DiT with AdaLN-Zero conditioning denoises future tokens using v-parameterization and p2 weighting; DDIM sampling yields smooth, diverse trajectories.

Model architecture: conditioned on anchor-frame geometry and past poses, ObjectForesight predicts future 6-DoF object trajectories.

Data Curation

We curate large-scale 6-DoF object trajectories from in-the-wild egocentric video using an automated pipeline with quality gates. Starting from action segments, we detect hands and candidate manipulated objects (EgoHOS), refine and track masks over time (SAM2), and filter for clear manipulations and good viewpoints. We reconstruct object geometry (TRELLIS), recover metric depth and camera motion (SpaTrackerV2), and track 6-DoF poses with FoundationPose using bidirectional tracking and re-registration. This yields 2 million+ short, metrically grounded, temporally coherent object-centric trajectories.

Data curation pipeline: from egocentric video to 3D object trajectories using EgoHOS, SAM2, TRELLIS, SpaTrackerV2, and FoundationPose.

Qualitative Results

We show qualitative results of ObjectForesight on unseen clips in HOT3D and EpicKitchens. We can see that the future 3D predictions are plausible and correspond meaningfully to manipulations of the object given the context frames as condition.

Comparison with Video Generation followed by Reconstruction

Video models now have been become very good at generating physically plausible videos given some visual/text context. We compare with a state-of-the-art video generation model Luma Ray 3. Conditioned on the context frames, we generate a video (zero-shot) and recover 6-DoF motion from the generated frames using our curation pipeline. Although these videos exhibit appearance artifacts and provide no explicit 3D constraints, our curation pipeline can still recover approximate object motion from the rendered frames. The recovered trajectories, however, are typically less stable than ObjectForesight's direct predictions, underscoring the advantage of explicitly modeling 3D dynamics rather than inferring them post-hoc from generated videos.

GT | Luma AI | Ours Luma AI Output

GT | Luma AI | Ours

Luma Output

GT | Luma AI | Ours

Luma Output

GT | Luma AI | Ours

Luma Output

BibTeX

@misc{soraki2026objectforesightpredictingfuture3d,
  title={ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos},
  author={Rustin Soraki and Homanga Bharadhwaj and Ali Farhadi and Roozbeh Mottaghi},
  year={2026},
  eprint={2601.05237},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2601.05237}
}

Abstract

Model Architecture

Data Curation

Qualitative Results

Acknowledgements

BibTeX