Decoupled Generative Modeling for Human-Object Interaction Synthesis

Abstract

Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive dynamic long sequence planning while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and a perceptual studies likewise prefer our results.

Key Contributions

DecHOI decouples trajectory planning from fine grained action synthesis, reducing optimization difficulty and removing the need for manual waypoints.
Adversarial training with a compact discriminator improves coordination between distal joints and the object and reduces interpenetration.
A long-horizon dynamic planner enables responsive updates to moving counterparts and supports scene aware interaction.
Across multiple benchmarks DecHOI achieves state of the art realism accuracy and diversity and consistently outperforms prior methods.

Overview

Overview of DecHOI. Architecture of DecHOI showing the decoupled trajectory and action generation process. Conditioned on the text instruction, geometry, current human and object poses, and a goal point, the trajectory generator plans paths, while the action generator produces joint motions on these paths to yield synchronized, contact-aware interactions. The right panels detail the Trajectory and Action Generators.

Video Results

DynaPlan

*For visualization, we render the obstacle (green) using a pre-trained action generation model, which can introduce slight jitter in the obstacle.

Quantitative Results

Table 1. Quantitative comparison on the FullBodyManipulation dataset with CHOIS, HOIFHLI, and OMOMO variants (Lin-OMOMO and Pred-OMOMO) across four categories of evaluation metrics. Arrows indicate direction: (↑) means higher is better, (↓) means lower is better, and (→) means closer to the real data value is better. The real-data DIV reference is 9.02.

Table 2. Quantitative results on the 3D-FUTURE dataset. DecHOI achieves better trajectory accuracy, motion stability, and contact realism than CHOIS and OMOMO baselines.

BibTeX

Citation

@InProceedings{Jung_2026_CVPR,
    author    = {Jung, Hwanhee and Lee, Seunggwan and Yoon, Jeongyoon and Kim, SeungHyeon and Nam, Giljoo and Huang, Qixing and Kim, Sangpil},
    title     = {Decoupled Generative Modeling for Human-Object Interaction Synthesis},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {2253-2263}
}