IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control


Bytedance Inc.   Corresponding Author

We present IDC-Net, a novel framework that, given metric-aligned RGB-D inputs and camera trajectories, generates RGB-D sequences with precise camera control. Thanks to the metric alignment between generated depths and camera poses, the outputs enable direct 3D reconstruction without post-processing.

Abstract

We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-imagedepth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved interframe geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables fine-grained camera control, enhancing control over the generated sequences. Extensive experiments show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences. Notably, the generated RGB-D sequences can be directly feed for downstream 3D Scene reconstruction tasks without extra post-processing steps, showcasing the practical benefits of our joint learning framework.

Video Results with Given Trajectory

IDCNet generates RGB-D scenes from input RGB-D data and given trajectories. Each row shows videos generated from different prompts under a shared trajectory.

More Video Results

Additional RGB-D results generated by IDCNet from varying input prompts and trajectories.

The Method

As the figure shown, IDC-Net jointly generates RGB and depth sequences in latent space, conditioned on an input frame and target camera trajectory. Camera poses are embedded through a GeoAware transformer to enforce spatial consistency. The generated metrically aligned RGB-D outputs enable direct point cloud extraction, supporting immediate downstream 3D reconstruction

The framework of IDC-Net.

BibTeX

@article{
      liu2025idcnetguidedvideodiffusion,
      author    = {Lijuan Liu and Wenfa Li and Dongbo Zhang and Shuo Wang and Shaohui Jiao},
      title     = {IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control},
      eprint    = {2508.04147},
      year      = {2025},
      archivePrefix={arXiv},
      url       = {https://arxiv.org/abs/2508.04147}, 
 }