Chrome Extension
WeChat Mini Program
Use on ChatGLM

Depth Any Video with Scalable Synthetic Data

ICLR 2025(2025)

Zhejiang University | University of Sydney | Shenzhen DJI Sciences and Technologies Ltd. | Shanghai AI Lab | Shanghai AI lab

Cited 0|Views30
Abstract
Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse virtual environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates—even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency. The code and model weights are open-sourced.
More
Translated text
PDF
Bibtex
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:本文提出了Depth Any Video模型,通过创新的合成数据管道和生成视频扩散模型,实现了对任意长度视频的深度估计,提高了空间准确性和时间一致性。

方法】:作者开发了一种可扩展的合成数据管道,并利用生成视频扩散模型的强大先验,结合旋转位置编码和流匹配技术,提出了一种处理不同长度和帧率视频的混合时长训练策略。

实验】:使用40,000个5秒长度的视频片段,每个片段都有精确的深度注释,Depth Any Video模型在空间准确性和时间一致性方面超过了所有先前的生成深度模型。