STATIC : Surface Temporal Affine for TIme Consistency in Video Monocular Depth Estimation

সংরক্ষণ করুন:
গ্রন্থ-পঞ্জীর বিবরন
প্রকাশিত:arXiv.org (Dec 2, 2024), p. n/a
প্রধান লেখক: Yang, Sunghun
অন্যান্য লেখক: Lee, Minhyeok, Cho, Suhwan, Lee, Jungho, Lee, Sangyoun
প্রকাশিত:
Cornell University Library, arXiv.org
বিষয়গুলি:
অনলাইন ব্যবহার করুন:Citation/Abstract
Full text outside of ProQuest
ট্যাগগুলো: ট্যাগ যুক্ত করুন
কোনো ট্যাগ নেই, প্রথমজন হিসাবে ট্যাগ করুন!

MARC

LEADER 00000nab a2200000uu 4500
001 3139000058
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3139000058 
045 0 |b d20241202 
100 1 |a Yang, Sunghun 
245 1 |a STATIC : Surface Temporal Affine for TIme Consistency in Video Monocular Depth Estimation 
260 |b Cornell University Library, arXiv.org  |c Dec 2, 2024 
513 |a Working Paper 
520 3 |a Video monocular depth estimation is essential for applications such as autonomous driving, AR/VR, and robotics. Recent transformer-based single-image monocular depth estimation models perform well on single images but struggle with depth consistency across video frames. Traditional methods aim to improve temporal consistency using multi-frame temporal modules or prior information like optical flow and camera parameters. However, these approaches face issues such as high memory use, reduced performance with dynamic or irregular motion, and limited motion understanding. We propose STATIC, a novel model that independently learns temporal consistency in static and dynamic area without additional information. A difference mask from surface normals identifies static and dynamic area by measuring directional variance. For static area, the Masked Static (MS) module enhances temporal consistency by focusing on stable regions. For dynamic area, the Surface Normal Similarity (SNS) module aligns areas and enhances temporal consistency by measuring feature similarity between frames. A final refinement integrates the independently learned static and dynamic area, enabling STATIC to achieve temporal consistency across the entire sequence. Our method achieves state-of-the-art video depth estimation on the KITTI and NYUv2 datasets without additional information. 
653 |a Robotics 
653 |a Consistency 
653 |a Parameter identification 
653 |a Similarity 
653 |a Modules 
653 |a Frames (data processing) 
653 |a Optical memory (data storage) 
653 |a Optical flow (image analysis) 
700 1 |a Lee, Minhyeok 
700 1 |a Cho, Suhwan 
700 1 |a Lee, Jungho 
700 1 |a Lee, Sangyoun 
773 0 |t arXiv.org  |g (Dec 2, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3139000058/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2412.01090