LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding

Đã lưu trong:

Chi tiết về thư mục
Xuất bản năm:	arXiv.org (Dec 24, 2024), p. n/a
Tác giả chính:	Li, Hao
Tác giả khác:	Qin, Roy, Zou, Zhengyu, He, Diqi, Bohan, Li, Dai, Bingquan, Zhang, Dingewn, Han, Junwei
Được phát hành:	Cornell University Library, arXiv.org
Những chủ đề:	Language Feature extraction Feature maps Editing Semantic segmentation Image segmentation Scene analysis Hierarchies Query languages
Truy cập trực tuyến:	Citation/Abstract Full text outside of ProQuest
Các nhãn:	Thêm thẻ Không có thẻ, Là người đầu tiên thẻ bản ghi này!

MARC


LEADER	00000nab a2200000uu 4500
001	3149107751
003	UK-CbPIL
022			\|a 2331-8422
035			\|a 3149107751
045	0		\|b d20241224
100	1		\|a Li, Hao
245	1		\|a LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding
260			\|b Cornell University Library, arXiv.org \|c Dec 24, 2024
513			\|a Working Paper
520	3		\|a Applying Gaussian Splatting to perception tasks for 3D scene understanding is becoming increasingly popular. Most existing works primarily focus on rendering 2D feature maps from novel viewpoints, which leads to an imprecise 3D language field with outlier languages, ultimately failing to align objects in 3D space. By utilizing masked images for feature extraction, these approaches also lack essential contextual information, leading to inaccurate feature representation. To this end, we propose a Language-Embedded Surface Field (LangSurf), which accurately aligns the 3D language fields with the surface of objects, facilitating precise 2D and 3D segmentation with text query, widely expanding the downstream tasks such as removal and editing. The core of LangSurf is a joint training strategy that flattens the language Gaussian on the object surfaces using geometry supervision and contrastive losses to assign accurate language features to the Gaussians of objects. In addition, we also introduce the Hierarchical-Context Awareness Module to extract features at the image level for contextual information then perform hierarchical mask pooling using masks segmented by SAM to obtain fine-grained language features in different hierarchies. Extensive experiments on open-vocabulary 2D and 3D semantic segmentation demonstrate that LangSurf outperforms the previous state-of-the-art method LangSplat by a large margin. As shown in Fig. 1, our method is capable of segmenting objects in 3D space, thus boosting the effectiveness of our approach in instance recognition, removal, and editing, which is also supported by comprehensive experiments. \url{https://langsurf.github.io}.
653			\|a Language
653			\|a Feature extraction
653			\|a Feature maps
653			\|a Editing
653			\|a Semantic segmentation
653			\|a Image segmentation
653			\|a Scene analysis
653			\|a Hierarchies
653			\|a Query languages
700	1		\|a Qin, Roy
700	1		\|a Zou, Zhengyu
700	1		\|a He, Diqi
700	1		\|a Bohan, Li
700	1		\|a Dai, Bingquan
700	1		\|a Zhang, Dingewn
700	1		\|a Han, Junwei
773	0		\|t arXiv.org \|g (Dec 24, 2024), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/3149107751/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/2412.17635