AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks

Spremljeno u:
Bibliografski detalji
Izdano u:arXiv.org (Dec 2, 2024), p. n/a
Glavni autor: Li, You
Daljnji autori: Fan, Ma, Yang, Yi
Izdano:
Cornell University Library, arXiv.org
Teme:
Online pristup:Citation/Abstract
Full text outside of ProQuest
Oznake: Dodaj oznaku
Bez oznaka, Budi prvi tko označuje ovaj zapis!

MARC

LEADER 00000nab a2200000uu 4500
001 3133540486
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3133540486 
045 0 |b d20241202 
100 1 |a Li, You 
245 1 |a AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks 
260 |b Cornell University Library, arXiv.org  |c Dec 2, 2024 
513 |a Working Paper 
520 3 |a Diffusion models have recently been employed to generate high-quality images, reducing the need for manual data collection and improving model generalization in tasks such as object detection, instance segmentation, and image perception. However, the synthetic framework is usually designed with meticulous human effort for each task due to various requirements on image layout, content, and annotation formats, restricting the application of synthetic data on more general scenarios. In this paper, we propose AnySynth, a unified framework integrating adaptable, comprehensive, and highly controllable components capable of generating an arbitrary type of synthetic data given diverse requirements. Specifically, the Task-Specific Layout Generation Module is first introduced to produce reasonable layouts for different tasks by leveraging the generation ability of large language models and layout priors of real-world images. A Uni-Controlled Image Generation Module is then developed to create high-quality synthetic images that are controllable and based on the generated layouts. In addition, user specific reference images, and style images can be incorporated into the generation to task requirements. Finally, the Task-Oriented Annotation Module offers precise and detailed annotations for the generated images across different tasks. We have validated our framework's performance across various tasks, including Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Composed Image Retrieval, and Multi-modal Image Perception and Grounding. The specific data synthesized by our framework significantly improves model performance in these tasks, demonstrating the generality and effectiveness of our framework. 
653 |a Large language models 
653 |a Image segmentation 
653 |a Image retrieval 
653 |a Retrieval 
653 |a Image annotation 
653 |a Perception 
653 |a Layouts 
653 |a Controllability 
653 |a Instance segmentation 
653 |a Modules 
653 |a Image quality 
653 |a Object recognition 
653 |a Image processing 
653 |a Data collection 
653 |a Synthetic data 
700 1 |a Fan, Ma 
700 1 |a Yang, Yi 
773 0 |t arXiv.org  |g (Dec 2, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3133540486/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2411.16749