Altogether: Image Captioning via Re-aligning Alt-text
Guardado en:
| Publicado en: | arXiv.org (Dec 12, 2024), p. n/a |
|---|---|
| Autor principal: | |
| Otros Autores: | , , , , , , , , , , , |
| Publicado: |
Cornell University Library, arXiv.org
|
| Materias: | |
| Acceso en línea: | Citation/Abstract Full text outside of ProQuest |
| Etiquetas: |
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
MARC
| LEADER | 00000nab a2200000uu 4500 | ||
|---|---|---|---|
| 001 | 3119817418 | ||
| 003 | UK-CbPIL | ||
| 022 | |a 2331-8422 | ||
| 035 | |a 3119817418 | ||
| 045 | 0 | |b d20241212 | |
| 100 | 1 | |a Hu, Xu | |
| 245 | 1 | |a Altogether: Image Captioning via Re-aligning Alt-text | |
| 260 | |b Cornell University Library, arXiv.org |c Dec 12, 2024 | ||
| 513 | |a Working Paper | ||
| 520 | 3 | |a This paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners' training data (e.g. GPT) is unknown. In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds, consequently constructing captions with rich visual concepts. This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge. We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale. Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks. | |
| 653 | |a Image annotation | ||
| 653 | |a Image classification | ||
| 653 | |a Visual tasks | ||
| 653 | |a Annotations | ||
| 653 | |a Image quality | ||
| 653 | |a Texts | ||
| 653 | |a Image processing | ||
| 653 | |a Human performance | ||
| 653 | |a Synthetic data | ||
| 700 | 1 | |a Po-Yao, Huang | |
| 700 | 1 | |a Tan, Xiaoqing Ellen | |
| 700 | 1 | |a Ching-Feng Yeh | |
| 700 | 1 | |a Kahn, Jacob | |
| 700 | 1 | |a Jou, Christine | |
| 700 | 1 | |a Ghosh, Gargi | |
| 700 | 1 | |a Levy, Omer | |
| 700 | 1 | |a Zettlemoyer, Luke | |
| 700 | 1 | |a Wen-tau Yih | |
| 700 | 1 | |a Shang-Wen, Li | |
| 700 | 1 | |a Xie, Saining | |
| 700 | 1 | |a Feichtenhofer, Christoph | |
| 773 | 0 | |t arXiv.org |g (Dec 12, 2024), p. n/a | |
| 786 | 0 | |d ProQuest |t Engineering Database | |
| 856 | 4 | 1 | |3 Citation/Abstract |u https://www.proquest.com/docview/3119817418/abstract/embedded/6A8EOT78XXH2IG52?source=fedsrch |
| 856 | 4 | 0 | |3 Full text outside of ProQuest |u http://arxiv.org/abs/2410.17251 |