Modular dual-stream visual fusion network for visual question answering

Guardado en:
Detalles Bibliográficos
Publicado en:The Visual Computer vol. 41, no. 1 (Jan 2025), p. 549
Publicado:
Springer Nature B.V.
Materias:
Acceso en línea:Citation/Abstract
Full Text
Full Text - PDF
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Resumen:Object detection networks’ extracted region features have been pivotal in visual question answering (VQA) advancements. However, lacking global context, these features may yield inaccurate answers for questions demanding such information. Conversely, grid features provide detailed global context but falter on questions requiring high-level semantic insights due to their lack of semantic richness. Therefore, this paper proposes an improved attention-based dual-stream visual fusion network (MDVFN), which fuses region features with grid features to obtain global context information, while grid features supplement high-level semantic information. Specifically, we design a visual crossed attention (VCA) module in the attention network, which can interactively fuse two visual features to enhance their performance before guiding attention with the question features. It is worth noting that in order to reduce the semantic noise generated by the interaction of two image features in the visual cross attention (VCA) module, the targeted optimization is carried out. Before fusion, the visual position information is embedded, respectively, and the visual fusion graph is used to constrain the fusion process. Additionally, to combine text information, grid features, and region features, we propose a modality-mixing network. To validate our model, we conducted extensive experiments on the VQA-v2 benchmark dataset and the GQA dataset. These experiments demonstrate that MDVFN outperforms the most advanced methods. For instance, our proposed model achieved accuracies of 72.16% and 72.03% on the VQA-v2 and GQA datasets, respectively.
ISSN:0178-2789
1432-2315
DOI:10.1007/s00371-024-03346-x
Fuente:Advanced Technologies & Aerospace Database