SwePub
Sök i LIBRIS databas

  Extended search

onr:"swepub:oai:gup.ub.gu.se/312511"
 

Search: onr:"swepub:oai:gup.ub.gu.se/312511" > What Does a Languag...

  • 1 of 1
  • Previous record
  • Next record
  •    To hitlist

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations

Ilinykh, Nikolai, 1994 (author)
Gothenburg University,Göteborgs universitet,Institutionen för filosofi, lingvistik och vetenskapsteori,Department of Philosophy, Linguistics and Theory of Science
Dobnik, Simon, 1977 (author)
Gothenburg University,Göteborgs universitet,Institutionen för filosofi, lingvistik och vetenskapsteori,Department of Philosophy, Linguistics and Theory of Science
 (creator_code:org_t)
2021-12-03
2021
English.
In: Frontiers in Artificial Intelligence: Identifying, Analyzing, and Overcoming Challenges in Vision and Language Research. - : Frontiers Media SA. - 2624-8212. ; 4
  • Journal article (peer-reviewed)
Abstract Subject headings
Close  
  • Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. However, very few have explored what structures are acquired by multi-modal transformers where linguistic and visual features are combined. It is critical to understand the representations learned by each modality, their respective interplay, and the task’s effect on these representations in large-scale architectures. In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. Our results indicate that the information about different relations between objects in the visual stream is hierarchical and varies from local to a global object-level understanding of the image. In particular, while visual representations in the first layers encode the knowledge of relations between semantically similar object detections, often constituting neighbouring objects, deeper layers expand their attention across more distant objects and learn global relations between them. We also show that globally attended objects in deeper layers can be linked with entities described in image descriptions, indicating a critical finding - the indirect effect of language on visual representations. In addition, we highlight how object-based input representations affect the structure of learned visual knowledge and guide the model towards more accurate image descriptions. A parallel question that we investigate is whether the insights from cognitive science echo the structure of representations that the current neural architecture learns. The proposed analysis of the inner workings of multi-modal transformers can be used to better understand and improve on such problems as pre-training of large-scale multi-modal architectures, multi-modal information fusion and probing of attention weights. In general, we contribute to the explainable multi-modal natural language processing and currently shallow understanding of how the input representations and the structure of the multi-modal transformer affect visual representations.

Subject headings

NATURVETENSKAP  -- Data- och informationsvetenskap -- Språkteknologi (hsv//swe)
NATURAL SCIENCES  -- Computer and Information Sciences -- Language Technology (hsv//eng)

Keyword

effect of language on vision
information fusion
language-and-vision
multi-modality
natural language processing
representation learning
self-attention
transformer

Publication and Content Type

ref (subject category)
art (subject category)

Find in a library

To the university's database

  • 1 of 1
  • Previous record
  • Next record
  •    To hitlist

Find more in SwePub

By the author/editor
Ilinykh, Nikolai ...
Dobnik, Simon, 1 ...
About the subject
NATURAL SCIENCES
NATURAL SCIENCES
and Computer and Inf ...
and Language Technol ...
Articles in the publication
Frontiers in Art ...
By the university
University of Gothenburg

Search outside SwePub

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Close

Copy and save the link in order to return to this view