We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks

00:00

Formal Metadata

Title
Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2023
Production PlaceToronto, Canada

Content Metadata

Subject Area
Genre
Abstract
Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. While being actively researched, multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models. In this work, we ask whether language-only models can be utilised for tasks that require visual input – but also, as we argue, often require a strong reasoning component. Similar to some recent related work, we make visual information accessible to the language model using separate verbalisation models. Specifically, we investigate the performance of open-source, open-access language models against GPT-3 on five vision-language tasks when given textually-encoded visual information. Our results suggest that language models are effective for solving vision-language tasks even with limited samples. This approach also enhances the interpretability of a model’s output by providing a means of tracing the output back through the verbalised image content.
Keywords
Task (computing)Programming languageMachine visionComputational linguisticsAssociative propertyComputer-generated imagerySpacetimeBimodal distributionEndliche ModelltheorieContext awarenessRepresentation (politics)Sample (statistics)Mathematical modelPerformance appraisalRepresentation (politics)Task (computing)CodeProgramming languageResultantConnectivity (graph theory)Sampling (statistics)SpacetimeMachine visionEndliche ModelltheorieOcean currentTemplate (C++)Wave packetGreatest elementOpen sourceMereologyDescriptive statisticsMathematical modelRight anglePerformance appraisalDifferent (Kate Ryan album)Function (mathematics)1 (number)CASE <Informatik>Computer animation
Regulärer Ausdruck <Textverarbeitung>Mathematical modelComputer-generated imageryRepresentation (politics)Object-oriented programmingDemosceneProgramming languageEndliche ModelltheorieProgramming languageDifferent (Kate Ryan album)Wave packetSampling (statistics)Representation (politics)ResultantFunction (mathematics)InformationOpen sourcePerformance appraisalContent (media)HeuristicCombinational logicQuicksortComputer animation
Computer-generated imageryDependent and independent variablesNegative numberObject-oriented programmingTask (computing)LaptopVisual systemComputerSign (mathematics)SurfacePrimality testSqueeze theoremRepresentation (politics)Endliche ModelltheorieProgramming languageMathematical modelAxiom of choiceSample (statistics)Open sourceSampling (statistics)Endliche ModelltheorieTask (computing)Axiom of choiceProgramming languageMultiplicationMathematical modelMereologyContent (media)Different (Kate Ryan album)Process (computing)Form (programming)Combinational logicOpen sourceComputer animation
Transcript: English(auto-generated)
Hello everyone, my name is Sherzod Takemo and today I'll be presenting the paper with the title images in language space exploring the suitability of large language models for vision and language tasks. This is a joint work done at the University of Potsdam together with Daevisch Lönnen.
The main motivation for this research is to understand whether prompting a large language code is possible for multi-modal tasks and if it is, to what extent. In this case what we mean by multi-modal tasks is tasks that require looking at both image and text pairs to be able to do the classification of certain tasks.
And in this paper we specifically focus on comparing the results of open source models together with commercial ones such as GPT-3. And by doing this we use multiple tasks which are all of the multi-modal and analyze the performance and see the differences between different models.
For this we have the following methodology. On the left side we have a component that we call image as text representation extraction. What it essentially does is, given an image, it will convert that into text representation using two methods. One is image captioning models which are using pre-trained captioning models.
Another one is by giving this image to multiple image classification models and combining their outputs into what we would call a visual text. And then this will basically be appended, text will be representation of that image.
On the right hand side we have a prompt template where we have a task description and we have two different tasks here. One is question answering and the other one is typical classification model where both we have two different templates. In the middle part we give in-context samples which are selected from the training space of the dataset.
On the bottom we have the evaluation sample that we're looking at currently. And then once this prompt template has been filled we prompt the language model and we extract the answer from the respective language model. We have these three image captioning models that we have tested and then we use four different
classification models which were run separately and then we combine the outputs that would essentially build what we call visual text. And in this way we convert such image into two different textual representations.
In this paper we have used three different open source language models and we compared it with the results from GPT-3. And this we did on five different datasets, NVSA is a sentiment dataset, OKVQA is a visual question answering dataset, MAMI is a meme based misogynist detection, NLVR is a reasoning
dataset, and HatefulMeme is another meme dataset that is based on identifying hateful content. In our experiments we have shown that using random sampling where we select random samples
from available training data versus using some sort of heuristics that selects a sample that is more similar to the given evaluation sample and we have shown that adaptive sampling leads to better results and similarly captioning models capture more detailed information than
combination of multiple image classification models. This we have also shown that and then also we have compared the results from from-toothed language models with the models that have been specifically fine-tuned on the respective datasets. We have presented multiple qualitative examples in the paper and let's look at one example
which is from OKVQA dataset. And this sample shows that there is an image such as this and then the bleep caption model gives an output such as a person with a teddy bear in a backpack and then visual tags essentially
retrieve something like three people, book, backpack, teddy bear, potted plant and for question such as what toy is this, the ground truth is supposed to be teddy bear or stuffed animal and both GPT-3 and FLANTI-5 models give an output such as teddy bear which is basically correct answer.
And in conclusion what we have shown that language models can be prompted also for multimodal tasks by representing the image content in text form and this we have done it in two ways captioning and combination of multiple model outputs to combine something like a text and then we have also compared the prompted language models with the fine-tuned models
that are specifically were trained on the dataset then we have also shown that the choice of in-context samples also makes a difference and most importantly we have shown that open source models have certain benefits also when compared to the commercial process GPT-3
and this essentially means that for certain tasks open source models can also yield good or adequate performance. All the resources that are part of this paper are publicly available, thanks for listening.