Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/62517 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2023 | |
Production Place | Toronto, Canada |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
Task (computing)Programming languageMachine visionComputational linguisticsAssociative propertyComputer-generated imagerySpacetimeBimodal distributionEndliche ModelltheorieContext awarenessRepresentation (politics)Sample (statistics)Mathematical modelPerformance appraisalRepresentation (politics)Task (computing)CodeProgramming languageResultantConnectivity (graph theory)Sampling (statistics)SpacetimeMachine visionEndliche ModelltheorieOcean currentTemplate (C++)Wave packetGreatest elementOpen sourceMereologyDescriptive statisticsMathematical modelRight anglePerformance appraisalDifferent (Kate Ryan album)Function (mathematics)1 (number)CASE <Informatik>Computer animation
02:14
Regulärer Ausdruck <Textverarbeitung>Mathematical modelComputer-generated imageryRepresentation (politics)Object-oriented programmingDemosceneProgramming languageEndliche ModelltheorieProgramming languageDifferent (Kate Ryan album)Wave packetSampling (statistics)Representation (politics)ResultantFunction (mathematics)InformationOpen sourcePerformance appraisalContent (media)HeuristicCombinational logicQuicksortComputer animation
03:56
Computer-generated imageryDependent and independent variablesNegative numberObject-oriented programmingTask (computing)LaptopVisual systemComputerSign (mathematics)SurfacePrimality testSqueeze theoremRepresentation (politics)Endliche ModelltheorieProgramming languageMathematical modelAxiom of choiceSample (statistics)Open sourceSampling (statistics)Endliche ModelltheorieTask (computing)Axiom of choiceProgramming languageMultiplicationMathematical modelMereologyContent (media)Different (Kate Ryan album)Process (computing)Form (programming)Combinational logicOpen sourceComputer animation
Transcript: English(auto-generated)
00:00
Hello everyone, my name is Sherzod Takemo and today I'll be presenting the paper with the title images in language space exploring the suitability of large language models for vision and language tasks. This is a joint work done at the University of Potsdam together with Daevisch Lönnen.
00:20
The main motivation for this research is to understand whether prompting a large language code is possible for multi-modal tasks and if it is, to what extent. In this case what we mean by multi-modal tasks is tasks that require looking at both image and text pairs to be able to do the classification of certain tasks.
00:40
And in this paper we specifically focus on comparing the results of open source models together with commercial ones such as GPT-3. And by doing this we use multiple tasks which are all of the multi-modal and analyze the performance and see the differences between different models.
01:06
For this we have the following methodology. On the left side we have a component that we call image as text representation extraction. What it essentially does is, given an image, it will convert that into text representation using two methods. One is image captioning models which are using pre-trained captioning models.
01:25
Another one is by giving this image to multiple image classification models and combining their outputs into what we would call a visual text. And then this will basically be appended, text will be representation of that image.
01:41
On the right hand side we have a prompt template where we have a task description and we have two different tasks here. One is question answering and the other one is typical classification model where both we have two different templates. In the middle part we give in-context samples which are selected from the training space of the dataset.
02:01
On the bottom we have the evaluation sample that we're looking at currently. And then once this prompt template has been filled we prompt the language model and we extract the answer from the respective language model. We have these three image captioning models that we have tested and then we use four different
02:22
classification models which were run separately and then we combine the outputs that would essentially build what we call visual text. And in this way we convert such image into two different textual representations.
02:42
In this paper we have used three different open source language models and we compared it with the results from GPT-3. And this we did on five different datasets, NVSA is a sentiment dataset, OKVQA is a visual question answering dataset, MAMI is a meme based misogynist detection, NLVR is a reasoning
03:05
dataset, and HatefulMeme is another meme dataset that is based on identifying hateful content. In our experiments we have shown that using random sampling where we select random samples
03:20
from available training data versus using some sort of heuristics that selects a sample that is more similar to the given evaluation sample and we have shown that adaptive sampling leads to better results and similarly captioning models capture more detailed information than
03:40
combination of multiple image classification models. This we have also shown that and then also we have compared the results from from-toothed language models with the models that have been specifically fine-tuned on the respective datasets. We have presented multiple qualitative examples in the paper and let's look at one example
04:02
which is from OKVQA dataset. And this sample shows that there is an image such as this and then the bleep caption model gives an output such as a person with a teddy bear in a backpack and then visual tags essentially
04:20
retrieve something like three people, book, backpack, teddy bear, potted plant and for question such as what toy is this, the ground truth is supposed to be teddy bear or stuffed animal and both GPT-3 and FLANTI-5 models give an output such as teddy bear which is basically correct answer.
04:41
And in conclusion what we have shown that language models can be prompted also for multimodal tasks by representing the image content in text form and this we have done it in two ways captioning and combination of multiple model outputs to combine something like a text and then we have also compared the prompted language models with the fine-tuned models
05:05
that are specifically were trained on the dataset then we have also shown that the choice of in-context samples also makes a difference and most importantly we have shown that open source models have certain benefits also when compared to the commercial process GPT-3
05:24
and this essentially means that for certain tasks open source models can also yield good or adequate performance. All the resources that are part of this paper are publicly available, thanks for listening.