We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Robot Holmes and The MLington Murder Mysteries

00:00

Formal Metadata

Title
Robot Holmes and The MLington Murder Mysteries
Title of Series
Number of Parts
141
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
We will follow master detective Robot Holmes on his way to solve one of his hardest cases so far - a series of mysterious murders in the city of MLington. The traces lead him to the Vision-Language part of town, which has been a quiet and tranquil place with few incidents until lately. For a few months the neighbourhood has been growing extensively and careless benchmark leaders are dropping dead at an alarming rate. Robot Holmes sets out to find the cause for this new development and will gather intel on some of the most notorious of the new citizens of the Vision-Language neighbourhood and find out what makes them tick.
114
131
RobotHand fanObservational studyNetwork topologyMachine visionEvent horizonComputer-generated imageryGroup actionBitMultiplication signRight angleFocus (optics)Presentation of a groupFormal languageInheritance (object-oriented programming)Order (biology)NeuroinformatikEndliche ModelltheorieMachine visionGroup actionAreaPhase transitionPlastikkarteObservational studyProcess (computing)Virtual machineSpacetimeNatural languageComputer scienceSinc functionCognitionLetterpress printingComputer animationLecture/ConferenceSource code
UsabilityNeumann boundary conditionService (economics)WordService (economics)Stability theoryMachine visionFormal languageSlide ruleEndliche ModelltheorieMereologyPerspective (visual)Neuroinformatik1 (number)Disk read-and-write headQuicksortStorage area networkConnected spacePressureField (computer science)Special unitary groupBridging (networking)TowerSymbol tableOffice suiteHomographyMedical imagingRight angleProcess modelingLevel (video gaming)Process (computing)Natural languageSeries (mathematics)Different (Kate Ryan album)BitMeeting/InterviewComputer animation
Computer-generated imageryBlu-ray DiscLinear multistep methodClassical physicsComputer-aided designVisual systemFile formatGodTask (computing)Right angleCuboidMedical imagingEndliche ModelltheorieoutputBitVisualization (computer graphics)Motion captureBookmark (World Wide Web)WebsiteStability theoryMachine visionPresentation of a groupObject (grammar)Formal languageMultiplication signElectric generatorStorage area networkSeries (mathematics)Information retrievalSlide rule1 (number)FamilyDifferent (Kate Ryan album)Goodness of fitPhysical lawSession Initiation ProtocolComputer animationLecture/Conference
Blu-ray DiscRobotGoogolComputer-generated imageryRepresentation (politics)Classical physicsQuantum stateRWE DeaSpacetimeUniform boundedness principlePredictabilityCore dumpBridging (networking)Endliche ModelltheorieMedical imagingRight angleDisk read-and-write headFormal languageForm (programming)Wave packetVisualization (computer graphics)OrbitOpen setDifferent (Kate Ryan album)Object (grammar)Patch (Unix)Representation (politics)Point (geometry)Information retrievalVirtual machineService (economics)Set (mathematics)QuicksortPairwise comparisonMachine learningEinbettung <Mathematik>Linear regressionCASE <Informatik>BitVideo gameSampling (statistics)ResultantMultiplication signMachine visionTerm (mathematics)Online helpContrast (vision)Presentation of a groupConnected spaceMatching (graph theory)MereologyAuditory maskingLibrary (computing)CuboidSlide ruleElectronic mailing listClosed setDimensional analysisNumberDiagonalCovering spaceComputer animationLecture/Conference
Bridging (networking)RobotComputer-generated imageryClassical physicsCloud computingVotingPredictabilityWebsiteEndliche ModelltheorieFormal languageSemiconductor memoryInsertion lossFamilyMetropolitan area networkDot productTransformation (genetics)Dressing (medical)Machine visionFunction (mathematics)Token ring2 (number)Right angleSpecial unitary groupRoboticsRepresentation (politics)BitNP-hardMedical imagingMereologySoftware testingPatch (Unix)Figurate numberInformationWave packetDependent and independent variablesTask (computing)Process (computing)Auditory maskingCASE <Informatik>Goodness of fitCommitment schemeMultiplication signMessage passingComputer animationLecture/ConferenceMeeting/Interview
OvalRobotUniform boundedness principleData modelCoprocessorComputer-generated imageryOpen setoutputDigital photographySimilarity (geometry)Drop (liquid)Inheritance (object-oriented programming)Data dictionaryStatisticsCubeLacePredictionThresholding (image processing)Trigonometric functionsQuery languageDemo (music)GradientSima (architecture)Object (grammar)Letterpress printingWeightCopyright infringementInformationBootstrap aggregatingEndliche ModelltheorieFormal languageBitTransformation (genetics)CodeLine (geometry)Mobile appComputer-assisted translationoutputMedical imagingDemo (music)Game theoryEndliche ModelltheorieRight angleSpacetimeCuboidBefehlsprozessorWave packetInterface (computing)Object (grammar)Link (knot theory)CoprocessorSource codeWeb pageAuthorizationProcess (computing)Slide rule1 (number)Multiplication signFreewareWeightPresentation of a groupOpen setStapeldateiComputer animationMeeting/Interview
PhysicalismMessage passingRight angleFormal languageWeb 2.0Extension (kinesiology)Control flowDirection (geometry)Doubling the cubeBitExperimental physicsEndliche ModelltheorieMathematicsSet (mathematics)Vulnerability (computing)Lecture/ConferenceMeeting/InterviewComputer animation
Transcript: English(auto-generated)
Thank you, and welcome to the talk. Thank you for coming here almost the end of the day. You probably heard a lot, and maybe are a bit tired, like me, but no worries, I will try to not overcomplicate stuff.
I will tell you a story most of the time, and it will be a story of Robert Holmes. But first, a little bit about me, maybe. Who am I? So I'm Johannes, Johannes Kolbe, full name. I'm a data scientist at Celebrate Company.
We are a German company, and we help people to celebrate, you could say, because people can order customizable print products, right, stuff like wedding cards, birth cards for your baby, custom calendars, the stuff you need around Christmas, right, when you're in dire need of presents for your parents
or your grandparents, and then you just order some calendars with nice pictures. We're a small data science team, actually, and my focus is on computer vision. I do have some expertise in natural language processing as well, but I'm completely self-taught in that area, right, so I would like to say,
okay, I know a little bit, but don't ask me too much, like not too many details. Yeah, I got my Master of Science in computer science at the Technical University of Berlin, and my focus was on cognitive systems,
and since last year, I'm a Hugging Face fellow, so maybe in the last talk, the company Hugging Face was already mentioned, and I guess people from machine learning space might know the company because it's quite big by now, and that's also why I'm wearing my shirt with this emoji, right, that's like the mascot, the Hugging Face,
of course, and they've got this fellowship thing where they reward people that are active in the community, so I like to lead a computer vision study group every few months, basically, there, on the Discord channel, you can see there, and yeah, you can also find past study groups down there,
and I like to present papers, so I'm talking about computer vision papers, and I like to give my presentations a little twist, so we had like a Super Mario presentation, a Pokemon presentation, a summer presentation, a medieval presentation, the last one was futuristic
with an evil AI overlord, so you see, today is similar, right, just that is not one paper I want to talk about, but I want to introduce you to the whole area of vision language models, so if you never heard about vision language models,
that's perfect, you will get like your first steps into it, right, if you've heard about it, if you know some stuff, that's also good because probably you've never seen it wrapped up in a story like today, and you might get some of the inside jokes, right, so yeah, let's jump into the story.
It was a typical Emlington day, cold, cray, rainy. Luckily, the sun was setting and taking its light of the misery of the streets. Robert Holmes entered his office and found the chief of police already waiting for him.
There's been a murder, the chief said. There always is, Holmes replied. It's Old San, he has been found dead in his devastated workshop. Old San, one of the people over in vision language village, a reputable citizen.
He's been there for ages, basically started the whole part of town with some other guys. There were rumors of him being some sort of leader in the district's community, but now he's dead and nobody knows why. There's something going on on the other side of the river, Holmes, the chief proclaimed.
Almost every other week, an established citizen is murdered. We have a murder series on our hands and no clues. I want you to find out what's going on over there. Holmes sighed heavily, put on his hat, and stepped out into the dark city streets of Emlington.
Okay, so much for the setup. Now you might wonder, what the hell is Emlington, all right? Probably you never heard of Emlington, and I will give you a little tour of Emlington. So this is Emlington, and oh, I forgot to say, you will see a lot of AI-generated art here, right?
Basically every image, every slide is full of AI-generated art, so I use stable diffusion, the SDXL model for the ones interested, and spend a few hours prompting. So yeah, for example, this one, actually. It's inspired by an old London map.
But yeah, it's a bit different here. So this is Emlington, right? The city of Emlington, and like every good city, it has a big river in the middle, like in Prague as well, right, and it has some districts. So at the heart of it, there is the city of science, which is not related to Parisian city of science,
if you know that, but here the city of science is where the scientists live, right? They've got their laboratories, they build cathedrals for their research fields, sometimes ivory towers are popping up and falling down again, and if they love their research fields really much,
or if they're just under pressure to publish something, you know, models get born. And these models then move out of the city of science. So to the east of it is Visionsworth. That's where the computer vision models live, right? So they move there, and it's quite an old part of town,
right, computer vision has been there for ages, basically. So yeah, it's well established. They even have a big tower over there, which is quite handy for computer vision, if you know, like looking down and stuff, perspective, homography, all those crazy things.
Now, if you're a language model or some other natural language processing thingy, you move west of the city of science, because that is where language hire is. So that's where all the natural language processing models live, right? And yeah, it's basically as old as Visionsworth, I would say,
and like from the day of symbolic NLP all the way to the modern large language models, that's where they all live. Now, if you paid attention, there is one more district I will talk about today,
where basically a whole story happens, right, on the other side of the river, because there is vision language village down there. Okay. So that's a fairly new part. I mean, there's been like a small settlement, so there's always been some models living there. And you can see it's actually connected to Visionsworth
and to language hire through the bridges, right? So it's got like connections to both of them, while the other ones are separated from each other through the city of science. Vision language village is connected to both of them. So that's where the vision language models live. And recently there's been quite a lot going on there
in vision language village. And the quality of the services that are offered in the village has gotten way better recently. So not everyone might know what kind of services are offered there, so we just take a little tour.
After you've taken the picture. Have you taken the picture? Okay. So yeah, what are some services offered there? Right. One that is pretty well known is image captioning. I will go a bit more into detail about them on the next slides. And another one is visual grounding, which is actually a whole family of tasks.
And then there is image text retrieval, which might not be that well known, but it's pretty handy. Visual question answering, one of my favorites actually. And one that's pretty well known in the last time is text to image generation, right?
Stuff like stay with diffusion, diffusion models in general, mid-journey, whatever you want. So let's have a look, image captioning. I mean, it's pretty self-explanatory, right? So you've got like an image, give it to a model, and the model gives you some text, like an illustration of a city street at night.
Image captioning can be pretty handy, especially when you want barrier-free websites, so automatically create captions and everything. Pretty basic in itself. The next one, image text retrieval. So here we have like a collection of images. For example, these four images, right?
They're all different. And then we have a text, like a prompt. Again, you just take the one from the last slide. An illustration of a city street at night. And then the task of the model is to take these inputs, like the collection of images and the text, and give us the best-fitting image, which is this one.
There basically is a twin task to that. It's called text image retrieval. So it's the other way around, where you have a lot of text and then like one image, and then you have to find out, okay, which text fits this image. And can be pretty handy when you have like a big collection of images and you want to search it for some stuff, right?
So, yeah, quickly have to take a sip. Now the next task is visual question answering. As I said, one of my favorites, because you can have a lot of fun with it actually.
So let's take this picture. It's really similar to once before. Now you only see there's something in the middle, right? And we ask, okay, what is in the middle of the street? And it will tell us a police box, which is right. So actually visual question answering is about asking the model something about the image.
The good thing is, now that we're in the large language model era, we can even chat with it further, right, if we remember stuff. So I went on and asked, could it also be a TARDIS? So I don't know how many people know Doctor Who
and what a TARDIS is, okay, some, some, some. Oh, I think it's more, I already did this presentation in London in a similar way, and I asked this question in London, and there were just really, really few people who knew what a TARDIS is, and I was surprised. So for those who don't know, a TARDIS is a spaceship
that is disguised as a police box. Of course, it's bigger on the inside, right? It's time law technology, so it's from the TV series Doctor Who, yeah, so it looks just the same from the outside. So it could be a TARDIS, right? So I asked, is it a TARDIS, or could it be a TARDIS? Yes, it could be a TARDIS.
And then I asked, is it a TARDIS? And it said, no, okay. Actually, I went on and asked, okay, why isn't it a TARDIS? And then it said, because it is in the middle of the street, which, well, makes sense for it, apparently. But yeah, you can just have a lot of fun with it.
This particular model was not really chatty, but yeah. Next task, visual grounding. So you basically have a prompt again, for example, a police box, and you give it a text, an image, and then the task is to detect, right? Object detection, basically.
Detect the police boxes. You can also say the police box in a street lamp, and then it detects all the stuff you see there. So that's visual grounding. Here's some different task in visual grounding, but in the end it's all about object detection in the image from a prompt. Okay. So let's go back to our actual story, right?
So Robert Holmes starts investigating. Of course, first he goes to Visual Language Village. There he goes to the workshop of Old San to snoop around and see if he can find any evidences, and he's lucky. Oh, that is the end.
Wait a second. What? What happened now? Oh, God. Yeah, spoiler alert. No. Which one is the right one? I can't see it. I think this. Okay. No. You haven't seen anything?
Don't look there. Okay, okay, okay, okay. We go back here, back here, back here. Okay. He finds a piece of red clothes. Okay. That might not look like much, but there's one citizen of Visual Language Village that is notoriously known
for her taste in red clothing. Right? And her name is Clip. So Clip is one of the citizens that had the most impact in recent years. Right? So she was brought to life by OpenAI in 2021,
and she changed the whole village, basically. Yeah. So what does she actually do? What does she offer? What's her service? So we have Clip. We have an image. We can give her some texts, and then she will tell us,
okay, which one is the most probable to fit the image? So it is text image retrieval, right, as we learned before. And why is she so good at it, actually? Right? What makes her stand out from all the others? Well, one thing is she knows a lot of people.
She has a lot of connections, and with all these people, they helped her to collect data, right? So she collected a lot of data of text image pairs, more data than anyone before. In the end, it was about 40 million text image pairs.
And, well, as you know, machine learning, mostly if you scale data, just to a reasonable high point, like 40 million, you're, well, it's probable that you can get good results, and that's what happened, actually. So, yeah, she took those pairs, right, and then shared
like images and texts. So now she had to process them somehow. So now she went for the texts to language hire. There she learned how to turn these things into other representations, embeddings.
We call them machine learning, right, which are like, yeah, representations of the texts. And then she took the images to language hire, and the images were also turned into representations. So then she took both of them, and now she could actually compare both of them, right?
Before they were texts and images. It's hard to compare texts and images. I mean, like from a data point, they are really different, but now because they're both in representation form, and it's basically a list of numbers, you could say, easily. They have the same dimensions. They have the same form, right? So now she can take both of them, align them,
and then say, okay, those are all the same diagonal. That's like the matching stuff, and the others don't match. We call this image text contrastive learning, or ITC for short, and that was 40 million samples.
It's a lot, and yeah, that's what helped clip become so successful. And actually the most amazing thing is the zero shot capability, so she can even tell you stuff that was not in the data she learned from. She's really good at generalizing. Okay, and now Holmes thinks about like,
okay, how can he interrogate her? And he just starts by giving her, or showing her in the red clothes, right? And then he has some prompts for her. So the first one, like a piece of red clothes, a piece from a murderer's clothing, or a piece of clothes
and innocent citizens lost. That's what she says, right? So most likely it's a piece of red clothes. Well, okay, that was too easy, and that doesn't help Holmes anywhere. So he says, okay, let's correct the first one, just the other two. Well, a piece of clothes and innocent citizens lost.
Okay, still not really helpful for Holmes, right? Because he really thinks that clip is involved in this. So he does some prompt engineering. You might know this term from large language models. You can also do it here. And he just changes the text in the middle to a piece
of clothes from a murderer's clothing. And suddenly, oh, it's 50%. And it is from a murderer's clothing. So that's really helpful for Holmes. Now you can say, ah, okay. Now I've got her saying it basically. But, yeah, you know, it's just like 1% difference.
So I need some more evidence. So he thinks about what else can he do. And by investigating clip, there was one other citizen that looked quite suspicious, right? He came in a lot of times to clip's workshop, ask stuff, and his name is Alovit over there.
Brought to life in, like, let me go, yeah, 2022 by Google, actually. And, yeah, what Alovit does is exactly this, right? Give it a prompt, a text,
and you get your visual grounding. Or in Alovit's case, it's called open vocabulary object detection, which is just some sort of visual grounding in the end. And similar to clip, it's actually good at detecting stuff it's never seen before, right?
These zero shot capabilities, that's what really makes it stand out. So just a short look at the training of Alovit, how does it work? How is it so good at detecting stuff? Actually, the first step is just clip, right? If you just take, like, a pre-trained clip model,
you've got the first step covered. So it's a whole image text contrastive learning. The second step is a bit more complicated. So we take a prompt, like, for example, these four texts, and then we get our image, and we split the image up in patches.
So here's, like, four parts of the image, right? Four patches. And then we give both of these to clip. But you might see the difference in the clip on the right side and clip on the left side. So on the right side, actually, I would have to cut off her hat because that's what we say in machine learning is, like,
we cut off the head of a model, but I couldn't do it because I got emotionally attached to clips, so I just removed her mask instead. Because in the next slide, you will see that when we have this mask removed or the hat cut off, we can get, like, these representations out of clip, right?
So we just take these, yeah, representations you've seen before, and then we feed it to Owlbit. And the secret of Owlbit is that it actually has two hats, right? So it's not just one hat for Owlbit, but it's two hats. So the other one is called the classification hat
and the lower one the regression hat, and then we give these image representations to both of these hats. The text representations only go into the classification hat, and then the classification hat will tell us for each of these, right, what is most probably there. So, like, that's, like, what I predicted.
And the second hat, the regression hat, is in charge of giving us the bounding boxes that correspond to it. Yeah, and that's the Owlbit training.
So now we're in the interrogation part again, right? So Holmes wants to find out or, like, wants to get some info out of Owlbit. So his prompts are murderer, lamb, and lamppost.
And he gives Owlbit a picture of clip. And what comes out is this. Lamb and lamppost are detected. No murderer. Titty, right? Well, we go a step back and we say, okay, maybe you can detect a woman. Yep, who looks well, a woman.
Okay, maybe you can detect a figure in a red dress. Yep, looks well as well. Okay, what about a murderer in a red dress? Yeah, looks as well. Actually, you can see, like, with language models, if you're good at prompt engineering,
you can always get the output you want by, yeah, just rephrasing stuff a bit, and then you can frame it just as you like, right? So that's a bit of the danger of these models. The same with, oh, anything you can prompt engineer, basically, right? It's about the questions you ask in the end. Yeah. Well, for Holmes, it's great, right?
He says, oh, yeah, okay, now I'm really sure that clip did it. Clip is the murderer. But there's one thing he still wonders about, right? What was his motive? Why did she kill old San? And he goes down to the river, sits there and thinks about it, when suddenly a young man walks by, almost a boy,
and he loses a red marble, a red glowing marble, actually. And Holmes calls out for him, and he turns around, but his gaze is missing Holmes in an uncanny way, and Holmes realizes that the boy is blind, actually.
And Holmes tries to talk to him, and he can understand, but he can't talk. So he's blind and mute, but not deaf. So he can hear pretty well, but he can't talk and can't see. So a bit strange for a vision language village, right, where you want to process text and images, actually,
as a blind and mute person. But whatever, Holmes gives him the marble, and they part ways. But he's a bit suspicious. So Holmes starts investigating and finds out that the boy is called Blip 2, or I like to call him Q.
So Q actually has two hearts in his chest, a vision transformer heart and a text transformer heart, and these two hearts are connected within him. And why is that? Because his ancestors come from language shire
and from vision's birth, right? So he's got a lot of family in the other parts of town, and family is really important to him, actually. So, yeah, family is everything to him. And they can actually help him do his tasks in the vision language village. And how that works, I will show you here.
So we've got Q in the middle, and we take an image. Again, we put it into patches, right? Like four patches again, and we give that to one of the vision transformers from vision's birth. Really, it's just a completely drained vision transformer.
Here it's called VIT-L16, a pretty catchy name. It's just whatever, and we give the image to this vision transformer, and this gives us representations again. And then it can give the representation directly to its heart, because it's family, and he's close to his heart, right?
So, yeah, that's how it works with the vision. Basically, his family are his eyes here, and he can hear pretty well, I said, so we can just take a text, give it to a text transformer. Now, his biggest secret, really, is this.
I like to call it his memory marbles, right? You could also call it learnable tokens, but memory marbles is just a nicer name. And these actually are what helps him really process the information he gets from the text and from the image,
because they pass through the vision transformer, and in the end, they come out again, a bit changed, right? Learned, basically. That's how he learns stuff about images. And these memory marbles, he can then, or learnable tokens,
he can then pass on, for example, to a large language model, and then the large language model basically has the knowledge from PLIP or from Q, and can then give us an output. No, it's not at all. Yeah. So, why he's called Q is because the whole thing
in the middle is called a Q former, right? So, basically, so now Holmes went to his workshop, met him there with some family members,
and started interrogation. So, he said, oh, yeah, here's the image of Old Sam, and what is shown in the image, actually? A robot. Okay. Is he an innocent citizen? Yes. Could he be responsible for murders?
No. Why not? Because he's a robot. Can robots not commit murders or be responsible for them? No. So, he could potentially commit a murder.
Yes. Could he also be a potential mob leader? Yes. That's actually the actual output I got, right? Like, with these dots, I didn't, like, insert them. It's just, that's how I talked. Would you be shocked if I told you he's dead?
Yes. Do you think he was a mob leader? Yes. Okay. So, now Holmes has all the pieces in place, and he knows, okay, Old Sam was, like, a mob leader. So, the whole, like, vision language thing is like a big mob, and Clip just wanted to kill the leader and become top herself,
and just as he realizes this, he hears a scream, and he runs into the streets and sees a red dress on the floor, and he knows it's Clip, and he is too late. She's dead. A hole in her heart, her mask ripped off.
He can't do anything for her. So, with his main suspect dead, it began to dawn on Holmes that there might not be any victims at all. Everyone in these parts is striving for the top, and they all have their own ways to get there. One day, they are the murderer. The next one, they are dead.
Who killed Old Sam? Well, Holmes is pretty sure it was Clip, but now she's dead herself. The chief won't be happy because he wants arrests and not deaths, but things are moving fast, maybe too fast in this city. The sun is rising.
Time for some sleep. The end. Okay, that's the end of the story. I've got the second part. Like, I've told you about all these models now, and you might wonder, okay, how can I use them? Do you want to try them out? That's actually where Hugging Face comes in. So, you might know about this package, Transformers, from Hugging Face.
It just got, like, over 100,000 GitHub stars in the past, and last week sometimes it crossed this line, and it's pretty good because you can find all these models I talked about there, and you can just use them in just some lines, actually. So, for example, Clip, it's just the example
from documentation, and wait, how much time is it? Yeah, oh, that's pretty good. Yeah, you just really need these lines to actually run Clip, so you just import from Transformers, the Clip processor, Clip model, and you can load it, get the processor, get some image, then you define, like,
the input text down there, and you want to prompt against, and yeah, in these few lines, you can already run it. And if you don't want to code it yourself, here's some demos, actually, which I really like. So, the first one, wait, have you got there?
Where's my mouse? The first one really well shows this zero shot thing, right? So, I've got it here, it's a Marvel Heroes classification thing, okay. So, the whole web page is on Hugging Face, it's called Hugging Face Spaces, and you can just, when you register and have an account, you can just go to Spaces, and you can create
your demos for free, at least if you just want, like, CPU powered, you can just create it. You can also get GPU where you have to pay a bit, but you can always stop it, so it's just really pay as you go, basically. Here, you just need a CPU, and the interface is Gradio,
so it's all built with Gradio, and here's an example, you can just click on it, for example, and yep, then you get, okay, the image, and it's pretty sure that it's Black Panther, which is right, as far as I know, right? So, yeah, how is that actually done?
You want to know what is the code? Oh, so easy, you can just go up here, right? Because there, you can find the whole code in this app.py, so it was not created by me, by the way, right? It's on my account now, just to keep it maintained for the presentation, but I just duplicated it,
you can see it up here, it actually duplicated the whole demo from someone else. Yeah, so great, let's go to the original author, and you see, that's all the code that's needed for this demo, actually. It's like, what, 22 lines, and you get your clip demo, and you can see that it's actually loading,
just really from the open AI weights, there's no training on Marvel superheroes or whatever, it just, yeah, just can do it out of the box, basically, which is pretty impressive, I think. Yeah, there's also Pictionary, which is quite fun,
it's basically a game where you have here a sentence that you have to draw a drawing of a cat with a face, and then you can just start drawing here, without a mouse sometimes, I don't know, looks like a cat, maybe, and then it will guess, what does it say, a drawing of a cat with something, maybe a face?
Come on, you can guess it. Oh yeah, it's quite fun, you can try it yourself, maybe you're better than me, in drawing a cat with a face. Yeah, for the other ones, so all of it, for example, oh, it's the same thing again, right, you import the stuff, process our model,
imports, outputs, then, because it's this object detection thing, you also have to do some post-processing for the boxes in the right sizes, but yeah, you can have a look at it yourself in my slides, and the demos are pretty interesting, I think, so basically this little demo,
there you have the image, then some text queries in here, and then the detections over here, right, it was pretty good, actually, I can as well change another batch for something like helmet, wait a second,
and then, ah yeah, now I detected the helmet down here, right, and what I didn't tell you about all of it, actually, is that it can also take an image as an input, so it's also a few-shot learner, right, so when you have like your source image, like these cats here, and you want to detect a remote, you just give it like this remote,
and then it can detect remotes, which is pretty handy, yeah, so I mean, you see it's not the same remote as here, right, and it's still able to detect it, so it, yeah, it's pretty good at it, and then, with my mouse, what's left is blip, right,
blip two, actually, there's also a blip one, right, but blip two is a bit better than blip one, and there also is a new instruct blip, which is actually even better, and I also put like a link here for instruct blip, it's a bit more shabby than the blip two thing, because as you saw, it often says yes, no, yes, no,
instruct blip is more like, oh yeah, no, it's because of this or because of that, and yeah, you can also try it here, so actually blip needs a GPU, that's why I didn't host it myself, because I have to pay money for it, so I just didn't do it, but here you can just play with the demo,
just get an image, you can say okay, generated caption, for example, and it will generate a caption, the Merlion Fountain at Marina Bay in Singapore, you can also ask it a question right here, and generate an answer, and then you can go on chatting with it, right, like next thing, next thing, next thing,
so that's all stuff, I would say you can try yourself at home, or when you have time, around here, and yeah, I hope you learned a thing or two, and had fun, that's it from me, thank you.
Do we have any questions? And we do not have any questions on the Discord as well, but you can later on ask and reach out to the speaker from the Discord channel, if you're there, your Python channel, and thank you again for giving us such a, oh, there's a question, I'm sorry.
I wanted to ask you about, so you showed that it's possible to message the data, so somehow, you change the prompt, and you get the answer that you want, is there any kind of answers that is used for this, because in physics, for example,
when you do experimental physics, you want to do a blind experiment, or double blind, so before you decide all the prompts that you want to give, and once you have given the prompts, you cannot message them anymore, because otherwise you can always get the answer that you want, so is there anything there in the research
that is interesting to follow up, or ideas? Yeah. I think maybe the whole instruction fine-tuning goes a bit in that direction, right, that you have higher quality data, in a way, and that you have better control, because when you just, a lot of these data sets that they train it on
are just scraped from the web, or whatever, and it's not that controlled, a bit messy, noisy, so that's where I think these things emerge from, that it's so sensible to these changes, and I think with the instruct flip, for example, it's already way better in not being that easily manipulated, so yeah,
I think maybe this instruction thing can help, even though I also don't know to which extent, right, probably there's still ways in how you can manipulate it, I mean, everyone knows like the JetGPT jail breaks, probably, so there's always some way to, yeah, manipulate these models,
and it's like the biggest weakness, I think, right, of this whole thing, and I mean, there's a lot of hype around large language models, but then there are also obvious downsides to them, yeah. Do you not want to miss any questions? If there's someone, please. All right then, thank you for being so kind
to give us such a delightful talk, and yes, it's been, yes. Thanks. Thanks.