Building Scalable Multimodal Search Applications with Python
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69410 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 2024124 / 131
1
10
12
13
16
19
22
33
48
51
54
56
70
71
84
92
93
95
99
107
111
117
123
00:00
BuildingBimodal distributionScalabilityDesign of experimentsOptical character recognitionHybrid computerBitSpeech synthesisMultiplication signVector spaceMedical imagingVideoconferencingComputer animation
00:40
Bimodal distributionCartesian coordinate systemType theoryQR codeRepository (publishing)Open sourceComputer animation
01:02
Computer-generated imageryOpen sourceDecision support systemBimodal distributionComputer animation
01:25
Bimodal distributionDatabaseVideoconferencingMobile appMedical imagingoutputComputer fileAudio file formatType theoryMultimediaComputer animation
01:51
Operations support systemEnterprise architectureSeries (mathematics)Open sourceVector spaceDatabaseBitWave packetVector spaceHybrid computerAdditionOpen sourceVirtual machineDatabaseComputer animation
02:25
Trigonometric functionsNetwork operating systemDimensional analysisVector spaceSimilarity (geometry)Vector graphicsData Encryption StandardMacro (computer science)DatabaseVector spaceVideoconferencingObject (grammar)Virtual machineTranslation (relic)Search engine (computing)Business modelLibrary (computing)Uniform resource locatorPoint (geometry)AnalogySheaf (mathematics)Different (Kate Ryan album)SpacetimeMedical imagingNeuroinformatikQuery languageEmailArithmetic meanForm (programming)Process (computing)Computer fileNumberArray data structureDot productWordQuicksortThree-dimensional spaceAudio file formatSemantics (computer science)Civil engineeringFile formatCartesian coordinate systemComputer animation
05:43
Dimensional analysisPrice indexQuery languageVector spaceQuery languagePoint (geometry)DistanceRight angleSubject indexingObject (grammar)DatabaseComputer animation
06:35
Cone penetration testVector spaceQuery languageOrder (biology)Vector spaceDatabaseResultantPhase transitionVirtual machineCodierung <Programmierung>Entire functionSubject indexingBusiness modelComputer animation
07:08
GoogolSemantics (computer science)GoogolHybrid computerSearch algorithmEnterprise architectureVirtual machineVector spaceDatabaseComputer animation
07:29
Vector spaceBitBusiness modelMultimediaMedical imagingDifferent (Kate Ryan album)VideoconferencingType theoryComputer animation
08:03
Router (computing)CAN busFile formatPattern languageParadoxVideoconferencingSet (mathematics)RoboticsRight angleSource codeComputer animation
09:06
OctahedronLength of stayCorrelation and dependenceComputer chessCalculusFormal languageTable (information)Link (knot theory)Order (biology)Group actionParadoxTranslation (relic)Computer animation
09:42
Order (biology)BitBusiness modelFormal languageoutputWordMachine visionInteractive televisionComputer animation
10:36
Computer-generated imageryVideoconferencingBimodal distributionDigitizingVector spaceNetwork switching subsystemTable (information)Formal languageDimensional analysisQuality of serviceSpacetimeEmbeddingOpen sourceModal logicFunction (mathematics)Bimodal distributionBitPoint (geometry)MultimediaVector spaceFile formatMedical imagingType theoryAudio file formatQuery languageComputer fileData compressionoutputBusiness modelInformation retrievalCodierung <Programmierung>Modal logicCartesian coordinate systemForm (programming)Process (computing)Correspondence (mathematics)SpacetimeDatabaseDimensional analysisDigitizingVideoconferencingRight angleTerm (mathematics)Uniform resource locatorLibrary (computing)Formal languageLine (geometry)Object (grammar)HypermediaComputing platformDifferent (Kate Ryan album)Functional (mathematics)ApproximationComputer animation
13:58
Formal languageBusiness modelBimodal distributionSatelliteAxiom of choiceProteinOrdinary differential equationBusiness modelVideoconferencingAudio file formatMedical imagingOpen setFormal languageProduct (business)Modal logicVector spaceInformation retrievalDimensional analysisTable (information)SpacetimeComputer fileBimodal distributionMereologyCASE <Informatik>BitType theoryElectronic mailing listPhysical systemLibrary catalogForm (programming)Basis <Mathematik>Descriptive statisticsNumberRankingComputing platformMultiplicationPerfect groupPoint (geometry)Uniqueness quantificationComputer animation
17:11
Information retrievalMathematical singularityRepresentation (politics)Modal logicDimensional analysisProduct (business)FacebookMetadataComputer animation
17:58
Presentation of a groupScale (map)Product (business)Query languageOperations support systemScalable Coherent InterfacePairwise comparisonModal logicLink (knot theory)Physical systemProduct (business)Greatest elementBimodal distributionComputing platformMatching (graph theory)Type theoryConnectivity (graph theory)Position operatorComputer animationSource code
18:48
Information retrievalAbstract syntax treeVector potentialDependent and independent variablesInformation retrievalCartesian coordinate systemElectric generatorBimodal distributionBitFormal languageBusiness modelEnterprise architectureContext awarenessSpacetimeInferenceMultiplication signDependent and independent variablesInformationFunction (mathematics)Computer animation
20:19
Business modelVector spaceContext awarenessQuery languageVector spaceBusiness modelDatabaseFormal languageContext awarenessReal-time operating systemScalabilityRight angleComputer animation
21:03
VideoconferencingSubject indexingContext awarenessMedical imagingRight angleDatabaseVector spaceFormal languageBusiness modelInformationDependent and independent variablesBimodal distributionComputer animation
21:45
BuildingComputer fontBimodal distributionQR codeFormal languageBusiness modelConnectivity (graph theory)Cartesian coordinate systemEinbettung <Mathematik>Machine vision2 (number)Computer animationLecture/Conference
22:23
RootTotal S.A.Transport Layer SecuritySuccessive over-relaxationVector spaceMedical imagingRepresentation (politics)Business modelSlide ruleCodeSpacetimeModal logicField (computer science)1 (number)Product (business)Contrast (vision)FacebookSoftware testingPhysical systemStandard deviationType theoryComputer animationLecture/ConferenceMeeting/Interview
Transcript: English(auto-generated)
00:04
So happy to be here. This is my first time at EuroPython, first time speaking and first time attending. As I was introduced, I'm originally from Toronto, Canada, and this has been really awesome so far. So my talk is situated perfectly because the previous talk introduced everybody to the concept of vector search and hybrid search.
00:27
I'm going to build a little bit on top of that where I'm going to talk about how you can include searching over images, video, audio as well. And how you can build all this with Python using the technologies that were talked about before.
00:42
So just to give a quick kind of overview, by the end of this talk you'll understand how this type of application works. And the application is fully open sourced. If you scan that QR code, it will take you to the GitHub repository that will show you how I built this. But let me play this for you guys and we can get an understanding of how this is built.
01:26
So the main idea behind this app is that you're searching here just with text, so your input is text, but what's coming out is multimodal in the sense that it can be any type of multimedia file. It can be a video, an image, it can be a text file as well, but here I haven't put any of those into the database, it can be an audio file as well.
01:47
And in this talk I'll talk about how I built this and how you can build this as well. As I was introduced already, my name is Dan Hudson, I work at Weaviate. I'll talk a little bit about what Weaviate does in a second, but the more important
02:01
thing is that I'm really passionate about AI and the good that it can do for humanity. I'm a trained engineer, I've been working in machine learning for a while now. And Weaviate is an open source vector database, so just to quickly give an understanding of what it can do, it can essentially help you build the hybrid search technologies that the last speakers were talking about.
02:22
In addition to scaling it up. Okay, so let's start off with a question. How many people here know what a vector database is? Okay, great. So that's about, I think 20%, 25%, that's pretty good. I'll do a quick five minute intro to what vector databases are and how I understand them.
02:42
So if you think about all of the data that lives on your computer over here, so these can be any emails, videos, audio files, text files, PDFs. You're going to take all that data and you're going to pass it through some form of AI machine learning based model. And the job of this machine learning model is to convert your data into vectors.
03:03
And vectors here are just arrays of floating point numbers. So every object that you had here, whether it's one email or whether one piece of one email is going to convert it into one vector. And the great thing about this vector is that it preserves the meaning of the data once it's translated. So you've got human understandable data up here.
03:22
You and I can read it, see it, understand it. And then you've got the translation of that data, which is machine understandable. And then to understand this better, we're going to plot out these vectors into some sort of three dimensional vector space. As the last speakers mentioned, this vector is typically from anywhere from 1000 to 4000 dimensional. But here I've shown it to you in 3D.
03:42
And the great thing about these machine learning models that we're leveraging to convert our data into vectors is that they preserve a lot of the meaning behind the data. So if I have an image over here of a chicken and the word chicken, because that's semantically related, those two objects, those two vectors are going to be closer together in vector space.
04:03
Whereas if you take dissimilar concepts like a wolf and a banana, those two things are dissimilar so they're going to be farther apart in vector space. And so you're able to convert your data into a machine understandable format while still preserving a lot of the human semantics that go into determining what composes that data.
04:22
And you can do this with text usually, but a lot of people are now beginning to do this with images, video, audio files, and you name it. And so every object that you want to search over or you want to store in your vector database gets projected into vector space. One of those green dots is one object.
04:41
And effectively what you're doing here is, if you think about this translation, we're looking at the concept of a data point and we're trying to identify where in vector space it belongs. So it's kind of like the analogy of a library where depending on what your book is about, you'll find it in a different location in the library. If it's about civil engineering, you'll find it in a different section.
05:02
If it's about arts and craft, it'll be in a completely different section of the library. So the way I like to think about vector databases are essentially as gigantic libraries that locate your data in a very specific space, depending on what the vector is. And the vector database, the search engine that's powering all of this, is essentially a superhuman librarian where you
05:23
take the query that you're interested in to them and they'll get you the five most relevant things back. Now it depends, whatever your data is in, whether it's text data, images, audio, if you can turn that data into a vector, you can take advantage of this vector database technology to search over it and add that capabilities into your applications.
05:44
So one example of how vector search works, you're going to take all of your data, you're going to index it and project it into this vector space. So once you've got all of your data, this is typically anywhere from millions of data points to billions of data points. And then the user comes along and asks the query.
06:01
So let's say that red dot over there is our query. And the really cool thing about vector databases is that the query is not necessarily a separate filter or a keyword. It can be any English sentence. It can be anything that can be turned into a vector as well and projected into this vector space. And the act of vector searching is effectively looking around in the proximity of that data point and
06:24
saying which of my indexed objects are the most semantically or similar in meaning to my query object. And then you're retrieving those and sending that back to the user. And so in short, the entire vector search pipeline looks like this. It revolves around some machine learning model, the encoder typically.
06:43
You pass your data through it during the indexing phase. You dump it into a vector database like Weaviate. And then the user comes along with a query. That's going to go through the same pathway of the encoder. You get a vector for that query. And the vector database is going to spit out a bunch of results ranked in order of relevance to the query for the user.
07:03
And then you can throw that back to your application and use it however you'd like. So effectively, it's kind of like a Google search over your own personal data. Google had this, I believe, starting in 2015 where they started adding machine learning-based semantic search on top of the keyword search.
07:24
And vector databases give you the ability to do this with your own personal data over your own enterprise documents now. So this talk is mainly about handling multimodal data. So if you've got images, audio, video files, how do you now understand semantically searching over those?
07:41
And so the rest of this talk is going to dive into how that happens and what types of models you can use to do that. So I'll talk a little bit about what multimodal models are. And I know the last talk spoke about multilingual models. A lot of people refer to multimodal as multilingual, but here when I say multimodal, I specifically mean videos, images, different multimedia formats.
08:04
And so about a year ago now, there was a debate in Toronto where they were discussing whether AI was going to pose an existential threat to humans. And then there was kind of an open timeline of when this would materialize. So let me start off with this question.
08:21
How many people here think that AI is going to pose an existential threat in the next five years? Yes? No? Very few hands go up. I guess I'm asking it in the wrong setting. Very educated people here that know about AI. I don't think so either. And the reason why a lot of people that don't believe in this, the reason why they don't believe
08:43
in it, is because we don't have AI that can do all of these very simple things for humans. Things that you and I take for granted. We don't have a successful self-driving car. You see all these videos of weird mistakes it's making. But even simpler, if you go down this stack, these are really simple things. We don't even have robots that can walk naturally.
09:02
So mimic the gait patterns of humans. And so this is known as Moravec's paradox where things that are very, very difficult for us, so language translation, playing chess, calculus, things that we need to be trained on, are very easy for AI. But on the other hand, things that are very easy for us are insanely difficult for AI.
09:25
Walking, running, setting up a table, all of these fine motor actions, these different sensory actions, are almost impossible for AI right now. And this is what people are saying is the missing link for AI to be generally knowledgeable.
09:43
And so in order to understand multimodal models and how they work, we need to understand a little bit better about how humans learn. So I have a son who is about a year and a half now. And if you think about how humans develop, the first year, year and a half of their age, they don't speak a lot. So he speaks maybe 5, 10, 15 words.
10:02
But a lot of the learning that happens in these early years is very kind of sensory-based, smelling, touching, putting things in his mouth, and kind of interaction-based. And then on top of this kind of foundational knowledge, you build more skills using language.
10:21
A lot of the models that we have right now are masters of language, but don't have any understanding outside of language. Slowly but surely we're getting language vision models that are kind of built on top of the language models, but we need a lot more of this sensory input to be able to do cross-modal reasoning. So this talk is more about jumping off points.
10:41
I'm going to talk about multimodality and what resources you can use, and I'm going to touch on it. It's not an in-depth talk. I'll talk a little bit more about that in the end. I've actually made a whole course for people that are interested in this. So the last talk talked about taking this idea of having a text-based input
11:01
and then passing it through some form of text encoder and it generating a vector. But now I want you to think of this as four different models where you've got specialist models for each modality. You have an image understanding model, an audio understanding model, and a video encoder model. And the job of all these models is to generate their corresponding multimedia format into a vector.
11:23
And because the original data point was semantically similar, you know, multimedia-wise it's completely different, but it's all about lines, the projected vector should also be similar. So if you notice the vectors, they're quite similar across the dimensions. And that's because semantically all of these things are very similar.
11:41
If they were books in a library, they would all be in the same location in the library. And there's even work that's being done that allows you to digitize smell. So this is work from a company that spun out of Google and they're putting out research. This is not publicly available in terms of usage for now, but they're essentially building up this odor map where if you can take different molecules,
12:02
you can project it into vector space and get an understanding of what is a grapey smell or a musky smell as well. And like I said, if you can turn your data into a vector, then you can take advantage of this vector database, Approximate Nearest Neighbors technology, to perform retrieval and search over it. Okay, so once you've got these models that are combined and can understand all of this different multimedia format,
12:26
you can essentially pass in any type of your data. So typically this is language, but increasingly we're seeing applications where this could be images, especially for e-commerce platforms and social media platforms. You've got a lot of image modality, audio, video, and then you've also got applications where you have sensory motor proprioception data
12:45
that can all be embedded into a unified vector space where now you can start to say, this is the image that I'm interested in, give me anything that's remotely relevant to this image. So your query can be an image, your query can be an audio file as well.
13:01
And so now you can take any one of these objects as input queries and you can project the query into vector space and then the vector space can spit back whatever is relevant, whether that's an audio file, whether that's an image, or a text file. And so this gives you a lot of cross-modal functionality, which is what humans are really good at.
13:22
So you can take an audio file and you can retrieve images or video. You can take images and video, retrieve an audio file. And so this is a model that we integrated into EVA so that people can build multimodal applications a lot easier. And this model allows you to combine multimedia formats. So if you have an image or an audio file you can project it into vector space
13:43
and you can take the vectors and add them together to do multimodal retrieval. And so this type of technology now gives you the ability to reason over these multimedia formats, more similar to what a human can do. And so a lot of people are interested in large language models,
14:02
but now we're increasingly seeing companies like Google, Anthropic, OpenAI move more towards large multimodal models that can increasingly understand images, video, audio files as well. So in the last part of this talk I want to talk a little bit about who's using these models in production.
14:23
A lot of the use cases that you hear about these days are mainly around text-based search. That's what the last talk was about. But some of our biggest customers are actually using multimodal applications in production. So the biggest use case here is around e-commerce. And typically this is because e-commerce companies have not just text assets for their products,
14:45
but they also have images, they have videos and they've got audio files as well. And the reason why multimodality is revolutionizing e-commerce is quite simple. So if I ask you a very simple question, what type of burger do you like,
15:00
this forms the basis of all recommender system technology. Fundamentally it is this simple, how do you take a customer's likes and dislikes and then how do you rank your catalog, your products based on this like and dislike. Historically this question was answered using one modality, text. So you can describe to me what you like and I've got a description of all of my products.
15:24
And then I can do this ranking based on text and say okay, this product matches your description the most, so it's going to be number one on my recommendations list. But now if you can capture multiple modalities, you can ask the person show me what type of burger you like,
15:42
because now you have a way of capturing that into a vector and now you can rank and search based on that vector. You can ask people what does the perfect burger smell like, what does it sound like, is it crunchy, so on and so forth. And so you can now, you have more dimensions and senses to capture a person's likes and dislikes across.
16:03
So all of these different modalities now kind of identify every user uniquely. And then because you have, there's also work that's being done on taking tabular data and capturing that. So you can take nutritional facts, turn that into an Excel sheet and into a table and then you can project that into vector space to use it for recommendations as well.
16:23
But the key point here is that because we now have a way to recommend off of these modalities, because you can turn these modalities into vectors, you can now build them into your platforms. And even better yet, why one vector, you also have multi-vector retrieval systems
16:41
where you can say some products are mainly bought due to their descriptions, but a lot of other products might be bought due to the way they look. Some products might be bought due to the way they smell. So that company that I was talking about called Osmo is quite useful for some products, but completely useless for other products.
17:01
So the way you buy different products depends on different senses. And now you have the ability to leverage those senses to recommend to those customers. And this is not just work that's being done around this by Amazon and Facebook Marketplace. And the reason why it's so powerful is because you now can more uniquely identify what a customer likes
17:25
and what they dislike, because you have more senses to do this with. I might like the way something looks, but I don't like the brand. And now you have a way to differentiate between those two senses. You can also compare products more uniquely, because you can compare across modalities.
17:44
You can say this is what it's described as, this is the metadata, but then this is what it looks like as well, and I don't like the way that it looks. And if you've got two very similar products, you now have multiple dimensions to differentiate them across. And so this is work from Amazon actually, and the link is at the bottom there,
18:02
where if a user came in and they passed in or they clicked on this particular product, their previous recommender system would recommend to you things like this. And you can see why it makes sense, because these are all things that are functionally the same. But the problem is that they don't visually match. And this is more of a text-based recommender system,
18:23
and when they added the multimodality component into it, these are the recommendations that came out. And one thing you notice here is that not only are they functionally the same, but they actually look the same. They're in the same position, the visual features are very similar. And this is the type of robustness that multimodality can add to your e-commerce platforms and e-commerce recommender systems as well.
18:49
And then the second application where we're seeing a lot of customers use multimodality is for multimodal retrieval augmented generation. So I'll talk a little bit about what retrieval augmented generation is and then how multimodality plays a role in it.
19:02
So if you think about how all of us are using large language models, typically we have some question and we pass it over to a language model to take in, reason over, and then produce some output that is answering this prompt, hopefully. A lot of people are now saying if you give it reference material, you can actually control the generation a lot better,
19:26
you can reduce hallucinations a lot better. So this is equivalent to being asked the question, but then also being given reference material to say, this information might be relevant for you to read before you have to give me an answer. And so this is a very simple concept.
19:41
In your prompt you have space for relevant context material that you add in and now you get a language model to reason over your data and answer questions. And so now you get customized responses because you can stuff in any context here that the language model can read over. And the really powerful thing about this is that these can be your enterprise documents,
20:02
these can be your company documents that the language model was never trained on, but now it can read at inference time and then it can customize the response based on this. And so this is a very simple concept. It has a very complicated name called retrieval augmented generation, but the concept is quite simple. And so how this modifies the vector search pipeline is that everything stays the same,
20:25
except what comes out of the vector database now is being given to a language model to form its relevant context. And so the reason why people are interested and excited about vector databases specifically for augmenting language models is because of the scalability.
20:41
If you have a vector database that you're searching over your data with and you're filtering documents with, you can scale up to billions of documents and you can retain a latency that's real-time sub 50 milliseconds. And so the reason why I'm bringing all of this up is because now we're beginning to see how people are retrieving non-textual context.
21:06
If you have images, video, audio, you can also index that into a vector database and you can retrieve from that. So if the prompt comes in and the most relevant piece of information in your database is not a text document but rather an image,
21:21
now that gets retrieved and now you can concatenate the text prompt and the image together and now you have a large multimodal model that can understand images and prompt and now it will give you a customized response by looking at the image, understanding the question and then generating some answer. And this is the concept of multimodal rag that we're now seeing people use and start to build with.
21:47
So that's a quick kind of outline of multimodality. There's a lot more to dive into here. I made a whole course around this and we delivered it with Andrew Ng. If people are interested, there's a QR code if you're interested.
22:01
It's a short course. You can watch it over a lunchtime. And it goes into not just the embedding model component of it but also how you train large language models to see how do you take a language model, turn it into a language vision model, all of those details and then there's a bunch of applications at the end that we build as well.
22:21
I'll leave that up there for a second. And thank you so much. If you have any questions, I'll take those now but if you have questions later on, I'm very happy to connect afterwards online or wherever. We have three minutes for one or two questions.
22:44
So if you have a question, please move to the microphone and ask. Hi. Excellent talk. Thank you so much. So I have a question coming back to your burger example.
23:01
So imagine if I don't want to compare them but I would like to create a golden standard burger like the best ever. How would I go around it? How do I get to extract these features from the vector database? I suppose it might be a very naive question but I'm just very curious about it. So one thing that you can do is actually take a multi-vector approach
23:21
where if you're interested in seeing how user behaviour is taking into account your product, you can actually take those four separate models, you can take text models, image models and you can project your products into vector space using those models individually. So you can have a text vector, you can have an image vector and maybe a nutritional vector and then you can recommend based on those
23:46
and you can perform AB testing to see what type of recommendation system do you often get hits off of and which ones underperform. And there's actually a whole field of explainable AI research that's going into studying which modality is responsible for when a sale happens and which one is not.
24:03
So when people buy food, do they buy it because of the way it looks? And when people buy clothes, do they buy it because of the way it's described with a brand? And so there's actually a whole field of explainability where you can extract how often a text vector is successfully used versus how often an image vector is used. And this is in contrast to how Amazon and Facebook are using it where
24:24
they're taking the text vector and the image vector and they're just adding them together. So now you've got both of those vectors blended into one vector representation. Alright, thank you so much. One other thing, so for people that are interested in this, all of the code is available here. I'll paste these slides afterwards.
24:42
And also if you're interested in building with this, this is where the code is coming from. We have a little present for you. Thank you so much. Unfortunately we can't take any more questions but you can always find Zain outside and ask any questions.
25:01
The next session starts in five minutes.