We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

How I used pgvector and PostgreSQL® to find pictures of me at a party

00:00

Formal Metadata

Title
How I used pgvector and PostgreSQL® to find pictures of me at a party
Title of Series
Number of Parts
131
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Nowadays, if you attend an event you're bound to end up with a catalogue of photographs to look at. Formal events are likely to have a professional photographer, and modern smartphones mean that it's easy to make a photographic record of just about any gathering. It can be fun to look through the pictures, to find yourself or your friends and family, but it can also be tedious. At our company get-together earlier in the year, the photographers did indeed take a lot of pictures. Afterwards the best of them were put up on our internal network - and like many people, I combed through them looking for those in which I appeared (yes, for vanity, but also with some amusement). In this talk, I'll explain how to automate finding the photographs I'm in (or at least, mostly so). I'll walk through Python code that extracts faces using OpenCV, calculates vector embeddings using imgbeddings and OpenAI, and stores them in PostgreSQL® using pgvector. Given all of that, I can then make an SQL query to find which pictures I'm in. Python is a good fit for data pipelines like this, as it has good bindings to machine learning packages, and excellent support for talking to PostgreSQL. You may be wondering why that sequence ends with PostgreSQL (and SQL) rather than something more machine learning specific. I'll talk about that as well, and in particular about how PostgreSQL allows us to cope when the amount of data gets too large to be handled locally, and how useful it is to be able to relate the similarity calculations to other columns in the database - in our case, perhaps including the image metadata.
Process (computing)Slide ruleSource codeComputing platformDigital photographyComputer animation
Data structureDigital photographyMachine learningTrigonometric functionsMereologyNumberComputerVector graphicsVector spaceEinbettung <Mathematik>Similarity (geometry)MathematicsInstance (computer science)Scale (map)GenderProcess (computing)Natural numberFormal languagePhysical systemCategory of beingMountain pass2 (number)Local GroupComputer-generated imageryDigital photographyMultiplication signProgrammer (hardware)Selectivity (electronic)Statement (computer science)Different (Kate Ryan album)WeightSource codeLengthDirection (geometry)BitVideoconferencingVector spaceVariety (linguistics)Machine learningDatabaseEndliche ModelltheorieWordCharacteristic polynomialNeuroinformatikDeterminismNumberBoom (sailing)Arithmetic meanCategory of beingPrototypeQuicksortFormal languagePhysical systemLevel (video gaming)Mereology1 (number)Array data structureVirtual machineCodeGenderDot productEinbettung <Mathematik>Scaling (geometry)SequenceInstance (computer science)Computer fileSet (mathematics)Process (computing)GodMathematicsStandard deviationComputer programmingGraph coloringCASE <Informatik>Point (geometry)Pairwise comparisonObservational studyNatural languageThree-dimensional spaceCalculationUniverse (mathematics)Computer programIntegrated development environmentExpert systemDistanceContext awarenessCartesian coordinate systemVariable (mathematics)Table (information)ResultantFloating pointMorley's categoricity theoremComputer animation
Library (computing)Source codeMachine visionComputer-generated imageryData modelRepository (publishing)Extension (kinesiology)Vector spaceComputer programComputer configurationMessage passingLocal GroupString (computer science)Transformation (genetics)IntegerMachine visionParameter (computer programming)Point (geometry)Square numberMeta elementKey (cryptography)CodeLaptopBitTable (information)Einbettung <Mathematik>Revision controlVector spaceInsertion lossUniform resource locatorMedical imagingNumberComputer fileClique-widthFunctional (mathematics)Structural loadData storage devicePoisson-KlammerInstallation artResultantTriangleSelectivity (electronic)Moment (mathematics)Connected spacePlug-in (computing)Limit (category theory)Product (business)Digital photographyLibrary (computing)Multiplication signExtension (kinesiology)AlgorithmProper mapDistanceComputer programmingTupleEndliche ModelltheorieRepository (publishing)Reading (process)Pairwise comparisonInformation securityNP-hardDefault (computer science)Statement (computer science)NeuroinformatikRight angleIntegrated development environmentError messageMatching (graph theory)Similarity (geometry)Message passingCustomer relationship managementFiber bundleAsynchronous Transfer ModeBasis <Mathematik>Finite differenceKeyboard shortcutOpen setRelational databaseAiry functionWrapper (data mining)ReliefRadio-frequency identificationTexture mapping2 (number)Operator (mathematics)DatabaseTouchscreenGroup actionComputer animation
Local GroupRouter (computing)Parameter (computer programming)Table (information)DatabaseAxiom of choiceVector spaceSubject indexingOpen setCommunications protocolIntrusion detection systemRelational databaseInstance (computer science)Vector graphicsData storage deviceAttribute grammarAssociative propertyService (economics)Slide ruleFunctional (mathematics)Formal languageStandard deviationBitQuery languageDistanceVector spaceParameter (computer programming)DatabaseFitness functionLine (geometry)Virtual machineTable (information)MereologyData storage deviceService (economics)String (computer science)Point (geometry)Slide ruleCASE <Informatik>Computer programmingBookmark (World Wide Web)Machine learningMultiplication signType theoryObject (grammar)Binary codePoint cloudExtreme programmingBenchmarkGame theorySharewareLevel (video gaming)Key (cryptography)Analytic setMoment (mathematics)Relational databaseDifferent (Kate Ryan album)QR codePattern languageMathematical optimizationFreewareAttribute grammarPower (physics)Price indexDimensional analysisSubject indexingLimit (category theory)Einbettung <Mathematik>Partition (number theory)Data structureResultantNumberArithmetic meanProgrammer (hardware)Goodness of fitDiallyl disulfideCommunications protocolExterior algebraScaling (geometry)QuicksortWordGrass (card game)Element (mathematics)Electronic mailing listSparse matrixComplex (psychology)Expert systemComputer animation
Finite-state machineRouter (computing)Object-oriented programmingArtificial neural networkVirtual machineQuicksortGene clusterCartesian coordinate systemEinbettung <Mathematik>StochasticBitSimilarity (geometry)Range (statistics)Finite-state machineSampling (statistics)DatabaseSoftwareVector spaceRoundness (object)GradientMaxima and minimaMedical imagingMultiplication signMeeting/InterviewLecture/ConferenceComputer animation
Transcript: English(auto-generated)
My name's Tibbs, I work for a company called Ivan, our motto is your trusted data and AI platform, and that is one of my only two adverts for the company. I love our company, I will enthuse that to you if you don't stop me after the session. So we have a company get together every year, we have photographers take tons of
photographs of us, so this is a melange of photographs from last year that happened to have me in, so of course I sit there for hours going through the photos looking for those with my face. Yeah, that's a lot of time.
So what are we going to talk about? Well, I'm going to give you an introduction of vague background, I've kind of started that, I will not explain machine learning at you because that's a big topic, I'll talk about how we find pictures of me, the subject of the talk, and then I'll talk about why we're using Postgres to do this, and not some other tool.
So, I'm a recovering AI skeptic, I've lived through more than one sequence of AI hype and doom, particularly the AI boom in the 80s and the subsequent AI winter of the 90s when AI became a dirty word. Every time this happens, however, we get benefit out of it, so we've got expert systems, knowledge-based systems, and so on, it's just the name AI gets avoided.
We're obviously in another of these, and we'll see how it goes. So when we started doing AI stuff at work, a couple of years ago, I was deeply skeptical. However, colleagues began to show me simple examples where I could see the immediate benefit. So quick prototypes of boring code. Rewriting a paragraph in a different way, the tone of voice is very formal,
I want it to be more informal. That sort of thing is really hard to do, if you get a chat GPT to rephrase something, it won't do it right, but it'll get over that bump. But what really first caught my interest about this whole thing was a colleague called Francesco Tiziot did a tutorial on finding photographs of, as it happened, me, from the photographs
that our company get together, and I looked at that and thought, well, first of all, oh wow, that's really cool, and that's nice simple code, and secondly, I could make a talk out of this. So he gave me permission, and this is the result of it.
So I'm not going to explain machine learning, because that's kind of like a university course, or something like that, it's a lot of study, and also, I don't actually necessarily understand it. This is my breakthrough talk, I gave it last year for the first time, about how I learnt enough about machine learning just to do some simple things. Similarly, I'm not going to go into depth on vectors embeddings terribly,
but we need some terminology. We talk about vector embeddings, vectors embeddings. This all sounds very jargony. A vector is just an array of numbers that represents a direction and a size, or a distance. And I looked up embedding, and in this context, it means representing something in a computer. So we're basically just saying this is an array of numbers,
which we can think of as representing a direction and a size that we store in a computer. Now, I'll probably talk about vectors, and I'll talk about embeddings interchangeably, because we're all careless about this, and I apologise for our entire industry. So, we all know, we're computer programmers, we can describe characteristics of things with numbers.
For instance, we know about describing colours with R, G and B. So here's a 3D graph, it's got the origin there, it's not not not, and in this case, I'm choosing to interpret 583 as being some colour, I haven't looked up what colour, I honestly don't know. So we can see here the red thing, it's got a direction, and it's got a size in 3D space.
And because we know about vectors from mathematics, we can do stuff, we can do mathematics, we can do calculations with them. This is not something we've invented, this is a very long-standing thing. In particular, we can compare their length, we can compare their direction,
we can compare two different vectors and say, what's the vector between them? That's as far as I'm going to go with the maths, because this is all stuff I learnt at school, and that was a long time ago. So we need to calculate these vectors, these embeddings. Now back in the very early days of natural language processing, people did this by hand.
Nowadays we can luckily throw computers at them and have them do it automatically. So if we look at early natural language processing, we might calculate the meanings of words. So we would say that a king maybe has an importance of one, our scale is one to nought, a gender of one, because male is one, and we're in old-fashioned days when gender is binary,
and they've got an importance of nought point eight. They're pretty important, they're not as important as God, who is above them at one, but they're pretty important. And other characteristics we might decide on. And we'd look at the word queen and we'd say, well, they're also really important, they're a queen, their gender is nought, the opposite of one.
Oh, sorry, I forgot the typical age there. They're not quite as old as the king, and a variety of other things. A princess would be nought point nine, they're not quite as important, their gender is still zero, they're typical age, they're a bit younger, and so on. There are problems with this, it doesn't scale well. Because words have many meanings, and then you need to have those meanings
standardized between all the words. And you'll already see that we've got bias built in. I'm assuming that a king is important, and less important than perhaps God. I'm assuming that a queen is a different gender than a king, and that those genders are binary. I'm assuming the ages of these people.
There are other assumptions that will be in those dot dot dots that I haven't categorized. So bias is always going to be built in, because we're human beings, we have assumptions. Nowadays, with machine learning, large language models and so on, we can train a machine learning system to recognize, and those are air quotes, that a thing belongs to particular categories.
And that thing can be bigger than just words. We can do whole texts, we can do pictures, we can do videos. This is wonderful, but of course, remember, we trained the models, so they have had a bias put into them, but now the models are calculating all those numbers,
and we don't know what those numbers mean. In the previous handmade one, we could say, well, this is about being important, this one's about being boy or girl. Now we don't know that. So the biases are harder to spot. We also have this terrible tendency to talk about training a machine, the thing recognizing stuff. We anthropomorphize really heavily.
I do this myself, I can't stop. So we begin to think about these things as having intelligence. Well, we shouldn't really impute that. That's not there, and that's another kind of bias we're assuming, that it's actually given some thought to this. But let's get to the finding pictures of me, because that's obviously the important part of the talk. This is a lovely photograph taken by a company photographer
a couple of years ago before I had blue hair. I apologize for the smirk, I think it's cute. And I use that, I've used much of my Slack photo for a long time. So our aim is to be able to say to Postgres that I would like to select file name from pictures, from a set of tables, I want to find the file names,
and I want to order them by vector embedding, it's close to this other vector embedding, and I want to find the first 10. So I want to find faces that are implicitly like others by comparing vector embeddings. So we've got two stages to this. Our first stage is we need to find all the faces out of our corpus of pictures and store them somewhere.
So here on the left we have a picture of myself and two colleagues, and this is the picture that was used in the tutorial. And we can run a program to find the faces out of that. Here it's found two faces. We can turn those into arrays of numbers, vector embeddings. Those are not real ones, those are made up numbers, because I couldn't be bothered.
And then we can store them somewhere. We might as well use Postgres, because database is a good idea. We could put them in a text file, that really doesn't scale. I could keep them in variables, that scales even less. A database is the obvious solution. Choose your favorite database. It's not perfect. It may not find all the faces. Well, I wouldn't necessarily find all the photos with me in them.
It may find things that aren't faces, for instance, like these two things. But since those should hopefully not look like my face, although those of you who saw Mart's lightning talk know about people who look like hands, it's probably good enough. We've got, we get these arrays of 768 floating-point numbers
from the model I'm using. That number is pertinent to the particular model you use. So we have to be careful that we, if we are using a model with 768 floating-point numbers, that we always work with that same set of numbers. We can't compare vector embeddings from different sources,
because they'll have different weights, they'll have different reasons, and they'll typically have different lengths. 768 sounds a lot. Actually, it's relatively small. So we're dealing with simple data here. So assuming we've done that and got all the faces into our database, we need to look for photos with my face in them.
So we would do the same thing. We take a picture of my face. We look for the faces in it. We hopefully only find one. We calculate the embedding. And then we put that into our select statement and hopefully find all the answers. Now, being a programmer, I wrote code to do this, because the tutorial was just doing this at the command line
or in a Jupyter notebook. I wanted programs. So the program requirements I ended up using, I was using click to handle the command line. To be honest, nowadays I'd probably go for Larry Hessing's appeal, because that's more fun. SciCop G2 is a beautiful tool for talking to Postgres. I'd be very interested if anyone out there
is using SciCop G3. It's been out for a while, but it doesn't seem to have much market penetration. I'd like to hear about that. OpenCV Python is a wrapper around the OpenCV computer vision library, which makes it easy to do vision-related things. And imagebeddings is a lovely library that wraps up calculating embeddings from an image.
And this is where I look at my crib sheet. It's wrapping OpenAI's clip model with hugging face transformers. It's known to have issues. It's not for production use. But it's very convenient and easy to use. And we need to download a pretrained model to recognize faces, which you can get from the OpenCV repository.
And that's an XML file. So by default, Postgres doesn't necessarily have the PG vector installed. And the PG vector plugin enables Postgres to extend its array functionality to understand how to do handle vector embeddings.
The company I work for, Ivan, Postgres, does have PG vector installed. An engineer did it last year. Almost as soon as PG vector came out. One afternoon, just for the fun of it, he installed PG vector and told the company. We went, oh, we're going to play with that. So he was very popular. If you haven't got it installed in your Postgres, the PG vector home readme on GitHub
explains how to do it. And once you've got it installed, you need to enable it, which is just create extension vector. We need to create a database table. I've got a very simple table called pictures here. I give it the key is a string to represent the name of the face.
The file name I found the face in. And then the embedding is the vector for that face. So it's a very simple table. So looking at our first problem, we're finding the faces, calculating the embeddings, and store them into Postgres. So I have a program unimaginably called find faces store embeddings.
Naming is not my thing. It'll take a Postgres URI as an argument, or actually it will use the PG service URI environment variable, which means I don't have to keep typing the name at the command line. And it would also mean if I was going to demonstrate this live, which I'm not, I wouldn't have to keep showing you my URI on the screen, and my security people wouldn't grumble at me.
The main is relatively simple. We've imported the CV2 OpenCV package. So we take in a tuple of image files, and the URI for Postgres. We call load algorithm, which I'll talk about in a moment, to load the hard cascade algorithm. We create our image beddings binding. And then for each image file,
and please excuse me, we are doing a single connection to Postgres for every single file. Do not do this at home. This is not production ready code. It's just really quick and easy. You should do proper connection management and bundle things up, and there are other tools to do this. But I connect into Postgres, I read the image, I find the faces,
and then I write them to Postgres. So the load algorithm, it gets the XML file, it reads it in using the cascade classifier. I check to make sure that I got that right, that it worked, because I'm embarrassed to say when I got the file name wrong, it fell over in an obscure way, and returned the classifier.
Imread is a convenient function on the OpenCV library. I'm reading in the image as grayscale. I believe that's the default. Find faces then just calls the hard cascade detect multi-scale function. You give it the image, and then you give it some magic parameters. Now these you are expected to play with
to get the results to be good for you. The original tutorial had min size of 100 by 100. I found 250 by 250 work better for me. I do not necessarily understand what these parameters do. This is the thing you play with, and see where it goes. And then once we've found the faces, we pass them into our write to Postgres function.
So we're passing in an array of faces. I grab the stem from the file name, and then for the XY location and the width and height of the face I found, I crop the image out. That's the thing I want to find the embedding for. So I calculate the embedding for the cropped image
for that individual face. I then calculate a nice key that has the base file name and the location of the face. So that's going to be unique for each face. And then I write all of that to Postgres. And write to Postgres does the traditional SQL you'd expect. It inserts into the pictures those values.
The on conflict is nice because this is play code. This is experimental code. What on conflict says is if face key already exists, just overwrite the old values of the other values. So excluded is our meta table that's got the values I've just passed in. That's really useful because it means I can run my program more than once without having to drop the table.
So program two is the thing to find me, the important bit of this talk. So it's called find nearby faces because of course it is. And the arguments are very similar. But it takes an integer that says how many matches I want back. Now typically people will do five matches for these things. I haven't quite got that message.
So I tend to do 10, which is way too many. But it worked reasonably well. So the main is a bit similar. We load our algorithm. We enable image beddings. We calculate our reference embedding. That's the embedding for my face. I'll talk about the vector string that I'll put in at the moment. And then we ask Postgres for the results. So the calculate reference embedding is familiar.
We read the file in. We find the faces. We use the same find faces to find the faces in that file. We desperately hope there's only one, because that would be a mistake if there's naught or three. We get out the cropped image and we calculate the embedding for it and return that in much the same way as we did before.
Now actually in the real code I didn't have a comment saying we hope there's only one. I actually checked because I'm writing play code, but I couldn't resist having actual error messages. The SQL statement that we're using, once embedding is represented as an array of floating point numbers with square brackets, and it's floating point numbers separated by commas.
Apologies for that. So this is just the Python to convert that using an F string. So I'll find nearby faces. This is where it's reasonable to have one connection for each finding, I think. Just executes the SQL we had at the beginning. So we're selecting file name from the pictures table, ordered by our embedding being...
And we have a limit of n. Now that funny triangle bracket thing I'll explain in a moment, and then we print out the results, which I want to print out the file names basically. So that SQL operator, that left bracket, hyphen, right bracket is one of the available comparison operators,
and that one's just finding the straight Euclidean distance between the different vectors. PG vector provides four different operators for that. I'm not going to go through them. The plus one at the end was added just recently with 0.7.0, so they're extending that. So how good is it? Well, it's our Wednesday at crab week,
which our logo is a cuddly crab, so crab week, of course. There were 779 photos on the Wednesday. 5,006 faces it found. I don't know how many of them were real. When I went through it manually, I found 25 with me in them. Some were in a crowd or a bit obscured. Three were of my back. I'm not expecting to recognize those. And two were with a false mustache.
This was a thing. Silly and wonderful thing. So when I run the program, the first time I ran it last year, it took 21 minutes, and when I ran it last week, it took 11 minutes. So this is to store, to calculate and store the embeddings. This is not fast, but this is just for play. I don't care. To find the nearby faces, the ten nearest faces,
it only took three seconds, which for play code felt reasonably fast. Of the first ten matches, the first nine of them are me, and amusingly, the tenth one is someone with beard and glasses. Remember I said these classifiers were biased? I don't know what the biases are. I'm suspecting it's going by texture.
It's a beard. The first one was me in oratorial mode. Another one, this is me in a group. I'm not very good at obscuring faces, so I've done big white blobs. On Thursday, we had another 574 photos and 3,500 faces. I only found seven that had my face visible,
and in four of them, I had dark glasses. These glasses go dark in the sunshine. This time, the first ten matches, only three of them were me. That was actually quite sensible, I think, but the others were, in general, people with beard and glasses. I'm spotting a pattern here. The very first one I was quite pleased by, because this is me looking downwards and from sideways.
So I would not necessarily have expected to recognize that. And this is the picture from the tutorial. I was so pleased and relieved that it found the reference picture. So was this a success? Yes. I learnt a lot.
I didn't spend a lot of time on it, and I got not awful results. I'm British. Not awful means very good. And I know what I would carry on to do. I'd definitely improve the programs if I was going to play more. I would add switches to allow me to change those face-detecting parameters on the fly.
I'd store the results with different parameters in different tables so I could use them separately. I'd add a switch to make doing the reference face stuff a once-off, because I don't need to keep calculating my face. For the finding the faces, I'd add switches to make it use all of those things. And I'd also sort that nasty thing about using Postgres inefficiently,
because that really hurts my soul. Now, in many ways, this is actually the most important part of the talk, I think. Why Postgres? Well, why is Postgres surprising? Well, we expect Post-Python to be a good fit for exploring machine learning, but we don't have a good reason for that except history.
So Python lucked into being the language that people use. It's partly because it's easy to embed things in it. It's partly because it's very easy to use for people who don't care about programming. A lot of it's just luck. It could have been Ruby. It could have been all sorts of other things. So why Postgres? Postgres is not a dedicated vector database. And naively, you might think you'd want a database that was key,
designed to do just this thing. Well, depending on the metaphor you like, Postgres is either your favorite Swiss Army knife or your favorite hammer. It depends what mood I'm in, whether I want to hit things or unscrew things. Remember, I'm British. It's significantly better than nothing.
That's faint praise for it's pretty useful. There comes a point when you need to store your embeddings. I mean, the first few you calculate, you get these long strings of numbers out, you put them somewhere, store them in a, you copy them. But 768 numbers, a database is a good place to put things. And Postgres is generally a good database to start with if you have no preference.
We probably have it running already. And if you don't, it's really easy to run locally. And also, I have worked for a company that has it running in the cloud, and it's free. It can SQL all the things. So you don't necessarily only want to do vector queries.
I want to find the things that are like this, that's a vector query, but are in stock. I want to find the pictures of me that were taken in Portugal between these dates. So the pictures of me is vector search, but the other two are straight SQL queries, and so on. Hybrid queries are quite important. You can use your standard Postgres optimization techniques that you've learned,
analyze or work on these queries, because the whole SQL structure is just SQL. So everything you've learned about optimizing your how to use Postgres will continue to work. Indexing is essential for any large-scale thing,
and PgEffector comes with two different types of indexing at the moment. I suspect more will get added. I don't have time to go into these. The extended notes have the stuff I wrote about indexing that I cut out of the talk. It was 10 minutes. It was too long. But there are two alternatives. They have pros and cons, and they're both quite good, and they're both better than nothing.
And as Python programmers, we should recognize this recurring pattern. We'll use the same, but we'll start in Python, and hopefully we'll continue, and if it's not suitable, we'll move to something when we know why it's not suitable. And Postgres, I think, is a good case for, it's easy to start in Postgres. You can probably do this thing.
And when you realize it's not, move to something else that is more suitable. Now, you have to remember to be prepared to do that stage with both of these. So never assume you're in the final solution. When not to use Postgres, well, clearly when it can't cope, and when it doesn't actually do what you want. So there are some limitations with PgVector.
It says we can't have more than 16,000 dimensions. That sounds a lot, but machine learning gets big. There's also Sparspec was added in 0.7, and that's up to 16,000 non-zero elements, which is subtly different. When vectors are too big to index, the indices will only go up to 2,000 dimensions at the moment.
And there are techniques we all know about for indices, about sparse indexing and so on, to get around that. But again, that's more complexity. Now, a Postgres table is apparently limited to 32 terabytes. And you can have a partition table with thousands of those, so I don't know that you would ever hit that limit,
but it's mentioned on the PgVector frequently asked questions list, so someone is worried about it. And obviously, Postgres is a general purpose database, so it might not be fast enough. But decide that after profiling, just in case it wasn't the vector embedding bit that was the slow thing,
but maybe your SQL queries aren't great. And when you need a missing distance function. We've got four now. One was added recently with 0.7, and I'm sure more will come. So just keep checking back if you haven't got the one you need yet. I do not give advice on distance functions. I've got a colleague who can talk about it sensibly.
There's a lot of literature out there about how to choose which distance functions you want for what purpose. I don't understand most of it. And sometimes you actually don't want a traditional relational database. I've got Mark sitting in the front who works for MongoDB, so he would definitely argue that case, that there are other useful things.
We at Ivan also have OpenSearch, which is a document store with built-in indexing, very powerful, has vector search. ClickHouse, who are sponsors of this conference, you should talk to them. They've got the Columnar database, which is a very powerful approach for analytics and so on. And Dragonfly, which is a key value store using the Redis protocol,
that has vector search built in as well. So you can get your vector search in other places. There are many more. And of course, there are currently lots of dedicated vector databases out there who are betting that a vector database is a new kind of thing that is worth having as a separate thing. We will see. This has worked for these other examples. Database technology is an interesting historical game.
The future, however, looks bright. Back last year, Jonathan Katz, who's a Postgres expert, wrote an article on vectors of the new JSON in Postgres. I remember when JSON support was added to Postgres. It was slow and clunky. Then they added JSONB, which gives you binary storage over JSON.
Everything got a lot faster. JSON in Postgres, you just use it now. No one will worry about it. I'm also remembering when large blob support was added. I want to store a big chunk of data with no structure to it. That was new and for a while was strange and risky. Now, Toast, the oversized attribute object storage technique,
I love that, is just standard. This year, Jonathan's posted a follow-up article and for one type of indexing, he had 150 times speedup over last year. Now, he's very careful to say this is a benchmark of one particular thing and this is the most extreme bit, but that's pretty impressive in a year. So this is going to get faster in Postgres.
It's going to get more support. It's going to get bigger support. And that's it. Postgres, the word, the Slavic logo and so on are lovely trademarks of the Postgres Community Association. That QR code will lead you to a free trial of our services if you want to play with our services in the cloud.
It will give you a free trial of all of the services, but you also have free use of Postgres, MySQL, or Redis and then Valki because of the trademark issue with Redis. Also, it's for free services. And the slides and accompanying material are on GitHub if you want to look at those later.
My notes are longer and have a lot of what I said and some bits I didn't say. Thank you very much. Thank you very much for that. Do we have any questions? Feel free to stand in front of the microphone. Fantastic. Hi. Last week, I got married and actually had the same idea as you had.
But I didn't want to get only my picture because there were a lot of guests. So I wanted to create a way for each guest to have access to their own pictures separately. So there are some things that I did differently from you.
So basically the first one was not using Heracascade because, to be honest, it's not good for my purposes. And I used MTCNN, which is actually a neural network that extracts the faces and then another neural network to create embeddings.
So I started out in a vector database and then I used DBSCAN to actually create clusters of faces so each people can actually see their own face. Oh, nice. Yeah.
So this is the where do you go next kind of lesson as you learn more about this. This was very much my, I do not understand machine learning, what is the first baby step I can take? And you've got a lot more sophisticated. I was also, of course, frustrated to find that actually I'm on Macs and, yeah, the picture live on Macs will do this for me, I believe.
But it was still more fun to do it myself. Okay. We're actually running a little bit short of time, so I think we'll just take one last question. I'm sure you'll be happy to answer questions in the corridor afterwards. Okay. Thank you for your talk. I did an experimental similar with Django and Psycho PG-3. Sorry, can you lean closer to the mic? Yes. Thank you.
I did a similar experiment with Django and Psycho PG-3 that worked very well and I provided a talk at another conference and I was very fascinated about your application to find images. What I was thinking was, have you talked about having a large amount of picture
from a great range of years so that your faces maybe from 30 years ago and now can be different also for recognizement? I have not.
My suspicion is that, well, certainly the very naive recognizer we're using would probably not work very well. Since people can write software that will age you or de-age you, you could presumably take that into account and do a sort of stochastic sort of sample, sample, sample
and look for things that match all of those. That would then give interesting mismatches of someone who lives now but looks like you from 10 years ago and that might be fun in itself. So, yeah, I don't know. I should also mention I have stickers if you want, Cuddly Crab stickers. Come to me afterwards for that. Thank you. Okay, thank you.
Can we have a big round of applause for Titu?