Deconstructing the text embedding models
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69422 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 2024112 / 131
1
10
12
13
16
19
22
33
48
51
54
56
70
71
84
92
93
95
99
107
111
117
123
00:00
Einbettung <Mathematik>Mathematical modelAutomatic repeat requestCartesian coordinate systemSoftware developerSuccessive over-relaxationInformationInformation privacyAsynchronous Transfer ModeQuicksortDatabaseDifferent (Kate Ryan album)Cartesian coordinate systemoutputSound effectEinbettung <Mathematik>Mathematical modelEmbeddingMathematical modelProcess (computing)Phase transitionOpen sourceSearch engine (computing)Software developerWave packetInformation retrievalSemantics (computer science)Formal languageVector spaceComputer animation
02:08
PseudodifferentialoperatorAbstract syntax treeCategory of beingoutputFunction (mathematics)SequenceComputer architectureCASE <Informatik>SpacetimePresentation of a groupSimilarity (geometry)Transformation (genetics)EmbeddingCodierung <Programmierung>Einbettung <Mathematik>Vector spaceMathematical modelMathematical model1 (number)Dimensional analysisOrder (biology)Trigonometric functionsBitRepresentation (politics)QuicksortMechanism designDiagramModule (mathematics)Execution unitReal numberRight angleComputer animation
04:40
Vector spaceSpacetimeOpen sourceAxiom of choiceSemiconductor memoryTransformation (genetics)BefehlsprozessorEinbettung <Mathematik>Mathematical modelBitParameter (computer programming)Computer animation
05:46
WordLevel (video gaming)Token ringGame controllerCASE <Informatik>Wave packetParameter (computer programming)QuicksortStatisticsConnectivity (graph theory)EmbeddingAlgorithmTransformation (genetics)outputAxiom of choiceIdentifiabilitySequenceDesign by contractString (computer science)Mathematical modelMathematical modelDiagramFlow separationStochasticComputer animation
08:19
Object-oriented analysis and designWordTransformation (genetics)NumberVector spaceNumeral (linguistics)SequenceLengthFormal languageoutputToken ringMappingEmbeddingSingle-precision floating-point formatTerm (mathematics)Different (Kate Ryan album)Multiplication signDefault (computer science)Data conversionWave packetParameter (computer programming)QuicksortArithmetic meanIdentifiabilityCorrespondence (mathematics)IntegerSubject indexingMathematical modelComputer architectureTable (information)Mathematical modelCASE <Informatik>Euklidischer RingEinbettung <Mathematik>Normal (geometry)Computer animation
11:21
Error messageEmbeddingIdentifiabilityInjektivitätToken ringoutputType theoryProcess (computing)Context awarenessDifferent (Kate Ryan album)MappingEinbettung <Mathematik>Transformation (genetics)Computer animation
11:59
Category of beingToken ringSuccessive over-relaxationSingle-precision floating-point formatWordToken ringHash functionoutputControl flowDoubling the cubeWave packetContext awarenessArchaeological field surveyMereologyMultiplicationLevel (video gaming)Row (database)Mathematical modelMathematical modelPlotterOutlierSubgroupShape (magazine)Representation (politics)Process modelingTransformation (genetics)1 (number)Visualization (computer graphics)Dot productPoint cloudNumberGraph coloringGroup actionSpacetimeBitCASE <Informatik>EmbeddingPoint (geometry)DistanceUsabilityReduction of orderMultiplication signVector spaceTwo-dimensional spaceSound effectEinbettung <Mathematik>Source codeService (economics)Server (computing)Moving averageFormal languageAreaQuicksortDimensional analysisGene clusterSet (mathematics)Domain nameCodeLatent heatArithmetic meanComputer animation
18:46
Exact sequenceLocal GroupEmbeddingoutputToken ringPermutationOrder (biology)Einbettung <Mathematik>Independence (probability theory)WordSequenceCodierung <Programmierung>Context awarenessPosition operatorArithmetic meanQuicksortParameter (computer programming)FreewareMechanism designProcess (computing)Table (information)Module (mathematics)InformationPoint (geometry)Adaptive behaviorNumberMathematical modelProper mapTransformation (genetics)MereologyMathematicsFunctional (mathematics)Sign (mathematics)Projective planeLevel (video gaming)Trigonometric functionsWave packetComputer animation
23:11
Ordinary differential equationSequenceMathematical modelToken ringAverageFunction (mathematics)Context awarenessEmbeddingMultiplication signVector spaceSingle-precision floating-point formatComputer animation
23:46
Token ringParameter (computer programming)Mathematical modelSequenceMultiplicationWordQuicksortMultiplication signToken ringComputing platformNormal (geometry)SubsetHypermediaProcess (computing)Arithmetic meanContext awarenessWave packetLatent heatParameter (computer programming)EmbeddingVideoconferencing1 (number)Formal languageMathematical modelUniformer RaumCodeDampingInformationLevel (video gaming)Open setProper mapComputer animation
25:56
Einbettung <Mathematik>EmbeddingToken ringError messageComputer animation
26:39
PseudodifferentialoperatorInformationReal numberToken ringHypermediaTrigonometric functionsMathematical modelProcess (computing)Context awarenessArithmetic meanTracing (software)Computing platformWordError messageProduct (business)Descriptive statisticsSimilarity (geometry)Data qualityQuery languageWave packet1 (number)WebsiteVector spaceBitQuicksortMehrplatzsystemTwitterMessage passingComputer animation
29:44
Similarity (geometry)First-person shooterRankingoutputNumberCASE <Informatik>Right angleToken ringSimilarity (geometry)NumberAdditionRepresentation (politics)Normal (geometry)MereologySet (mathematics)Different (Kate Ryan album)Mathematical modelEinbettung <Mathematik>DigitizingProduct (business)Discounts and allowancesPoint (geometry)ExpressionComa BerenicesIdentity managementIdentifiabilityMultiplicationComputer animation
32:27
Motion captureMaß <Mathematik>Online chatToken ringProper mapMathematical model1 (number)Condition numberToken ringArithmetic meanExpressionNumberCASE <Informatik>IntegerTransformation (genetics)Vector spaceWordMotion captureWave packetHybrid computerProper mapSparse matrixSimilarity (geometry)Mathematical modelMathematical modelPoint (geometry)Multiplication signAdditionMathematicsOpen setMultiplicationComputer animation
34:11
Digital filterCartesian coordinate systemSimilarity (geometry)MetadataVector spaceFilter <Stochastik>Mechanism designAttribute grammarAdditionCase moddingMeasurementExpressionComputer animation
34:51
Computer-assisted translationLatent heatDatabasePrice indexComputer wormSimilarity (geometry)MetadataDigital filterMechanism designSubject indexingMatching (graph theory)Filter <Stochastik>Reverse engineeringAdditionVector spaceMedical imagingComputer wormMetadataMultiplication signPoint (geometry)Computer animation
35:25
Point (geometry)Query languageVector spaceComputer wormDigital filterCartesian coordinate systemComputer wormFilter <Stochastik>AdditionPoint (geometry)Uniqueness quantificationQuicksortCartesian coordinate systemSubject indexingSpacetimeDialectVector spaceFunctional (mathematics)Similarity (geometry)Form (programming)WordMathematical modelComputer animation
36:36
Arithmetic meanReal numberEinbettung <Mathematik>Token ringMathematical modelMathematical modelEmbeddingFormal languageComputer animation
37:13
Proxy serverToken ringProcess (computing)Mathematical modelProcess (computing)NumberBenchmarkSet (mathematics)Token ringFormal languageProcedural programmingWave packetLatent heatStatisticsExterior algebraMathematical modelOrder (biology)Bit rateQuery languageDiscrepancy theoryFrequencyTunisComputer animation
39:30
Process (computing)MultiplicationToken ringWordSingle-precision floating-point formatMathematical modelMathematical modelEinbettung <Mathematik>TunisFreezingInjektivitätStrategy gameSoftwareEmbeddingSpacetimeCASE <Informatik>outputCartesian coordinate systemAreaDomain nameSimilarity (geometry)Process (computing)SubsetVector spaceMultiplication signDatabaseWave packetParameter (computer programming)IterationArithmetic meanProcedural programmingSoftware developerWebsiteSequence1 (number)Representation (politics)Order (biology)ZufallsvektorComputer animation
43:45
Cartesian coordinate systemSoftware developerMathematical modelComputer architectureToken ringMessage passingQR codePoint (geometry)Multiplication signComputer animation
44:21
Computer animation
Transcript: English(auto-generated)
00:04
OK, let me just do this small introduction here. So I'm mainly working as a senior developer advocate at Quadrant. And we are also building another open source vector database. We focus a lot about performance and try to make it as fast as possible,
00:21
incorporating some different search modes. I'm not going to get much into details because I've noticed that many of you already used some sort of vector databases, which are ultimately rather search engines instead. But it doesn't matter a lot. But since you are somehow experienced with using vector
00:41
databases, I also think that you know the concept of the text embedding models. But as a regular user, you may never really thought about what are the internals and what is actually the process that happens under the hood when you just pass textual input. But back in the days, text embedding models
01:02
were rather a niche that was used by some information retrieval engineers to improve the effectiveness of their search pipelines. It has changed a lot when OpenAI introduced ChatGPT. And when we realized that it might be actually pretty useful
01:21
to use them to add this semantic search layer to reduce the risk of hallucinating for those models and actually to incorporate your own private data so the large language model can rely on it instead of just using the training data that it has seen in its training phase.
01:42
And as I said, a regular user rarely cares about the way their embeddings are created. This is, however, quite useful if you really care about the quality. Because if you don't understand the process, you can never really improve it at all.
02:00
So generally speaking, the idea behind the embedding models is that given a specific text, it will produce a single vector. So traditional check using would be hopefully represented by a single vector. And the most useful property of these vectors is that if you have two different texts that
02:21
represent similar idea or concept, they should be close to each other in that semantic space. So Beprock, Ned, Lozello, I guess we cannot go more closer to traditional check using than this. Should have a similar representation if you just see the individual dimensions here.
02:42
You should see that they are close to each other. They are never identical. But still, if we calculate the cosine similarity or a different metric, the similarities should be pretty high because they actually represent something that is similar in the real world.
03:00
And majority of the embedding models, actually all of the ones that we use nowadays, are based on transformer model. This diagram is probably the most popular diagram that we have seen in so many presentations that I'm not just going to get into details. But it comes from the Attention is All You Need paper.
03:21
And it presents a general overview of the transformer architecture. But in case of the embedding models, we do not care as much about the right-hand side of that picture because the embedding models are usually encoder-only architectures. They do not predict any kind of output probabilities. So they do not predict the next token in the sequence
03:43
contrary to what LLMs do. But they just take some input and produce vectors, which have these useful properties of being similar to each other. And if we have a closer look at the left-hand side
04:01
of this whole diagram, we can clearly see that everything starts from the inputs, which are not well-defined here. Then those inputs are converted into input embeddings and reached with positional encodings and then processed by n stacked modules armed with attention mechanism.
04:22
This is a very generic overview. And it doesn't say much about how the inputs or outputs look like. However, if we consider some examples, you should gather some sort of intuition on how things work under the hood.
04:40
So in order to visualize that a little bit more, I selected one of the available sentence transformers. All-mini-lml6v2 has been actually mentioned today already in one of the previous talks. And that's not the best embedding model available out there.
05:00
But it's quite commonly used for multiple reasons, mainly because it's quite small and might be launched even on a CPU. So even for those who are still GPU poor, like me, this is a fairly reasonable choice. And also, this is an open source tool. So you can easily see all the steps
05:21
and hopefully improve it, fine-tune it on your own data. It also has a reasonably small amount of parameters. That's also why it is so small, because it requires around 30 megabytes of memory, which is enormously small if you compare it even to the smallest LLM available.
05:42
And it works surprisingly well for English data. And today, I'm just going to dive into the internals of it. So one thing that this transformer diagram didn't mention is actually the tokenizer, which is rather a separate component, separate from the transformer itself.
06:02
And the tokenizer defines some sort of contract between the input data and the model itself. And in case of text, the tokenizer is responsible for cutting a long sequence of words into pieces. So these unclear inputs of the model
06:22
are not like whole text, because those are not end-to-end models, but rather sequences of identifiers our tokenizer assigned to each token in that original data. And it's still unclear how this tokenization works,
06:42
because there is no single algorithm available out there. The tokenizer could possibly split our string into individual letters, and there are some attempts to do that. On the other hand, we could possibly split the text just into words, but the usual choice
07:02
for the tokenizer is some sort of sub-word level tokenization, and algorithms such as byte pair encoding, word piece, or unigram are quite commonly used for that. And the main idea behind all of these algorithms is that those are also trainable components.
07:22
So if you are about to train your own transformer for the text embeddings, but also an LLM, you should also train your own tokenizer. And contrary to the transformer models, the tokenizer is not based on any kind of stochastic method. It's rather some sort of statistics
07:42
over the training data. So the tokenizer just selects the best sub-word tokens with a fixed vocabulary size. So let's say we expect the tokenizer to produce a vocabulary with 30,000 tokens, and that's actually the only parameter
08:01
that we have control over. And in case of this old mini-LM-L6-V2 model, word piece was chosen to be the algorithm for the tokenization. It uses word or sub-word level tokens, which is worth noting here. And if we take this original example,
08:23
this is how this sentence is going to be split into tokens. So each word will get an individual token assigned to it. Obviously, there is some conversion happening, some normalization, like lowercase is used by default. So there is no difference.
08:40
And actually, that's true for majority of embedding models. Usually, the letters are just converted to lowercase. But apparently, there is a huge difference if you have a look at LLMs, because in LLMs we care about the distinguishing between upper and lowercase. Still, those tokens are not directly passed
09:02
to the model yet. However, if we take the other example we had, this time it will be just converted into twice as much tokens, even though the length in terms of the number of words is not that different. It's actually identical. That's just because this model from sentence transformers
09:22
was probably never trained on check data. And that's actually one of the reasons we struggle with supporting multiple languages if our tokenizer is bad. And that's also an issue that you can see with LLMs,
09:41
that they work pretty well for English data and then struggle with some other languages. The length of that sequence also matters a lot. And the first step that the transformer model actually does is a lookup table, so nothing really fancy,
10:01
because the tokenizer would produce a sequence of numerical identifiers. So each token from the vocabulary has a specific integer behind. That uniquely represents a given word or subword in that sequence.
10:20
So what the tokenizer passes to the transformer model is actually a sequence of integers. And those integers are used as indexes in the first layer of the transformer, in the first model of the transformer architecture, because each ID has a corresponding input token embedding, which is also trainable parameter.
10:42
Those embeddings are trained in the whole training pipeline of the transformer models, and then they already should capture some meaning of each individual token. So we have a lookup table, nothing more fancy, but that actually is already a good start, because that brings some sort of understanding
11:02
of each individual token in the sequence. But those input token embeddings are actually more similar to word2vec modeling, because there is just a single vector per token, so there is no much understanding of the whole sequence, it's rather more simple mapping exercise.
11:22
But this is quite interesting exercise also to inspect how those input token embeddings look like, even in a very simple way. So since there is like a one-to-one mapping between the identifier and the token embedding,
11:41
we can easily visualize them, because the context doesn't matter, the token would never change. And before we do that, we also need to do this small distinction between the different types of tokens we can encounter in that process. So our sentence transformer
12:02
roughly divides the tokens into three groups, but we can also split them more. So basically we have some special tokens, like unknown, which is used as some sort of fallback. So if we experience a character that has never been seen in the training data of the tokenizer, it should be converted into unknown token,
12:22
just because we don't wanna break things if we just see some new data coming in. There are also some whole words, so obviously those are the most interesting ones, because they should already have some meaning. Subwords, which are marked with this double hash prefix, so those are used when we encounter a word
12:43
that doesn't have a specific token embedding assigned to it. So then we just cut a single word into multiple pieces and have those subword level tokenization. But there are also numbers, some special characters, and a bunch of unused tokens that actually I'm not really sure about,
13:01
because I don't see a clear purpose of using them in this sentence transformer, but I will speak about it later on. And now if we can just inspect those input token embeddings, we can, for example, ask what are the closest embeddings to Python, for example.
13:21
As you may see, the vocabulary is fixed, so we can just compare the distances here and find those closest neighbors, and in case of Python, we can clearly see that the model itself was trained on some code, but it also didn't skip biology lessons because there is the closest neighbors
13:40
coming from those two domains. So we have double hash ungo, which probably comes from Django, we have import pip, and also some other animals as well. So it's pretty interesting, it doesn't give that much insight into the model
14:02
and its effectiveness on the specific dataset, but still it's a simple idea to just check what are the synonyms of a particular word and how this model works with a very specific terminology. And things get a little bit more interesting
14:22
if we just perform t-SNE dimensionality reduction to compress every single 384-dimensional vector into two-dimensional space. And if we do that, we can clearly see how those models are just grouped together. If there are any clusters,
14:41
obviously we are just doing eyeballing, which is not the best technique that exists, but we just want to have some sort of intuition. We don't wanna understand every single token and its closest neighbors. And there are some interesting outcomes if you just look at the plot.
15:02
And here we also used different colors to differentiate those different groups. So we have green points representing full words, we have sub-word tokens represented with blue, and red points are just some special tokens specific to that model.
15:21
And as you can already see, there are some groups, some sub-areas of that plot in which majority of points belong to a specific group. For example, there is this huge cloud of red points at the very bottom of this plot, but there are also some interesting shapes around.
15:41
So first of all, those special tokens are actually some unused tokens. You need to believe me, but I'm just going to publish that as an interactive visualization pretty soon. And surprisingly, those unused tokens are close to non-English characters. So those blue dots around those red ones
16:04
are actually some Japanese characters and actually lots of characters that are just coming from different languages as well. And this is quite interesting because these unused tokens have never been seen in the training data.
16:22
So they are just randomly initialized and should be really far away from the real data that we've seen during the training of the model. However, this non-English characters may indicate that they were just seen
16:42
during the training of the tokenizer but not in the training of the model. So those two processes were probably separate from each other. And I would not expect this model to work that well for non-English data if non-English characters are just close to unused tokens in that space
17:01
because there won't be any meaningful representation assigned to them. And if we just concentrate on some other subgroups of that plot, numbers are clustered with each other. So there are some pretty interesting shapes created based on the numbers
17:21
and they are separated from the rest of the data. However, there is also a specific subgroup for years. And it's also good to know that those are mainly years between the 17th and the 21st century. So if you have anything before and in fact before,
17:43
then they probably won't be represented here. Also, inspecting this plot helps to find some outliers. For example, Dom's day token is really close to the token of 1086,
18:04
which is a year, and that might be a surprising fact. However, if you just try to find the reason why it could happen, Dom's day is a manuscript record of the great survey of much of England and parts of Wales completed in 1086. So probably this Wikipedia paragraph
18:22
was present in the training data of this model and there was no other context in which this particular word could have appeared. And the other nearest neighbors are Neolithic 1757 and some other years. But this time I just didn't verify the sources.
18:42
Maybe there are some other great surveys completed in these years. However, those input embeddings are just context independent. So no matter the order of the tokens in the sequence, we can just perform any permutation of the input words. They will always get the same token embedding as an input.
19:04
So this is somehow similar to Word2Vec approach which was an exciting topic, but probably like 10 years ago. But we can also inspect it further and if you just check every single subarea of that chart,
19:22
you can see that there is a subarea in which we have people's names, country names, actually in general some proper names. And this is actually quite intuitive because if you have a sentence, the meaning of that sentence shouldn't change that much
19:43
if you just swap the name of a person unless we speak about a public feature. But that actually means that the relationship between the names was learned properly. And that the whole process would already start with some sort of meaningful token embeddings
20:01
because they will capture some meaning of the words used and the model will then just incorporate some additional information. So those input token embeddings are supposed to carry the meaning of words or subwords and they are all context free which is actually the reason why we could visualize it.
20:24
And the next step of the model is our positional encodings and those are actually capturing the order of tokens in the sequence. That was not captured by the token embeddings because they were just context free.
20:41
But here we usually use some sort of sign function in order to let the model know that the fourth token in the sequence is closer to the fifth one than to the 10th. So this is a pretty simple way of encoding that and positional encodings are just selected and based on some trigonometric function.
21:04
And they are not trainable parameters so they are just fixed during the whole training process. We just do this kind of projection in order to encode that information. And what happens next is like a sequence
21:20
of stacked modules with attention mechanisms. So after all this lookup tables we used so far, our sequence of input token embeddings and reach with positional encodings is now passed through this model. So this model uses the attention mechanism at various steps and tries to find
21:44
inter token relationships. So our initially context free token embeddings become context aware embeddings because now we take care about the sequence itself.
22:01
So ultimately each of these modules, the center transformers I chose actually has six of them but each of these modules takes a sequence of embeddings and produces a sequence of embeddings of the same size. So this is just some adaptation of the original inputs
22:21
but thanks to this attention mechanism we now take care about the different tokens in that sequence and that is another parameter of, another way of understanding the meaning of our data. Initially we start with this token level meaning
22:41
because every token has a certain meaning aside and then the attention can also modify it to just put more emphasis on the specific part of the input data. And the actual approach to make it even more powerful is just to stack another layer on top of it
23:00
because we can easily increase the number of layers if you happen to build your own model at some point you can easily do it and the more parameters the better should be the quality of that model. And the last missing piece is actually how we take a sequence of embeddings, this time context-aware, to create a single vector per whole text.
23:22
And this is actually pretty simple, there are various ways of how to achieve that and the easiest and the one that is used by this particular model is just to take an average of all the output embeddings it generated. This is called pooling and as mentioned there are various ways of how to do it
23:41
but this one is just quite commonly used. And the tokenization itself has certain problems. Obviously if we are not able to represent a single word by a single token then we need to cut it into pieces and if you cut every single word in your sequence
24:02
into multiple pieces there will be obviously no meaning assigned to each of them. Each sub-word token probably occurs in multiple contexts so obviously similarly to Word2vec there is no clear meaning for all of them. That makes a lot of sense for word-level tokens
24:20
but for sub-word-level ones we lose that information. So the model has to learn the relationship of the whole sequence of tokens inside of its parameters so the attention has to take care of it. And a lot of issues related to the embeddings ties back to the tokenization process.
24:41
If you haven't seen this great video by Andrew Karpathy, he implemented GPT-2 tokenizer from scratch, this is a great exercise to do and he also points out a lot of issues that might be related to the tokenization. Really encourage you to have a look. And for example, some of the issues may arise
25:03
from just a different language used in the training data compared to the data you would be working with. For example, there are some specific subsets of Unicode that can help you to use bold or italic letters on a social media platform even though
25:21
they explicitly do not allow to use that. And if you do not perform some normalization it will be all gone because those Unicode characters were never represented in the data and there are more and more examples like this. And also, if you use any sort of proprietary tool
25:41
like OpenAI or Cohere, you are probably paying for tokens so you really care if your data is just cut into millions of tokens every single time you pass a longer document because obviously you will be charged for that. And we'll just go through some common misunderstandings
26:01
when it comes to embeddings and tokenization. And hopefully I will also advocate for using some approaches of how to solve that. By the way, when I started using embeddings a few years ago, I thought they are really like error-proof and I was supposed to be telling
26:22
that they magically solve all the problems of search like typos and the multilinguality and et cetera. But actually it turned out not to be true especially for the simpler models available out there. For example, if I would like to analyze social media data
26:44
I would probably have lots of texts using emojis like Prague is really sunny in July. That would be probably the emotion of the original writer of that message. However, if you just have a look at the tokens generated by our model
27:02
this particular emoji is translated to unknown token just because it was not seen in the training process of the tokenizer. So every single emoji will be always translated to unknown. And well, that's fine if you work with academic papers
27:21
you probably won't ever experience this kind of characters. But if you analyze Twitter, LinkedIn or any other social media platforms actually even articles in media you can probably see a lot of this emojis being used and they actually carry a lot of information
27:41
like the sentiment of the person who wrote them. And unfortunately, if you just select a model or tokenizer that was never trained on that data you will lose that information easily. Another thing is all about typos.
28:01
If you have a product that runs for multiple users who will definitely experience a lot of typos in their queries or in the documents generated by them. Also, you may struggle with the data quality on your own if you let's say have an e-commerce site
28:20
and you have lots of product descriptions you cannot avoid having some typos and errors in them. And surprisingly, if you just forget to put this additional C in the word accommodate it will be no longer described by a single token but it will be split into four separate tokens.
28:42
Each of them without a clear meaning at the very beginning because obviously they occur in multiple contexts. So we never know what would be the closest neighbors to each of them. And if you calculate the similarity between them this is all calculated by cosine similarity
29:01
it is fairly low, 0.28. This doesn't sound like similar at all. Obviously in a real world scenario there will be some additional words used in the query so hopefully the similarity will be a bit higher but still you lose that precision just because of those typos not being supported
29:22
and that directly traces back to the tokenizer itself. And some other typos as well. Basically, if you have real world data and you wanna use vector search you probably have to consider some sort of query reformulation that will be just getting rid of those typos at least the most commonly occurring ones.
29:45
And this is quite a common challenge for our users. This is quite a common question that we get. Well, our semantic search does not work for our use case because we want to combine it with numerical search. Well, if we have two shirts and they have the same price
30:02
they are similar in some way, right? But apparently it's not true for the embedding models. Here, a simple example, this shirt costs $55 and $55 might be just expressed in two different ways possibly in multiple ways as well. Maybe you have a discount of $5 from $60, whatever.
30:24
And the similarity of those sentences seems to be pretty high, like 0.8283. That sounds okay. However, if we reduce the price to just 50 but use digits to represent the second price, it even grows up to 0.9, which is counterintuitive, right?
30:46
We just reduce the price so they shouldn't be that similar. But if you compare the tokenization of both, the overlap of the tokens is higher for the second example because there is just a difference in a single token. Here we have two additional tokens being used.
31:02
And if we increase the price even further, if it doesn't cost $55 anymore by 559, the similarity of the tokenization is even higher because all the tokens present in the first example are also present in the second one
31:21
but there's just an additional one because numbers are just cut into multiple pieces. So phone numbers, product identifiers, et cetera won't be represented at all. It's also funny to see that if you are, let's say an electronic supplier and your users know exactly the part number
31:41
they are looking for, then maybe semantic search is not for you. That also applies to dates. There are various ways of how to express dates and definitely some normalization is required because here they are just represented in a completely different way. The set of tokens is completely different except for the year which is just cut into the same tokens
32:04
and you would easily get a better similarity to the second example if we just swap day with the name. So 2024-16-02 would be closer to 2024-02-16
32:20
than to the first representation even though they are just valid ways of expressing dates. So that doesn't work either. And the main problem is that tokenization rarely captures whole numbers, especially the bigger ones. Maybe if you just have some small numbers or prices, you could easily use one of the existing ones
32:42
if it covers all the cases. Still, it won't be able to express conditions like cheaper than, before, certain date, et cetera. That just simply doesn't work and we need different means. And also, another problem is caused by the data cut-off. Those models were just trained at some point
33:02
and they don't know their most recent facts. So if you just calculate tokenization, to create the tokenization of OpenAI ChaGPT, both are just split into multiple tokens. OpenAI probably was not that famous when they were just collecting the training data for that sentence transformer or actually the tokenizer.
33:23
And the similarity of both is really, really low. So, like, word is changing too rapidly to just capture everything and model that we trained yesterday may not understand the meaning of some new words. And that applies also to proper names.
33:41
For example, the president of the US changes from time to time and our models are fixed so they would never know the best answer possible. So if you have lots of proper names in your data, then you probably need to consider some additional means
34:01
and hybrid search or sparse search vectors based on BM25 or some other methods might be a good idea. However, if you are able to perform an additional step of trying to extract these metadata filters or letting your users to just express
34:21
this additional criteria in a different way, but that's a UI problem or UX problem, you could easily use some additional attributes filtering, metadata filters, and those are also incorporated in modern vector databases, including Quadrant. So instead of just letting you search
34:41
with the similarity measures, you can also provide this additional filter mechanism that might be incorporated into your semantic search. So for example, you are selling clothes and each glove is described by the visual appearance that you can easily capture by using vectors. So your users will be able to use reverse image search
35:03
to find a matching piece of clothes based just on the picture they took at some point, but those additional criteria like price, which actually changes over time, the manufacturer, fabric, et cetera, won't be captured at all. So that has to be expressed by payload indexes or metadata filters actually,
35:22
but payload indexes are the mechanism that you can use in Quadrant to perform that because if you use our search API, you can pass the vector to use for search to find the closest neighbors in that space and filters actually help to reduce the search space to certain regions in which your points
35:43
fulfill specific criteria. So pretty unique feature of Quadrant is that this is incorporated into the semantic search, so there is no need for pre or post filtering, but this is like a technical thing. As a user, you can just use the payload indexes and add this additional filters on top of it.
36:03
And there is also a problem with handling non-English data, like those are all various forms of my name and in linguistics, declension is changing of the form of a word and declension shows a specific function in the sentence by some sort of inflection.
36:23
So all of these are actually valid variants of my name, but if you calculate the similarity of them, it might be higher or lower and you can never predict that just because the model was not trained on English data. And this is pretty useful to realize that the tokenizer itself
36:42
might be used to estimate the quality of the embedding model in your own scenario. When you have the data, you would like to encode with a specific model and you have it obviously, because you have a bunch of documents you want to be able to search over. And there are very simple means yet not that well documented anywhere.
37:01
I was just trying to figure out if someone is doing that, but people just try to think about the embedding models as if they were just magic tools, understanding the real language, but it's apparently not that true. Even a simple exercise of calculating the ratio of unknown tokens, that helps a lot to make sure that your language,
37:22
your terminology, your specific data set is captured well by the tokenization process. And if you see that the number of unknown tokens for your data is relatively high, or there is a discrepancy between the rate of unknown tokens
37:41
in the documents you have and in the queries you have, then you can easily see that there's something wrong with the way you cut your text into pieces. And maybe it's better to evaluate some other models to see maybe there is an alternative that we could easily use and swap the existing model.
38:02
And token frequency, actually, I didn't mention much about the way this tokenizer was trained, but basically this is an interactive process. Every single step of training the tokenizer, we select a pair of tokens to merge. So this pair of tokens is thought to be
38:21
the best merge in some way. And if we just evaluate the order of tokens in the vocabulary, we can clearly see some statistics of how the training data was distributed. So if you just perform the same training procedure for your own data, and you see that it's completely different, the language does not match
38:41
the one you have in the tokenizer, this is also another signal that maybe you are just struggling with the tokenization and need to rethink using a different model. And that's a very simple way of inspecting how a specific model will work
39:02
in this non-benchmark scenario. Because obviously there are some benchmarks like MTEB, which is a great idea to just check it before you select any model. But obviously those are all public benchmarks. So it's quite easy to just be on top of this benchmark
39:21
because the data is publicly available. So why can't we just fine-tune our model so it works well in that specific scenario? And tokenizers quite often neglected in fine-tuning the embedding models. I've seen many companies trying to fine-tune just their models without even touching the tokenizer.
39:42
But let's say you would like to take an existing model that works well for English and then learn to speak Czech as well. You can easily do it, but if you just split every single word into multiple tokens, that might be not the best way of how to do it. We can easily extend an existing tokenizer.
40:01
That's not a big deal. Actually, it should be included in all the fine-tuning processes that we do. We can easily add a new token into an existing vocabulary and then fine-tune a model so it learns this initial representation. So the initial meaning of this particular word will be kept in the model from the very beginning
40:23
so there is no need to learn that there is a certain meaning in a sequence of tokens. And we don't really need to perform a full fine-tuning process if we do that because if we have a model that we are just satisfied with and we just want to put some additional words
40:43
that it could understand, we can possibly just freeze all the parameters, add those new tokens into the vocabulary, and then start with some random vector from the very beginning, perform this fine-tuning process, and at every single iteration,
41:01
our input token embedding will be just getting closer to the area of that space in which it should belong. So it will just try to find the best place for itself, but all the other tokens will be just kept intact. I mean, all the other embeddings will be just kept intact.
41:20
So we'll be just modifying a small subset of the input token embeddings. And obviously, this is a very simple strategy and I couldn't really find a good name for it and this is also not well-documented yet but I'm working on this. I would call it word injection fine-tuning because here, we are just freezing all the parameters
41:43
of the network so we are not training the attention at all, neither this feed forward layer that we have here but we are also fixing the input embeddings of the tokens so we are only fine-tuning those ones that we added into the vocabulary.
42:01
And I did some experiments like this all-mini-LM-L6-V2. Actually, I learned the name by heart. It doesn't understand the meaning of quadrant. Like vector databases were introduced after the model was trained. So it was just cutting that into multiple pieces, Q, double-hash-dram and double-hash-T.
42:24
And all of the sub-word tokens didn't have any specific meaning. But I was able to do this simple exercise. I just added a single token at a time, performed this word injection fine-tuning and it was able to gather some meaning that quadrant is about some vector, Python, Ras
42:43
because this is a tool that is used by the developers. And apparently that improves the similarity in some way. This is fairly easy to be done because I just scraped the documentation from our website and I just performed an unsupervised procedure
43:03
so the input embedding of quadrant was just learned from that data and it was less than 10 iterations of the training. So not a big deal to do it. You can also add multiple tokens at the same time and obviously put some additional tokens
43:22
into an existing semantic space defined by this input token embeddings. Obviously, if you have a model that was trained on medical data, I would not expect it to be able to get some tokens coming from a different domain but still in some very simple cases like adding emojis into an existing model,
43:43
that should work fairly well. And there is still an discussion going on if we really need a tokenization at all. There are some architectures that hopefully will get rid of it completely but until it happens, we still need to have a closer look
44:01
on how our texts are really represented by the model that we use. If you have any questions, I'm not sure if you have time right now but I will be just hanging around because I really enjoyed the conference and this QR code points to my LinkedIn account so you can easily send me a message. Thank you very much. Thank you. Thank you.
Recommendations
Series of 14 media