Word2Vec model to generate synonyms on the fly in Apache Lucene
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 56 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67187 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202217 / 56
22
26
38
46
56
00:00
Streaming mediaElectronic meeting systemMusical ensembleGoodness of fitOpen sourceTrailSoftware engineeringDegree (graph theory)Process modelingXMLUMLLecture/ConferenceMeeting/Interview
00:29
Linker (computing)SoftwareDegree (graph theory)ComputerCoding theoryReference dataInformation retrievalMachine learningUsabilityData miningInformation retrievalUniverse (mathematics)Virtual machineINTEGRALData miningMachine learningMessage passingGoodness of fitComputer animationLecture/Conference
01:22
Performance appraisalProcess modelingProgramming languageNatural numberRankingSimilarity (geometry)UsabilityLuceneOpen sourceInformation retrievalOpen sourceHybrid computerProjective planeWebsiteNatural languageTwitterPerformance appraisalRankingProcess modelingInformation retrievalExpert systemBridging (networking)SoftwareEvent horizonLecture/ConferenceComputer animation
02:45
Similarity (geometry)RankingPerformance appraisalTunisMathematical analysisImplementationQuery languageLimit (category theory)Multiplication signThermal expansionSubject indexingState of matterAlgorithmLecture/ConferenceComputer animation
03:38
Query languageThermal expansionInformation retrievalFirst-order logicSubject indexingQuery languageMultiplication signContext awarenessSearch engine (computing)Lecture/ConferenceComputer animation
04:27
Query languageSubject indexingGraph (mathematics)Digital filterState of matterThermal expansionFirst-order logicProblemorientierte ProgrammierspracheDemonPhysical systemProcess modelingSearch engine (computing)State of matterThermal expansionWordInformationDifferent (Kate Ryan album)Order (biology)MappingMoment (mathematics)First-order logicProgramming languageGraph (mathematics)Multiplication signDemonElectronic mailing listSoftware maintenanceProblemorientierte ProgrammierspracheArithmetic meanPhysical systemWeightProcess modelingFluid staticsCodeDatabaseUniverse (mathematics)Computer fileLimit (category theory)1 (number)Real numberContext awarenessLecture/ConferenceMeeting/InterviewComputer animation
07:01
Formal grammarComputer-generated imagerySource codeEinbettung <Mathematik>Semantics (computer science)SpacetimeFirst-order logicSimilarity (geometry)Vector processorArtificial neural networkAlgorithmoutputRepresentation (politics)FeldrechnerSeries (mathematics)HypothesisReal numberVideo gameContext awarenessRule of inferenceFormal grammarVirtual machineWordFirst-order logicProcess modelingArtificial neural networkArithmetic meanSemantics (computer science)AlgorithmEinbettung <Mathematik>NumberProgramming languageSearch engine (computing)Electronic mailing listoutputFeldrechnerFunction (mathematics)Series (mathematics)HypothesisSimilarity (geometry)Representation (politics)Matching (graph theory)ResultantInformationDistribution (mathematics)Lecture/ConferenceComputer animation
09:13
Codierung <Programmierung>FeldrechneroutputWordEinbettung <Mathematik>WeightMatrix (mathematics)Hausdorff dimensionFunction (mathematics)Artificial neural networkVector processorSummierbarkeitPredictionContext awarenessAxonometric projectionGroup actionOpen sourceJava appletLibrary (computing)Software developerImplementationProcess modelingDisintegrationUnsupervised learningVector processorSemantics (computer science)SpacetimeProcess modelingoutputInformationWordFunction (mathematics)Artificial neural networkEinbettung <Mathematik>WeightBendingMatrix (mathematics)Different (Kate Ryan album)Context awarenessComputer architectureAnalytic continuationAlgorithmParameter (computer programming)WindowFeldrechnerRepresentation (politics)Loop (music)NumberForm (programming)Programming languageGoodness of fitCuboidDampingSoftware testingSlide ruleComputer fileOpen sourceComputer-assisted translationImplementationRow (database)Library (computing)Computer chessSoftware developerCASE <Informatik>Wave packetMoment (mathematics)Axiom of choiceInterface (computing)Java appletComputer clusterLecture/ConferenceComputer animation
13:36
FeldrechnerFunction (mathematics)Hausdorff dimensionProcess modelingInformationLevel (video gaming)FeldrechnerMereologyDimensional analysisDefault (computer science)Motion captureComputer animationXMLUMLLecture/Conference
14:15
Thermal expansionProcess modelingBenchmarkFeldrechnerQuery languageSimilarity (geometry)Trigonometric functionsFirst-order logicVector processorUsabilityComputer-generated imageryTheoryGraph (mathematics)HierarchyNachbarschaftsgraphGraph (mathematics)Electronic mailing listInformation retrievalParsingConfiguration spaceParameter (computer programming)Default (computer science)File formatMaxima and minimaGraph (mathematics)Electronic mailing listInformation retrievalQuery languageParsingProcess modelingLibrary (computing)Thermal expansionFile formatAlgorithmNavigationSimilarity (geometry)outputComputer fileFirst-order logicStreaming mediaDefault (computer science)Maxima and minimaSubsetOrder of magnitudeVector processorFeldrechnerLink (knot theory)NumberLevel (video gaming)DistanceImplementationConnectivity (graph theory)Source codeVector spaceData structureParsingOrder (biology)Video gameRevision controlWordSoftware testingLaptopInheritance (object-oriented programming)MereologyToken ringMultiplication signDifferent (Kate Ryan album)Parameter (computer programming)Set (mathematics)Limit (category theory)ForestCASE <Informatik>ApproximationCausalitySlide ruleCorrespondence (mathematics)Arithmetic meanMusical ensemblePrototypeData dictionaryComputational intelligenceLecture/ConferenceComputer animationMeeting/Interview
23:35
Function (mathematics)Process modelingSubject indexingJava appletPrice indexToken ringFeldrechnerField (computer science)BuildingCurve fittingAlgorithmLibrary (computing)Default (computer science)HypercubeParameter (computer programming)First-order logicRepository (publishing)outputProblemorientierte ProgrammierspracheImplementationLimit (category theory)IterationMereologyProcess modelingoutputComputer fileSubject indexingWord2 (number)Multiplication signLink (knot theory)Extreme programmingCodeOpen sourceField (computer science)Revision controlFunction (mathematics)Default (computer science)Parameter (computer programming)Library (computing)TheoryCase moddingMessage passingWrapper (data mining)Customer relationship managementLecture/ConferenceComputer animation
26:48
View (database)Subject indexingJava appletPrice indexProcess modelingInformationLatent heatWave packetSubject indexingProcess modelingWordLecture/ConferenceComputer animation
27:20
Process modelingComputer filePrice indexJava appletInformationLatent heatWave packetSubject indexingQuery languageParsingDirectory serviceOpen setPhysical systemoutputFirst-order logicString (computer science)BuildingMicroprocessorSimilarity (geometry)Game controllerMikrocomputerLaptopComputer hardwareSoftwareMainframe computerComputerField (computer science)Thermal expansionTerm (mathematics)Graph (mathematics)View (database)Vector processorProcess modelingCASE <Informatik>Multiplication signQuery languageExpandierender GraphSubject indexingFirst-order logicSimilarity (geometry)Vector processorComputational intelligenceTunisBuildingGraph (mathematics)Semiconductor memoryWordSoftware testingDefault (computer science)Parameter (computer programming)Open setMereologyLaptopFactory (trading post)SoftwareClient (computing)Mainframe computerElectric generatorFunctional programmingoutputComputer fileMicroprocessorGame controllerMikrocomputerComputer hardwareProblemorientierte ProgrammierspracheGroup actionDifferent (Kate Ryan album)Thermal expansionDirectory serviceMathematicsXMLLecture/ConferenceComputer animation
31:49
MultiplicationProcess modelingData recoveryRead-only memoryPrice indexData storage deviceMathematicsGraph (mathematics)Open setMultiplication signProcess modelingWave packetPhysical systemMoment (mathematics)Computer architectureRange (statistics)Subject indexingElasticity (physics)Computer fileINTEGRALData storage deviceSemiconductor memoryGraph (mathematics)Data recoveryNumberDefault (computer science)Parameter (computer programming)PlanningMereologyMiniDiscFile formatFlow separationCASE <Informatik>Instance (computer science)File systemVirtual memoryLimit (category theory)Programming languageLine (geometry)MultiplicationLecture/ConferenceComputer animation
34:22
Einbettung <Mathematik>MereologyFreewareMoment (mathematics)Lecture/Conference
34:54
Streaming mediaMultiplication signVotingLecture/ConferenceMeeting/Interview
35:25
Vector processorStrategy gameSimilarity (geometry)Lecture/Conference
35:58
Process modelingVirtual machineLecture/Conference
36:27
WordArithmetic meanContext awarenessVector processorAlgorithmSet (mathematics)Order (biology)Different (Kate Ryan album)Lecture/ConferenceMeeting/Interview
37:14
Musical ensembleLecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
00:07
Welcome, everybody. Good afternoon. And we are here to present you our open source contribution we did on Apache Lucene to generate synonym on the fly by using Word2Vec model.
00:22
Before starting, let me just introduce myself. My name is Daniele. I am a Italian software engineer. I got a master's degree back in 2014 at the University of Pisa. I am passionate about coding. I also love eating, but doing sport as well. Pass it to Laria.
00:43
Hi, good afternoon. It's a real pleasure for us to be here today. So, we would like to thank Berlin Badzwa for the opportunity and also the audience for joining our talk. My name is Ilaria Petretti and I'm Italian, but I live in the Middle East. I've been
01:00
working as an information retrieval machine engineer at SIS after earning a master's in data science in 2020. And I have a passion for data mining and machine learning techniques and mainly deal with their integration with the information retrieval system. And in general, I like all kinds of sport, especially basketball, since I was a basketball player.
01:24
So now, just a quick intro about our company, SIS. It was founded in 2016. It is headquartered in London, but all the employees are distributed worldwide, mainly in Europe. We are open
01:41
search enthusiasts and we are a facilucine solar elastic search expert. We are active contributed back to the community, for example, with our search quality evaluation tool, the rated rank evaluator, and we are active researchers. So, we always work on avant-garde topics, since the mission of SIS is to build a bridge from academia
02:06
to the industry through open source software. Since 2019, we have been organizing the London Information Redrieval Meetup. That is an event that, due to the COVID-19 pandemic,
02:20
it became a hybrid. So, it is in person in London and also online. So, if you want to attend it or if you want to present your project, just feel free to contact us or visit our website. And finally, yes, this is our hot trend. So, we mainly work on
02:42
neural search, natural language processing, learning to rank, document similarities, search quality evaluation, and relevance tuning. So, now this is the agenda. So, what we are going to cover with this talk today. I will briefly start with an introduction of a synonym
03:01
expansion. So, what is the state of the art? What are the limits coming from the current approaches and what is what we are proposing today? Then, just a quick intro about Word2Vec algorithm. After that, we will explore our contribution to Apache Lucina
03:21
that integrate Word2Vec into the text analysis pipeline. And then, we will explore in detail our implementation, showing you also some examples, some practical example at both index and query time, and finally, our futures works. So, let's get started with synonym
03:41
expansion. So, how and why synonyms are used in search? As already known, when executing a query, the term generated at the index time need to match those of the query. So, let's have a look at this example. Best places for a walk in the mountains. We know that a walk can be expressed
04:01
also with other terms, hiking and trekking, but if the text was indexed using, for example, the term hike and the user doing the query will enter a walk, maybe it could not find that document. That's why it's extremely important to make the search engine aware of
04:20
synonyms. So, the synonym expansion is a technique used in information retrieval that allow to express the same information need in different ways in order to enrich the search keywords and improve recall. So, the state of the art in Apache Lucina and Solar
04:43
is the vocabulary-based synonym expansion. So, at the moment, the simplest way to implement synonym is feeding a search engine a vocabulary that contains a mapping between all the words and their related synonyms. So, you can simply add a static list of synonyms, comma separated,
05:05
written in a txt file and then let the synonym graph filter read them from there. And here you can see in the example code how you can use the synonym graph filter with Solar. So, you can manually build the synonym list or you can download a vocabulary from WordNet,
05:26
for example, that is the Alexical Synonyms Database for English language from Princeton University that is constantly updated. So, it's a very large and high quality list of synonyms.
05:45
And then in 2020, our SIS director, Alessandro Benedetti, contributed to Apache Lucina, integrated these features and introducing a new filter, the limited boost filter, that has given the ability to assign, to associate different numerical weight
06:06
to each synonym in order to boost the ones that are more important and that are closer to the original concept. But yeah, what are the limits coming from this approach?
06:20
This sentence is a good example for explaining them. So, the term daemon in the domain of operating system, article, is not a synonym of devil, but it's closer to the term process. So, as we can see, vocabularies don't necessarily match with your contextual domain. And also, vocabularies are not always available for all the languages.
06:47
They will change over time, so they will require manual maintenance and also the cost will be higher. And finally, synonym expansion for a word is based on its denotation and doesn't need to account for the connotation.
07:01
So, what I mean is the context in which the word appears, because, for example, in real life, people tend to use a word as if they were synonyms, but by grammar rule, they aren't. So, how can we solve this problem using machine learning?
07:21
Why not use a Word2Vec neural network to generate synonyms on the fly? And this is what we are proposing and what we have integrated in Apache Lucene. First of all, we would like to thank the holder of the book, Deep Learning for Search, Tomaso Teofili, for expiring us with this contribution. What are the advantages of this solution?
07:43
For sure, having a search engine that is able to use a neural network to generate synonyms accurately from the data to be in jest rather than manually building a list of synonyms or downloading vocabularies will help finding more matching and avoiding missing relevant search results.
08:04
This approach is language agnostic, so we don't care the language we use, whether it's formal or informal, because no grammar or syntax are involved, since the main idea is to consider the nearest neighbors of a word.
08:22
OK, now just a quick intro about Word2Vec, because it's not the purpose of this talk and of this contribution going into the details. But I want to give you an overview because maybe most of you know it, but maybe some are not. So Word2Vec is one of the most common neural network-based algorithms
08:44
for learning word representation, that it takes a corpus in input and it will output a series of vector representations, one for each word in the corpus, that are called neural word embeddings. So they represent words with numbers.
09:01
The main idea behind Word2Vec is the distributional hypothesis. So words that appear in the same context tend to have similar meanings. So as a consequence, two similar words in terms of semantics will be identified by the model with two vectors that are close to each other in the space.
09:23
Word2Vec is an unsupervised learning technique. It is a feed-forward neural network, so where the information flows from the input layer through the output layer without any loop, the input is encoded using one hot method. The hidden layer is just one, that's why it is called shallow neural network
09:45
and not deep neural network. And the number of neurons in the hidden layer is a hyperparameter that you can set up, so you can choose the value and it will be the desired embedding size. The output is also a one-hot encoded form, and then the word embeddings will be the vectors
10:07
from the network. So you will obtain word embeddings from the weight matrix, from the hidden weight. Now I just want to briefly tell you the difference
10:22
between two different architectures that you can choose, continuous bag-on-word and skip-gram. Both algorithms use nearby words to extract the semantic of a word into embeddings, but they are, as you can see from the picture, they are exactly the opposite.
10:43
So continuous bag-on-word, the distributed representation of a context, so the near-boring words are used to predict a target word, while skip-gram is the opposite. So we will use the distributed representation
11:07
of a word to predict the context. But what is a context? So how we can define the neighboring words? Through another hyperparameter,
11:20
that is the window size, so word2vec for each sentence, it will read work in a sliding window of n-word. So let's have a look, for example, at this sentence. The cat chews the mouse up to the den. This sentence is split into fragments, and each fragment is fed to the neural network
11:44
as a pair that consists of a target word and then the context. So, for example, let's have a look at the third row. So given a window size of two and a target word that is chews in this case,
12:01
we will pick two words before the target word and two words after the target word. So the context will be the cat, the mouse. And so the word pair for training will become chews the, chews cat, chews the and chews mouse. Now, for the word2vec implementation, we use the deep learning for J library
12:25
that at the moment seems to be one of the best choice for an active Java library for deep learning. It is open source and it was written in Java,
12:40
but it has interfaces for other languages. It is integrated with Hadoop and Apache Spark. It has a good developer community. And what we have used is an out-of-the-box implementation of word2vec that is based on skip-grunt architectures, skip-grunt model. And it's very easy to use from scratch. So you have just to set up the parameters
13:03
and pass the input test. And then, yeah, this slide, just for your curiosity, I want to show you the deep learning for J word2vec model output. So after training the model, what you will get, it's a zip that contains several files.
13:22
And one of them is a TXT called scene0. And it contains a vocabulary in which each token, each word that is base64 encoded has a vector associated to it. So this is just a simple example, just to show you, because, in fact,
13:43
as you can see, the vector is very short. The vector dimension is 2, but we know that 2 is too low and it's not enough to capture enough information. In fact, the default value is 100. OK, now I'll leave the stage to my colleague for the contribution part.
14:03
Thank you, Laria. So let's now go on the heart of our contribution, what we did. So first of all, we implemented a word2vec synonym filter. This is another token filter like the synonym graph token filter already in place in Apache Lucene. But this time, we are not getting as an input
14:26
a static file of the action, but we generate synonym on the fly by using word2vec. Let's start from the problem we had to face for designing and implementing this token filter. We had to find a way to read the model and parse it. We need a smart way to store the model,
14:49
and finally, we need a way to query the model in order to expand our synonym. To expand, I mean, the term with our synonym. But we already used deep learning 4j for generating it. So we have the library.
15:06
It implements almost everything because if the file is generated using deep learning 4j, we can also have a way to parse it. We have a way to query it, query the model to get
15:20
the synonyms. So it's perfect. It's already done, implemented, tested, and that's it. But unfortunately, life is never so easy. So when we implement the first prototype, we can show that too many dependencies coming from the library deep learning 4j.
15:42
So that now all these dependencies became dependencies of Lucene. So when we implemented, we saw that too many conflicts came up, and we had to exclude dependencies, and the source code became a mess. And even more important, when we do some preliminary tests,
16:02
we noticed that the search was quite slow because it was taking about 70 milliseconds for each synonym expansion. Okay, I did this test on my laptop. I didn't use a super format computer, but 70 milliseconds is still too much. So let's do a step back.
16:26
What do we have to do? We have a word. We want to extract the vector coming from the Word2Vec model and put this vector in the forest, let me say, in the vector space
16:44
generated by the list of vectors generated by Word2Vec. And we have to select the vectors that are closer enough to our query vector. So in this case, we selected a subset containing, for example, a, z, and t, while w is our query term and our query vector.
17:11
But this is just a k and then search. So we want to get the k nearest neighbor given a specific value. But this is already implemented in Lucene. So we don't even need
17:24
to reinvent the wheel. So Lucene implements the k and then search by using HNSW. So just a quick introduction about HNSW. We've already seen how this algorithm works.
17:43
Yesterday during the talk of Alessandro Meditti, I don't know about if someone of you wasn't here. Let me just quickly recap how the algorithm works. So a navigable small word graph means that we are dealing with a proximity graph.
18:06
So this is a graph where vectors are represented by nodes. And two nodes are connected by a link if the corresponding vectors are close enough to each other.
18:22
The idea of hierarchical comes from the skip list. Indeed, also the skip list is organized on multiple layers where on the top layer there are a lot of nodes with small link that connects each other. And on the top level,
18:42
one link allows you to do a bigger step in the list and then also in the graph. So coming back to the graph, this data structure indeed is structured on different layers. On the lower layer, we have many nodes. So we have shorter edges that allows you to
19:05
refine your search because you can see your neighbors near the node where you are. On the top layer, there are a few number of nodes with edges that allows you to do
19:24
long hop. And this is useful for fast retrieval. So how the search works. We start from the top level and we navigate using a greedy approach. We navigate it through the
19:40
graph. We select the local minimum, that is the node which are smaller distance with our query vector. And we iterate on the lower levels refining the search. At the end, you selected a good approximation of the neighbors of your query vector.
20:07
So this is the solution we implemented. So for parsing the model, we don't have anything implemented. So we have to implement our own model, our own, sorry, our departure.
20:24
Currently, the parser supports only the deep learning for J model, but this is designed to be extendable and maybe tomorrow support other models. The parser generates a stream that goes, is read by another component that builds the graph, the HNSW graph that we've already seen.
20:45
And finally, for query expansion, we use another component already present in solar, in Lucene, sorry, that allow implement the search explained in the slide before. So implement the solution. We notice a drastically improvement on performance during this
21:07
for a query expansion. So the search time is being reduced by 70 milliseconds to 6 milliseconds. So one order of magnitude less and we didn't even to add additional dependencies.
21:24
That's a good improvement. So how can we use this Word2Vec synonym filter? We can use it how the synonym filter we already have in Lucene. The difference is that
21:45
instead of getting the static file, we get a model that is the path containing the already trained Word2Vec model. We have another parameter that is the format that is currently,
22:05
okay, we have just the deep learning for J model supported. So we have just one default. Then we have other two settings for limit the number of synonyms retrieved. One max synonyms per term
22:20
allows you to limit the maximum number of synonyms you want to get. And the mean upset similarity is the minimal similarity between two different terms that you want to get in your, so in your term to be considered a synonym.
22:41
So a mean, for example, you have your best neighbor is one term with a similarity of 0.3, for example. But this is not enough to consider these two terms as synonyms, so I will exclude it. And finally, we have another parameter that is similarity as a boost. So we've been dealing
23:05
with similarities between vectors, so similarities between term since a while. And why don't we use this similarity value as a boost? So lower similarity means that the term should have a lower relevance and the highest similarity,
23:25
highest relevance. So this was the first and the main part of the contribution. But when Ilaria described the limits of the current implementation,
23:43
one of them was that we didn't take into consideration the connotation coming from the contextual domain. So this is what we are doing here. What is the best way to do it? So the
24:00
best way is getting and generating a model starting from your own data. So this is, that's why actually we implemented another external tool. This is a command line tool that is doing right this one. So it gets a Lucene index as an input, a field name, so
24:26
where to get data from, and it outputs the already trained Word2Vec model. So okay, this is just how you can invoke the tool. Okay, passing the Lucene index path,
24:42
the field name, and the output file name. And anyway, you will get the output file name. So looking at the implementation, this is more or less a wrapper over deep learning 4j. We first of all iterate over the
25:05
Lucene index, we generate the model, we train it, and we serialize the model writing the output file. As you can see here, this is not,
25:20
having the best model was not part of our contribution. So we just use the default parameters. But anyway, something to say is that we are using this one as an external tool, because first of all, we don't know the impact on the Lucene index, on the Lucene performance,
25:45
and second of all, we cannot use Word2Vec deep learning 4j, so we don't know, we saw before that there are some mess important in the library.
26:03
Okay, I see that we don't have too much time left, so this is the links where you can find both the trainer and our open source contribution. That is now, the Word2Vec CRM filter is now part of our fork of the Lucene index, but in a few days,
26:24
we are ready to prepare a pull request to match our code on the official version. Actually, in theory, we should have 40 minutes, not 30.
26:42
By the way, yeah, okay, I'll quickly show you some practical user. So I'm going fast. So anyway, just to see how we can use all our contribution. So we are Italians and we
27:00
want to get some synonyms for Italian. So we download the Italian documents from Wikipedia, that is about 3.4 gigabytes of data. We store all this data into an index and we use the Word2Vec model trainer to generate a trained model. In this case,
27:24
this example, this is called Wikipedia model.zip. We now use the same model to expand. Okay, we have time, great. We now use this model to expand synonym for query time. So
27:47
as you can see here, we generated a custom analyzer where adding a Word2Vec filter factory. In this case, we use all default parameters and we only pass the already trained
28:04
model. Open a searcher, ask the client to give me an input term, generating the query and search for the documents. We launch our test. We pass as a word the term computer.
28:27
This is now part of the Italian dictionary but everybody can understand and we got microprocessor, controller, microcomputer, desktop net, notebook hub, the software chip mainframe.
28:40
Each term with its similarity value. Generating the query, we can see now how the similarity value is used as a boost. Indeed, for example, chip has a boost of 0.8994, that is the similarity of the term chip, while computer, that is our original term,
29:06
has a similarity one. But if you don't want that your synonym function will impact performance at query time, you may decide to extend directly your terms, your documents
29:25
at index time. So what do we do? We use the same custom analyzer. We create an index writer. Okay, for this example, we just generated a single document with a single value with the
29:41
same word the computer and we do the test. Just a note, need to pay attention because if you apply with this approach, if you proceed with this approach, you need to be aware that your index will be bigger, the index time will be higher, so the process will be slower and if you have to do some fine tuning on the model, something changes, you have to re-index the whole model,
30:08
the whole collection again. By the way, we did this example and just to verify that everything is working correctly, we used Yuke to open the index and see, indeed we found, you cannot see
30:23
here very well, but we can find the same terms we saw before in the index. Just a couple of notes, all the synonyms retrieved are not proper synonyms, but this is something not
30:42
depending on our contribution, this is something depending on the model. Indeed, we didn't spend time fine tuning the model because each domain, for each domain, is a different fine tuning. But something that actually depends on the contribution is the
31:00
what you can see on the left part. So, to read the file, building the HNSW graph, it took about two minutes for about 300,000 vectors and this is why the HNSW graph is
31:21
stored in memory. So, every time you have to start up a process containing solar, containing Lucene, you will have every time to load the model and create and build your graph every time you start up Lucene. So, we can do something better and passing to Laria for our future work.
31:50
Yes, okay, let's conclude this talk with our future works. Yes, so we have mentioned already during the talk, but we just want to summarize them.
32:02
As Daniela already said, our current limitation is that the model now is kept in memory. So, what does it mean? For example, in case of disaster recovery and you have to restart your system, it will take longer, so also the loading time will increase. And in case of multi-process,
32:23
you have to load the model and build the graph several times. So, it will occupy memory and time based on the number of processes that you have. So, how can we plan to solve it?
32:43
What we would like to do is to change the model storage part, let's say, to store the model as a Lucene index. So, we would like to force it to be on disk instead of memory. So, these have several benefits like we no longer need to load the model and rebuild
33:02
the graph because we have it already on file system and then we will load it by the memory mapping. So, also in case of disaster recovery, it will be faster. And if you have multiple Lucene instances, they will use the same model. So, no need to load the model every time.
33:27
Then just some improvements. So, they don't depend on our contribution, but it's something that we would like to improve. As already said, we haven't really cared so much about the training of the model. We use the default parameters, but we would like to introduce
33:43
the HEPA parameters tuning also in our common line tool. And we would like to generate synonyms also with other language models like BERT, for example. Because at the moment, as we said, the only format that we support is the deep learning 4G, but the entire architecture
34:04
is already configured to extend our range of possible models. And just finally, solar elastic search, open search integration. Why the question mark? Because we haven't really investigated that part. So, we think that when we import the dependency
34:30
that contains our contribution, we will have the new features for free, but we have to check it if there is something to adjust. And finally, we would like to introduce multi-term synonyms,
34:44
because at the moment we can create embedding just for unigrams. So, stay tuned, because it's only the beginning and the best is yet to come. And thank you very much for your attention.
35:03
Wow. Thanks a lot. It was a really interesting talk. So, because we were running behind in time, I'll just ask a question that was posted online. And then whoever has a question here, I guess, will be around and you can ‑‑ Yeah. We will be here.
35:21
So, I'll pick the question with the most votes, just to be fair. So, the question is I have often found near‑word vectors rather unrestrained. Queen might lead to monarch, but also to king or crown, which are not synonyms. Did you encounter this? Do you have any strategies to mitigate that? To be honest, can I ‑‑ sorry, can I see the ‑‑
35:47
Yeah, definitely. So ‑‑ okay. This depends mostly from the document you have.
36:03
The data you have and how you fine‑tuning your model. There's not a specific answer for this question, actually. You have to try different parameters and until you don't find a good model, you know, these kind of things you have to try.
36:25
Yeah, what they say about machine learning garbage in garbage out, right? Okay. As already said, words that appear in similar context tend to have similar meanings, but we know that, for example, synonyms and huntonyms appear in the same context. So, it's extremely important that
36:45
algorithms ingest a lot of data, a large set of documents in order to be able to find some sentences where these words appear in different sentences. So, it can figure out that they aren't similar and they will assign different word vectors.
37:02
Yeah, I guess that makes sense. Yeah, we can ask how many data do they ingest, for this example. Yeah, I guess you can ‑‑ Anyway, feel free to contact us if you have other questions and we will ‑‑ we are happy to have a discussion. You can also contact Danilo and Ilaria online and ask your questions.
37:23
So, let's thank the speakers again.