Full-text search with Lucene and neat things you can do with it
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 110 | |
Autor | ||
Lizenz | CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben | |
Identifikatoren | 10.5446/51015 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
NDC Oslo 201264 / 110
1
2
3
5
7
9
11
12
15
19
20
23
24
27
28
29
31
32
33
35
36
37
38
39
41
43
46
47
51
52
56
59
60
61
62
63
65
67
70
71
74
75
77
79
80
81
83
87
91
92
93
94
95
96
97
98
100
103
106
108
110
00:00
SoftwareentwicklerGoogolProdukt <Mathematik>ResultanteWeb-SeiteEntscheidungstheorieQuick-SortFormation <Mathematik>AggregatzustandComputervirusBimodulComputeranimation
01:44
SoftwareentwicklerOnline-KatalogProdukt <Mathematik>ProgrammbibliothekSuchmaschineQuick-SortEntscheidungstheorieMultiplikationsoperatorComputeranimation
02:51
MorphingWeb logObjektrelationale AbbildungSoftwareentwicklerProjektive EbeneSpeicherabzugSoftwareentwicklerWikiVertauschungsrelationKartesische KoordinatenQuellcodeAggregatzustandOpen SourceBitFormale SpracheARM <Computerarchitektur>Quick-SortFrequenzAutomatische IndexierungPunktComputervirusAppletBimodulComputeranimation
04:05
SoftwareentwicklerVideokonferenzGruppenkeimLesezeichen <Internet>Inklusion <Mathematik>Web-SeitePhysikalisches SystemMinkowski-MetrikPartielle DifferentiationWeb SiteVerschlingungVisuelles SystemDezimalbruchSchlussregelDemo <Programm>CodeAutomatische IndexierungParserGüte der AnpassungResultanteKartesische KoordinatenWort <Informatik>Demo <Programm>HackerBitZahlenbereichWeb-SeiteTermStichprobenumfangArithmetisches MittelRechenbuchQuick-SortAbfrageKontextbezogenes SystemFrequenzNeuroinformatikMereologieSyntaktische AnalyseErwartungswertPunktLoopInformationsspeicherungDemoszene <Programmierung>Inhalt <Mathematik>Notebook-ComputerSchlüsselverwaltungMultiplikationsoperatorZweiKonfiguration <Informatik>GoogolQuaderUnendlichkeitBildschirmmaskeRechter WinkelSchreib-Lese-KopfComputeranimation
08:49
SoftwareentwicklerDemo <Programm>NeuroinformatikMailing-ListePunktspektrumNachbarschaft <Mathematik>UmwandlungsenthalpieAutomatische IndexierungZweiWeb-SeiteMAPComputeranimation
09:52
Demo <Programm>Web-SeiteInhalt <Mathematik>ZufallszahlenTopologieSoftwareentwicklerVideo GenieLesen <Datenverarbeitung>PERM <Computer>Offene MengeVerschlingungElement <Gruppentheorie>ErneuerungstheorieDoS-AttackeFlächeninhaltInklusion <Mathematik>Virtuelle MaschineKonvexe HülleDatenflussEichtheorieMereologieTermDrucksondierungTropfenReelle ZahlDynamisches RAMFunktion <Mathematik>Physikalisches SystemFehlermeldungGebäude <Mathematik>BefehlsprozessorMorphingSoftwaretestWort <Informatik>MenütechnikInformationHyperbelverfahrenFlächeninhaltProjektive EbeneWasserdampftafelStreaming <Kommunikationstechnik>TermKontrollstrukturWort <Informatik>Computeranimation
11:34
Visuelles SystemSoftwareentwicklerGebäude <Mathematik>Funktion <Mathematik>FehlermeldungBefehlsprozessorPhysikalisches SystemCodeSoftwaretestKonfigurationsraumStandardabweichungWeb-SeiteAbfrageZeichenketteStrom <Mathematik>Message-PassingWort <Informatik>TemplateDynamisches RAMMIDI <Musikelektronik>DifferenteFunktionalBitWort <Informatik>TypentheorieQuick-SortExistenzsatzKonfiguration <Informatik>BildschirmmaskeVerschlingungComputeranimation
12:45
SoftwareentwicklerPowerPointATMDatenfeldIndexberechnungMeta-TagRechenschieberAppletProgrammbibliothekAutomatische IndexierungTwitter <Softwareplattform>Binder <Informatik>MaßerweiterungPunktGewicht <Ausgleichsrechnung>Vollständiger VerbandElektronische PublikationAuswahlaxiomVerzeichnisdienstOffene MengeAbfrageParserPhysikalisches SystemSyntaktische AnalyseProzess <Informatik>RechenbuchMailing-ListeData DictionaryInverses ProblemSondierungZeitzoneInformationsspeicherungDemoszene <Programmierung>AbfrageAutomatische IndexierungDatenfeldDifferenteVerzeichnisdienstObjekt <Kategorie>ParserHyperbelverfahrenNeuroinformatikImplementierungTermStichprobenumfangWort <Informatik>InternetworkingKonzentrizitätProzess <Informatik>DatenbankPhysikalisches SystemEndliche ModelltheorieDatenmodellBlackboxDatenstrukturElektronische PublikationProdukt <Mathematik>VersionsverwaltungParallele SchnittstelleGewicht <Ausgleichsrechnung>AuswahlaxiomCodeDateiverwaltungQuick-SortSyntaktische AnalyseNichtlinearer OperatorMAPSchreiben <Datenverarbeitung>Inhalt <Mathematik>Projektive EbeneGlobale OptimierungMathematikInverter <Schaltung>BitrateSystemaufrufResultanteRechter WinkelZahlenbereichGesetz <Physik>MultiplikationsoperatorGoogolIdentifizierbarkeitZeichenketteEindeutigkeitMinimalgradChirurgie <Mathematik>PunktGruppenoperationAbgeschlossene MengeBildverstehenQuaderSinusfunktionMinkowski-MetrikTwitter <Softwareplattform>FunktionalOrtsoperatorFlächeninhaltZweiData DictionaryComputeranimation
22:05
IndexberechnungData DictionaryMailing-ListeSoftwareentwicklerAbfrageHydrostatikZeichenketteUmwandlungsenthalpieTermToken-RingSichtbarkeitsverfahrenRelationentheorieFunktion <Mathematik>Web-SeiteFehlermeldungGebäude <Mathematik>BefehlsprozessorStrom <Mathematik>Syntaktische AnalyseUnendlichkeitLoopInhalt <Mathematik>TouchscreenSpeicherabzugSichtenkonzeptBildschirmfensterFaktor <Algebra>DatenfeldVerkehrsinformationProgrammRechenwerkMomentenproblemTorusAutomatische IndexierungFormale SpracheSchlussregelKartesische KoordinatenCodeWort <Informatik>Prozess <Informatik>Ganze FunktionAutomatische IndexierungBitToken-RingNichtlinearer OperatorÄhnlichkeitsgeometrieMatchingCodierungFunktionalArithmetische FolgeQuick-SortTermShape <Informatik>NormalvektorMereologieCachingDatensatzDatenfeldGüte der AnpassungHalbleiterspeicherGreen-FunktionImplementierungInformationsspeicherungInhalt <Mathematik>DatenbankInformationMultiplikationsoperatorZusammenhängender GraphWeb-ApplikationLoopResultanteUnendlichkeitEreignishorizontRadikal <Mathematik>EntscheidungstheorieBenutzerbeteiligungData DictionaryParserZweiDigitale PhotographieMinkowski-MetrikRechter WinkelFormation <Mathematik>Bildgebendes VerfahrenDatenreplikationSchätzfunktionHilfesystemBAYESSchnittmengeForcingLokales MinimumArithmetisches MittelWhiteboardMechanismus-Design-TheorieZahlenbereichSchreib-Lese-KopfAutomatische HandlungsplanungBestimmtheitsmaßLesen <Datenverarbeitung>SpeicherabzugCOMMailing-ListeZentralisatorKugelkappeComputeranimation
31:19
Funktion <Mathematik>Physikalisches SystemStandardabweichungÄhnlichkeitsgeometrieSichtenkonzeptSoftwareentwicklerHydrostatikDatenfeldAbfrageInhalt <Mathematik>Syntaktische AnalyseFahne <Mathematik>Mixed RealityInstantiierungWort <Informatik>Demo <Programm>Web-SeiteDickeStrom <Mathematik>SteuerwerkTouchscreenSoftwaretestElementare ZahlentheorieParserAbfrageAutomatische IndexierungLeistung <Physik>DatenfeldResultanteMereologieVektorraumObjekt <Kategorie>Wort <Informatik>FunktionalOffice-PaketARM <Computerarchitektur>Computeranimation
33:53
RelationentheorieTermATMPowerPointPERM <Computer>Demo <Programm>ZeichenketteAbfrageWeb-SeiteZahlenbereichSichtenkonzeptBildschirmfensterSoftwareentwicklerHydrostatikEuler-DiagrammENUMTermTypentheorieTabelleDatenfeldComputeranimation
34:38
Demo <Programm>SoftwareentwicklerGebäude <Mathematik>AbfrageSichtenkonzeptFehlermeldungKonfigurationsraumBefehlsprozessorTouchscreenSoftwaretestHydrostatikZeichenketteInhalt <Mathematik>Fahne <Mathematik>SteuerwerkArchitektur <Informatik>SpeicherabzugTermWeb SiteObjekt <Kategorie>ImplementierungProdukt <Mathematik>ResultanteInformationsspeicherungForcingDifferenteOrdnung <Mathematik>MereologieHalbleiterspeicherBitAutomatische IndexierungMetropolitan area networkParserHackerCachingWort <Informatik>MatchingAnalysisProzess <Informatik>AbfrageComputeranimation
38:18
QuellcodeAnalysisKette <Mathematik>Automatische IndexierungSyntaktische AnalyseIndexberechnungVollständiger VerbandMethode der kleinsten QuadrateWarteschlangeVersionsverwaltungAbfrageGewicht <Ausgleichsrechnung>RechenschieberATMPowerPointSoftwareentwicklerInhalt <Mathematik>SystemplattformSkriptspracheSoftwaretestProgrammLucas-ZahlenreiheInklusion <Mathematik>SichtenkonzeptMixed RealityVerschlingungShape <Informatik>RadiusStrategisches SpielVersionsverwaltungURLGoogolSystemaufrufDatenfeldOrtsoperatorZentralisatorKreisflächeMinkowski-MetrikZahlenbereichImplementierungPhysikalischer EffektCASE <Informatik>EinfügungsdämpfungKartesische KoordinatenServerSkalarproduktEndliche ModelltheorieDemoszene <Programmierung>BimodulPunktBildschirmmaskeTermInformationsspeicherungDifferenteSelbst organisierendes SystemDefaultRechteckSuchmaschineAutomatische IndexierungKette <Mathematik>HackerPolygonPunktspektrumFormale SpracheQuick-SortBitLie-GruppeMailing-ListeDateiformatPlastikkarteShape <Informatik>MAPGewicht <Ausgleichsrechnung>Projektive EbeneSchwingungNichtlinearer OperatorAutorisierungZeichenvorratMultiplikationsoperatorDatensatzWort <Informatik>AbfrageParserZeichenketteAnalysisTopologieGanze FunktionZweiDatenstromRadiusGreen-FunktionStreaming <Kommunikationstechnik>GraphfärbungStandardabweichungFuzzy-LogikAdressraumDemo <Programm>StrömungsrichtungAusnahmebehandlungE-MailSpielkonsoleComputeranimation
48:00
SichtenkonzeptSoftwaretestGebäude <Mathematik>BildschirmfensterSoftwareentwicklerVisuelles SystemÜbersetzer <Informatik>ZeitrichtungKonfigurationsraumFunktion <Mathematik>ProgrammCAN-BusRadiusAbfrageStrategisches SpielURLPunktspektrumAbstandKreisflächeRadiusComputeranimation
48:43
CAN-BusRadiusJukebox <Datentechnik>SoftwareentwicklerMinkowski-MetrikZeichenketteHydrostatikOvalShape <Informatik>Strategisches SpielSichtenkonzeptSoftwaretestTouchscreenKontextbezogenes SystemDickeDiagrammAbfragePunktspektrumEinsElektronischer FingerabdruckHilfesystemIndexberechnungRechteckARM <Computerarchitektur>InformationMenütechnikStrategisches SpielDifferenzenrechnungMereologieGewicht <Ausgleichsrechnung>TopologiePolygonAbfrageInformationsspeicherungURLTreiber <Programm>AppletObjekt <Kategorie>KreisflächeAutomatische IndexierungProzess <Informatik>PunktwolkeSystemaufrufDemoszene <Programmierung>RechenschieberDatenfeldQuellcodeShape <Informatik>Service providerEnergiedichteImplementierungPunktProjektive EbeneEndliche ModelltheorieProgrammbibliothekMultiplikationsoperatorBitKontextbezogenes SystemTermMathematikRechenwerkTwitter <Softwareplattform>VersionsverwaltungHalbleiterspeicherGrundraumDifferenteNichtlinearer OperatorGrenzschichtablösungVerzeichnisdienstMailing-ListeWort <Informatik>BimodulRekursive FunktionStichprobenumfangCodeComputeranimation
55:03
QuellcodeComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:03
Hi, everyone. Hello. Basically, this session here today is about full text with Lucene. You probably know what full text search is. You probably have seen it either with Lucene
00:20
or using some sort of SQL server, for example. Some products have some module you can turn on and use full text search with. This convention started with Aral talking about user experience. Today, I'm going to show you how full text search, which is seemingly unrelated to user experience,
00:46
can actually take any product that you have and just enhance the user experience that your users have. It can make your product much more intelligent, and it can maximize your profits, for example.
01:02
Google has been doing this for a lot of years now. They have been allowing you to perform full text searches. You are going to Google, you're searching for quite about anything, and you will get results. You will have your results, the most relevant results as the first results. Google can also correct you if you are wrong. It can tell you, did you mean that or did you mean this?
01:29
It allows you to refine your searches. It can show you similar pages for each page that you can see in the results. Basically, it just allows you to find what you are looking for.
01:45
Full text search, basically, the motivation is that because we have a lot of data we want to search on, full text search is a very efficient way to allow you to use all of that information and get what you want out of it with ease and without spending a lot of time making the actual search.
02:09
You don't have to go to libraries and look for books, and you don't have to go all over the internet and seek one document that you are looking for. You don't have to take whole teams of people to just sit and catalog stuff.
02:24
You can just throw it into some sort of an engine, which we'll see in a second, and it will just work for you. This session today, we are going to see how we can actually use full text search in any product that we have. We are going to see how full text search engines work, and then we will see how,
02:44
understanding how full text search engines work, we can take them and enhance user experience with them. My name is Itamar. I'm working as a co-developer for Avendb for quite a while now. I'm also a participant of various open source projects.
03:03
I'm managing the C++ Lucene project, which is taking the Java Lucene project and porting it into C++. It's been a while since we actually made some work there, but it is still alive. I'm a committer to the lucene.net project.
03:22
I've contributed about two weeks ago a new special module. I hope we can demo it later today in the end of the session. I'm also working on some sort of a solution for Hebrew searches. Hebrew is about the hardest language to index, and that's a full-blown project now.
03:43
I also created this update project, which is a bit of marketing here for an open source project. This project basically allows you to take any desktop application and update it very easily. So, basically, a lot of mainly C-sharp open source projects.
04:04
For this talk here today, I downloaded the English Wikipedia dump. As you can see, it's actually something like 8.5 gigabytes. It's very huge. I took it and just indexed it.
04:20
I built a small application to just create an index out of it. This index, as you can see here, is 73 gigabytes in size. Usually, when you index stuff, the resulting index is going to be about 10% of the data that you created.
04:44
I would have expected it to be around 800 megabytes, but I did some extra things there. Basically, because I'm indexing Wikipedia in the compressed form, it's more than 8 gigabytes. I would assume it would be something like 20 gigabytes of an XML page.
05:04
I also stored every page that I indexed. I stored within the index itself, so that also blows the index a bit. And I did some extra hacks to make the demo actually work for today. This is why the index is that big.
05:20
Usually, it doesn't get too big. But it's good that it's big. You can actually see how the 73 gigabyte index still operates very fast. Okay, that's a good question, actually, because it took me a week and a half.
05:43
That's the point. I did it in three goes, because the first go, I'm using a parser. Wikipedia, basically, to parse the Wikipedia markdown stuff, the only code that works is a very ugly PHP code. I won't run PHP, so I took some sort of C sharp parser,
06:04
and that C sharp parser got stuck in 92% when I was indexing. It just got to a document where it went to an infinite loop. I solved that. After two and a half days, I discovered that. I killed the application. I did some hack to go around that, indexed it again.
06:22
This is a laptop, so I forgot it and went and put it to sleep, and then I had to start all over again. Basically, it took me a week and a half, about two and a half days, for indexing and parsing. I put a lot of effort into this talk. The computer put a lot of effort into it.
06:41
Yeah, that too. So, here is our sample application. It holds the entire English Wikipedia. I can basically just look for whatever I want here. Let's look for Oslo. Okay, a couple of milliseconds, basically. It also started up, so it took it a bit more now.
07:03
You can see I have 1,300 results. Sorry, 1,300 results. And you can see the application also is able to highlight the actual words within the best fragment of the document. That is, where that word exists in the most frequent times,
07:21
it will just take that fragment and highlight the words for me. So you can actually even read some part of the document that is relevant to you directly from the search results, just as you do in Google. I can page around, of course. Okay, wasn't expected.
07:41
Another thing to note is the score here. Basically, this score is given to me by Lucine. Lucine tells me, okay, this is some arbitrary number, and I'm going to sort the results based on that number. This number has been calculated by the amount of terms that Lucine matched in the index for my query in the document.
08:04
It is based on the frequency of that term in all the documents compared to the frequency of that term within this specific document. Really, a lot of calculations. That number doesn't mean anything outside the context
08:20
of this specific query, but it uses that to sort my results. So I'm getting the most relevant results up. This is why I get the actual Wikipedia entry for Oslo. First, I'm getting other Oslo-related entries below that.
08:44
It is smart enough to actually do this. If I'm going to click on related content, it's going to take about 10 seconds because we are using a 70-gigabyte index. It is going to calculate similar pages to this page, okay?
09:02
I'm going to see what pages are talking about Oslo and are related to this specific document, and I'm going to get this back in five, four, come on.
09:20
The computer is still working hard. Okay, anyway, I will be getting a list of documents, of other Wikipedia documents that are in the index, and Lucine thinks they are relevant. So if I look here, I'll get Oslo, which is the actual, I don't know why they gave me that, but anyway, I'm going to get this.
09:42
Let's see what it is. Okay, this is the actual neighborhood we are in now, okay? The Oslo spectrum is in this neighborhood. I'm going to get this. I don't know why. Okay, he lived in Oslo. I'm going to get a lot of things.
10:01
The Fjord city, that's probably the nickname for this city. Yeah, it's a project within this city. You see, I'm getting a lot of info that I couldn't have known about, but Lucine is smart enough to actually be able to tell me that. We are going to look at how he does that later.
10:21
I can also do something even more fun. I can actually start typing, and Lucine is going to give me, come on, everything breaks today. Let's put it in release.
10:46
Great. I can actually, I'm starting to type, and I'm getting documents that are relevant. Okay, I'm going to go to the actual JSON stream.
11:02
Lucine can give me the actual terms or titles, Wikipedia document entries, that are relevant to me because I started typing something. So if I started typing OS, I will get OSX, I will get OSI, I will get Oslo, I will get a lot of other things that have a word that starts with OS.
11:22
This is how I can actually provide an autocomplete feature to my users. Actually, I had it implemented very nicely here with jQuery. Apparently, I'm not much of a UI person. So basically, this thing can actually give me a lot of functionality.
11:42
I can also do something a bit different. I can tell it, okay, please don't only search, but also give me suggestions. Okay, if I'm making a typo, please correct that for me.
12:02
Or at least show me what other options I have. So let's type some sort of a word which doesn't exist. And actually, I'm going to get some documents that actually contain that word for some reason.
12:21
I don't know why. But Lucine can tell me, okay, but did you mean deaf? Did you mean ideal? I mean, I don't have this word here. Perhaps you meant another word. And then I can actually provide a link. Again, not much of a UI person, but I can actually give the users a link here. So he clicks and gets to the other search, to the refined search.
12:45
So how all of that works. So I've been asked before, I'll just tell the whole story in short. Lucine has been around since around the year 2000.
13:00
It was initially created by a guy called Doug Cutting. He created that based on a lot of white papers. He did really good work. And he created that in Java. That project exists up until today. It is very wide. It is under the Apache umbrella.
13:22
Over the years, it has been very mature. It works very fast. It has many contributors. The people change, but there is always a lot of people actively working on the project. It provides API for both indexing and searching.
13:44
It has a very large contributions folder, a package, that adds a lot more functionality. As I mentioned, for example, the spatial model. It is very fast, it is very efficient, and it has a couple of ports. Today, I'm going to demo the version that you saw, the Lucine .NET version.
14:04
Currently, Lucine is going to version 4. The last released version is, I think, 3.6. The .NET version is currently at version 3. That is a bit behind, but again, like the Lucine version, the Lucine .NET version is also trying to keep up.
14:25
Lucine is behind Twitter, it is behind LinkedIn, it is behind very large players. It handles a lot of data, as you have seen. An index of 73GB works very fast.
14:40
This is a code sample for indexing. This actual code sample is in Java, but it's almost the same like the .NET version. It is also a bit outdated, but never mind. We have the writer here. I'm creating a new writer, I'm giving it a directory implementation.
15:03
We have quite a few of them. We have the FS directory, which is a file system directory. I wanted to write this index to the file system. I also have the RAM directory. That RAM directory is used to perform faster searches on the RAM,
15:21
if I don't want to actually persist that index. Then I'm creating something that's called Analyzer. We'll talk about that later. Once I have the writer, I'm ready to index. To actually index, I'll be creating a document object. That document object is going to have several fields in it.
15:43
Each field allows me to index different data and search on it on different queries. We'll talk about that later as well. Here I'm going to index my file system. I'm creating a new field called Contents,
16:01
and I'm creating a field called Filename. I'm storing the file name. It is actually indexed here as well, but I'm storing it so I can actually get it later. I'm reading the file and throwing it into the contents field. That contents field I'm going to search on later.
16:21
After I've added those fields into the document object, I'm going to add that document to the index writer, and that index writer is going to index to perform the actual indexing operation. I'm done. I'm optimizing it. Optimize was actually recently renamed to Merged Segments,
16:41
but basically just please wrap up the operation, and then I'm closing the index for writing. And then I'm getting to the search stage. The search stage is basically doing quite the same, but the other way around. I'll be creating a searcher. I'll give it the directory implementation,
17:03
either a RAM one, an FS directory one, and then I'm going to parse the query. I will have to take the query string, which is exactly the same as you know it from Google. I can use operators like OR, like AND, like proximity, a lot of other stuff like that,
17:21
and I'm passing it to an object called Query Parser. That object is going to parse the query, and it's going to return me an object of query. That query object is what I'm going to use to perform the actual search. For me, this entire thing is a black box. I'm just calling this API, I'm getting the query,
17:42
and I'm passing it to the searcher. The searcher returns me the documents that match the query. So here you can see I have the actual document number, okay, each document in Lucene is going to have some sort of a unique identifier,
18:00
and then I'm going to load that document to get the filing field. I got it, I show it to the user as a result, and I'm more or less done.
18:21
Yeah, it works as well. Phrases work as well, yeah. Everything you know from Google basically works in this Query Parser just the same. You can also search on different fields, only on certain fields, on a couple of fields. We are going to see that in a second, I hope so. So as I said, the searches are actually performed
18:46
in two and a half stages. You have the stage of indexing, you have the stage of searching, and just before you search, you have to do some sort of query parsing. Obviously, you can build your own query object, but using the Query Parser is the first choice.
19:02
And you probably want to use a Lucene index in parallel to your database, for example, if you are indexing some sort of data. Let's talk about actual scenarios. If you have a database and you have a lot of products in it and you want to, for example, allow for a very efficient search,
19:23
you are going to create a Lucene index in parallel to your database, and whenever a user, whenever a product changes, you update the index, and whenever a user makes a query, you are going to query the index for full text search queries. That will allow you to perform the suggests stuff,
19:43
that will allow you to recommend other products for that person, and basically, that's going to be very useful to you. Now, there are a couple of tools that you probably want to know about like, for example, Hibernate Search, which will do that for you.
20:02
You can use your data model to tell it how to index it in parallel to your database changes, and it works quite nicely. To understand how all of those features work, let's for a second concentrate on how the actual index works.
20:23
To be able to index and to be able to search on a lot of documents, you basically cannot work like a human. You cannot say, okay, I'm going to scan all of those documents and look for the word I'm looking for. That will take you forever, especially if you're indexing the entire Internet.
20:43
What you actually want to do is to invert the process. You want to have documents and terms inverted. You want to have the computer look for a term and then tell you in which documents that term is.
21:01
So, this is how Lucene works. Lucene creates an inverted index, which, given a couple of documents, it is going to break them up, okay, it's going to basically on spaces, and it's going to create an alphabetical dictionary which will contain references to those documents by ID.
21:25
That dictionary looks like this. It is sorted alphabetically, and it has a column, if you want, that will contain the actual document IDs that term is found in. And whenever Lucene wants to perform a query,
21:42
you perform a query with Lucene, what actually happens is Lucene goes to that dictionary, it looks up the word, and it just gives you the document number. This is done in a very efficient manner. It uses Eradics or Patricia, it uses a very specialized data structure
22:01
that allows you to first have that optimized in memory, and second, perform very fast lookups. Lucene employs a lot of caches to ensure that quite everything, a lot of things, are done in memory. So, what happens when a user searches for a word,
22:22
he searches for a keeper, he will find the word keeper. Now, if you see here, I'm not sure, it might be a bit too small, but you can see that I have three words that are quite similar, but once the user looked for keeper, he was only able to get one word,
22:40
which is keeper, which is an exact match. But words like keep or keeps weren't found. That's by design, basically. And there is a process that we'll talk about in a second, that will actually allow you to perform this operation and find similar words that are actually sharing the same stem,
23:02
actually sharing the same base word. That is part of the terminal's normalization process. So, there are a couple of words that don't really have any meaning. We call them stop words,
23:20
because they give us nothing, basically. They appear quite a lot. They inflate the index, and we don't really care about them. So, those words, we remove, okay? In English, for example, the word end, the word did, had, words like that,
23:40
we just remove. We don't just ignore them. Words, similar words, like, for example, slip and slips here, in green. Words like that, we are going to merge. We are going to have one entry for. Same thing like keep and keeper,
24:00
or keeps, and stuff like that. Now, this process is basically analyzer dependent. When you index, you provide an analyzer to the indexing mechanism, and it is going to perform some sort of operation. That operation is going to normalize those words. It is going to kick out stop words
24:21
if it was designed to do that, and it is going to merge certain words. Again, depends on how it was coded. For example, we have portal stammer. We have something that's called a snowball stammer. We have an S stammer. Those three are basically used for English texts. Snowball has some other languages implementations,
24:42
and each are doing some different decisions. Now, what's important to understand is that some words, we don't actually want to do that. We don't actually want to do that for, because sometimes when we normalize a bit too much, we are going to hit our relevance. We are going to get irrelevant results, irrelevant results.
25:05
We should really consider what analyzer or what technique we are going to use. There are actually a lot of academic papers on that. I think I'm not quite up to date with that, but basically, last time I checked, stammer was the preferred way to perform indexing
25:24
on the English language. Stammer basically just removes all the S, all the trailing S's, possessive S's and stuff like that. How the actual indexing code would look like if we were to write it? Can you see this, or do I need to enlarge it?
25:43
Enlarge? You're sitting in the first row. Okay, you see? It's good. So I would create a sort of dictionary, a dictionary which is sorted alphabetically.
26:02
And the entire process I'm going to do is just take the words, tokenize them. That is, I'm going to break a sentence based on spaces, for example. I might want to add commas, dots, stuff like that, that isn't really a word.
26:20
I'm not sure what I'm going to do with numbers. Again, this is really dependent on my situation. And I'm going to tokenize that, and I'm going to put that into a terms array. And for each of the items in that array, I'm going to throw them into the dictionary. I'm going to normalize them and then add it to the dictionary.
26:43
Now, again, the normalize function is really something that I need to decide on what to do. It is up to me, entirely up to me, entirely up to my corpus. It is nothing. There is no strict rule on telling you what exactly you want to do.
27:00
So here I'm going to do some sort of an S timer, a very simple one. I'm just going to remove S and ES. And then that new word is going to be stored in the dictionary. Unlike this indexing code, the search code is going to be a lot more simpler. The indexing process might take you a week and a half,
27:23
but it is going to do all the hard work for you. Actual searching, which is what we really care about, is going to be very, very fast. All I'm going to do is get the term from the index based on the actual word, and then I have the entire documents, the document IDs,
27:42
that term will appear in. Before I will do that, I'm going to normalize it again, because I had a text, I normalized it, I threw it into some sort of index, the dictionary, and then the user queries me. Now, when the user queries,
28:01
it might use an unnormalized shape of the word, right? It might be using a word with a trailing S. So I want to remove that, otherwise I won't be able to find it. And this is why doing too much normalization might hurt my relevance.
28:26
So let's look a bit about the actual code. So this is the code that I used. I'm going to upload this to GitHub later, but let's go through it now.
28:44
So I'm creating a standard analyzer, which is something that ships with Lucene. That's the analyzer basically you want to use when you index in English text. You have any other component reason to use another one. I'm creating the index writer.
29:01
This is some sort of a hack to make it work faster. And then I have some sort of a corpus reader. I'm going to read the entire Wikipedia stuff, and for each document, this event is going to be called. This is the hack I want to bypass. That's a very ugly one. To bypass the infinite loop that the parser went into.
29:25
And basically, I'm skipping blank documents, and I'm creating a document for each of the Wikipedia entries I have. So I'm going to have an ID field, which I'm going to store and not index. That is, it won't be searchable.
29:42
I only want to get it when I get results. I want to be able to get the ID back from that document. And then I'm going to have a field title, which is analyzed, which is being indexed. It is also being stored because I don't have the entire Wikipedia available to me because I'm working against a dump. I'm not working against an actual live database
30:03
that I can access to get the info from. We'll see in a second why I need to get that info. And then I'm using something that's called a boost. Because I'm adding another field here, which is basically, it does exactly the same as I'm doing for title. But the title, as far as I'm concerned,
30:22
if a title contains a word, that word is much more relevant to me than the content. Now Lucene does everything already within it. It has some implementation that's called TF-IDF. Basically what that means is that if the shorter your field is going to be, the higher it is going to score each word in it.
30:42
So I want to boost it a bit. So I'm telling it, okay, this field, this title field, please boost it a bit more. Give it more importance. So I'm playing a bit with how Lucene scores the documents. And then some boring progress stuff. I'm finishing up, wrapping up, and closing the writer.
31:05
Now whenever, this is one application I have. I have this index build application. This is what I showed you. Now the Lucene neat things is the web application, this web application.
31:20
And the entire magic is happening here in the index.cs. I have this search method here. Again, as I showed you before, I'm creating a query parser. And I also have something that's called multi-field query parser. I'm allowing the user to decide whether he wants me
31:40
to search on all the documents on both fields, or just use search titles only. So if I look for Oslo again, I'm going to see only documents with the word Oslo in their title. I can see also I have only 200 results, 300 results.
32:03
Now I can remove this. And now it is very, it might happen that I get somewhere along the way. Okay, it is broken. I might get somewhere along the way documents that do not have the word Oslo in the title.
32:22
But because they appeared in the actual content, I got them. So after I found the results, I'm creating a first vector highlighter. That highlighter object is what giving me the actual snippet.
32:43
Okay, it gives me the fragment, the text fragment, and it highlights the more important parts of the document where that word is. So again, just an API. I'm creating it. I'm getting the results. And I'm adding the search results here.
33:00
I'm taking the title and the ID field, which I have stored. I'm storing the score as well. And I'm passing a fragment, which basically I'm creating a fragment object here, getting it from the content field, and then passing, the entire heavy lifting is done by the actual highlighter object,
33:27
which is part of Lucene as well. It is a contribution to Lucene. It's under the contribs folder. Again, it is basically part of Lucene itself. And I'm just passing it along. Now, we had other functionality here,
33:40
like we had the autocomplete one. Now, how do I go about doing an autocomplete? Basically, I can do something like this. I can go and say, please give me, use this table here, that table of all the terms,
34:00
and just look up, I'm starting to type, please give me terms that are relevant to what I'm typing. So I can do that. I can go there and change it. I can tell it get terms. And the get terms will basically do that.
34:20
I'll create a prefix term enum. This is something internal to Lucene. Please give me all the terms from that dictionary that you have that starts on that field that starts with a certain prefix. So it is going to do that.
34:43
But I might be getting incorrect results, or results that don't mean much to me. Because I indexed Wikipedia, and Wikipedia has a lot of garbage in it, again, especially when your parser really sucks, I will be able to get a lot of data
35:02
that is not really relevant to me. Nobody was looking for anything like that. To get relevant results, I really have to go really deep into the world, and I don't really want that. So what I can do, instead of doing that, I'm going to do get terms scored,
35:23
which is something, again, my implementation basically. What it does is a full-blown search. Because Lucene searches are that cheap to make, especially after a couple of searches that the cache has built up, and the index searcher warmed up, I can just perform the search.
35:41
And once I do that, sorry, I need to compile. So once I do that, I get really relevant results. And that's what we saw before. So imagine that. You have an e-commerce website, and the user starts typing in a couple of letters.
36:03
Starting from the second letter, you can actually give him actual products. Now, because we are performing an actual search, I can actually suggest products that are more relevant to him. I can play with the scoring and score documents, score products, which are more popular.
36:21
I can do a lot of hacks in here that will really maximize the user experience here. The same thing goes to the more like this stuff. Again, a very simple method. I only call this.
36:45
And what it does, it creates a more like this object, which, again, is part of Lucene, part of Lucene contributions. And I just tell it what analyzer I used because it does make some difference. As we said, different analyzer
37:00
will give me different results sometimes. And I'm setting a minimum word length, some words I want to ignore, especially short words. What it is going to do, it is going to take the document ID I gave it, it is going to load all the terms of that document into the memory, order them by score.
37:21
That is, the most frequent terms are going to be left in memory. The rest I'm going to ignore. And those terms are going to be used to create a large new query. And that query with all of the frequent terms are going to be sent to Lucene to find matching documents, relevant documents.
37:42
That basically makes sense if you think of it because if I have one document with a lot of frequent terms, all the relevant documents will have those frequent terms as well. This is how it works. Apparently, I discovered a bug because as you saw, it took it quite a while to run.
38:02
So, I don't know, it might be worth looking at, but basically it works very nicely. Or we can always store caches for that. So let's talk a bit about the analysis process.
38:22
The way Lucene works, let's consider the colors. Yellow is the actual data that we want to index. Blue is what we need to write. Green is everything Lucene does for us. I'm only talking about the basic operations now.
38:43
So I have data. I want to index it. I'm throwing it into the gather and parse stage, which is basically the index writer. I have that index writer, and I'm... Sorry, that's even before. I'm gathering it.
39:00
I have, for example, if I'm working against a Wikipedia dump, I'm going to have to extract the XML or work with the actual parsed chunks, compressed chunks, and then I'm going to have to parse it because again, I have some... Each format has some things that I want to ignore.
39:21
For example, if I'm indexing HTML, I want to remove the HTML tags because the actual words mean nothing to me, the actual HTML entities. So I parse it out. I have a stream of pure text now. That text I'm going to pass to a Lucene document. I'm going to create a Lucene document,
39:41
create fields for that, and throw that document later into the writer. Before I'm doing that, I'm passing it through the analysis chain, which is basically the analyzers, which we are going to see in a second. The same thing I'm going to do for my application. I have a query parser. That query parser is going to pass the query string to the analysis chain
40:03
because, again, I want to make sure the terms match. Analysis chain is being used by both sides. After I have that, all I have to do is go search or write a document, and then I have the Lucene index.
40:20
These are the analyzers we have available from Lucene. If I'm going to index this sentence here, I'm going to see different behavior by different analyzers. Let's look down below. We have the keyword analyzer. Keyword analyzer basically does nothing.
40:41
It just takes the entire sentence, the entire data stream that I have, and it will throw it as is into the index. We will have one long term. Sometimes it is useful. We use that, for example, in RavenDB as the basic way of work,
41:00
but usually you would want to tokenize. All the other analyzers tokenize, but every one of them is doing this a bit differently. We have the white space analyzer. The white space analyzer is going to tokenize on white spaces only.
41:20
It is going to leave dots. It is going to leave commas. It is going to leave anything that is not a white space, and it is going to only tokenize on a white space and do nothing about it, nothing else except from that. For example, it will not lowercase. It will not try to do anything smart,
41:41
no normalization, nothing like that. Then we have the simple analyzer. Simple analyzer basically will only leave words, only leave alphabetical words. It is going to remove numbers. It is going to tokenize the numbers, basically. It is going to kill anything that is not an alphabetical character.
42:03
So, for example, this email address here is now broken up to discrete terms because this is not an alphabetical character. Then we have the stop analyzer, which basically does exactly the same, but now it also removes all the stop words.
42:20
For example, in English, a stop word is the word the. Stop words, by the way, are very language dependent. For example, I can tell you that in Hebrew there is no definite list of stop words. In English, there is. I am not quite sure about other languages. I have been told the Norwegian language
42:41
does have a very definitive list of stop words, but I wouldn't really know. And then we have the standard analyzer. Standard analyzer basically has some smarts. It is going to kick out the stop words. It is going to lowercase.
43:02
But it is going to keep numbers. It is going to keep email addresses intact. IP addresses it is going to keep intact. And it is going to remove the possessive S. It is not exactly an S term because an S term will usually remove the pluralized form S, but it is going to remove the possessive S
43:21
because the possessive S doesn't really change much of a meaning. And these are the analyzers we can actually use. And then we have someone who asked you about the query parser syntax. Basically, these are some examples for that for some of the syntax that we have available to us.
43:41
Just like you know it from Google, it is not here, but we have the end or the preserved words, basically, which are operators. They are just like putting pluses or menaces before the word. I can limit the search to a certain field.
44:03
By default, each word is going to be limited anyway to the default field. But I can tell it, okay, please use another field for this word. Only search it on a certain field. I will have to use the multi-field query parser if I wanted to perform one search on a couple of fields. That is basically how I have done the Wikipedia search.
44:24
And then we have wildcard queries. We have phrase queries. We have fuzzy phrase queries. We have ranges. We have quite about anything. Okay.
44:40
I can either talk a bit about some other hacks, or I can show you the spatial search, which we want. Spatial. Okay. My baby. I'm not going to go into too much detail on how it's implemented. That's really an entire session. But I'm going to show a very cool demo.
45:06
So the problem with spatial searches is they are not really built for being used by search engines. It is some sort of a hack.
45:21
It does use... Basically, it takes a longitude and a latitude coordinates, and it translates them into some way we can use a search engine to find it on. It uses a lot of hacks. There are a couple of different implementations you can do that with. The old Lucene way of doing that,
45:42
it is up until today. I mean, all the current versions of Lucene have that. They use an implementation that, because the way of working is so hacky, they were really faulty. They had... A lot of times, they were searching for a location. Oh, perhaps I need to say a couple of words
46:01
what spatial search is. Sorry. Geospatial is basically taking a couple of points on Earth and indexing them, throwing them into the index, and then telling Lucene, please give me... I'm giving you a point. Please give me all the close locations
46:21
you have in the index in a certain radius. Now, basically, this is like point-to-point way of searching, but because of the way geospatial searches work, you can actually make that work with polygons. You can actually make that work with rectangles,
46:40
with circles, with whatever shape you want. But because the previous Lucene implementation wasn't that good, some good people have been working on it for about a year now, and they created a new module, and that module works really great. Two weeks ago, I committed that module into the Lucene.net project,
47:02
and this is what I'm going to show you now. This is a console application, which basically indexes a few locations. This is London, New York, stuff like that. These are the real coordinates,
47:20
and I have the coordinate of this location here, the Oslo Spectrum. So I have the location here, and I'm making a query of a circle that is I'm telling it, okay, give me all the locations that you have in your index that are within this circle. This circle, the origin point of the circle,
47:42
is going to be the Oslo Spectrum, which is defined here, with a radius of 100 kilometers. Okay, the kilometer stuff is being defined up here. Now, it is going to work
48:01
and tell me that the location Oslo is within 100 kilometers of the Oslo Spectrum. Okay, I have London, Manhattan, New York, which are quite far away, and then I have Oslo and Bergen. Oslo and Bergen basically are quite new to each other,
48:22
okay, compared to the other locations, but I was using 100 kilometers radius. In my search here, I created a circle with a radius of 100 kilometers. Bergen, I checked it yesterday, the distance between Bergen and Oslo is about 600 kilometers,
48:40
so if I change it to 600, I will find Bergen as well. The API is really nice, really simple to work with. To add a point, I basically have a special context,
49:03
which is defined to use kilometers, and I'm telling it, okay, make a point out of these coordinates, and then I'm telling the strategy object I have, please create fields using that shape. That shape is going to return the point,
49:21
and please create a field, create the index fields out of it, and then I'm going to add all these fields that I get to the document, and I'm going to add the document to the writer, and that writer is going to index the locations. The search, we already looked at the search, but basically I'm just creating a query object,
49:44
a Lucene query object, using the strategy again. There are a couple of available strategies. This one is the one that is, I think it's the, yeah, the recursive prefix tree strategy. That's the recommended strategy to use.
50:00
And then you make a query and give it a shape, and tell Lucene, please give me all the locations that you have indexed that are within that shape. The operator I'm using here is within. I have different operations, different valuable operations. I can tell it, please give me a shape
50:21
that intersects with the shape I'm giving you now, contains and stuff like that, and I am using is within here. That's about it, if you have any questions. Yeah.
50:46
Looking at the index, you want to see how the different analyzers will work, or you're given an index, you want to see what analyzer was used.
51:06
You can just do that in memory. You can just create a RAM directory and use the writer. That's the easiest way to do that. Basically, you could use the analyzers directly, but the API is a bit more difficult to work with than just using a writer and looking at the index later.
51:22
You have a tool called Luc, L-U-K-E, and that tool can basically, you can point it to an index, and it will just show you the list of the terms. Okay? Any other questions, guys? Yeah.
51:42
Yeah. That's the point. That's the entire point of the new library, the new module. Basically, I don't care what I'm indexing here. I have a shape. Okay? And the context, I'm just telling you, okay, make a point, but you can make a circle,
52:02
you can make a rect, you can use libraries that support polygons. Basically, the Java implementation has something that's called JTS. I don't remember what the acronyms are, but JTS basically lets you use polygons very well, and the project, the actual,
52:20
it is a sample of two libraries here, the special thing. It has the special module for Lucene. The special module for Lucene uses what's called special for N, or in the Java version, special for J. Okay, special for J is the library made by those people that I mentioned, David Smiley and Ryan something,
52:40
and they created it, and part of it was the Lucene part. They separated the part to contribute the Lucene part to Lucene itself, but they left the special for J project outside of it on purpose. That part is using JTS to provide polygon support.
53:00
We still don't have that on .NET. I'll be happy to assist if you want to part it. Yeah, several what? Different indexes on different nodes.
53:22
Yeah, the question was, is it easy to scale? The answer is yes, but you basically want to use Solr, because there is something that's called Solr Cloud now. It's Java, but Solr basically operates on HTTP. You can just talk to it with an HTTP API.
53:42
You would want to use something that is ready for you. There are some directory implementations that support the cloud. I don't think, don't take my word on it, but I don't think the Lucene .NET will be able to scale easily. You have to do some work on that.
54:01
The Java version, if you just run Solr, it will just run fine. Solr is S-O-L-R. Any other questions? Yes, there is a Norwegian analyzer.
54:24
I think it is within Lucene. I don't think it was ported to .NET, but it is going to be fairly easy to just port it. The Java and .NET are really similar, and the Lucene .NET API changes a bit now, but it is going to be very trivial to port it.
54:46
I'm going to upload this code to GitHub. My username there is sinhershko. Just the same, sorry, wrong slides.
55:02
My username is just like my Twitter username. Feel free to ping me if you need anything, and the source will be there in a couple of hours. I'm done, right? Five minutes for more questions, if you guys have anything. Six.
55:22
Okay, thank you, guys.