Bleve - Text indexing for Go
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Alternativer Titel |
| |
Serientitel | ||
Anzahl der Teile | 150 | |
Autor | ||
Lizenz | CC-Namensnennung 2.0 Belgien: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/34401 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache | ||
Produktionsjahr | 2015 |
Inhaltliche Metadaten
Fachgebiet | |
Genre |
00:00
Automatische IndexierungInterface <Schaltung>SpeicherabzugGoogolEin-AusgabeDivisionDatenmodellOffene MengeInklusion <Mathematik>StichprobeAbfrageAutomatische IndexierungGeradeMapping <Computergraphik>FehlermeldungQuick-SortMatchingTermEinfache GenauigkeitParametersystemInstantiierungIdentifizierbarkeitDatenstrukturProgrammbibliothekZeichenketteFunktionalInterface <Schaltung>Software Development KitRechter WinkelMultiplikationsoperatorDatenbankZahlenbereichOffene MengeReelle ZahlTotal <Mathematik>Produkt <Mathematik>ComputerarchitekturDämpfungMereologieDatenreplikationDatenfeldDefaultResultanteFunktion <Mathematik>BinärdatenAutorisierungOpen SourceMultifunktionKonfiguration <Informatik>TypentheorieCASE <Informatik>Repository <Informatik>Demoszene <Programmierung>InformationsspeicherungSystemprogrammEchtzeitsystemQuaderWeb-SeiteGebäude <Mathematik>Lokales MinimumDifferenteKategorie <Mathematik>EinsPlug inRechenschieberCodeDatenmodellAnalysisEntscheidungstheorieSpeicherabzugMAPKommandospracheWeb SiteObjekt <Kategorie>Spiegelung <Mathematik>KardinalzahlLeistung <Physik>Kontextbezogenes SystemHilfesystemInhalt <Mathematik>Arithmetisches MittelProzess <Informatik>PRINCE2Zellularer AutomatCOMGraphfärbungFitnessfunktionVererbungshierarchieWort <Informatik>Inverser LimesWald <Graphentheorie>Physikalisches SystemDienst <Informatik>Endliche ModelltheorieFormale SpracheElement <Gruppentheorie>Minkowski-MetrikXMLProgramm/Quellcode
09:58
EreignishorizontProgrammschemaDatenstrukturAutomatische IndexierungMenütechnikEbeneW3C-StandardSpannweite <Stochastik>ZeichenketteFunktionalAbfrageSchedulingAutomatische IndexierungParametersystemDatenfeldKardinalzahlFuzzy-LogikSchnittmengeAbstandWort <Informatik>Prozess <Informatik>MultiplikationsoperatorSpiegelung <Mathematik>Quick-SortZeichenketteResultanteMapping <Computergraphik>StapeldateiDifferenteRechter WinkelSpeicherabzugStichprobenumfangTransformation <Mathematik>TermEinsMatchingDeskriptive StatistikMereologieLesen <Datenverarbeitung>Güte der AnpassungDatensatzLeistung <Physik>BitDatenstrukturEreignishorizontLokales MinimumDefaultDateiformatAbgeschlossene MengeKategorie <Mathematik>ZeitstempelVarietät <Mathematik>ProgrammbibliothekCASE <Informatik>Domain <Netzwerk>Zeiger <Informatik>RechenschieberDruckspannungFlächeninhaltMoment <Mathematik>SystemaufrufSpannweite <Stochastik>AnalysisAdditionRadikal <Mathematik>Methode der kleinsten QuadrateDämpfungGraphfärbungMessage-PassingGruppenoperationGenerator <Informatik>NummernsystemZahlenbereichNichtlinearer OperatorOrtsoperatorGeradeQuaderCodierung <Programmierung>Syntaktische AnalyseCodeTypentheorieProgramm/Quellcode
19:50
Automatische IndexierungAnalysisGroße VereinheitlichungSpielkonsoleOpen SourceOffene MengeSoftware RadioWeg <Topologie>GraphikprozessorAutomatische IndexierungAdditionMapping <Computergraphik>Funktion <Mathematik>UmwandlungsenthalpieFormale SpracheMailing-ListeDefaultStandardabweichungAnalysisOpen SourceSoftware RadioDatenstrukturBitToken-RingMinkowski-MetrikRechter WinkelMehrrechnersystemFlächeninhaltGeradeDatenfeldTermEinfache GenauigkeitApp <Programm>StichprobenumfangElektronische PublikationNichtlinearer OperatorKonfiguration <Informatik>RohdatenQuick-SortAppletResultanteOrdnung <Mathematik>FunktionalSpeicherabzugMatchingKategorie <Mathematik>ZählenMomentenproblemPunktAbfrageVersionsverwaltungProdukt <Mathematik>KommandospracheEinsSchedulingGarbentheorieEin-AusgabeDeskriptive StatistikVerschlingungSchnittmengeWechselsprungEreignishorizontProgrammbibliothekPlug inStrategisches SpielCoxeter-GruppeCodeServerSurreale ZahlEnergiedichteData DictionaryWort <Informatik>Weg <Topologie>SchlüsselverwaltungIdentifizierbarkeitCASE <Informatik>EinfügungsdämpfungRechenwerkQuaderMultiplikationsoperatorSpannweite <Stochastik>Kernel <Informatik>AutorisierungZahlenbereichDemoszene <Programmierung>Twitter <Softwareplattform>MaßerweiterungStapeldateiWeb SiteGrenzschichtablösungKardinalzahlDatentransferNotebook-ComputerSymboltabelleGruppenoperationMögliche-Welten-SemantikProgramm/Quellcode
29:43
GraphikprozessorRippen <Informatik>Automatische IndexierungFunktion <Mathematik>BenchmarkMaß <Mathematik>GoogolGruppenoperationCloud ComputingExplosion <Stochastik>ElementargeometrieMereologieSoftwareCASE <Informatik>ZahlenbereichRechter WinkelFlächeninhaltFormale SpracheHalbleiterspeicherRichtungFuzzy-LogikFigurierte ZahlInverser LimesMAPWort <Informatik>Produkt <Mathematik>ProgrammbibliothekDistributionenraumGebäude <Mathematik>Elastische DeformationDatenbankMultiplikationsoperatorProgrammierungEinfache GenauigkeitCodeSpannweite <Stochastik>FreewareQuick-SortTermSystemaufrufAutomatische IndexierungDatensatzMultifunktionAlgorithmusFitnessfunktionZustandsmaschineVerkehrsinformationBenchmarkGruppenoperationSprachsyntheseBitParametersystemHook <Programmierung>Dateiformatp-BlockMatchingQuaderEinfügungsdämpfungAbfrageReelle ZahlStapeldateiSoftwaretestSpeicherabzugPunktLokales MinimumNotebook-ComputerBoolesche AlgebraAnalysisBitrateResultanteEinsHilfesystemSoftwareentwicklerFokalpunktKoordinatenBootenSystemprogrammAutonomic ComputingFluss <Mathematik>Gebundener ZustandAuflösung <Mathematik>App <Programm>
39:35
GoogolComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:05
Okay, so next talk is Martin Schock. He's going to be talking about BLEEP, which is a pretty cool search library running Go and open-sourcing.
00:25
Good afternoon, everybody. My name is Marty Schock, and today I'm going to talk about BLEEP, which is a text indexing library for Go. Are you the author of BLEEP? I'm the author, yeah. We're trying to make a real community of it, though, so I don't want it to just be me, and I'll sort of hit that theme a couple of times later on.
00:41
Some of you, let me get the tabs out of the way, some of you might be wondering, BLEEP, what is this? I thought it was pronounced BLEEP. I'd say 50-50 people I run into pronounce it BLEEP or BLEEP. I pronounce it BLEEP, but anything you want to pronounce it is fine with me. So, who am I? I work for a company called Couchbase. We make a distributed NoSQL database.
01:04
We do have an official Go SDK out now, but I'm really not going to talk too much about Couchbase today. I just want to highlight, you know, very much like MongoDB, we're using Go a lot internally, and just to give you a sense of that, the nickel query language, secondary indexing, and cross-data center replication are three of the biggest features in Couchbase,
01:22
and those are all either being written in Go from scratch now or being rewritten in Go. So Go is a big part of the future of Couchbase. But I'm not going to talk about Couchbase today. If you want to ask me any questions, just grab me afterwards. The first question I almost always get about BLEEP is why. There's already Lucene, Elasticsearch, Solr, there's an ecosystem, they're all really good.
01:44
And I'll be the first to admit, I think these are all awesome products. A lot of the inspiration for BLEEP has come from looking at the way Lucene is working. And so, if you're already using Java, the JVM, and you're happy with Lucene, by all means, keep using them. They're great.
02:00
But sometimes you don't already have a JVM in your architecture, and adding that to your architecture is maybe a more heavyweight thing than you're interested in doing. So we really started asking ourselves the question, could we build 50% of Lucene's text analysis, pair that with some off-the-shelf KV stores, and maybe something interesting comes out of that. And so that's really the experiment we pursued.
02:24
That led to these sort of four core ideas. One, there's this text analysis pipeline. And the idea is, you know, we just have to build the most important pieces first. If we get the interfaces right, users will come along and say, hey, we need this other, you know, stemmer or something. And if we get the interfaces right, it's easy for them to contribute that, and then add that to the ecosystem.
02:45
The next part was this idea of pluggable KV storage. It meant we didn't have to start off by writing some binary file format to get, you know, sort of squeeze out the maximum performance up front. There's a lot of interesting ones out there right now. Like I said, we have plugins for BoltDB, LevelDB, and ForestDB.
03:01
Looking into adding one for RocksDB as well. And on the one hand, this lets users sort of choose whichever one meets their needs best. But it also has some interesting properties that, you know, some of the KV stores work better for different use cases. So if you have a very read-heavy or search-heavy use case, or if you have a very, you know, real-time indexing use case, you may end up wanting to choose a different KV store.
03:24
So that's something that we sort of fell out of that for free as one of the things we were building. And finally, the idea was, if we can make term search work, almost all the more complicated searches you want to run later on, sort of build on top of that term search. So we could, by building a small amount of functionality, get something up and running quickly.
03:42
Now, before we go too much further, it's helpful just to get all on the same page in terms of what is search. This is obviously what most people think of first, when you say search. A box where I can just type in what I want, and I hopefully get what I want back. Sometimes it also means a more advanced search. You maybe want to be able to describe phrases, maybe you want to restrict things to certain fields.
04:03
Maybe you want to also search, sort of augment your text search with numerical or date searches as well. And then when we look at the search results, these days people expect spelling suggestions if there's some sort of misspelling. It's helpful for the user to see little snippets of document content coming back with their results.
04:23
That helps the user understand the context of why this result was returned. And taking that a step further, even highlighting the search terms inside of the snippet. It's a very powerful feature for users to understand why did this document match. And another really popular feature that's sort of come out of the loose scene in Elasticsearch world is this notion of faceted search.
04:44
So here you see me doing a search for Golang books on a major retailer. And on the left hand side you see it's broken down by categories. And then in parentheses are accounts, telling you how many books are in that particular category. And in a sort of well-designed site this allows you to do faceted navigation
05:02
where just by clicking on those you sort of drill deeper and deeper into categories. So we're going to look at that capability as well. So enough with the high level stuff, let's look at some code to get started. One of the earliest decisions we made was we wanted the bloody library to be GoGettable. So the good news is to get started you can just do a GoGet to our GitHub repo.
05:24
And if you add the dot dot dot you'll also get some command line utilities we built with the package installed as well. Once you've gotten the package, again with that same theme of making it easy for users to use, we wanted it so that the simplest use case of the system, you just have to import a single package.
05:41
Which you see here, again it's the same GitHub repo. Now behind the scenes Bluby has many other packages that are sort of layered through the functionality. And so more advanced users or people extending the functionality would maybe need to use more packages. But for simple use cases the single package is all you need.
06:01
Once we've imported the package we're going to look at the data model. So again we're building a really simple example here. So I've just defined a structure named person and that has a field called name of type string. So the very simplest thing we could work with. Bluby is internally going to use reflection to sort of discover what's going on and try and make the most sense out of your object. This is again just a very simple structure to get us going.
06:23
Now the next step is creating what we call a mapping. The mapping is what essentially takes your document, your data model, and turns it into what's going to be put into the index. This is where eventually you can configure a lot more details. What we have here is if we just use that new index mapping you get a default mapping.
06:40
And we've done a lot of work to make the default mapping as useful as possible. And you might wonder why do we even expose this? If it could be completely optional maybe we should have just hidden that in the API. And we went back and forth on this topic internally. Ultimately what we decided was the mapping is so important for getting high quality results that it was helpful to keep that in the face of the user. So you're not ever going to forget that mapping is there and if you're not getting the results you're looking for you may need to tweak that mapping.
07:05
And we'll see an example of that later in the slides. But for now we're just going to proceed with the default mapping. And here on line 18 we open the index or create a new index in this case. Now when we create a new index we provide the path and then a reference to the mapping that we just created.
07:20
That's going to return either an index or an error. And once we've got the index open now we're going to go ahead and create an instance of that person structure with my name Marty Shock. And then on line 24 we actually invoke the index method. Now the first parameter here is a string which is just sort of a unique identifier for the document you're putting in the index.
07:40
And then the second argument is that instance we just created on line 23. And again that's going to either succeed or return an error. And we can actually run this one. So we're going to run that and then again nothing exciting happens other than it gets to the end and prints out that it indexed the document. So we now have an index with a single document in there.
08:00
Now the next thing you might want to do is search through it, right? So here we're going to open the index using the Blovey open function. And the open function differs slightly because now you're only giving it the path. You don't need to provide the mapping because we actually serialized the mapping into the index. So we actually have that persisted and that affects sort of how you use the index going forward. With that initial mapping you provided.
08:21
So when you open it again you're either going to get the index reference or an error. And proceeding on to line 21. Here I'm creating a query object. This is the simplest kind of query possible. It's called a term query. The term query doesn't do anything fancy. It looks for an exact match of the term you specify in the index. It's a lot of times not very useful on its own.
08:42
But it's the simplest one to demonstrate here. So the query, it's helpful to think of the query as describing what we're looking for. So in the next line, the request is describing how we want to get it. So this is where we can control how many results we want to bring back. Do we want to skip over any of the results? Do we maybe want to pull back fields that we stored as well?
09:01
That kind of stuff will be added to the request. Here we're just, again, creating a default request with none of the other options specified. And of course now we can run it. We do that by using the search method and passing in the request. And we can go ahead and run that. Now again, the document we indexed had my name Marty Shock.
09:21
That's going to match the term Marty. And that corresponds with the output we see here. We see that there was one match. There's only one document total. M1 is that identifier we provided. So that's the identifier being returned to us. And the number you see on the right there is the score. So we're doing TF-IDF scoring. We certainly would like that to be more configurable in the future. But this is certainly the baseline that most people would expect
09:43
getting started with search. So that really is just how easy it is to use. I really want to emphasize, in about 20 lines of code, we were able to create an index. And then another 20 lines of code, we were able to go ahead and search through that index. But you'll probably want to see a more realistic example.
10:01
And I was trying to find a good data set to work with. And I wanted something that the audience would be familiar with. I don't have time to teach everyone some new domain. And I came across the FOSDEM schedule of events, which they happen to publish in a variety of formats, one of which was pretty easy to parse, was iCal. And so here's a sample record describing this talk. There are actually 550 or so of these on the full feed.
10:25
So I decided to go ahead and use this for our example data set. I have some parsing code that's not really interesting to today's talk, that returns essentially events in the structure you see here. So it has a few more fields than we saw in the previous one. Most of them are strings.
10:41
I'll just highlight two in addition. This uses the time structure. And the final one is a duration, which we store the duration of a talk in minutes. And that's in a float64. Now we also have the JSON struct tags. That's just sort of a convenience here. Blovey does understand those tags and will allow us to refer to those fields using the lowercase name.
11:01
It's just sort of a preference to simplify things here. It's a good question. I'm not exactly sure. I would say probably not. So the way numbers and dates are handled is a little bit tricky. It uses a different encoding scheme inside the index for those fields.
11:21
So most likely we could support time duration if we just have the right checks in place. I mentioned that part where we're using reflection to sort of discover what the documents are and how to handle those. That's something we could probably augment it to do a better job for time duration. So here I'm going to go ahead and index the documents. Again, I'm doing it a little differently now.
11:40
I'm going to index in batches. So again, on line 38 you see there's this parse events function. Again, we don't need to worry about the details of that. That's just a channel that's returning pointers to these events as I'm creating them. I create a new batch and then I keep adding them. So on line 39 I'm adding a document to a batch. Every time I get 100 in a batch I go ahead and execute the whole batch.
12:01
A little bit of oil plated at the end to clean up the last batch, which is not 100. And we can go ahead and run this. Again, when you're running the indexing larger amounts of documents you can do these by putting things in the batches as well. So that indexed 550 events. And now we should be able to do a lot of interesting searches on this data because there's a lot of good text in that data set.
12:23
So to start with I'm going to do a very simple term query just like the ones we saw earlier, but I'm going to search for the term Blevy, which is the name of this talk, or this library. And I'm going to also add one other thing on line 20, which is I'm going to ask it to highlight with a style called HTML in the results. So this is going to highlight matching terms
12:42
in the results set. And it's probably easier to show it than it is to explain it. So here you notice I've got it highlighting in sort of a yellow highlighter color all the matches for that term Blevy. And of course it did match this particular talk that we're in right now. So that's a really cool feature. Again, we have two different ones. One, we have like an ANSI formatting for terminals
13:03
and we also have the HTML one that you see here. But again, it's designed to be pluggable so you could map that to other things. Now, that was a simple term search. Let's do something a little bit more advanced called a phrase search. On line 18, I'm going to build out the phrase using an array of strings. So this is looking for the phrase advanced text indexing.
13:24
And on line 19, I sort of create this phrase query by passing in that array of terms. And I'm also, in this case, restricting it to the description field. So if we run this one, this should also match the one talk that we happen to be in now. And again, if I sort of scroll over here a little bit,
13:41
we do see that that phrase now is highlighted as well since that was the match we were looking for. Now, you don't get too far before you want to be able to combine simpler queries into more complex queries. We have conjunction and disjunction. What I'm going to show here is a conjunction query for the term search text
14:00
and the term search for search. So it's sort of like the phrase text search but there's no position requirement. They don't have to be side by side. They just have to both be in that particular document. And when we run this one, again, it's going to match slightly more documents now because this is sort of a more general query. So this matched four documents. It follows them that were about text search.
14:22
Now, let's say we wanted to combine more queries. Maybe somebody was out in the hallway before this talk and they said, I heard about this library called believe, something to do with text search. But they type it in as they heard it, like an English speaker might type it in as believe. So if we run that, again, these are just all ended together.
14:40
That's not going to match anything because none of the documents have text and search and believe. But what if instead of a text search, or sorry, a term search for believe, what if they'd done a fuzzy search instead? If we run that, the fuzzy search is going to ultimately match the particular document. And what happens is the term believe
15:01
doesn't have any exact matches. But the fuzzy search is going to do essentially find anything with a Levenshtein distance of two or less and match. So in this case, this is Levenshtein distance of two away from believe. And so that's why it ultimately highlighted that bloody term inside the document. So again, for users that are sort of manually typing things in, the fuzzy search is often very useful.
15:35
So what should happen is the score should ultimately take into account the Levenshtein distance for the fuzziness.
15:40
So one that matched less fuzzy should score higher. I don't know that it's 100% implemented that way right now, but that would be the desired behavior. Now I mentioned earlier those two fields. One had the duration. And so what I'm going to take advantage of here is a numeric range query. So the two parameters are essentially a minimum and a maximum value.
16:02
So I defined the long talk to be 110 minutes. I think you could probably agree that would be a pretty long talk. So let's go ahead and run this. This should find any of the talks that are longer than 110 minutes. And in this case, it just matches two. And these are both exams. And we see they have a duration of 120. So that sort of makes sense
16:23
sometimes. So I have a date range query here. Again, just like before, it's a start and an end date. In this case, we're using the, I guess, RFC 3339 format of the timestamp. That's all configurable inside the library, but that's just the default here. So this is going to look for today,
16:41
5.30 and later. So that's a very late talk on Sunday. And if we run that, that's going to go ahead and match just one talk, which is the closing talk. And finally, again, I've done all the different kinds of queries, but I did want to highlight one more, which is query strings.
17:00
Everything you've seen so far has been a very programmatic way of inquiring about your data. But sometimes you want to expose to end users something they can just type in, but still have all the power of the programmatic queries we saw earlier. So I highlight some of them here. Again, the syntax is very similar to Lucene. I would say it's not quite as complete as Lucene's, but the syntax is certainly designed
17:20
to be similar or the same. I'm using strings here to make it a little easier to read. Description colon text is going to say in the field description, it's going to have to match the term text. The plus in front is going to mean it must satisfy that particular clause. On the next one, we see text indexing in quotes,
17:41
which triggers a phrase match in this case. And again, that's also restricted to the summary field. Summary colon believe tilde two is going to do a fuzzy search. And then again, that two there The next one is minus description colon Lucene.
18:00
So this one's going to say it must not contain the term Lucene. And then the last one we see is a numeric range query where we have duration colon greater than 30. That's going to look for talks greater than 30 minutes. The syntax is still not as complete as Lucene's. We're missing the date one, for example, and there's a couple of things we need to improve there. But the idea is this is really useful
18:21
for exposing it to end users and this one as well. And this is going to match five talks. Again, I think the first one is Blevy in this case. Yeah, that has the highest score as well.
18:41
So I mentioned the mapping at the beginning of the talk and how it was important. And as you can see, we've gotten pretty far with just using the default mapping. A lot of the stuff you expected to work did work out of the box. But it's not perfect and I wanted to highlight that here
19:01
is haystacks. And so let's say, what if we wanted to search for just haystack? Is it going to find it? Let's try it. It's not going to find it right now. Anyone have a guess as to why it didn't find it? Right.
19:21
So fuzzy would be one approach that would also accomplish the same goal. But ultimately, what I want to focus on on this slide is changing the mapping to get better results here. So as human beings, we're able to look at all this text and work on English text to give us essentially a higher recall in our search results.
19:42
And what we're going to do is we're going to take advantage of that in this mapping and essentially tell it, hey, this field's in English. So on line 28, again, this looks a little complex, but we're not going to go through all of it. On line 28, I say, hey, I want to use an analyzer called EN, which is the name that the English analyzer is registered under. And then on lines 31 and 32,
20:07
there's a couple other tweaks to the mapping. But if we go ahead and run this, this is going to essentially re-index all the 550 data events and produce a new index using this updated mapping.
20:20
And so all of it, that's in a file name called custom.proby. So all of the subsequent examples are going to switch to that now. And so now if we use this custom mapping, if we run that same search for Haystack again, that is going to match. And if you notice, even though we searched for Haystack, it highlighted the whole term Haystacks. And the reason is the English analyzer
20:40
does stemming on the input, which was able to essentially stem Haystacks just to a single term Haystack, which ultimately matched in the search. Normally when you're doing search results, you're actually doing the same analysis on the input and on the document fields. And that way, when you're searching, the terms are going to line up inside the index.
21:00
I don't have the count in front of me. It's like 13 or so. It's basically ones where there was an available stemmer. And some of them aren't complicated. So it's an area where, if you have a particular language you're interested in, let me know. Between looking at what Lucene does and everything, you can pretty easily put something together. It's not that complicated.
21:25
Possibly. I mentioned earlier, we wanted it to be go-gettable. So we tried to have as much pure go stuff as we could to get that great out-of-the-box experience. For some of the reasons that we're using, for some of the functionality we don't have in Go yet.
21:40
So we'd be open to optional add-ons for some of that capability, but we probably wouldn't want it in the core for that reason. So there are different strategies to handling mixed languages. Again, same basic thing you run into in Lucene.
22:00
Sometimes you create separate fields for each language, and you index them into separate fields on a language. And then on the search side, that's one option. We do have a plug-in for, Google has a library CLD2, which is a language detector. We have a plug-in for that.
22:21
It's a little bit of a dead end. We had ideas that we could figure out the language and then just do the right thing. And we may be exploring that in the future. But we ultimately found that's not quite as simple as it sounds. What happens is it maybe works fine on documents when you're indexing them, but at the search time,
22:42
you really guess what language that is then. And so I would say that's an area we're still exploring. We're open to ideas. But it's not quite as simple as magic, really. Now I do want to mention this. Getting this analysis right is like the most important thing. And so with that in mind, we created a tool we call the Blovey Text Analysis Wizard.
23:02
So this is just, and it's hosted on the internet, so you can just go right to it. You don't have to actually install it because Blovey indexes the text quickly, just as an example. And I'm gonna start with an analyzer called the Keyword Analyzer.
23:21
The Keyword Analyzer is really the simplest one because it doesn't do any analysis at all. It treats the entire thing as a single token. And so you notice that whole phrase, even though it has spaces that look like words, that's all indexed as a single token. That's useful for things that are
23:42
The next more complicated one is what's called the Simple Analyzer. This, you notice, gives us five separate tokens. And really the only thing it did, in addition to tokenizing it into separate words, is that it also lowercase, the first term, Blovey, became lowercase Blovey. The next more complicated one is something called Standard.
24:01
And Standard was the default that we were using earlier. So when we just had it doesn't do any really English specific things, like stemming would require, you could argue that the stopword list is English biased as well, but the stemming itself
24:21
is very language specific. So we have a special analyzer called EN, and this is gonna do the additional steps. So now in addition to removing the stopword, you notice indexes became index and quickly with a Y became quickly with an I.
24:42
So the thing is that they match up, so that they stem to the same thing on input and on output. So this is a tool, again, if you're not getting good enough results, this is a great tool for figuring out how you can tweak your mapping to get better results. Now, that being said, it's a good point to remind everyone that when you're doing a search,
25:12
that's incredibly worse in other searches that you're not running at the moment. So just sort of keep this in the back of your mind that it's not this easy thing to just sort of hone in on the right behavior.
25:23
Now, I also mentioned faceted search at the beginning of the talk, so I wanted to demonstrate that capability as well. In that data structure we got, there was a field called
25:44
and so the results are going to give us the name of the category and then the count of how many would have been in that category. The search itself, you notice on line 20 setting the size to zero, that's basically saying, I'm going to run a search that matches everything, which is that new match all query, but I'm not actually interested in the search results themselves
26:01
because it's every document in sort of an arbitrary order. So don't give me any results back, or so. So you can see, you know, Lightning talks over 41, Java 26,
26:22
if we scroll down a little bit, Go has nine. So again, this is sort of just looking at the raw output, but you'll see later how we can use this to build some cool UI. And I also want to mention, we do have some optional HTTP handlers. They're in a separate package. Basically, I would say, all of the major operations
26:47
are going to be JSON, and that allows it to just use the normal JSON serialized and deserialized to put those into the index. We do have a sample app that does exactly that. It's called Blovey Explorer. And again, that gives you sort of a point and click GUI
27:00
to sort of play around with it. So if you play around for the first time, this is a great way to sort of get started in a more visual way. And this is really, jump to the right tab,
27:22
this is really all UI work. I mean, and it mainly took a day because I'm not that good on front end stuff. So this is like an AngularJS app. It's using the HTTP APIs behind the scenes, and it's using the full data set that we've seen earlier. So we could search for something like, I don't know,
27:40
open source kernel, and if we run that, that matched 171 results. As you can see, we have the title of the talk, the author, those are both links that take you back to the schedule. We have the start time, the duration, the room, and then the description. In the little bubble on the right, you see the score.
28:01
And the cool thing is, we've added this refine results section on the right hand side, which will allow us to sort of drill into the data. So let's say we wanted to talk later today, so I'm gonna check this Sunday box, which should have 89 results, and you notice that did scope us down to then only 89 results. Maybe I have sort of a short attention span, and so I want to talk less than 30 minutes.
28:21
So that took us down to 63 results. And maybe, I don't know, software-defined radio is a lot of work. This is mainly just throwing some HTML and JavaScript in front of Blovey and its HTTP handlers, and you can pretty quickly
28:41
get some pretty cool functionality. You can update the index. So what you can't do is change the mapping. So you commit to your mapping, and it's very much like other products in that way. So the version that's online, this is one that's just
29:07
I mean, index updating is very similar to the way you initially insert it. So at query time, you can define what the buckets are.
29:21
I don't have any code that necessarily shows that, but actually, I think let me jump back to the presentation here. The facet you saw here, let me just close this. Okay, so I guess the only facet I have here is the term facet, which is the default,
29:45
so what I did in the example app was, and let me just uncheck that box so we can see that there's more than one. All right, let's just reload that. Oh, what did I break?
30:01
Okay, so you see there's two buckets here. There are actually three that I defined. So one was for short talks under 30 minutes, one was for 30 to 60, and one was 60 plus. So those are defined with a minimum and maximum, and they don't have to cover the whole range, either, so that you can have optional endpoints if you want. So it's pretty flexible
30:20
in terms of how you define those. And again, like I said, those are at query time. I should also mention, there are other approaches to doing facets, right? So this is, if you're familiar with Elasticsearch, this is more like Elasticsearch initial facets.
30:46
I couldn't get away without someone asking about performance. It seems to come up all the time. What I will say is, we really focused on getting features sort of right first. Then our next focus was on getting on an API that we thought was sort of
31:00
the right level of granularity for people to use. But we are finally getting to the point where we're starting to look at performance a little bit. So what I would say, and again, I don't want you to go capabilities for benchmarking, this is great for testing small blocks of code, right?
31:21
So someone says, I have a faster way to do facets, right? So this is great for being able to say, this is how we do it now, this is how we're thinking about doing it, and sort of quantitatively getting a sense of if it's better or not. But this is not always the best way to get at some of the things that we need to deal with. So we also created a utility called Blovey Bench.
31:45
And this is done in different ways. So we do some individual inserts, then we do some batch inserts, then we run some queries. So this helps us answer questions like, is the indexing performance degrading over time as we index more and more documents? Or how does the search performance relate to the number of matching documents?
32:00
And again, this is one snapshot that I just included here. I wouldn't focus too much on the details. But it gives you a sense of the kind of data we can look at and see, okay, all right, well, again, this is just a highlight. This is a new tool. We just built this in the last couple weeks. But this is something we're gonna be using
32:20
to sort of, as we're making improvements to the library, we can sort of use this as a reference to see if we're getting better or worse, the same, and so on. And then ultimately, the million dollar question is, well, how does it necessarily compare to Lucene? I don't have anything for that today. It's really, it's not impossible, but it's difficult to get good answers to that kind of thing. This was started in that direction.
32:45
There was something where we could start to answer that question. But it's difficult to answer, other than I can tell you Lucene is faster right now. That much we're sure of. So finally, there was a call earlier about the community. This is really important to me.
33:02
Someone asked if I was the main developer for Blovee IM. But I don't want it to be that way. I want more contributors. I want people to sort of help out and make this better. So in terms of joining or participating in an IRC, it's a small and quiet room right now, so you can pretty easily get my attention
33:21
if you need to. Google Groups we use for general discussion. If you have a use case, you're not sure how to implement it, that's the right way to sort of do that. If you're interested in planning a larger feature, right, this would be a great place to join in and talk about your ideas before you sit down and write a bunch of code. And then of course we use GitHub, right?
33:40
So this is Apache licensed, report issues, submit pull requests through GitHub. I am happy to say we do already have eight contributors other than myself. That's not a huge number, but it's not zero. And I would be honest with you, they range from minor typos in the README that are fixed up to performance improvements on new features.
34:03
So it spans a range, but I'd encourage anyone that's interested to get involved because we're at an early stage with this still and we need to get involved. A little bit of a roadmap. These are some features I would say are on our radar for right now.
34:21
Results sorting. Right now everything you see is always sorted by the score, which is useful for a majority of cases but not all cases. Better spelling suggestion and fuzzy search. This is an area where there's a lot of interesting ideas about finite state machines, finite state autonomous that you can use to essentially trade memory
34:43
and so that's an area where we know there's a lot of cool stuff out there that we need to hook into. And then performance. I think performance is not something we're gonna solve pre 1.0 but we wanna understand the bounds and the parameters of it a little bit better so that we know where we stand and where we're trying to get to.
35:01
And then ultimately we wanna prepare for a 1.0 release. So right now the file format is sort of in flux. The API is still changing sometimes. But when we get to a 1.0 release that's where we would have backward compatibility for file formats and stabilizing the API, things like that.
35:24
In terms of others speaking, I mean I'm here today because we really wanna get the word out about this library. I'm headed to India in less than a month now to speak at GopherCon India and I'll definitely be at GopherCon in Denver in July. I'll be submitting a proposal. I don't know if that'll go through or not
35:41
but I'll be there one way or another. So if you're there and you wanna meet up and talk about this stuff, definitely let me know. And if you have other meetups or stuff where you want us to talk or someone else in the community to talk, we're definitely interested in doing that as well. So that's all I have prepared today so I'm happy to take any questions you guys have.
36:01
Yep. So what I would say is nothing out of the box handles that. So that Blevy Explorer program I mentioned, if you boot that up, it looks very much like a single-node Elasticsearch, right? So it's like there's nothing distributed at all. I work for Couchbase, right?
36:21
So we're a distributed database. So you can imagine their interests are in distributing this. And so all I can really say is we're building on top of the library. We're trying to focus on making this a useful library in general and very much like Lucene, right? And then we can sort of build our own distributed capabilities.
36:41
So that's something Couchbase is working on and I would say expect to see that in products in the future but today this library is as it is. Yep. Not right now, so I would say oh, sorry, the question was
37:00
does it support spatial search? I would say what you frequently do with Lucene is you can do a geohash on a coordinate pair or something and then you can insert that into the index and use basically the same approach we use for the numeric ranges. Use that for geo stuff as well. It's not great, right?
37:20
That works in some use cases and not in others. But that's sort of like, again, I would say low-hanging fruit. Someone in this room could probably say in half an hour, say okay, I coded it up, now we support geo. So that's the cool part. We could get that sort of low-hanging fruit but it's not a dedicated geo solution.
37:45
I'd have to check. I think it's actually decent. I'd have to double check. I know I wrote part of it because I remember copying and pasting characters from a map into the thing so I know I worked on it some. I don't know if I finished it because I think some of them
38:01
had a common ancestor that was being reused amongst a bunch of the analyzers and so I did that, figuring hey, I'll get four languages if I do this one piece here. So I did some of it but absolutely, let's talk and let's make it better. And I'd also mention, just real quick, it's surprising, we've gotten a lot,
38:21
I would say more interest from people outside of the United States than anywhere else. A lot of users in Europe are already using this and a lot of interest in our support for Chinese and Japanese techs as well. So we're gonna go in the direction the community takes it, really.
38:43
Oh, is there support for any phonetic algorithms? Not right now, but again, I would say that text analysis pipeline would be where that would fit in. So definitely interested in doing that as well. Any other questions? Last question.
39:03
Deleting items, so the question is how do we delete items from the index? So like a lot of things, we maintain a back index of all the other entries that are maintained and we gotta either overwrite or delete the previous rows corresponding to that document
39:20
and then handle the new ones. But in the case of just deleting, we have a row which says these are all the rows corresponding to this document and we need to delete all of those as well. Alright, thank you all very much.