Bleve - Text indexing for Go

Video in TIB AV-Portal: Bleve - Text indexing for Go

Formal Metadata

Bleve - Text indexing for Go
Alternative Title
Go - Bleve
Title of Series
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date
Production Year

Content Metadata

Subject Area
Context awareness Code Decision theory Multiplication sign 1 (number) Real-time operating system Parameter (computer programming) Mereology Formal language Data model Different (Kate Ryan album) Forest Core dump Cuboid Damping Endliche Modelltheorie Error message Physical system Mapping Reflection (mathematics) Data storage device Maxima and minima Instance (computer science) Demoscene Category of being Type theory Arithmetic mean Data model Process (computing) Software repository Drill commands Interface (computing) Website Right angle Quicksort Resultant Spacetime Web page Functional (mathematics) Identifiability Service (economics) Division (mathematics) Binary file Field (computer science) Power (physics) Element (mathematics) Product (business) Term (mathematics) Googol String (computer science) Subject indexing Energy level Utility software Data structure output Plug-in (computing) Computer architecture Default (computer science) Matching (graph theory) Cellular automaton Interface (computing) Content (media) Mathematical analysis Core dump Database Limit (category theory) Subject indexing Word Personal digital assistant Object (grammar)
Scheduling (computing) Multiplication sign Correspondence (mathematics) Numbering scheme Coma Berenices Mereology Computer configuration Damping Electric generator Broadcast programming Mapping File format Reflection (mathematics) Fitness function Sampling (statistics) Open set Process (computing) Sample (statistics) Right angle Quicksort PRINCE2 Data structure Resultant Row (database) Functional (mathematics) Identifiability Variety (linguistics) Event horizon Graph coloring Field (computer science) Number Goodness of fit Term (mathematics) Operator (mathematics) String (computer science) Subject indexing Data structure Domain name Default (computer science) Addition Stapeldatei Matching (graph theory) Inheritance (object-oriented programming) Total S.A. Line (geometry) Inclusion map Subject indexing Pointer (computer programming) Event horizon Query language Personal digital assistant Object (grammar)
Area Stapeldatei Matching (graph theory) File format 1 (number) Mathematical analysis Set (mathematics) Bit Event horizon Graph coloring System call Field (computer science) Subject indexing Radical (chemistry) Goodness of fit Event horizon Term (mathematics) Personal digital assistant String (computer science) Core dump Subject indexing Resultant Descriptive statistics Library (computing)
Matching (graph theory) Query language Term (mathematics) Subject indexing Quicksort Position operator Library (computing)
Group action Matching (graph theory) Range (statistics) Set (mathematics) Menu (computing) Range (statistics) Maxima and minima Parameter (computer programming) Distance Field (computer science) Plane (geometry) Word Numeral (linguistics) Term (mathematics) Query language Personal digital assistant Subject indexing Fuzzy logic Quicksort World Wide Web Consortium
Multiplication sign Range (statistics) Least squares Parameter (computer programming) Distance Field (computer science) Power (physics) Term (mathematics) Different (Kate Ryan album) String (computer science) Subject indexing Descriptive statistics Default (computer science) Matching (graph theory) File format Closed set Range (statistics) Bit Timestamp System call Subject indexing Message passing Numeral (linguistics) Personal digital assistant Query language String (computer science) Quicksort Reading (process) Library (computing)
Default (computer science) Slide rule Mapping Transformation (genetics) Stress (mechanics) Bit Line (geometry) Field (computer science) Subject indexing Category of being Personal digital assistant String (computer science) Subject indexing Moment <Mathematik> Right angle Resultant
Functional (mathematics) Code Multiplication sign 1 (number) Mathematical analysis Event horizon Field (computer science) Formal language Strategy game Term (mathematics) Computer configuration Single-precision floating-point format Subject indexing Cuboid Extension (kinesiology) Plug-in (computing) Area Mapping Mathematical analysis Counting Bit Line (geometry) Flow separation Subject indexing Word output Right angle Resultant Library (computing)
Point (geometry) Trail Group action Identifiability Token ring Multiplication sign Execution unit Insertion loss Function (mathematics) Data dictionary Field (computer science) Twitter Number Latent heat Term (mathematics) Single-precision floating-point format Subject indexing Data structure Addition Standard deviation Matching (graph theory) Mapping Key (cryptography) Mathematical analysis Electronic mailing list Symbol table Subject indexing Category of being Word Personal digital assistant Query language Order (biology) output Resultant Spacetime
Point (geometry) Laptop Functional (mathematics) Server (computing) Scheduling (computing) Presentation of a group Open source Java applet Code Multiplication sign Sheaf (mathematics) Mögliche-Welten-Semantik Software-defined radio Food energy Data transmission Product (business) Revision control Computer configuration Operator (mathematics) Subject indexing Authorization Cuboid Video game console Graphics processing unit Stapeldatei Trail Mapping Open source Sampling (statistics) Software-defined radio Grand Unified Theory Open set Demoscene Subject indexing Kernel (computing) Drill commands Query language Surreal number Website Right angle Quicksort Resultant
Point (geometry) Code Multiplication sign Range (statistics) Maxima and minima Ripping Word Numeral (linguistics) Bit rate Term (mathematics) Subject indexing Cuboid Graphics processing unit
Building Group action Code Multiplication sign Direction (geometry) Correspondence (mathematics) Range (statistics) Cloud computing Insertion loss Parameter (computer programming) Mereology Computer programming Formal language Benchmark Hooking Semiconductor memory Single-precision floating-point format Core dump Elasticity (physics) Cuboid Area Algorithm Mapping File format Block (periodic table) Fitness function Bit Benchmark Right angle Figurate number Quicksort Freeware Geometry Row (database) Laptop Real number Product (business) Number Term (mathematics) Googol Subject indexing Energy level Software testing Traffic reporting Maß <Mathematik> Boolean algebra Stapeldatei Matching (graph theory) Mathematical analysis Database Group action Limit (category theory) System call Subject indexing Word Explosion Software Personal digital assistant Query language Function (mathematics) Fuzzy logic Speech synthesis Finite-state machine Library (computing)
unless the there he's marking shot I think it will be you uh critical services like really go over so few definite but humans militia and limits about what bloody which is the text indexing leverage will 1 of the I and the other thing that word which try to make a a community of of those so I don't want to just be me and also that the amicable time later on so you the Tetherless so you might be wondering bloody what is this I thought was France believe that I thank every lights a 50 50 people run into pronounce it bloodier believe I pronounce a bloody but anything 1 pronounces on me so the whom I worked for a company called space we make a distributed and civil database we do have an official go decay out now and but I'm really not going to talk too much about catch this today and I just wanna highlight you know very much like Monday era using go on internally and just a sense of that the middle Korea language secondary indexing encrusted acidification 3 the biggest features and catch basin is all in there being written and go from scratch now or being rewritten go and go is a big part of the future of this the but not here yet but then if you wanna ask many questions just grab me afterwards the 1st question almost always get about bloody is why uh Rosati Lucene ElasticSearch Solas an ecosystem they're all really good at the 1st to admit I think these are also products of a lot of inspiration from only has come from from looking at the way was he was working and so if you are using the job of the JVM and you're happy with Lucene about means keep using the degree of but and sometimes you you you already have a JVM in architecture and adding that your architectures maybe the more heavy-weight thing interested in doing and so we really start to ask ourselves the question it can we build 50 % with Lucene's text analysis you know pair that was some off the shelf KB stores and in may be something interesting comes out at and so that's really the experiment we pursued that led to be sort of 4 core ideas allow 1 the suspects analysis pipeline and the idea is you know we just have to build the most important pieces 1st to forget the interfaces right users will come along and say hey we need this other you know spammers on and if get interfaces right it's easy for them to contribute that and then add that to the ecosystem I this part was this idea of potable Kb storage in that we didn't have to start by operating some binary file format to get you know you know so squeeze out the maximal performance up front we could there's a lot of interesting ones out there and outlook so we're plug-ins for BEL-TD level the forest DB but looking into adding 1 of rocksteady as well and on the 1 hand this lets users choose whichever 1 meets their needs best but it also has some interesting properties that you know some of the KB stores work better for different use cases so if you have a very real heavier search of a use case or if you have a very you know real time indexing use case you may end up wanting to choose a different KB store onset something we serve so if element for free as 1 of the things we were building and finally the idea was if we can make term search were almost all the more complicated searches you wanna run later on to build on top of that term search weakened by building a small amount of functionality get something up and running quickly now before we get too much further it's helpful to get all the same page in terms of what is search this is obviously what most will think of 1st of users search on a box right can this type in what I want and I hope we get what I want back sometimes it also means a more advanced search you may be 1 or the other describe phrases maybe 1 of restricting the certain fields maybe 1 our also search so on your text search with numerical a date searches as well and when we look at the search results these days people expect spelling suggestions if there's some sort of misspellings it's helpful for the user to see most that's of document content coming back up with the result that helps user understand the context of why this result was returned and taken a step further in highlighting the search terms inside of the cell that's a very powerful feature for users understand why this document match another really popular feature and this this sort of model is seen in in uh ElasticSearch world is this notion of faceted search and so here you see me doing search for underlying books on Major of retailer and on the left hand side you see it's broken down by categories and then imprints user accounts tell you how many books are in that particular category and in a is a well-designed site this allows you to do faceted navigation waters by clicking on those is sort of drill deeper and deeper into the categories so we look at at that capability as well so enough at the higher level stuff also look at some code to get started 1 of the earliest decisions we made was 1 of the the body 11 go gettable of so the good news is to get started you can just you go get to our GitHub repo and if you add the dot dot dot you also get command line utilities we build a with packages installed as well once you've got the package again with that same theme of making it easy for users to use we want so that the simplest use case of the system you just import single package which you see here against the same target of real not behind the scenes body has many other packages of sort of layered the functionality and so more advanced users of people you know extending the functionality would maybe you need to use more packages but for simple use cases of the the single packages all you need once again for the package will look at the data model so again this is that work building a really simple example here so I just define a structure named person as a field called name of type string for the very simplest thing we could work with bloody central again these reflections of discover what's going on try and make the most sense of your object but this is again just a very simple simple structure get going now the next step is pretty what we call a mapping mapping is what essentially takes you're document your data model and turns it into what can be put into the index this is where eventually can configure a lot more details what we have here is is that if we just use that new index mapping you get a default mapping and we've done a lot of work to make the default mapping as useful as possible and you might wonder why we even exposes if he could be completely optional may wish to him that in the and we went back and forth on this topic internally all the way what we decided was the mapping is so important for getting high quality results it was helpful to keep that in the face of the user so you not ever gonna forget that mappings there and if you not getting results are looking for you may need to tweak that mapping will see an example that a later in the slides but for now is gonna perceive the default mapping here online 18 we open in in new index in this case no occurred in new index we provide the path and reference the mapping that we just created that's gonna return even index on air the and once get the index open now we're and create an instance of that person structure with my name Marty shop and then aligned 24 which invoke the index method now the 1st parameter here's a string which is just the sort of a unique identifier for the document putting in the annex and then the 2nd argument is that instance we just read online 23 and again that's skinny the succeed return an error in the cashier run this 1 the forget that the heading and nothing
exciting happens other than it gets to the end of Prince of the index the documents so we now have an index with a dark document in there now the next thing I wanna do is search through right so here we're going open the index using the bloody opened function and the open function differ slightly because neither only giving it the past you don't need to provide a mapping because we had to serialize the mapping into the index so we actually have that persisted and that affects sort of how you use the index going forward with an initial mapping you provided so when you open it again you can get the index reference on air parents proceeding onto 1 21 here and great fitting a query object this is the simplest kind of query possible to com term query the term Cory doesn't do anything fancier looks for an exact match of the term you specify in the index but it's all terms not very useful on its own but it's a simple song the demonstrate here so the query is sensible thing to the corresponding describing what we're looking for and on the next line the request is describing how we wanna get so the soldiers were we can control in how many results we wanna bring back to us skip over any other results to maybe 1 a pullback of fields we stored as well that kind of stuff will be added to the request here would us again getting a default request was not another the options specified it was now we can run it and we've you do that by using the search method and passing in the request and we can go and run that now again the
document index had minor Marty shock that that's going to match the term RT and that corresponds about what we see here we see that there was 1 match 1 document total and 1 is an identifier we provided so that's the deidentify of our being returned to us and every sinner right there is the score the were doing TF-IDF scoring Our certainly like that to be more configurable the future but this is certainly the baseline that most people would expect a getting started with research it the so that really is just how easy it is to use a really want emphasize like in not 21 the colorable credit in X and then another 21 to cover able to go in search the index but your policy a more realistic example and so I was trying to find a good dataset to work with and I want something that you know the audience be familiar with and I kind of like you know teacher on some new domain it came across the father schedule events which they happen to publish in a variety of formats 1 of which was pretty easy to parse was ICAO and so here's a sample record describing this talk so they're actually of 550 so these on the the full feed outside and again use this for example dataset I don't know I had some parts encode this not really interesting to today's talk of that that returns essentially events in the structure you see here that has a few more fields in this on the previous 1 most of the strings others highlight 2 in addition 1 is start which uses the time structure and the final 1 is a duration which we store the duration of the talk in minutes and that's in a float 64 now if we also deduce sunstruck tag that's just sort of convenience here what he does understand those tags and will allow us to refer to those fields using a lower case name just a preference to simplify things you what's more by operation as well of good question I'm not exactly sure what it is I would say probably not that we have so the way numbers and dates are handled is a little bit tricky but uses a different like encoding scheme inside the index for those fields so most likely we could support generation if we just have the checks in place but I mention that part were reason reflection sort of discover what the documents are and how to handle those that something would apply augmented to do a better job for for time durations so here are going to go and index of documents and again I'm doing a little differently now I'm gonna index in batches and so again online 38 see there's this person that's function we're going to work out the details of that that's just a channel that's returning a pointer so these events and creating them but I could new batch and keep adding them so 139 adding a document to a batch every time you get a hundred a batch and then execute the whole batch a booklet at the end to you know clean up the last batch which is not a hundred amino and run this and
again when you're running the indexing large amounts of documents you get some efficiencies by putting things in the batch as well to so that in next 550 events and have who committed a lot of interesting searches there because there's a lot of good text in a dataset so to start with and I do a very simple term core just like the ones we saw earlier but on a search for the term bloody and which is the name of this talk In this library and I'm mean also had 1 other thing online 20 which is and ask it to highlight with a style called HTML and the results so this is the highlight matching terms in the result set
this values shot is to explain it so here you notice I got highlighting in so the yellow highlighter color all the matches that term bloody and acquisitive did match this particular talk that we're in right now such a really cool feature again we have 2 different ones 1 we have like a fancy formatting for terminals and we also have a HTML 1 you see here but again it's designed pluggable seated that that other things now that was a simple term search let's do something a little bit more advanced call for research search online 18 and a build up the phrase is an array of strings so this is looking for the phrase advanced text indexing and online 19 I create areas so the craters phrase Corey by passing in that array of terms and analysis in this case are restricting to the description field so if we run this 1 the this should also match the
1 talk that we have you and again if I sort of stroll over here that we do see that that phrase now is highlighted as well so that was the match were looking for now you don't get too far before you wanna build a combined simpler quarries into more complex queries we have conjunction disjunction what I'm going to show here is a kind of inquiry the term search texts and the term search for search and so it's like a star like the phrase search but there's no position requirement in all to be side by side it is at the both be in a particular document among the run this 1 again it's
going to match slightly more documents now because this is a sort of a more general courses much for documents what follows them that were about text search now let's go on the command more queries maybe somebody without the hallway before the stock in this I heard I heard about this library called text the believe something text search of detected in as I heard it like you English speaker might have been in as believe so if we run that again is all ended
together the sort match anything because none of the documents have text and search and believed but what if instead of a text search term search for for believe what are they gonna fuzzy search instead we run that the fuzzy
search is going ultimately match the preclude document what happens is the term believe doesn't have an exact match for the fuzzy search is going to uh essentially find anything with a Levenshtein distance of 2 were last and match so in this case bloody Islamist and distance of 2 away from believe and so that's why don't we highlighted that bloody the term inside the document so again for users there's mainly typing things in the fuzzy search is often very useful so people who were not is on the protest 1 action on middle of the the word it not so what should happen but his this Gaussian alternately taking account the Levenshtein distance for the fuzziness threat so 1 that matched last fuzzy should score higher animal that is hunted set of men that we're that's that's what would be the desired behavior the now I mentioned earlier this 2 fields 1 had the duration and so what I'm going to take revenge of here's a numeric range query that's the 2 parameters are essentially a minimum and maximum value so I define long talk to be 110 minutes I think the probably the the prolonged talk that so let's go ahead and run this is should find any of the talks that are longer
than 110 minutes and in this case just matches to and these are both exams we see them a duration of 120 so that sort of makes sense with what we expected to see and also we have the start time so I have a date range Corey here again just before the start and end date but in this case we're using the this C 3 3 3 9 format of the timestamp that's all configurable but inside the library but that's just the default here are so this is going to look for today 5 30 later so that's a very late talk on Sunday and if we run
that that's going to go ahead and match uh just 1 talk which is the closing of closing talk the and finally again haven't touched on all the different kinds of queries a bit about how it 1 more which is query strings everything you've seen so far has been a very problematic way of inquiring about your data of but sometimes you when exposed and users something they can just type in but still have all the power of the the programmatic or as we saw earlier so I highlight some of them here again the syntax is very similar say no it's it's not quite as complete as Lucene's but syntax is certainly designed to be of similar or the same but so here we have on the first one again what is concatenating strings here to make a little easier to read description column text is going to stay in the field description of that have to match the term text the plus in front is going mean it must satisfy that particular clause on the next 1 we see text indexing in quotes which further phrase match in this case and again that's also restricted to the summary field summary colon believe told 2 is going to do a fuzzy search and again at 2 there is the Levenshtein distance that you're going pass as a parameter that the the next 1 is minus description call was seen so this one's gonna say it must not contain the term was seen and the Lasso misses numeric range query we have duration colon greater than 30 and that's gonna look forward at Oxford and 30 minutes the syntax is still nights and not as complete as Lucene's we were missing a date 1 for example is a couple of things we need to improve their but the idea is this is really useful for exposing and end users they can construct a more powerful queries so we can run this 1 as well the
and this is going to match the 5 torques again I think the 1st 1 is bloody in this case and yeah that has the highest score as well so I mentioned the mapping of the beginning of the talk in the is important and as you get you get pretty far which is using the default mapping a lot of the stuff you expect it worked it worked out of the box but but it's not perfect and I want to highlight that here this is I would say a very important topic that to successfully using bloody so there's a talk earlier today finding that needles in worldwide haystacks and so let's say what the want the search for just haystack In a final the stress not going to find it right now yes
anyone have a guess is the wide and find it and as right at so fuzzy would be 1 approach that would also accomplish the same goal but ultimately what I want to focus on this slide is changing the mapping to get better results so as human beings were label look at all this taxon and realized this alone English right and so there is additional transformations we can make on English text to make it to give us such a higher recall our search results and what we're gonna do is around take advantage of that in this mapping and essentially tell it hey this skills in English some 128 again it's this looks a little complex but rather go through all of online 28 I say only is an analyzer called Ian which is the name that the English analyzes registered under and then online 31 32 I'm saying OK for this particular field I wanna use that map the again we will look at the rest of lines which are making a couple other tweaks to the mapping of bit of a gun runners this is gonna index all of
the 5 50 data that you know events and produce a new at a new index of using this updated mapping and so all the bits in a fountain code customer so all the subsequent examples in a switch to that no and so now if we use this custom mapping if we run that same surfer haystack again that
is going to match and if you notice you know we search for haystack it high the whole term haystack and the reason is English analyzer those standing on the input which is able essentially stem haystacks just a single term haystack which ultimately matched in the search normally when you're doing search results you're actually doing the same analysis on input and on the document fields right in that way when you're searching in the terms in a line up inside the index and I don't have the count from me it's like 13 A so that is this the ones where there was a easy you easily available stemmer and and some are complicated and so it's an area where the optical language are interested in what we know about it's a it's you know between looking at what was seen dozen everything you can pretty easily you know put something together and it's not that complicated you ever you know the the reference spots of the possibly mean again I think there's so bloody I mentioned earlier we wanted to be go gettable so we try to have as much pure goes up as we could to get that great other box experience but we do have existing C extensions that were using for some of the function would not have been going out so would be open to optional like so add-ons for some give all the but we probably wanted in the core and for that reason now we want and the what us so the different strategies to handling mixed languages are again same basic thing you run into in was seen you could Chris and educate separate fields for each language in the EU them in separate field of language and on the search side you actually you know you analyze it for each of the different language than then match up those up to the corresponding fields that's 1 option we do have a plugin for a Google has a library seal the 2 which is a language detector we have a plugin for that it's a little bit of a that and ah I we had ideas we can l like to figure out the language and then this do the right thing and we may be exploring and the future of we alternate found that's like not quite as simple sounds and what happens is it maybe works fine on documents when indexing them but the search time when you search and it's usually a shorter amount of words and it's harder to accurately guess what language that is then so that I would say that area were still exploring and ideas but it's not quite as simple as magic really itself and now I I do wanna mention this getting this analysis right is like the most important thing in so with that in mind but we could a tool we call the bloody
text analysis was at and so this is just a into it tells the non-unit seeking go right to it actions don't run it this lets you see how your taxes going essentially end up in the index some and use the phrase of bloody indexes the text quickly just as an example and a start with an the keyword analyzer if you're analyzes really the symbols 1 because it doesn't do any analysis or treats the entire thing as a single token so you notice that whole phrase you on has spaces of will point words that all index is a single token that's useful for things the identifiers or maybe your else that sometimes and were looking things you don't mess 1 treatise words and so that's a key word analyzer the next more complicated 1 is what's called a simple analyzer this unit is gives us 5 separate tokens and really the only thing he did in addition the tokenizing and separate words is it loss a lower case the 1st term bloody became bloody next more complicated 1 is something called standard and standard was the default that we're using earlier so we decide that the full mapping that was using the standard analyzer and he noticed the with the word was removed because that's a stop word into the default dictionary but it still doesn't do any like really English specific things like standing would require you could argue that the the stopword list is is English bias as well but the stunning itself is very language-specific and so we have a special analyzer called the and and this is going to do the additional steps are now in addition to but removing stopwords indexes became index and quickly with a wider came quickly with once again that's just the output of this number and it were rather than folks so what the specific terms are the important thing is that they match up right so that they spend the same thing on input and output so this is a tool again if you're not getting good enough results is a great tool for figuring out how you consider Twitter mapping time to get better results now that being
said I do it's a good point to remind everyone that when you're doing research there's precision recall right so typically what happens is you have 1 search that doesn't give you the result you looking for and you tweak the mapping to get what you're looking for but you always need to be mindful the may have actually made things significantly worse in other searches in not running at the moment of so the search was the back your mind that it's not this like easy thing to so the home in on the right behavior now I also mentioned faceted search the beginning of the talk and so when the demonstrate that capability as well in that data structure we got there was a field called category which correspond to the track of the of the talk so what I've done is I've added a facet request here called categories and so is this give you can imagine Cunningham a bucket for each of the different categories and just tossing it into the bucket the correct bucket and so the results are going to give us the name of a category and an account of how many women in that category the search itself you notice some online 20 setting size 0 that space this thing I'm in a research that matches everything which is that new match all queries but I'm not actually interested in the search results themselves because it's every document in certain arbitrary order so documenting results back just in in facet results
environmental here you see again it it be returned the top 50 I think there only like 44 so and so you can see in a lightning talks or 41 Java 26 I'll fiscal downloaded girl has 9 and so again this is so just looking at the raw output as we'll see later how we can use this to build some simple UI and also energy we do have some optional HUB handlers there in a separate package that they basically basically all the all the major operations in what are mapped to this so it's really easy to set up a server of exposes functionality is they make the assumption that the docking bodies and adjacent and that allows it to just use normal adjacent surrealism relies too to put those into the in that that's actually we don't a sample at the does exactly that called explorer and again I could use of a point could gooey to sort of play around with it so if you play around for the 1st time this was a great way to sort of get started and a more visual so but you put all this together what kind of things can we actually build this is really that that the end goal of
about a day's worth the work and this is really jump to the of this is really an all you I work I mean and ends mainly to today because I'm not that good on friends stuff so this is like an Angular JS AP it's using the HDP API is behind the scenes and using the full dataset that we've seen earlier so we could search for something like an open source kernel every run that matched 177 171 results but as you can see we have the title the talk the author of those of us likes to take you back to the schedule we have the start time duration the room and then the description on in a little bubble on the right you see the score and the cool thing is we've added this refined result section right hand side which allows to the drill into the data so let's say we wanted them to talk later today some check this Sunday box which should have 89 results in did Scopus down to then only 89 results so maybe I have sort of a short attention span so I wanna talk less than 30 minutes that took a stand 63 results and maybe software-defined radio sounds kind of interesting let's check that box as well that takes us of 5 results and so again and I just wanna highlight this is like not a whole lot work this is merely just transmission on JavaScript in front of bloody and should be handlers and you can pretty quickly get get some pretty article functionality the possible world the but it's you can update the index so we can do is change the mapping so you need you commit your mapping in this very much like other products in that way and but so those versions online this is 1 this is running my laptop is a hosted version that's online every hour it polls the scandal and this essentially updates all of them I was on the website in batches but yeah I mean that index updating is is very similar to the way initially inserted yeah brains what also at query time you can define what the bucket so I don't have any code that nicely shows that but and actually I think let me jump back to the presentation here
the facet you saw here disclose this because I guess the only fast have here is the
term facet which is the fault of the sequential you're looking for if word it to see the code that created the numeric range facets but essentially defined buckets so what I did in this example
that was at only check that box see there's more than 1 emotions through that notably OK so you see this 2 buckets here were actually 3 that defines a 1 was for short talks under 30 minutes 1 was for 30 to 60 in 1 was 60 plus so those are defined with a minimum maximum and they don't have to cover the whole range here and so that you can have optional and points if you want an so it's it's it's pretty flexible into the high paid the and again except those of equity time I should also mention the other approaches to doing facets rates so this is if you play with elastic is is more like ElasticSearch initial facets there are some index-time fastening approaches as well which we don't have and monitoring that might then we're done with that is other people
the so I could get away without someone asking about performance that since a couple the time what I will say is really focused on getting features so the right 1st and then our next focuses on getting an API that we thought was so the the right level of granularity of people to use I will we are finally getting the but worse certain look at performance over so what I would say I knowledge folks on numbers is just some kind of my laptop I would is characterized we use what I'm calling micro-benchmarks so this is just using Boolean go police for benchmarking this is great for testing small book you know blocks of code right so someone says have a faster way to do facets right so this is great for being able say this is how we do it now so we're thinking about doing it as a quantity we're getting a sense of it's better not and but this is not always the best way to get some of the things we need to deal so we also pretty called by bench this is for more longer running tests right so this in this case we have to take real data from Wikipedia I would build up an index meat we insert them in different ways so we use an individual uncertainties in batch inserts then there are some queries so this OntoSensor questions like you know is the indexing performance degrading over time as we index more more documents or you know how does the search performance relate to the number of matching documents and again this is 1 snapshot uh that is included I would focus too much on the details because a sense of what the kind of data we can look at see and I will indexing look pretty good and it's it's a little erratic at times but it's not it's not getting worse as we go necessarily an and so the young and this is just highlight a new tool we just published in the last couple weeks of the something we're using the sort of as were making improvements library we can sort of uses the reference see for getting better or worse the same and so on and then ultimately the the main our questions will have that micellar compared with the I don't have anything for that today it's really it's not impossible but it's it's difficult to get good answers that kind of thing and this was started in that direction so they have something benchmarks that work on the same Wikipedia dataset so I could see this eventually get into something where we can start that question and but it's it's difficult answer other than I can tell you is seen as vested right now but then that much we're sure of the so finally there was a call for you earlier about the community and i this is really important to me as a massive I was an interloper for bloody I am and but I don't want to be that way I want more contributors I want people sort of about I make this better so terms of joining a participating the community we do have a you roman in a free node C but it's a small and quiet room right now so you can pretty pretty easily get my attention and if you need to Google groups we use of a general discussion of the have a use case a natural implemented that's 3 would it take to do that if you're interested in planning a larger feature right this would be a great place to join in and you talk about your ideas before you sit on right a code I have course we use can have right so this is Apache License report issues submit pull requests 3 3 and I am happy to say we do already have the contributors of the myself but this on a huge number but it's not 0 invalid gasoline they range from minor you typos and really that a fixed up to performance improvements on new features and so it spans a range but I were writing not encourage anyone that's interested do to get off because it's rather early stage will bestow words easy to get involved limit of a road map the right now so this is some features out there like on our radar for right now results scope sorting right ironies is always sorted by the score which is useful for a majority of cases but not all cases of better spelling suggestion and fuzzy search this is an area where there's a lot of interesting ideas about a finite state machines fine it's an autonomous you can use to essentially you know trade memory for really fast and matching of text and so that's an area where we know there's a lot of cool stuff out there we need to to hook into the performance right I think it performs that some sultry was 1 of only 1 alike understand abounds in the parameters of a hell of a better so that we know of where we stand were trying to get to and also we want prepare for 1 release so right now the the file format is sort influx of the API is still changing sometimes we try and minimize a little bit but it is beyond the body it is still changing right but when we get a 1 a really sets for world having a backward compatibility for file formats and you know stabilizing API things like that but in terms of other speaking I mean I'm here today because you know really wanna get the word out about this library unheaded India in once a month now speaker go for con India and all deathly the at the go for common in Denver in July I'll be submitting a proposal and enough of the of of but I'll be there 1 or another so this is there anyone up and talk about this stuff deftly now let me know and if you have other made up your stuff you where you want to talk for someone else in the community we just interested in doing that as well so that's all I have prepared today so you have to take any questions you have yeah how you do all the and so what I would say is nothing on the box panels that and so that bloody UMP explore program mentioned you if you book that up it looks very much like a single node Elastic Search rated so it's like there's nothing distributed all I work for cascades right so were distributed database so you can imagine their interests are in distributing this and so all I can say is we're building on top of a library were were were trying to focus on making this a useful library in general and very much like Lucene right and then we can was used layer on top of that we we add the sugar capabilities so that something countries is working on and you know I would say I expect to see that in in product in the future but today you know like this library is as it is so that quarters but not right now so I would say 0 sorry and the question was there's a support special search and at i would say what's what you freely use was seen as you can do a geohash on on like corner there's something and then you can insert that into the index and is basically the same approach we from the numeric ranges use that for geo stuff as well that's it's not great right that has limited networks works in some use cases and not in others but that sort of like a guarantee lying fruit someone in this room could probably say in half an hour say OK so I could now knows were geo and so that like that's a core part we get aligning fruit and but it's not a dedicated geo solution the it I was wrapping arabic but I would have the check at all I think it's actually decent at all just as I think I know I wrote part of it because remember like you know having a pasting characters from that you know and and in the things I know I worked on itself and enough I finished it because I think some of them had a common so like ancestor uh that that was being reused a much much of a bunch of the analyzes and so I did that figure how get for languages do this 1 piece here I did some of it uh but absolutely was stop and we'll make a better and also mention this request is surprising we've gotten a lot of the more interest from people outside of united States anywhere else uh while users in Europe for already using this and a lot of interest in our support for Chinese Japanese text as well and so were going to go in the direction the community takes a really so if there's more of them but not right now but the of the could have this is the support for any genetic algorithms but not right now but again I would say that text analysis pipeline would be where that would sort of fit in the so death waitresses and in doing that as well any other questions last question the right the deleting items so the question is how do we delete items from the index of the like a lot of things are we maintain a back index of all the other countries there are maintained for that the the document d so the simple so in know thing is if you're updating a document like you will get it over right or delete the previous rows corresponding that document and then you know handle the new ones but in the in the case adjust the leading and we have a role which says these are all the rows correspond in this document and we need to delete all those as well right thank you all very much