We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

#bbuzz: Ask Me Anything: Lucene 9

00:00

Formal Metadata

Title
#bbuzz: Ask Me Anything: Lucene 9
Title of Series
Number of Parts
48
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Uwe Schindler is deeply involved in the development of Apache Lucene and is happy to answer all your questions about the upcoming Lucene 9.
Case moddingIntrusion detection systemInformation retrievalProjective planeWeb portalInformation technology consultingFunctional (mathematics)Mathematical analysisMusical ensembleAttribute grammarComputer animationXMLUMLMeeting/Interview
LuceneThetafunktionSoftwareJava appletSubject indexingPrice indexMereologyQuery languageBlock (periodic table)Information securityInformationStatisticsStandard deviationRegulärer Ausdruck <Textverarbeitung>Core dumpHolographic data storagePiInformation securityMemory managementDemosceneProcess (computing)TwitterBitMathematicsSoftware developerRevision controlFormal languageInformationQuery languageBlock (periodic table)Subject indexingImplementationMaxima and minimaBuilding1 (number)Physical systemCodeComputer fileDigital electronicsMereologySemiconductor memoryCASE <Informatik>SpacetimeRoundness (object)Sheaf (mathematics)NumberControl flowState of matterResultantSystem administrator2 (number)Java appletMeeting/InterviewComputer animation
Maxima and minimaMultiplication signQuicksortAlgorithm2 (number)Subject indexingGene clusterBlock (periodic table)Configuration spaceMusical ensembleHacker (term)File formatCodeRevision controlBeat (acoustics)MathematicsTheoryElectronic mailing listStatisticsElasticity (physics)CASE <Informatik>Java appletEndliche ModelltheorieProduct (business)InformationDefault (computer science)Right angleBoss CorporationData conversionBlogImplementationCodierung <Programmierung>Computer fileOcean currentCodecField (computer science)FreewareMeeting/Interview
CodeQuery languageComputing platformProjective planeCoprocessorData structureCrash (computing)Software bugRevision controlSet (mathematics)Type theoryBitFile formatFront and back endsSubject indexingDirection (geometry)Java appletBlock (periodic table)Virtual memoryGodSystem callMoment (mathematics)Computer fileLibrary (computing)MiniDiscSpacetimeSlide ruleBuffer solutionProcess (computing)BefehlsprozessorFilter <Stochastik>Machine codeVirtual machineMaxima and minimaHeegaard splittingAlgorithmMereologyCompilation albumData compressionAssembly languageCASE <Informatik>Metropolitan area networkDirectory serviceMultilaterationGraphics processing unitPhysical systemDemosceneNeuroinformatikSemiconductor memoryMeeting/Interview
Query languageBlock (periodic table)Maxima and minimaInformationLevel (video gaming)MultiplicationData storage deviceElectronic mailing listNormed vector spaceBlock (periodic table)Query languageInformationMathematicsElectronic mailing listDigital object identifierMaxima and minimaEndliche ModelltheorieAdditionPriority queueData storage deviceSlide ruleCASE <Informatik>BitStatisticsTerm (mathematics)Multiplication signFrequencyAlgorithmCountingGoogolBoolean algebraJunction (traffic)Subject indexingNumberPoint (geometry)WordMereologyKey (cryptography)Mathematical optimizationSoftwareProcess (computing)2 (number)Logical constantMeeting/InterviewComputer animation
InformationMaxima and minimaLevel (video gaming)MultiplicationBlock (periodic table)Data storage deviceElectronic mailing listNormed vector spaceSubject indexingInformation retrievalWeb pageTerm (mathematics)NumberMaxima and minimaBlock (periodic table)FrequencyWordData dictionaryInformationSlide ruleElectronic mailing listInheritance (object-oriented programming)MultilaterationIterationLevel (video gaming)Data compressionMultiplicationAlgorithmTotal S.A.Subject indexingCASE <Informatik>Query languageNormal (geometry)Multiplication signLie groupComputer animationXML
ResultantQuery languageData storage deviceConfiguration spaceRevision controlCASE <Informatik>NumberBitMultiplication signSubject indexingFile formatComputer wormLevel (video gaming)Boolean algebraInformationType theoryTerm (mathematics)Image resolutionDefault (computer science)Online helpMathematicsParameter (computer programming)Moment (mathematics)Block (periodic table)CountingRoundness (object)Data structureOrder (biology)CodePhysical systemVector spaceElectronic mailing listMemory managementRankingImplementationDifferent (Kate Ryan album)Interface (computing)Semiconductor memoryBuffer overflowOpen setBuildingLink (knot theory)Software developerMessage passingMusical ensembleCoefficient of determinationSpacetimeMaxima and minimaOverhead (computing)Fuzzy logicWebsiteTrailXMLComputer animationMeeting/Interview
CASE <Informatik>PlanningBitMultiplication signHeegaard splittingImplementationDifferent (Kate Ryan album)BuildingSoftware testingString (computer science)Computer fileProjective planeMessage passingPoisson-KlammerDirectory serviceParameter (computer programming)StatisticsQuicksortScheduling (computing)RoutingPoint (geometry)Revision controlPatch (Unix)Moment (mathematics)NumberMathematicsCodeData structureFigurate numberPhysical systemIterationIntegrated development environmentSoftware developerBlock (periodic table)CoroutineSoftwareMaxima and minimaProcess (computing)Meeting/Interview
CASE <Informatik>Revision controlContrast (vision)CodeProjective planeRoutingRepository (publishing)BitElectronic mailing listSoftware developerSimilarity (geometry)File formatEmailMathematicsField (computer science)Subject indexingTelecommunicationQuery languageMultiplication signPoint (geometry)WordHeegaard splittingFunctional (mathematics)QuicksortFormal languageDifferent (Kate Ryan album)Product (business)EstimatorLevel (video gaming)Data miningPointer (computer programming)SoftwareComplex (psychology)ImplementationGoogolDemosceneComplex systemState of matterTheory
XMLUML
Transcript: English(auto-generated)
Hi, everyone. Welcome to the AMA session with Uwe Schindler about what's new in Lucene 9. Uwe is a committer and PMC member
of the Apache Lucene and Solr project. He's worked on a lot of interesting Lucene features, like the fast numerical search and is maintaining the new attribute-based text-analysis API. He works as the managing director for SD data solutions in Bremen
that provides consulting and support for Apache Lucene, Solr, and Elasticsearch. He also works for Pangaea, where he implemented the portal's geospatial retrieval functions with Lucene. So, welcome, Uwe. Hi.
So, thank you for the introduction, and I'm glad to answer all your questions today that you could imagine about Lucene 9. So, in general, maybe the first thing that I want to start with is before the questions,
I hope they are coming in already, is to give you some short overview what will come with Lucene 9, and maybe also a little bit of short background of Lucene 8, because the interesting thing with Lucene 9,
in short, if you would say it, there are two new main features which are the important changes. One is you moving to Java 11, yeah, and the second one is they are currently work going on to implement Gradle as a build system,
so that's something which is more interesting for the developers. So, that are the most often seen new features. The other ones, which are also included, are in most cases also back ported to the Lucene 8.6 series in general to Lucene 8.
So, in most cases, when we have a new Lucene version, it's just the breaking changes coming in, and more or less, the breaking change here is that you have to switch to Java 11, and for the developers, it's greater, but we also updated a lot of stuff like,
for example, analyzers. We have new Snowball Analyzers for new languages, and although the old ones were updated, so they should be faster now, because the auto-generated code was updated, and there are some changes, but because they are breaking backwards compatibility,
they are new, but most of the other stuff was already back ported to Java 8, and one of the main features, no, sorry, Lucene 8, and one of the main features, so there are three things that you can ask questions today. Two are about the so-called early circuits of queries,
which is called the block max band, which is, if you don't need the exact number of results of a query, you can short circuit if you know that some of the documents will never match,
because they cannot be in the top 10, so that's a new feature which is steadily involved to affect more queries, and that's very, very important. The second step is, which is an ongoing process in Lucene 8, is step-by-step moving away the index parts
that are currently mostly, or in previous versions, were mostly located on the heap to move them off heap, so that means the index files are only memory mapped, and to run, for example, your Zola or Elasticsearch cluster, you only need very few heap space,
mainly you only need the heap space for managing your cluster, like the cluster state and all that stuff, but for executing queries and doing aggregations or faceting, the amount of heap, it gets reduced. And another new feature in Lucene 8
are the interval queries. And I also have something, because there are also some Zola users here, so the next version of Zola will not be Zola 9, it will be Zola and Lucene 8.6, and there are some smaller changes.
As said before, the block max indexing is now finally also coming to Zola, so the short-circuiting of queries is coming, and you also have some minor stuff, like a security info panel in the admin UI, which is now important, because we got a lot of security requests
for a non-secured Zola cluster, so you see a little bit more information, and there are also some streaming API improvements coming. And so here, yeah, sorry, I forgot about that. But I think we are now, with the introduction,
that's almost all, and now I want to get some queries from you, and I'm sure you already have detected some in the selection. Yeah, so I'm gonna get back to the new features in a bit, and all of them look really interesting.
But with the new version, what does it really mean for existing users? Is there something that they should look out for, prepare themselves for? Or is there something you want to highlight? Yes, so when you're upgrading to a new Lucene version, but it's same applies to Zola and Elasticsearch,
there are, of course, some breaking changes, and one of those is, of course, you have to upgrade to Java 11. I hope most of you already did that for the current clusters, because that's a last chance to do that.
But you have to also keep in mind some other stuff. So for example, if you have older indexes that you haven't re-indexed for a longer time, it might happen that you can use them with the new versions of Zola or Lucene, because the backwards compatibility layer
only allows to read indexes from previous versions. So that means once you migrate to Lucene 9 or Zola 9, you can only read an index which was originally created, and that's very important, originally created with Java 8. That's something which is new since Lucene 7
and made hot somehow in Lucene 8. So actually, previously it was possible to simply merge some index or optimize it to get it to the almost latest version, but there were some changes between Lucene 6 and Lucene 7 and Lucene 8
that had some internal statistics that were used for the scoring and so on, and also offsets which were partly negative in older indexes, and because of that, we have to put some stopping in it. You cannot upgrade those old indexes. You can still upgrade them to Lucene version 8,
but the problem is Lucene 9 will still reject to open that index, because in some way, the index is also the original version which it was created stored in the index, and if Lucene, on opening the index, figures out that the index was created
and with a very old version and just upgraded sequentially from version to version, it will complain to open it. There are some workarounds available, but they need manual work. So you have to somehow change the index files
on your own by hacking something in Lucene, but in general, that's sort of supported configuration, so you should be prepared to update, and that's mainly what affects all the major Lucene versions, and then you have to install Java 11 or possibly later,
and when you're upgrading Zola, you should, the steps from Zola 8 to Zola 9 are not so complicated like they were from Zola 7 to Zola 8, where you also had
to take care of the HTTP version 2 changes, so you can easily do a loading upgrade. So if I'm not right here, Anson, you may correct me. That sounds absolutely right, but just to summarize, there's no easy way to do this,
but anyone who's on Zola or Lucene, Elasticsearch version that uses prior to a Lucene 7 release, right? Other than re-indexing everything and making it work. Yeah, as I said, there are some other possibilities,
but it needs hacking and writing some Lucene code on your own. So it's very funny, you sometimes have customers which have really, really old indexes, they still want to get into newer versions, and then it really gets funny, so yeah.
That's something we can talk another time about that. So we were getting questions on the channel. So the question says, hey, I was attending ECIR this year, and they discussed about how great it is
to see BlockMax WAND implemented in Lucene, but they also discussed on why it takes so long for researchers to see their ideas implemented in real-world products. What's your take on this? That's an interesting question. I was expecting I should explain how that works.
Feel free to, we can take that offline. I can do that later. So, but basically, it started already in, I think, 2011 or 2012. I'm not 100% sure which year it was. We had a talk here at Berlin Buswords.
We had the first version of BlockMax already proposed, and that was already before Lucene, but I think it was with Lucene 3.6 or maybe at the early time of Lucene 4, and we had those new codecs at that time,
so there were already some ideas how to implement that, but the biggest problem was, as always, with theoretical stuff. In theory, it's in most cases, it's quite easy to implement because you have something like an issue related problem
where you can create your index just based on the requirements for the features, but because you know Lucene has a lot of more features than only scoring documents, so we have numeric fields and all that stuff
that also are used in the scoring, and the other thing is, in Lucene, it's possible to easily change the scoring model that you're using. So, for example, in the past, we were using TF-IDF, but now the default is PM25,
but there are also other scoring algorithms that you can exchange, and that works without re-indexing the whole index, and because of that, the encoding of the index, so the approach at that time was called max score, so that was to store the maximum score of every document somehow in the posting list,
and the idea here was, unfortunately, to calculate the score, you need to know the scoring algorithm that you want to use, and if you save the statistics in the index, you cannot switch it afterwards, and that was something that we have to change, and also from the index file format,
if you wanted to create the max score index, you had to do that globally, so matching was not working because there was no easy way to save that information, and so that's basically the reason why it took so long, and then I think it was Adrian Grant and some other people two or three years ago
were reading some newer papers about that, where people trying that old block max stuff and tried to implement it, but actually, the implementation in Lucene is still different than what was originally proposed there
because they were still using something like scores and to store them in the index, and our approach was to make it easier, and also we figured out that the index file format of Lucene is perfectly to put a second step on top of this max score algorithm, which is those block stuff,
and because of that, it's a block max, and the block max algorithms were proposed much later than the max score, so but we now implemented that, and it's part of the Lucene code base, and because the Lucene index already has something like a block structure in the postings,
so it was very, very easy to add that. The more complicated work was to change the queries, and yes, it's still ongoing work to change more and more queries to use the block max stuff, so there's this fun queries and all that stuff which still have to be refactored and changed.
Great, thank you so much. So, Maya Sharibova, who's also the newest Lucene Committer now, has a question. Has a question. Can you discuss how Lucene is taking advantages of SIMD and processors that support this?
What type of causes, ah, okay, ah, yeah, that's also something new. You can, that's something new. Unfortunately, it's still not a new feature. You can already try it out, I think. I'm not sure in which version it came,
so maybe she can answer the question on her own, but so the new instructions in the CPU's, we can use them in Lucene, but at the moment, we cannot do that directly. That's one of the back sides of using Java as your backend system,
because we don't have any possibilities to control how the compilation of the Java code in the hotspot engine is doing when it creates machine code out of it.
So, but recently in newer Java versions, there were significant improvements. I'm not sure about which Java version you have minimally used, so it might be that it won't work with Java 8. I'm quite sure about that, but with Java 11, you can be sure
that the code is combined in a better way, and the trick here was to, the old, usually Lucene trick, I don't want to get too much into the details. So it was also Adrian Grant. I'm not sure if he's also here on the burning passwords,
but he started to work together and try to optimize the index format a little bit and also the algorithms. So the virtual machine, the Java virtual machine is able, based on the structure of the Java code,
to convert that into those new instructions. So we have to look at our compression algorithms that they are compatible to that. Sometimes we had to change a little bit the encoding on disk to need a little bit more space
so if longs and all that stuff, and also in memory, we are using now byte buffers to load those hosting lists, but in that case, you can parallelize a lot of stuff. This is something like an ongoing process. It does not work with all CPUs
that are currently on market, but luckily, we don't need to compile our code. We are hoping that the JDK is doing that, and in the past, this was already done. For example, a lot of code, like filters where we are working with split sets is already highly optimized to work with those new instructions,
although in older versions, it started with Java 7, and just some funny backgrounds. Those horrible bugs in Java 7 were exactly caused by the virtual machine doing the wrong stuff in that regard and creating assembly code which was not correct. So when this happened to lead to those crashes
on some platforms or corrupt indexes and all that stuff. So that's something we need two parts. We need to optimize our code and the new code, but there's also something else which might come a little bit later, and that's going into Project Panama.
That's a new project of OpenJDK that allows to call non-Java methods from inside Java, so it's possible to create a method handle to call something like a library that is compiled in C code, and that might be another possibility.
So we can ship Lucene optionally with some pre-compiled binary blocks in the char file that can be directly accessed without speed. So the problem is you can still do that at the moment with Java, but the problem is the switch from the Java
to the C code is very, very slow at the moment, and with the new APIs like method handles on C code looks like promising in that case, and I also talked about that already on the back end. This will also improve the memory map directory, hopefully.
But this is something which may take one year more, I think. I would expect Java 17 may have that feature coming. There's a question with the new Java abilities to call C code.
Just a curious question on when can we see Lucene on GPU? Can you repeat? With the new Java abilities to call C code, the question is like when can we see Lucene working on GPUs?
Oh, God. Yeah, I think there are some issues open, but as always, the answer to all these questions is nobody's working at the moment for it, but if you're interested and you have the infrastructure to do that with your Java versions,
then just step in and I hope if you have some good ideas in your experience in using GPUs, then write some code, which is the C code layer that we might use, and then we can see how to improve that.
Perfect, yes. I would have guessed so. I just asked that question because it just came up and had like a ton of interest. A lot of people just said, yes, we wanna ask the same question. So, okay, now- We already posted the Lucene issue with the GPU. So that's a long discussion issue.
It has Lucene 7754, so it seems to be four or five years old. Yeah, so the next question is from Matteo. Please explain BM of BM-And. Okay, yeah, so I think I can do that
because I have some slides for that already prepared because that's not so easy to explain with just talking, so okay, the slides are coming in. So that's 10 times faster queries. So what's going on there? I will not go too much into the details
because I think most people here will stop thinking and just fetching their beers and drinking too much of that. So, the main changes as said before, what's the background of the BlockMax one is,
the idea is to short circuit all the queries where the total count is not needed. So you might know that already from Google. So for example, if you're searching in Google, you see on top of your documents, some information like, we found a lot of documents, sometimes it's an exact number like we found
one of the documents, which is very seldom in most cases, but if you're very specific with keywords, it works, but at some point, it will simply tell you, we found a lot of documents, we don't tell you exactly how many there are. And the reason for that is because currently when Lucene is running and executing a query,
the problem is to actually get the top document, the top 10 documents that you're displaying in your search results, it has to process all the hits because it has to calculate the score and put the document into a priority queue and the priority queue,
so whenever a new document comes in, it just puts it into a priority queue and at the end, only the top ranking documents are left over because everything that's not part of the priority queue is falling out at the end. There are already some optimizations in that process, so for example, if it figures out
the calculated score is too low already, it doesn't even try to insert it into the priority queue and just throws it away immediately, but you still have to calculate the score of that document. And the idea here is how can we make Lucene ignore
and not ignore those hits which are not competitive at all and the idea as told before was to put some additional information into the index, so during execution of the query, you can figure out without actually calculating the score and collecting the hit that a bunch of documents
which are in most cases somehow in a block together in your index, so you see something like the posting list consists of blocks of 100 documents. For example, you can say, okay, the next 100 documents are not interesting at all because the score that would be calculated based on the statistics of that block is too low
and so I can just jump over the block. And this new change was implemented for some queries like the term query, which is easy, you're just jumping into a term and then you're consuming the posting list, but all the other stuff like the W in the one block marks
is then related to Boolean query and that's very important for these junctions. If you have a conjunction, in most cases, it's not so bad because you're just ending together your terms, but if you are having an or query
where you are ending, it's enough to have something like a stop word in your query and that's also the reason why in the past, most people said we don't want to have stop words in your index because on an or query, that gets horrible, horrible slow because you have to iterate the posting list of all those queries and then it takes
very, very long time, so you have queries taking two, three, four seconds sometimes for large indexes if you are asking for a stop word. And also for phrase queries and there was also recently the addition of constant score queries, but how does it work now? So what we are doing is we add not something
like the score to our posting list, so that was the original proposal and this is why it took a little bit longer to do that. The idea here is because we still want to have the information and change, for example,
the scoring model, like from TF-IDF we want to switch to PM25 or something completely different and because of that, we did not store something like a single value in our posting list, but instead we are storing the maximum term frequency
for a block in the posting list and we are putting in the norm and with that information for our whole block, you can of course calculate the score, the maximum score of that block and then can say, okay, those documents are not interesting and we just skip over them
and because of that, there are still some requirements on that, so for example, you are not completely free with your scoring algorithm, so for example, if the TF raises, so if the TF goes up monotone, the score must also go up monotone, so if you have something like a scoring algorithm
which has a, it was interestingly yesterday in the evening in the talks, there was some discussion about that maybe people having, putting, repeating their words should get a bad score at the end because you know in some shops this bad,
but if you have something like that, you cannot use that approach, so the TF, if it goes up, that must be same and for the norm, it's very similar to do that, so the document frequency and all that stuff need to be somehow predictable, so it must be monotone, so the score is expectable
because otherwise the algorithm would not work and the cool thing now with the block stuff is that this can also be done on a multi-level approach and if you see that multi-level, you can think of storing that in the skip list and that was a great idea that we had at that time,
so the original paper for that was I think not long ago, I think it was in CPA 11 using block max indexes and so how does it work in all,
so basically what is a skip list, so then and that's the main slide which hopefully explains what is happening, so what you see here, you have for example a skip list for Lucene and for search, so you have a Lucene search, that's your query
and you're searching, then you're first looking up Lucene and search in your term dictionary and then you get a list, the posting list and in the posting list, it's just a list of numbers, so in that case, Lucene is included in documents three, seven, eight, 15, 16, 19, 32, 49, 51, 56
and something like that and the same is for the search query, for the search term, you also have some document numbers, so there are two possibilities, a lot of people know for end queries, you have those deep log approach,
so you're jumping going forward in the list and that's why the skip list is there, so for example for an end query, you want to find the overlap, so that means the first hit is the number seven where both are inside, so that means you are first moving the Lucene iterator forward and then you are on seven
but then you are asking the other iterator, please go forward but the minimum document I'm interested in would be document number seven and then the second iterator, which is positioned on search is moving forward and so it skips already the four and the five document,
so for example and then once they both landed on seven, the next one is a 15 and so they are jumping sometimes, so to help with that jumping, you have those skip lists, so when something is in the first block which is on three,
there is information that you can jump to document number 15 by just going there and so it can skip over all those documents because they are somehow compressed and the size of those compression is not known, so it knows where to go and then it's on the 15 document and the idea here now is to add to that skip list
also that information and now from the multi skip list levels is also be offered in multiple levels, so if somebody is on the first block, he can also say, okay, I need to go to document 49, so it can just jump over everything
and then it lands on 33 or maybe on 46 and then from there it can iterate further and you see here that's an easy way to skip and now what's coming on here is, what we are doing here is we are simply storing the term frequency, the maximum term frequency for those information
like inside that skip list, so we know that the maximum term frequency of the Lucene in documents three, seven, eight and 15, no, three, seven and eight is three, so that means if we are already during collecting the documents,
we know that the score is, for example, the current score is something 25 and based on the norm, we know we need a minimum term frequency of three, then we can say, okay, I'm asking the skip list and say, okay, let's jump forward,
get me to a document where the term frequency is greater than three and in that case, it can also use the multi level skip list, in that case, it can jump to document 33 and jump over two blocks already and in that case, it's then possible
to skip over those documents. The only information that you use, of course, is the total number of documents that were hit and that's why sometimes you cannot use that at all but for example, if you are not interested in the top number of documents, you can quickly go over them, yeah.
So if there are any other questions, we can also think about that later. So thanks, Uwe, for that great explanation, super useful. There's a question about, do you have to change the version in config.xml for Solr to 9.0 or 8.6
in order to use this feature? Does the config in Solr require any changes in terms of explicitly specifying the version to use the feature? No, not necessarily.
So basically, there's in most cases, a problem with upgrading Solr. So that's maybe some addendum to what I said before and I think, Anshum, you will confirm that. So whenever you are upgrading Solr, you should look into the version numbers
of your schema XML because sometimes, if you're keeping the old version, you get very, very bad defaults. So I have a lot of customers with very, very old schemas where the default is to not enable doc values on their indexes and then they are complaining that their heap space overflows because they just upgraded Lucene.
We're hoping that doc values are enabled and everything works and then they get heap space out of memory and all that stuff. So basically, but for that feature, you don't need to change anything. You just need to upgrade the new Lucene version. So for the block max stuff,
to get that feature in Solr, you have to wait for 8.6 coming out hopefully soon and then it will be enabled. I think the default is still that all the results are counted and I'm not gonna represent you about that but the result format will also change a little bit so you get the information about the number
of hits collected and some information if it's an exact number or if it's not exact. So it's something like a Boolean and I think you can only also tell him during the query execution by a query parameter. I'm not sure if this is yet implemented already. I think the default is 1,000, so yeah.
Yeah, I think it's supposed to be done. It's not done yet. It's not done yet, yeah. So basically, you can simply say, I want to get exact numbers up to maybe 10,000 documents and then once the 10,000 documents are collected, it simply stops counting exactly
and then it will use the skip list to jump over the documents. So but that's something you have to explicitly enable and you need to explicitly enable that because it doesn't work with all types of queries. So for example, if you enable fauceting on your queries,
it won't work because then you will also get incomplete faucet counts and all that stuff so which is not really wanted but in that case, you have to think a little bit about how to do that in your code. So one recommendation would be to just execute the query without faucets,
show the results very, very early and then in the background, calculate the faucets and deliver it to your website, maybe with an Ajax code or something like that and that would be some possibility how to do that but in general, I think everything keeps the default.
Yep, we have another question. I remember some discussion about vector ANN search in a JIRA issue. Is there anything in the works for Lucene 9.0 or is this still being discussed? Happy to get an update about the ongoing discussions as well.
Oh, no. So vector ANN search. Yeah, Alexander was at IT because I have the channel open. Yeah, so basically, the discussion is still going on and it's very, very funny. The issue is open world here and there's recently came in a new approach to do the same
and I know, for example, that Elasticsearch has its own approach at the moment so they are also doing something like indexing of vectors but they have a different approach and everything is somehow waiting for a Lucene internal implementation but currently, people are discussing
so there's an issue open on Zola. On the Zola issue, there's a proposal to implement it only in Zola which is somehow similar to what happens in Elasticsearch and then the people around Mike McAnlis and Mike,
the two bikes from Amazon are looking into that issue and they already implemented something and the final step now is, as always, it sounds cool in isolation but at the end, you want to have something like generic, how we want to index those back to us
and the newest issue now opens, I think, on May 20 or something like that was about, we need something like an abstract API for the posting format and an API and something like a standard interface to access posting format so we can implement both proposals of posting formats
and so we need to see something in the overlap so we don't have to be that too much glued into the queries somehow so we need something like the queries and so there needs to be something like that because the current approach that was posted on Lucene is not a new index format.
The original version, I think, that was open one year ago was using payloads and doc values but I'm not 100% sure, I think, because payloads or doc values, I think, doc values in that case and I think the Elasticsearch version is also using doc values for that
which is just a workaround to store that information but it has some overhead on the query time so in that case, we would need something like a special index format so something that describes the index format and an API to access those structures on the lower level
so it's hot at the moment again but I would not expect that to come for Lucene 9.0 but as always, help us with that and I think there are a lot of problems to solve like, for example, Lucene is using
this segmented index structure so merging gets very, very expensive and all that stuff so we have to find solutions for that and we were also discussing yesterday in the other question round where it was already discussed so it would be good to have the vectors in Lucene
but you cannot simply query only on the vector so you always need something like a combined approach so that looks like learning to rank or something like that so you're first getting something like 1,000 top ranking results by conventional search and then the top 1,000 results, you would go and use the vectors in the index
to get the final top 10 or something like that so doing that on the whole index is likely to blow up I think, does this explain a little bit? Yes, yes, sounds good to me. I'm gonna switch some track
and ask you about the build system considering a lot of effort has gone into switching over from ANT, that's something that we've been using forever, over to Gradle, where does it stand right now and what does it mean for the users? Yeah, as I said, I hope it will come to Lucene 9
they are really very, very info, we are really putting that in because there are some other questions I've seen another question coming a little bit later which is also related to that and it also somehow links together so in very short, we have a working Gradle build
at the moment so the only thing that doesn't work yet is you cannot build a release out of it but a lot of other stuff is already working so it is for the daily work of a developer so for example, if you want to submit a new feature
into Lucene and you have a patch and a pull request and when you're developing that, I would really recommend you to use the new Gradle build so you just have to call it Gradle W and then you can run everything we also already have pre-committed all that stuff
and you will see a significant improvement in test running times because you know, all of the Lucene build was doing a lot of stuff in parallel when executing the tests but if you have looked at the Lucene builds especially when you're doing Zola, it takes something like two minutes
until it starts to compile Zola although it does nothing on the Lucene at all it just iterates through all the and projects which all find out there's nothing to do and compile and then it starts to maybe change compile and then you compile one file in Zola so this should be really, really faster
because and does not really work so developing this Gradle should be fun now again, we also have some new checks already which are only implemented in Gradle so for example, in Zola, if you have wrong log usage so Eric Erickson committed something that immediately wants you like this forbidden APIs
for example, to not concat strings in your log messages and instead use those curly brackets and the pack parameters or your sort of never ever call a method in those logging parameters and all that stuff so that's something because it was already slowing down
so that's something which is a new check only working with Gradle but you cannot do a release at the moment so because the packaging is not yet working but I was working on the other docs for something like two weeks ago or three weeks ago and that's already very, very fine
so I think the packaging is more or less the last step and then we can release so I would suggest just try it out there's also the other problem is with backboarding patches to earlier versions and that's something because at some point we want to change
the build infrastructure a little bit so there's a directory structure so we to be more aligned to Gradle builds which we cannot do at the moment because both is at the same time in the build system so there's a lot of customization in Gradle at the moment so if you import it into your favorite IDE
you might get some problems with it so yeah. Okay great I have two questions on my mind right now and one of them is the hard one the other one not so much I'll put the hard one out there first
so that we can use all of the rest of the time for the other one so what is the background behind the plan to decouple Lucene with Solr? Is this the simple or the hard one? This is the hard one the other one is a nice little one yes. Okay yeah so that's really something
so I can only so please excuse in some cases I'm also a little bit biased because everybody has its own personal opinion about that so actually I'm quite open and I also voted plus one
for the split of Lucene and Solr but actually I'm somehow in between to me I was working on the Gradle build it doesn't matter to me but actually the whole thing started like that so you could also ask why was Lucene and Solr put together half a year now 10 years ago
and now we are suddenly trying to split from each other that's very very hard to explain to somebody and so I was already expecting that this question will come here and I can tell you one thing it won't happen the split before Solr 9 will come out
and Lucene 9 will come out because that will make it even more complicated and I think it will also wait a little bit until the Gradle build is finished so that's only my personal opinion in that case but most of the other committers agree with that but how did it happen?
So the reason for that by the question came up again so it came up already in 2014 I remember when we were sitting in the restaurant after some earlier burning buzzwords in 2014 and 2015 and we were discussing about that Lucene and Solr might split or might not split
so of course there are always some people who don't like Solr so they want to split on the other hand Solr is really something which is also good for Lucene because we get to have more extensive testing so there were always those conflicting parties and I was saying I would have Lucene and Solr together
but what happened is when we started to do the Gradle build which is already working on Lucene and Solr perfectly so I have no problem I see no reason to split because of the Gradle build but the Gradle build very very clearly showed that Lucene and Solr although they are the same project
have a completely different style of first building the project so there was nothing so in fact what happens that's also one reason why the Ant will take so long the first thing that happens is Ant is iterating all the route that projects generates all the char files
and then the char files are copied over to the Solr directory and then they are consumed from there to build Solr so that really looks on the first thing like yeah that's somehow separate and when you look closer into the Gradle build and when you're doing a lot of stuff you see there are other checks in it
so it's really separate and then there were also some statistics done a little bit later like the number of persons who are working on that and figure out there a lot of committers working on both projects some of them are only for us to work on both projects
so in that case if you do a backwards incompatible change in Lucene of course you have to touch the Solr code so this is not really an argument but it has shown that now in the newer time most developments are really separate from each other so Solr has its own iteration of new features
you also see for example as I said before the BlockMax 1 was first there in Lucene came out with version 8.0 and I would have wished that with version 8.1 8.0 also the BlockMax stuff
would have been appeared in Solr because if you look at the patch it's not really complicated so but unfortunately it took till 8.6 and it's still not out to get that very short circuit stuff into Solr so you see here if it would really be one project
that would have been one single commit changing Lucene and Solr at once somehow okay it's a little bit easier but in that case but now the question why did we move together in 2011 so the reason for that was a little bit different contrary the problem at that time was
that Solr and Lucene were from two completely different persons and they had a completely different history and before Solr even joined the Apache Software Foundation there were a lot of implementations that should have been better placed in Lucene at that time
they were going to Solr and this didn't stop at that time it was all the time so it was starting to we only developed that in Solr but it was not somehow merged or not even thought of to do that in Lucene because of the release schedule because Lucene had a very very slow release schedule
I think and Solr was faster could also be the other way around I don't remember that and because of that some of the stuff was implemented in Solr mostly and one famous example is in Solr there was tons of analyzers for different languages and one example is word delimiter filter
which Lucene people still hate but because of that so at some point and of course all those stuff was not usable for projects outside although it's only Lucene internal features
because an analyzer has nothing to do with Solr so the idea here was to clean up the whole code base move the stuff where it belongs so for example function queries were moved to Lucene analyzers were moved to Lucene and that really helped to clean up the whole project
and it also was a great success for other people using Lucene so a lot of customers of mine were really happy to no longer need to import Solr dependencies just for your own Lucene implementation just to get the analyzers and of course also for this was a good step forward
for Elasticsearch because Elasticsearch was able to use all those analyzers so you can see that from both cases so yeah word delimiter filter WDF it means VDF in German, yeah, funny. Yeah, so but Alan also hates it I think
yeah, I was just looking here into there. Yes, so that's some reasons and now we are at a stage where the whole thing is really good of course there are still the risks that it can diverge anyways so the same can happen
but there's also the other thing because now with the Lucene and the committers we are getting more and more committers it's a good time to split but it's not only splitting the projects so we could do the same we could just fork and both will be under the same Lucene top level project
but the idea here is to make it their own Apache Software Foundation top level project so at the end it will also be solr.apache.org and all that stuff so that's a good way forward and about the dependencies I think the approach will be very similar to what Elasticsearch is currently doing
so because a lot of people were complaining about yeah, if not everything from Lucene is immediately tested in Zola that will be very, very hard so that means in contrast to the early versions of Zola where this was not the case we must go some route to say
okay, we are using snapshot versions from some shared repository like the Apache snapshot repository just with some date code and it's regularly updated and something like that so that's approach which Elasticsearch is doing and I think that might also happen with Zola and the other thing is
if you want to align major releases or not my personal opinion is we won't align minor releases anymore because if there's a Lucene version 8.1, 8.2, 8.3 doesn't matter at all it can still be Zola 8.5, 8.6, 8.7
doesn't matter but whenever the major version changes there needs to be a communication because you remember from my earlier talks we are only backwards convertible to the major version before so that means with the Lucene 9 you can only link Lucene 8 indexes and so there's somehow a problem
so this means we also at least need something like a major version update also in Zola, etc. So I would think of it will stay the same major versions will stay aligned but the minor versions not so I think that's everything I can say about the split here
and I would tend to wait a little bit until Lucene 9 is released and the development of Lucene 8 and Zola 8 stops or at least it's only backwards compatible changes at that point I think the split will happen
so that's my best guess but as always it's never true. Hansome, you are silenced. Oh, sorry. He was doing the estimates I think.
So the last question that he covered the previous one pretty well. So do you have considering Lucene as a product complex system, do you have any pointers or suggestions on how new contributors could engage
and contribute to the project and to the community? Yeah, so I think for some new contributors it's really hard because there is not too much
going on on the development mailing list mostly is done in JIRA tickets and you're flooded by JIRA tickets. We split those mailing lists not long ago because of that so the real discussions are now going on on the development list and the JIRA issues are all going and the GitHub are going to another mailing list
so that helps a little bit so it makes it easier for new committers but there's still the problem that most people are because it's a huge, really huge code check you don't know where to start and because of that we have something like
tagging on our JIRA issues which are useful for beginners and mostly this was done also for Google Summer of Code but something similar so there are some issues which are a little bit easier to start with and just start and submit a pull request
which is much easier than before now that Apache Software Foundation has included full support for GitHub so you don't need to muck with Git repositories by the ASF, just talk it on GitHub and send a pull request and we can also merge it for you then the committers very, very easily
so I think it's much easier to do that so just come there and submit a pull request but of course don't forget to register for the mailing list and ask you questions there though you can also talk in discussions
or maybe go to other issues and talk with us so that should be something and if you have something which was really useful for you just give it to us, we are happy to see something like the recent contributions like the talk before about those new posting format
for those indexes with many, many fields and that which is really something a few people need but that was really something and the person is now also a committer who proposed it originally so that's also something you can keep in mind
if you're working together with us, we are all friendly sometimes we are a little bit harsh and the policeman is arguing with you about the bad code you are writing but don't take that too serious. Great, yes, I hope that's gonna motivate enough people
to come in and start contributing to the projects that'd be great to see. Okay, I think we don't have any more questions we're also out of time so thank you so much Uve for answering all those hard and not so hard questions, all sorts of them
so yes, thank you so much.