We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Bug hunting with Apache Lucene

00:00

Formale Metadaten

Titel
Bug hunting with Apache Lucene
Serientitel
Teil
42
Anzahl der Teile
110
Autor
Lizenz
CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
19
20
Vorschaubild
44:46
23
30
Vorschaubild
25:53
69
Vorschaubild
25:58
76
78
79
96
97
Diskrete-Elemente-MethodeLuceneSoftwareIdentitätsverwaltungProgrammierumgebungProgrammbibliothekAppletSpeicherabzugInverter <Schaltung>ServerPauli-PrinzipData MiningBenutzeroberflächeMenütechnikMultiplikationsoperatorRechter WinkelSystemaufrufDatenstrukturRechenwerkSchnittmengeKomplex <Algebra>Natürliche ZahlKlasse <Mathematik>URLInternetworkingCASE <Informatik>RadiusImplementierungAggregatzustandDatensatzFlächeninhaltMereologieTermEndliche ModelltheorieStichprobenumfangGrundraumArithmetisches MittelTeilbarkeitTaskSummierbarkeitSchlussregelProgrammierungEinsComputerspielGlobale OptimierungBefehl <Informatik>Digital Rights ManagementMathematikTouchscreenAutomatische IndexierungBeobachtungsstudieIterationRegulärer Ausdruck <Textverarbeitung>ProgrammschleifeSuchmaschineZustandsmaschineAbfrageSoftwaretestSpeicherabzugProgrammbibliothekProgrammfehlerNichtlinearer OperatorTransduktor <Automatentheorie>ZahlenbereichFilter <Stochastik>MAPBitKartesische KoordinatenUmwandlungsenthalpieCodeWeb-SeiteAlgorithmusNP-hartes ProblemAppletMailing-ListeExistenzsatzGebäude <Mathematik>SystemprogrammZweiMusterspracheProzess <Informatik>Projektive EbeneQuaderInnerer PunktSoftwareentwicklerProdukt <Mathematik>Twitter <Softwareplattform>EDV-BeratungXMLComputeranimation
LuceneRechter WinkelTermProgram SlicingWort <Informatik>GruppenoperationSichtenkonzeptHalbleiterspeicherBitArray <Informatik>Schwach besetzte MatrixSchnittmengeComputeranimation
Diskrete-Elemente-MethodeLuceneRandomisierungSoftwaretestProgrammierumgebungEin-AusgabeFormation <Mathematik>Operations ResearchPhysikalisches SystemVerzeichnisdienstZufallszahlenCodeImplementierungFunktion <Mathematik>Randbedingung <Mathematik>SpeicherabzugMetropolitan area networkOrakel <Informatik>MereologieServerClientZeiger <Informatik>RandomisierungSoftwaretestHalbleiterspeicherMAPProgrammierumgebungAutomatische IndexierungMaßerweiterungFramework <Informatik>Puffer <Netzplantechnik>CodeEin-AusgabeMultiplikationsoperatorBildschirmfensterKomponente <Software>ZeitzoneElektronische PublikationVerzeichnisdienstImplementierungNichtlinearer OperatorMini-DiscSuite <Programmpaket>VersionsverwaltungKonfiguration <Informatik>MomentenproblemSoftwareObjekt <Kategorie>BitZeiger <Informatik>KomponententestVirtuelle AdresseVirtuelle MaschineProzess <Informatik>AlgorithmusResultanteSystemprogrammPaarvergleichAdditionGlobale OptimierungRandwertSchaltnetzClientInformationInverser LimesRechenwerkDifferenteSystemzusammenbruchServerNebenbedingungValiditätSpezifisches VolumenOrdnung <Mathematik>Ganze FunktionStellenringZusammenhängender GraphLesen <Datenverarbeitung>VorhersagbarkeitMathematikPhysikalisches SystemBildschirmmaskeSoftware Development KitCliquenweiteKategorie <Mathematik>SoundverarbeitungWinkelOntologie <Wissensverarbeitung>AggregatzustandSelbst organisierendes SystemComputerspielPunktMereologieQuick-SortPhasenumwandlungTaskSchnittmengeComputervirusAnalytische FortsetzungMechanismus-Design-TheorieComputeranimation
RandomisierungOrakel <Informatik>ClientMereologieDiskrete-Elemente-MethodeLuceneAppletBefehlsprozessorRippen <Informatik>Physikalischer EffektNP-hartes ProblemZeichenketteKompakter RaumCodeRhombus <Mathematik>GeradeSoftwaretestMultiplikationsoperatorMomentenproblemMereologieSelbst organisierendes SystemEnergiedichteURLStichprobenumfangObjekt <Kategorie>Wort <Informatik>Reelle ZahlPhysikalisches SystemWorkstation <Musikinstrument>SchnittmengeComputerspielPixelStabGüte der AnpassungMathematikQuellcodeNichtlinearer OperatorCASE <Informatik>TaskTypentheorieRechter WinkelRechenschieberTermOrdnung <Mathematik>BitEndliche ModelltheorieKlasse <Mathematik>ProgrammierumgebungVersionsverwaltungSprachsyntheseQuick-SortPhysikalische TheorieZusammenhängender GraphForcingResultanteGesetz <Physik>MaßerweiterungArithmetisches MittelEinsCoxeter-GruppeFeuchteleitungHalbleiterspeicherZeichenketteNatürliche ZahlVirtuelle AdresseVirtuelle MaschineProgrammfehlerInstantiierungProzess <Informatik>Projektive EbeneSkriptspracheGeradeCodeGlobale OptimierungParametersystemVektorraumAppletElektronische PublikationDifferenteResamplingDifferenzenrechnungGrenzschichtablösungRhombus <Mathematik>BefehlsprozessorLaufzeitfehlerSystem FMaschinenspracheComputeranimation
Diskrete-Elemente-MethodeLuceneSpeicherabzugGoogolComputeranimation
Transkript: Englisch(automatisch erzeugt)
OK, this talk will be about something like a follow-up about Andrew Hailey's talk about bug hunting. In this case, more from the user perspective, not how to debug those bugs.
It's how to detect them when they happen in the JVM. So for example, you are running your program, and suddenly you get a segmentation for it. And you try several times, and the second time it doesn't work. So the problem is, it's very, very hard to reproduce.
So your test cases won't really catch them. Yeah, and you might know the Apache Lucene project. I'm part of the project management committee, and I'm an Apache Software Foundation member. And I'm mostly working on Apache Lucene, which
is a full-text search engine. We'll come back to that in a minute. And I'm working as a consultant mostly in this world, so providing support for full-text search engines based on Apache Lucene, Solr, and Elasticsearch.
So the first is, what's Lucene about? This is about Java, so maybe not everybody knows Lucene. So Apache Lucene Core is a high-performance, full-featured text search engine, which is completely written in Java from the beginning. So it started in about 1998, I think.
And at that time, it was really innovative to do something like a full-text search engine completely in Java. So now it's very mature, and it's used in a lot of products.
And basically how it works, just to show it in a very quick way, a full-text search engine uses an inverted index, which is nothing more than what you see here on the screen. It's basically what you have in a book. At the end, it's the index you have. And it's called inverted index because you can look up
a term in that index. And so something like that, cherries. And then you get the page numbers where this term occurs. So basically, this is not very complicated. Only in a book, you only have a few terms in a full-text index. Like, you know, Google has the biggest one somehow.
But also in your shopping system, you have something like an index which consists of millions or even billions of terms. And you also have many, many documents. So the page numbers are simply the documents where the term exists. So these are posted for the posting list
in the full-text search engine. So Apache Lucene is a library which lies behind the, I think, more people know this two products. One is Elasticsearch.
And the other one is Apache Solr. Both use Lucene behind the scenes. And users, I just choose some that everybody knows. So Wikipedia, for example, is using Elasticsearch for the full-text search. GitHub, everybody uses it, I think, uses Elasticsearch for the code search.
When you are entering something there, there's also the iTunes shop using Apache Solr, Instagram. Another user which directly uses Lucene is Twitter. They have a fork of Lucene, which is very special,
changed a little bit to work better with real-time infrastructure. And of course, every JEDK developer also knows Jira. They use Lucene internally. All those implementation inside is not really the best one. If you are searching for an issue,
it's, in most cases, a bit hard with full-text box. So what makes Lucene interesting for testing for JDM bugs? So I think the first thing you might think of, OK, maybe the algorithms. Yeah, that's partly true. Because Lucene is, as I said,
you have huge indexes with billions of terms. Those are partly gigabytes from size. You have a lot of very performance-critical tight loops. So this is optimal for the hotspot optimizer to start doing optimizations. Because much of those loops are
executed millions of times for each search query. So it's very, very low-level Java code. So it's not something like a JED application where you're just coding some interfaces. Here, you're processing data all the time.
The second thing is, Lucene, you might think, yeah, we are here in a university building. So think of your studies at the university. Lucene, this term index you have seen before, mostly is based on structures like finite state automatons, finite state transducers. So you might know that from regular expressions
how they are internally implemented. So when you're looking up a term, you are going through the state machine. And you have a query where you want to match the terms. So there's a lot of, and those FSTs, like the term index, are very, very huge.
So to have this term index inside, to look up the terms and find the postings based on that. So we are also handling a lot of data. And together with all those loops, there's a lot for what's called to optimize. And it does it really, really good. The other thing is, there are also
some code patterns which are repeated very often. So everything is based on iterators in Lucene. So we don't use Java iterators here. It's just the pattern of iterators. So for example, when you execute a query, you are collecting the documents that are hits. And then you have several queries
that you want to combine together. So every query returns the documents it has seen as in something like an iterator on ints. And if you then run a second query and want to intersect the results, so for example to int,
so find all documents where term A and term B is inside, you have to somehow leapfrog is something, is maybe a good explanation of the algorithm. So this iterator also has the possibility to not only jump to the next item, it can also jump to the next item after a specific document ID I have seen.
So they can somehow work together. And those algorithms are, as you see here, also very low level. And another thing which is done here is, for example, applying filters when users are clicking on facets and all that stuff. There are a lot of bit set operations inside,
like and, or. Or for the iterators, we need stuff like next set bit. And we have a lot of own implementation, so we don't use the other utility bit set, because it's not fixed size and has some performance problems. So basically, we also have something like sparse bit sets where you have gaps
inside the bit set that you can save memory. And of course, there's a lot of stuff to optimize for hotspot. Yeah, one thing I have to mention is we have rare use of string, only user-facing APIs.
So everything internally works on byte arrays, which contain somehow the terms like UTF-8 bytes. But all those stuff like FSTs are working only on more or less huge byte arrays and slices on top of that. And the last thing Lucene is using is memory mapping. So the whole index is on 64-bit machines,
completely memory mapped using map byte buffer. And this makes it really fast. So whenever it's possible to use that, we are using that on the searching side to read all that information from disk,
have random access while sorting, and all that stuff. But all this stuff you see here doesn't really make it very special if you want to run tests. The problem is when you run your unit tests, you only have a very limited view, and hotspot cannot really start to optimize. Because when the unit test is only executed once,
it's very unlikely that the hotspot optimizer starts to do something. So you need something in addition. One thing is you need to run your tests on huge amounts of data. That's what Lucene is also doing. And the second thing is if you just
want to test that your code is correct, it's perfectly fine to write classical unit tests. But the problem is if you have so many combinations of algorithms you can glue together, and you also have the hotspot optimizer kicking in, you need something else in your test framework.
And that's what Lucene started in 2011. And this is called randomize your tests. And it will blow up your socks. This was a talk by David Rice, who invented that framework.
The idea here is to not just have static unit tests, but you are modifying the input data. Of course, you do not completely randomize the input data. You have some special constraints on that. You also try to exchange some software components.
For example, in Lucene we have something like different directory implementations, how to do the disk access. So you run the test suite, simply each time with another implementation of the disk IO access, and all that stuff. Then there's also another thing, like the environment, local time zone.
You can exchange that. Another thing, which is only in italic here, is exchanging the JVM and the operating system, but also the options of the JVM. And another thing is inserting through mocks, something like O problems or network problems into your tests. The problem is randomization is not bad.
It's not really good for tests. So for that, we have something like a framework which allows us to make it predictable what happens there. So we have an extension to chain unit. It's almost compatible to chain unit. And it also supports everything you can run it
from inside Eclipse. But it has some additional features, like it allows to do randomization. It also allows to isolate the tests, resets, and we also have the possibility
to change the random hop edge. I come back to that now. And the question is, how do you want to reproduce a test failure when you make everything random in your tests? And because of that, the test framework
for every test execution, the test gets an initial random seed, which is calculated by the test framework and is available through utility methods. And then the test can do something like randomizing its input data and do some stuff on it. When a test fails, this random seed is printed
during the test execution, and it's also included in the stack traces. So for example, we have a test here running in Eclipse, and this test fails. You see here, you get the assertion error, but you also see we have some random seed inside, which allows us to execute the test runner
with exactly the random seed, and it would reproduce exactly the same test environment based on that. And sometimes, of course, you freeze the JVM, but that could happen, then you can also ask JStack to get the test seed to reproduce the same stuff.
The problem is how to do assertions in your randomized test code. Okay, you can do something like compare against a reference. So if you have a different algorithm doing the same, you can just compare the results. You randomize your input data, you have something where you know that it really works,
and you have something where you are not 100% sure, so you are just comparing the results. You can also do something like checking for boundary conditions and all that stuff. And finally, you can also do nothing and just run your tests and wait until your JVM crashes.
And that's what we are doing in the testing of OpenJDK early access builds at the moment. So the idea how to do that. So we have learned about a lot of randomization here. What's missing now? So the idea here is what we are now need is something like JVM randomization,
so all our test suites are running with different JVMs, different versions. We are also running this IBM J9, for example, or preview releases of JDK-9 at the moment. And a second thing that we are somehow randomizing,
okay, we are cycling through it, is we are always, for every test run, which are running 24-7, so it's not something like the test suite only runs when somebody commits code. No, the test suite is running all the time. So the machine is running the tests from, exactly, so it never stops.
It's once again separate, several processes in parallel. So for every test run, we are changing the garbage collector, we are changing a different JVM, like 32 or 40, 64 bits. We are, for 32 bits, we are also changing the client VM, sometimes we enable compressed
ordinary object pointers. So we are randomizing everything running the tests until we find some issue. And the last thing we are randomizing is, of course, Linux, Windows, Mac OS, and Solaris. Linux, Mac OS, and Solaris is not so interesting.
What's interesting, for example, for a lot of users is to just also sometimes run your tests on Windows because there's a file system, it often blows up because you cannot delete files which are currently open and all that stuff. So you often see test failures on Windows which you are not really seeing at the beginning.
So basically, here is one of our servers which is doing the test runs or the times where we have several jobs for different versions of the machine and different operating systems.
So for example, here you see one for the trunk of Linux. So on every run, you see here which parameters were used. Maybe it's a little bit too small, but for example, this one was JDK 32 bits. JDK 9 early access build 102, which is the actual one, I think,
or is a new one already? No, yeah. And so you can see that when the tests are executed, it runs something like a movie script with picture, right? Settings like Yava Home and everything else. And after that, it starts the whole build process
and it runs all the tests. At the beginning, it prints out everything again. And if you're watching the whole thing, and you might see sudden failure. The failure can of course be caused in your own project which we have seen in the last week quite often because we tested some new component.
And whenever it was used, some tests were failing strangely. And yeah, so you're finding a lot of bugs by doing those optimizations. But sometimes like today in the morning, you see something like that, segmentation for it.
So actually on Wednesday, I updated to from 95, I think, 95 of JDK9 to 102. And today in the morning, we saw the first issue with 32-bit, G1, GC and something like that. So that's just actual, yeah.
Finally, I wanted to show some bugs we found that. So initially the whole thing started at the Yava 7 TA release. We don't want to talk about that. It's very old already. And the second thing when something happened,
we did not do early access tests at that time or not really early access tests, was Yava 7 update 40. And at that time, there was a very, very strange bug which did only hit some users and finally found out that it was because in update 40, there was some improvements
for the super words which is using the CPU, AVX optimizations, this I used, I told you we are using a lot of bit sets. And those bit sets are ended together or together and the vector extensions make this more fast. So by doing stuff like ending four words together
and all that stuff, the problem was this only happened on actual CPUs. And it was not easy to figure out. And at that time, Rory O'Donnell came to us and asked if we could start to do the early access testing. So it was finally fixed in 7 update 55.
So for example, we also had some other smaller bugs found like by the low-cale randomization, we found out that runtime exit doesn't work in the Turkish low-cale because of a stupid bug because it was using string to upper case or string to lower case with the default low-cale. And in the Turkish, the small e and the capital E
are not both having, so one has a dot, the other one has no dot in English, but in Turkey, both have a dot or have no dot and that's a problem and because of that, something in the class initializer
of the process builder didn't work. So this was fixed in Java 8. There's also another bug which is not fixed until today, this would be something for Andrew to look into it. It's still not fixed, so we have in the machine class
an assert with tips since 7 update 25, but it's mostly impossible to reproduce. So course is still unknown. And finally, for the Java 9 part, during this testing like the bug you have seen today, we found a lot of those array copy bugs
which were due to the improvements around that. I think it was Roland doing all that stuff. They were almost easy to reproduce, but they were only found by those random testing. There was also something else,
a string to lower case did not work with some concatenated strings. This was a natural hotspot issue. And there's another one ongoing which is since build 93. So for example, you have seen we are not using the compact strings at the moment because they fail horribly with Lucene at the moment.
This one is easy to reproduce but fixed recently. I think there was also something like a memory barrier or something fail is missing. So string get just failed in that case. And this was something else we found out that build 54 broke the compiling resource
and target 1.7 and together with a diamond operator. And this was a bug in the type system at that time. Oh, I'm not sure. It was not really a bug in the type system. The problem was that the type system was updated and it was not taking care of the old source and target versions.
And also Lucene is also fixing something with working together with JDK. So we are always updating our code. So the next version will be, of course, currently we are already with Lucene,
we are compatible to Chixo as it stands at the moment. We are not yet running the tests all the time. So just in our development environments and we remove, for example, a lot of stuff accessible objects and accessible is removed everywhere from the code. We only have one instance back and that's another recent discussion
where we are working together about the sunless cleaner removal in JDK 9. The problem is, as I said, we are using those memory map files and we have a little bit of problems with long running processes. So we need to do those unwrapping and this would have failed with the other lines.
So we have to work around for that at the moment. So we are also working together. But we are prepared for Chixo and I think you should really start to try to work with Chixo. Yeah, and that was it, thank you. I also wanted to thank to the people mentioned here,
Vladimir, he's always fixing all the hotspot bugs and the same for Rodant and the other ones, Wally O'Donnell is sitting here and yeah, thank you.