We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

A lightning intro to re-Isearch

00:00

Formale Metadaten

Titel
A lightning intro to re-Isearch
Untertitel
re-Isearch, the 27 year old new kid on the search block
Serientitel
Anzahl der Teile
287
Autor
Mitwirkende
Lizenz
CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Project re-isearch is a novel multimodal search and retrieval engine using mathematical models and algorithms different from the all-too-common inverted index. The design allows it to have, in practice, effectively no limits on the frequency of words, term length, number of fields or complexity of structured data and support even overlap--- where fields or structures cross other's boundaries (common examples are quotes, line/sentences, biblical verse, annotations). Its model enables a completely flexible unit of retrieval and modes of search. Developed using a highly portable C++ subset to be RAM efficient, the engine provides also bindings to a number of other languages such as Python, Tcl, Java etc.
Hill-Differentialgleichungp-BlockDiagrammXMLComputeranimation
SoftwareQuellcodeBenutzerprofilInformationSoftwareentwicklerProgrammierumgebungVerzeichnisdienstOffene MengeGruppenoperationAggregatzustandDatenbankMathematikElektronischer ProgrammführerPhysikalisches SystemThermische ZustandsgleichungWort <Informatik>Eigentliche AbbildungSelbst organisierendes SystemBaumechanikSystemplattformContent ManagementVolumenPrimitive <Informatik>LastMailing-ListeSoftwareentwicklerWeb SiteRadikal <Mathematik>VersionsverwaltungOpen SourceRechenschieberServerBitComputeranimation
DatenstrukturInformationGarbentheorieVerschlingungInhalt <Mathematik>AuszeichnungsspracheVisuelles SystemInformation RetrievalPlastikkarteEndliche ModelltheorieStandardabweichungAutomatische IndexierungWeb-SeiteDichte <Stochastik>Element <Gruppentheorie>GeradeSinusfunktionVirtuelle RealitätWort <Informatik>RechnernetzKlumpenstichprobeMessage-PassingEinhüllendeMereologieInhalt <Mathematik>Quick-SortMultiplikationsoperatorVirtualisierungGarbentheorieDatenstrukturDifferenteInformation RetrievalAuszeichnungsspracheUnimodale VerteilungMathematische LogikKontextbezogenes SystemSuchmaschineNormalvektorRechenwerkKontrast <Statistik>Wort <Informatik>Objekt <Kategorie>StandardabweichungMailing-ListeInternetworkingComputeranimation
Rekursive FunktionProjektive EbeneInternetworkingKontextbezogenes SystemMereologieUnimodale VerteilungQuick-SortComputeranimation
Data MiningArray <Informatik>AdressraumCachingAutomatische IndexierungMini-DiscWort <Informatik>ROM <Informatik>Objekt <Kategorie>DatenfeldExploitVirtuelle RealitätPhysikalisches SystemOperations ResearchInformation RetrievalBinärdatenSinusfunktionWeb-SeiteTranslation <Mathematik>Kernel <Informatik>BefehlsprozessorDatenstrukturUmwandlungsenthalpieNummernsystemCoxeter-GruppeSyntaktische AnalyseParametersystemMatroidKonfigurationsdatenbankWitt-AlgebraBestimmtheitsmaßQuick-SortTypentheorieWeb-SeiteMultiplikationsoperatorObjekt <Kategorie>FontAutomatische IndexierungInformation RetrievalMailing-ListeLesen <Datenverarbeitung>TouchscreenDatenfeldWort <Informatik>DatenstrukturSpeicherabzugAlgorithmusProzess <Informatik>Umsetzung <Informatik>Interface <Schaltung>ParametersystemGruppenoperationSyntaktische AnalyseSichtenkonzeptE-MailMAPKonfigurationsdatenbankMatchingElektronische PublikationInverser LimesTextsystemMengenalgebraHypercubeModifikation <Mathematik>PunktDickeComputeranimation
DateiformatDichte <Stochastik>Green-FunktionVHDSLQuick-SortFilter <Stochastik>Elektronische PublikationDifferenteBestimmtheitsmaßAdditionTypentheorieCodeLastPlug inMailing-ListePhysikalisches SystemProgramm/Quellcode
Automatische IndexierungSinusfunktionDatenbankMathematikVerzeichnisdienstQuellcodeCompilerBefehl <Informatik>KraftROM <Informatik>Prozess <Informatik>VerschlingungMusterspracheInklusion <Mathematik>ZeichenketteWort <Informatik>E-MailSyntaktische AnalyseDatenstrukturPhysikalisches SystemServerGenerizitätNotebook-ComputerIntelCodeDämon <Informatik>LastKonfiguration <Informatik>PlastikkarteZahlensystemProgrammbibliothekRückkopplungNichtlinearer OperatorMathematische LogikReverse EngineeringInformation RetrievalSystemprogrammierungAlgorithmusTermRegulärer Ausdruck <Textverarbeitung>DatenmodellPeer-to-Peer-NetzSchnittmengeEinfach zusammenhängender RaumBinärdatenPhasenumwandlungUmwandlungsenthalpieOrdnungsreduktionGeradeSystemaufrufSprachsyntheseBitrateMAPDiskrete-Elemente-MethodeWellenpaketMereologiePRINCE2Relation <Informatik>IT infrastructure libraryMagnetbandlaufwerkInterpretiererDatenfeldTypentheorieDisjunktion <Logik>Element <Gruppentheorie>DefaultGewicht <Ausgleichsrechnung>Funktion <Mathematik>ZählenAbelsche KategorieRankingInformationsspeicherungEindeutigkeitDateiformatKartesisches ProduktPaarvergleichRelationale DatenbankInhalt <Mathematik>Zeiger <Informatik>Coxeter-GruppeArchitektur <Informatik>GasströmungOperations ResearchVirtuelle RealitätProdukt <Mathematik>Basis <Mathematik>SoftwareEreignishorizontKonfiguration <Informatik>AbfrageInstantiierungPlastikkarteZahlenbereichDemoszene <Programmierung>DatenstrukturMAPSchlussregelGeradeMultiplikationsoperatorSpannweite <Stochastik>Mailing-ListeE-MailEDV-BeratungWasserdampftafelAnalytische MengeComputerspielLastQuick-SortSyntaktische AnalyseLeistung <Physik>Gewicht <Ausgleichsrechnung>DatenbankAutomatische IndexierungGamecontrollerElektronische PublikationObjekt <Kategorie>BitVirtuelle MaschineServerRelativitätstheorieFlächeninhaltDynamisches SystemANSI Z39.50-1992ThumbnailEindeutigkeitWort <Informatik>Arithmetischer AusdruckNichtlinearer OperatorBoolesche AlgebraResultanteHalbleiterspeicherNotebook-ComputerDemo <Programm>Computeranimation
SoftwareComputeranimation
Elektronische PublikationDichte <Stochastik>RechenschieberZahlenbereichAbfrageEinfach zusammenhängender RaumPrädikat <Logik>ComputeranimationBesprechung/Interview
Inverser LimesFormale SemantikMAPElektronische PublikationTermKontextbezogenes SystemPaarvergleichGeradeSprachsyntheseVerschlingungAbfrageArray <Informatik>LaufzeitfehlerRichtungWort <Informatik>AlgorithmusPhysikalisches SystemZahlenbereichWeb SiteBesprechung/Interview
Inverser LimesAbfrageBesprechung/Interview
Inverser LimesVersionsverwaltungElektronische PublikationAdressraumBesprechung/Interview
Computeranimation
Computeranimation
Transkript: Englisch(automatisch erzeugt)
Welcome everybody to Fostom 22 online. Great to be here. This lightning talk on iSearch, the 27 year old new kid on the search block.
And as you can see, there was some port in the supplemental lead foundation in GI0 and of course some calorifices in ICT. So a little bit of quick, quick, quick history. Basically the engine goes back to 94 when it started off and development was then split between North Carolina and Germany.
And it was deployed in lots and lots of sites and it became a proprietary fork and basically at one point, basically in 2011 it kind of stopped. And despite lack of servers, a lot of servers continued to run and I guess they don't want to break a big thing. And as I said, sites, lots and lots of sites, patent office, NASA, yeah, you see this long list of slides.
We'll upload this slide somewhere else so you can have a more detailed look at it because I'm going to be going really quick. This is a, you know, a really quick talk. So anyway, development terminated. Lots of people if I could open source IB which was the proprietary version of iSearch and we were really surprised at ApacheCon how primitive the offerings were.
So with the kind support of MLNet and API Zero in the middle of the COVID pandemic, we decided to basically bring this thing out and push the envelope. So anyway, so what's the difference? So mainstream search engines are about finding any information, a list of documents in a list of offerings.
So basically search engines gives a long list and, you know, instead of the content and you can't sometimes find it and then the sorting sort of screws things up. So basically a lot of stuff is actually not really findable or not really searchable or I mean it's there so it's searchable but you can't just find it.
So what this engine is about is looking at some of the structure, let's say XML, another kind of markup. Also explicit sort of paragraph because it basically has some logic to try to identify for certain kinds of doc types, some basically visual context and to basically find this sort of stuff.
So anyway, small queries, one time damage. So basically what we have here in contrast to normal search engines is the possibility to have a different kind of level, a unit of retrieval. The standard unit of retrieval, if you look at something like your standard internet search engine, it's basically the object that itself was indexed.
So that's the page, the PDF, the word document or whatever. And here we have very, you know, the possibility to do a different kind of granularity and that we also have the structure. So we have a document which is part of a collection, which is a part of a collection. So we, for example, can search and we can look at actually saying, OK, well, actually, this is not the document that's relevant.
This is the collection or this is a section of a document. So we can walk around the structure and we can try to figure out what actually is kind of relevant. And I think that's pretty cool. So and as I said, it's virtual. So basically we can create these collections to sort of transcend these sort of bubbles so we
can dig deeper to find these new insights and not something that was cut ahead of time. So basically the users can redefine things at search time and multimodal research. We actually applied this back in 2008 as part of an art project by Isaiah. And together with the Dutch design group, Metahaven presented this in the context of Internet search,
using this sort of multimodal recursive search to find layers of things of the same kinds of items. So the design of this thing is basically we've got a core engine. We've got like the documents and we've got doc types and doc types are the things that basically the interface that understands the various document formats.
So the doc types help with the indexing. They also help the retrieval side for basically doing some of the conversion. And, you know, here's here's sort of an outline of the algorithm. I will skip this because we're you know, we don't really have time in this sort of quick talk.
But basically what I do in this is I basically can index every word. I understand all the structures. I understand what word belongs to what structures. And I could do this really quick. So basically we've got all the specific paths to various objects and the little objects can get their own index structures.
And so we've got a like a polymorphism. So we can also, for example, like in this case, we have a date field six February twenty twenty two, which can also be encoded as zero two zero six twenty twenty two or even movie six, which is, I guess, Polish.
I don't know Polish, but it's Polish 22. But we can also look upon it from the point of view of actually text, which basically if we were looking for Feb as the text, it won't match Luty. So we have different ways of looking at the same sort of fields and objects. And we have some tricks as to how we actually get some really good IO because this thing's IO limited.
We use heavily a map and a lot of little tricks for basically getting things in. And this goes down to low level looking at the transition view. But basically what it basically means is that when we have multiple processes running the index, we don't have to do the IO all the time.
That the page is probably already in memory so that basically our little indexes are pretty fast. So as I mentioned before, we now have for every word we have an address, we can open file, read it. And, you know, and we also use the doctype to see how it's encoded.
So we understand all the paths, we understand all the words. Now, what we have within our algorithm here is we have the sort of first X things that were the first X characters or octets within the screen we sort of cache. And that's a sort of look ahead so that we can understand how it is and not necessarily have to open the file.
And this, but it also means that we do not have limits to the length of any word. So basically if it goes beyond that limitations, we open up the file and then we can go here. So we can do original file, we can do unlimited length, literals using any wildcard, Richard scream of our dreams.
And we can even do it the other way and we can say the given path, what's inside it. We can reconstitute also this any other possible way. And, yeah, and as I said, we have these things called virtual indexes, which allow us basically with a little file to create other indexes, to lump indexes together, whatever.
And in fact, if we want, there's also the possibility within a real index to actually import another index, which is not just basically having two indexes, because if I have if I'm making a virtual index, obviously I have the speed to search the one index of the speed to search the other index.
So basically that the complex goes that way. But when I, I import it, then it's also binary search within that index itself. And importing those also pretty quick. OK. And as I said, we have this dog type registry. And so the index goes goes to the dog type registry and the dog types can call each other.
Because sometimes the other thing. So, for example, I get I'm using the auto type dog type, which is the dog type, which tries to guess what kind of file I'm looking at if I don't tell it what it is. And it looks at it goes, oh, this looks like a mail file. And it passes it off
to the mail dog type and the mail dog type says, OK, well, it's not quite a mail file. This is a mailing list. So it will pass it now to another dog type, dog type D, which is the one that does the list. And the list one, they say, ah, but this is not this kind of list is this other kind of list and pass it on or whatever.
And we also have the possibility for various hyper parameters of these dog types to actually create what I'll call virtual trills in these dog types, which are the same as the dog type, but with slightly different hyper parameters. And again, it's just a file where you would define this stuff and not actually completely new parsing. There's it's all in the handbooks of the documentation. And as you get lots and lots
of dog types, you can see this list here and some of them are also filters. So, for example, filter detects filtered XML. We have an external file that does that. We have pandoc. We have all kinds of other existing sort of filters here. And we, you know, we handle all these different kinds of things here natively.
So we can reconstruct it. And then we also have something called a plug in architecture, which allows you to extend easily using C++ code to add additional high speed dog types to the system. And when it finds them, it loads them up here. As you can see here, I have an XFIF and I have MS Office or MXL plugin, etc.
So we also have a bunch of tools. I index, I util, etc. And just to give an example. So to index the file, we have I index my DB file one, two, where that's the simplest way to go. I detect it will then basically index it. But we have lots of options.
You can imagine, again, control P, the handbook, lots of stuff here and basically alone just to select files to walk in the step. It's basically like find on steroids. Now, let's talk a little bit about the performance on the machine that I'm currently sitting in front of, which is an Intel iCore 729, basically, and using 512 MB memory.
Yes, you read that correctly. And it will run actually even in as little as eight MB or four. It'll run on your wristwatch. We get about 56,000 words a minute. We're about half a million emails in under 20 minutes. But that's because a lot of the parsing of mails and doing stuff is kind of complicated.
So if we're just indexing full text, basically, you know, not looking at the structure and the de-parsing, we're going to get like we get up to about 20 times the speed. And we've clocked that as on this notebook at 70 million words a minute. And we put this up on servers where we're getting well beyond that. So, again, most of the time is spent analyzing document structure and parsing.
And there's a lot of room to actually improve that software. And searching, you know, there are there's an eye search tool. Normally you write C++ code, a scripting language, which is Python. But it's still quite useful and it can even run as a daemon.
There are lots of features in there. We've done a better as a back end, et cetera. And you can imagine it's got loads and loads of options. But I'm going to spare you that because I actually want to know we have limited amount of time. And I want to talk about the query languages have, because that's where some of the real power is beyond the fact of all these things I can do.
So, you know, what would a powerful engine be without fine grained search and a number of query methods and a rich query language? So we basically have a number of query languages here. We've got CQL, which is maintain the Library of Congress. Something about SOW, Z3950. If you don't know about it, don't worry about it.
Then we've got something I call smart queries, relevant feedback, you know, find me something like this. Infix notation, RPN notation, we also support. Now, smart queries are pretty cool because basically they are for the non-technical user and they are easy to use. And they get around some of the problem of like, well, you shouldn't be an or or an and, you know, some some engines basically take all the words and add them.
Some of them do or or it gives you too much, depending upon the way it's sorted. And, you know, misses a lot of important stuff. And so what we have here is something called smart query logic. And it basically does a number of things. It looks at that. Maybe it's maybe it's literal.
Are they in the same node? Maybe that would be kind of cool. Or if it's like and within this area here or if it's or, but within this sort of query, that's sort of the function here. So I'll give you a quick example. I looked searching through Shakespeare. I look up rich water and it finds no phrases like which water, but in the life it finds rich water are the same.
And so basically they're within the same container, which we call peer. Hate Jews. Again, there is no line or container with that. But we have in The Merchant of Venice, we have a lot of scenes that are talking about hating Jews within the same thing. And so he gets hate Jews reduced to. And I hate him for it's a Christian found basically in the lines book by Shylock.
And as you see here, I hate him for it as a Christian. Now, what we can also do is when we're looking at this, we can we know our position within within the within the structure. We can also go and walk up backwards, keep it within the scene.
So we can ask, for example, who is on the stage because we have these stage directions, you know, enter, exit, you know, etc. And so that's also pretty cool. We actually do a demo of that. But anyway, so out out, you know, it finds that actually it's it's it's used in a number of places.
So it finds here to the confirmed phrase out out. And we have a general query language expression here. And it supports, of course, you can imagine wildcard and Boolean operators, as you can imagine, lots of Boolean and or and not basically. Yeah. Keeps going here. Long list. Consult the handbook and unary operators not within.
Yeah. Keeps going. More things. And we've got the possibility also given to do sort of sorting. And, you know, which, of course, interested in search performance. And basically the performance is, you know, big O, the log in and the log event
where n is the number of unique words and m and number of instances to field search. So basically, you know, rule of thumb is the smaller the results encounter, the faster the search. And we have a number of features like fuel to try to limit this in a practical way. And we've got semantic search. We have personalized story and we've got all kinds of objects, which, as I mentioned, beyond text.
Here's some of the object list here. And we've got all kinds of relations and, you know, great. But they don't even have to be in the object. They can actually be external. We allow that. And we also have dynamic presentation, which allows you to convert this thing back in here. So the content come out of their places. And we've got all kinds of scoring, normalization, ranking, which we can also do stuff.
So anyway, bullet points, ETL, wide range of document types, data storage, dynamic search time, analytic recommendation, customization. And it's really available. And, you know, any work you haven't dreamed of? Oh, yeah.
I go on for hours. Yeah, I make that week. So anyway, visit a non monotonic net. I search to learn more software. It's all freely available on GitHub. So thank you very much.
PDF or PowerPoint file of your slides. And then do you say the query performances? I wonder. And as I saw the semantic search, you have to connect items, predicates and items together.
To define the system, we are doing semantic search or autonomously they are composed. You have you have different approaches. Number one, the I will put up the slides in GitHub.
There's the handbook. There's the sources. There's all kinds of documentation. There's a comparison to other engines like you've seen. It's all in GitHub. Under Reicer, you can find it in GitHub. The link that I published should actually go there.
And in terms of semantic search, yeah, there is actually a file. The file can be created by hand because basically you need for four on the semantic level. It has it has to be context dependent. So if I'm, for example, searching a collection that has to do with automobiles, obviously certain words have certain associations, which would be quite different than if I'm looking at site biology or whatever.
And so either it's created by hand, it's hand created, or you can use vectorization. There are a number of algorithms you can use to create this file by actually doing some, you know, back of words to vectorization to get that.
I think it's quite flexible. OK. And in terms of queries, you've got, you know, as I tried to show here, the smart query. But you have actually very rich query language. And then you can go back and forth. You can do even crazy stuff. So, for example, like I showed, searching for in Shakespeare and getting back a line.
I can say, OK, now I want that line. I want to know in what speech that is and then or in what app that is. And then I could even ask the question, I could walk up the stage directions and then I could actually at runtime say who is on the stage when this line is being set. And I can all do this as queries. So more questions. Go ahead. OK. Do you have system limits, limitations of your system indexing, indexers, limitation?
And of course, query limitation. Do you have any limitations to me? It limits. I mean, there are limitations. Yeah. Yeah. Sorry. Yeah. No, there are, of course, limitations right now.
The current version there is, I think, limited to 32 million files and hundreds of terabytes of data. So, but that limitation is artificial just because I wanted to keep the addresses within.