Optimization
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 12 | |
Number of Parts | 12 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/16256 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
Information retrievalMathematical optimization.NET FrameworkMoving averageWebsiteSummierbarkeitRegulärer Ausdruck <Textverarbeitung>Time domainSineObject (grammar)InfinityMetropolitan area networkPrinciple of maximum entropyMembrane keyboardMaxima and minimaHTTP cookieSystem of linear equationsLie groupReal numberInformation retrievalMathematical optimizationInformationPhysical systemObject (grammar)ResultantProbability density functionForm (programming)WebsiteGame theoryCausalityFrequencySuite (music)Multiplication signXMLComputer animationSource code
02:02
Term (mathematics)Disk read-and-write headDivision (mathematics)Term (mathematics)WhiteboardResultantElectronic mailing listSource codeXML
03:04
Software maintenanceComputer fontRankingNewsletterSummierbarkeitPhysical lawMaxima and minimaValue-added networkGame theoryResultant1 (number)Type theoryDifferent (Kate Ryan album)Internet service providerWordPosition operatorProcess (computing)Information securityAreaSheaf (mathematics)Musical ensembleGroup actionXMLProgram flowchart
04:43
SummierbarkeitNewsletterComputer fontRankingConditional-access moduleRelevance feedbackQuery languageSet (mathematics)InformationPhysical systemInformation retrievalMathematical optimizationTerm (mathematics)CalculationSimilarity (geometry)Representation (politics)Matrix (mathematics)User interfaceSubject indexingProcess (computing)Distribution (mathematics)AdditionStructural loadSystem programmingGroup actionPhysical lawResultantPoint (geometry)Semiconductor memoryRow (database)Ring (mathematics)Game theoryFitness functionPhysical systemIterationMultiplication signForceBit rateGoodness of fitInformationAreaStatisticsFreewareLengthAverageSpeech synthesisWebsiteDependent and independent variablesUniform boundedness principleTheory of relativityTerm (mathematics)Numbering schemeExtension (kinesiology)Group actionMathematical optimizationClosed setState of matterPlanningCellular automatonWordRevision controlComputer configurationType theorySimilarity (geometry)AdditionProcess (computing)Relevance feedbackReal numberDistribution (mathematics)1 (number)Web 2.0Query languageThermal expansionInteractive televisionSoftware developerStructural loadComputer animationXML
12:11
Query languageRelevance feedbackSimilarity (geometry)Term (mathematics)Presentation of a groupThermal expansionWeightVector space modelData modelComputer networkEndliche ModelltheorieDistribution (mathematics)Parameter (computer programming)NumberInformationSource codeAlgorithmInformation retrievalData structureVector graphicsData Encryption StandardSicControl flowIndependence (probability theory)Finitary relationAssociative propertyInformationRelevance feedbackThermal expansionPseudonymizationSoftwareMultiplication signSubject indexingResultantWindowWord1 (number)Greatest elementView (database)WhiteboardSocial classCausalityGroup actionType theoryLibrary (computing)CASE <Informatik>2 (number)DialectÜbertragungsfunktionInterpreter (computing)Level (video gaming)Term (mathematics)Physical systemMusical ensembleUniform resource locatorNetzwerkdatenbanksystemFunctional (mathematics)Order (biology)Computer simulationNumberQuicksortFamilyPoint (geometry)FeedbackExecution unitGame theoryRule of inferenceLogicNetwork topologyMachine visionPhysical lawDirected graphContext awarenessProcess (computing)Query languageBitSimilarity (geometry)Theory of relativityVector graphicsEndliche ModelltheorieIndependence (probability theory)Different (Kate Ryan album)Well-formed formulaVector space modelComputer animation
21:16
Object (grammar)Term (mathematics)CASE <Informatik>Rule of inferenceSimilarity (geometry)Line (geometry)Matrix (mathematics)CalculationInformation retrievalFinitary relationAssociative propertyQuery languageIndependence (probability theory)InternetworkingWorld Wide Web ConsortiumMountain passInformation systemsData acquisitionGroup actionVector graphicsSimulationMereologyThermal expansionStatisticsContext awarenessTheory of relativityRelevance feedbackAssociative propertyWeb pageInstance (computer science)Term (mathematics)WordSimilarity (geometry)Goodness of fitObservational studyDescriptive statisticsPerfect groupQuery languageCognitionMatching (graph theory)Type theoryMatrix (mathematics)Object (grammar)Interactive televisionForm (programming)ResultantQuicksortSimilarity matrixPhysical systemDifferent (Kate Ryan album)Multiplication signData structureMetric systemRule of inferenceRevision controlOcean currentWebsiteAreaComputer simulationDemosceneBit ratePublic domainVirtual machineNetwork topologyOrder (biology)View (database)Element (mathematics)NeuroinformatikPhysical lawMedical imagingCASE <Informatik>Extension (kinesiology)ForestSymmetry (physics)BitMathematicsWater vaporInformationInjektivitätTournament (medieval)Line (geometry)VotingFrequencyComputer animation
30:03
SimulationVector graphicsTerm (mathematics)Object (grammar)Similarity (geometry)Matrix (mathematics)Associative propertyFinitary relationMereologyQuery languageThermal expansionStatisticsInformation systemsWordSource codeLattice (order)Meta elementRankingDifferent (Kate Ryan album)Electronic mailing listInformation retrievalMathematical analysisLink (knot theory)FreewareMachine visionGroup actionGraph (mathematics)System programmingPhysical systemNumberCurveReading (process)Context awarenessMathematicsInformation retrievalTerm (mathematics)Similarity (geometry)RankingThermal expansionWeb pageWeb 2.0Associative propertyCASE <Informatik>Mathematical analysisLink (knot theory)ResultantDifferent (Kate Ryan album)Social classWordSource codeSkalarproduktraumAlgorithmMetasearch engineSubject indexingArithmetic meanElectronic mailing listSimilarity matrixFlow separationNormal (geometry)Musical ensembleQuery languagePhysical systemFeedbackDistribution (mathematics)Medical imagingSemantics (computer science)Product (business)WhiteboardMultiplication signGoodness of fitSinc functionSummierbarkeitFormal languageProgrammer (hardware)AreaComputer simulationSet (mathematics)MereologyAverageBridging (networking)1 (number)WebsiteClosed setIdeal (ethics)Ring (mathematics)Matrix (mathematics)Machine visionExtension (kinesiology)Dependent and independent variablesOnline helpView (database)Scaling (geometry)DistanceHypermediaVolume (thermodynamics)Mathematical singularityParticle systemDigital photographySingle-precision floating-point formatBitGeometryComputer animation
38:49
System programmingMachine learningWorld Wide Web ConsortiumMeta elementMeasurementInformation retrievalHardware-in-the-loop simulationElectronic mailing listRankingMultiplicationMaxima and minimaSummierbarkeitMachine learningAlgorithmDecision theoryExpert systemVirtual machineInformationLattice (order)Level (video gaming)HypermediaFlow separationNeuroinformatikPosition operatorCASE <Informatik>NumberFacebookMultiplication signQuicksortRoundness (object)Search engine (computing)Web pageOperator (mathematics)MereologyShape (magazine)EstimatorProcess (computing)Perfect groupOffice suiteKey (cryptography)FamilyVideo gameRankingGoodness of fitAxiom of choiceLoginSubject indexingElectronic mailing listInformation retrievalSummierbarkeitMaxima and minimaResultantDifferent (Kate Ryan album)Web 2.0Query languageComputer animation
46:44
Maxima and minimaDatabase normalizationRoundingSummierbarkeitWeight functionElectronic mailing listPhysical systemMathematical analysisLink (knot theory)Meta elementLinear mapSingle-precision floating-point formatParameter (computer programming)System programmingNumberInformation retrievalFunction (mathematics)InformationUser interfaceProcess (computing)Subject indexingWeightAverageStandard deviationVotingComa BerenicesGoogolOrder (biology)Group actionWeb pageMultitier architectureData storage deviceSoftwareIntegrated development environmentFeedbackVector potentialWordSimilarity (geometry)Set (mathematics)Execution unitRankingSocial classData typeVideoconferencingProduct (business)Numerical taxonomyLine (geometry)Vector potentialLogic gatePressureDependent and independent variablesMaxima and minimaMultiplication signPlanningPhysical systemResultantTerm (mathematics)Point (geometry)Virtual machineCombinational logicDifferent (Kate Ryan album)Revision controlInheritance (object-oriented programming)Moment (mathematics)Functional (mathematics)DistanceWebsiteAreaNeuroinformatikMereologyPosition operatorTable (information)Uniform resource locatorPerspective (visual)Game theoryCASE <Informatik>SummierbarkeitVotingMeta elementAlgorithmGoodness of fitElectronic mailing listBinary multiplierCalculationQuery languageWeightNormal (geometry)MultiplicationPairwise comparisonRankingGoogolExpert systemSound effectRange (statistics)Computer animation
54:00
Green's functionPhysical lawPerformance appraisalMetric systemMathematical optimizationoutputFunction (mathematics)Natural languageNatural numberMereologyWordNumberTerm (mathematics)InformationSubject indexingInformation retrievalUser interfaceWorkstation <Musikinstrument>Representation (politics)Raw image formatMatrix (mathematics)InfinitySystem programmingData typeKey (cryptography)Reduction of orderDigital filterElement (mathematics)Process (computing)Numerical taxonomyMaß <Mathematik>Information extractionSystem identificationRule of inferenceCategory of beingAbelian categoryMeasurementBit rateBit error rateSpecial linear groupMusical ensembleVarianceValue-added networkNetwork topologyDiscrete element methodArtificial neural networkObservational studySearch engine (computing)InferenceLine (geometry)EmailVideoconferencingSolid geometryOnline helpEuler anglesInformation managementRevision controlBlogUniverse (mathematics)ChainContent (media)Suite (music)Physical systemWordQuery languagePattern recognitionInformationAdditionReal numberInformation retrievalWeb 2.0Natural language1 (number)Type theoryResultantMereologyNatural numberIntegrated development environmentElectronic mailing listSubject indexingCASE <Informatik>Row (database)Right anglePhysical lawNumberPoint (geometry)WhiteboardStreaming mediaPressureRule of inferenceState of matterPopulation densityInteractive televisionNetwork topologyMixed realityExpected valueNumerical taxonomySeries (mathematics)Computer animation
58:42
Natural languageSource codeMaxima and minimaHome pageComa BerenicesInformationSimilarity (geometry)Web serviceFeedbackGoogolPairwise comparisonSource codeResultantPhysical systemLine (geometry)FrequencyGroup actionHypermediaPoint (geometry)MereologyXMLComputer animation
Transcript: English(auto-generated)
00:01
How can we optimize information retrieval systems? There are a few techniques that we'll discuss today. First, do we need that? I always like to bring some examples from real world systems. For example here, very nice systems, very nice example from, what is it, Continental.
00:23
Very basic things that don't work here. For example, the linguistic pre-processing objective with an N at the end, and there is no result, but if you, what is it?
00:44
Interestingly here, you don't find the result for the basic form, but for the modified form with the EN at the end, pietoresitiven, pietoresitiven, you find something with pietoresitiven, but not with the basic form.
01:01
Quite strange, right? It shouldn't happen. The stemming went somehow wrong for one of, well, a rather large company. Another example from the same website, Druk en finnige germann, aha, no results.
01:24
And then, if you search on Google, you will find that there is, in fact, a document with Druk en finnige germann on the website of Continental. And again, either it's a linguistic pre-processing problem
01:42
or a general problem, we don't know. Simply, you cannot find it, right? So, a lot of investment, make a nice website, but then the search will not lead the user to this document, to this PDF document, so a lot of users won't find it,
02:01
and investment is basically lost. Also, what do we have here? Aya, probably special character problem here, innovation in the Portuguese side here, in a vassal, as you may know,
02:21
there's a special character for the C here in Portuguese, and you don't find anything, innovation should be a term that any company has something to offer, right? Could be due to special characters. Are other issues here? Name entity, finding Jose Aula,
02:44
somebody who is, not just any name, this is somebody who is on the board of the company, and you find some results, but we don't find the list of the board members here with this query, which is strange, right?
03:02
The person that should be well-known in the company, and you should be directed to this result. Anyway, so a lot of issues are here. Another issue, for example, Alianz also, big website, Reise Schutz was a hero, so insurance for traveling, something that people might look for,
03:20
people might want to buy, and strangely enough, an insurance provider does not have any result for Reise Schutz was a hero. Why could that be the case? Of course, we don't know, there could be many issues, yeah? What is the one for the ones
03:43
who book the whole thing, sir? Aha, the travel agencies. This is, well, there are different types of travel insurance, something that we, it's a travel health insurance,
04:00
that's actually what I'm looking for, and there is a travel, what is that called? Reise Richtwitz was a hero, don't even know the English name. If you cannot travel, you get reimbursed your money, don't know the English term for that. Anyway, could be a problem of terminology, right?
04:20
I'm looking for Reise Schutz was a hero, I know what I want, but indeed, and that's the issue, so Reise was a hero, it's, where is it? Reise Schu, here we see that they actually offer something, but it's simply a different terminology. Reise was a hero instead of Reise Schutz was a hero.
04:45
Not such a bad term, Reise Schutz was a hero, is it? They should look for what do people, how do people call this, right, and add this to their index, at least. Also, Reise can't be a hero, right? Even if it is not a technical term,
05:00
doesn't exist, somebody might query, and you'll be able to sell one more insurance. Okay, so there's a lot of need for improvement in today's search technology, and we'll briefly discuss these optimization methods, important is relevance feedback,
05:21
and query modification will focus on query expansion methods and fusion, also fusion. That's at least what we should cover today. Let's see if we even manage to get to diversity and query answering, what time do we have? Okay, relevance feedback,
05:40
very important technique that is known to work extremely well, so if you can, if a system would offer that, it would really improve the quality of search. Usually the system knows very little about what you really want, right? The average length of web queries is two point something,
06:03
so two or three words, that's not very much, so that it would be better to give more information. I think I mentioned that before, right? Longer queries have a better chance of getting better, good results. So if you can think of more terms, just add them and then hopefully you might get a better result.
06:23
Now, relevance feedback is a bit similar. It also sells the user, please provide more information to the system. But it's troublesome to think of more terms, it takes time, typing effort, so people don't like to do that. They want to type as little as possible,
06:43
least effort principle, right? And I have to think of terms, what could I call this again, right? I have to come up with terms from my memory, not so convenient.
07:01
And relevance feedback says, okay, it's okay to just add a few terms, but then once you see the result, please tell the system which ones are relevant and which ones are not. So you don't have to come up with new terms, you just say this was relevant, this is relevant, this is not relevant.
07:20
Send this information back to the system. The system will say, aha, this was relevant, this was not relevant, this it didn't say anything about. How, what does he really want, right? How does he, what does he mean when he types these two terms? Wolf-Niedersachsen maybe, right? As we heard from him many times.
07:41
And now he thinks these two documents are relevant. What is typical for these two documents? Let's find more documents that have also the initial query, but they also like the documents that are judged as relevant. So the system knows more about your need, about your information need,
08:01
by you saying this is relevant, this is not relevant. So there is a loop, there are two iterations now. There can be more, right? There's no need to stop if you want to. Reformulation, formulation of a query. First time it's really a formulation, you view the results and judgment,
08:21
something that people do anyway, right? You look at something, that's not good. Oh, this is nice. And this is a mental load that you'll carry anyway. Now just let the system know, right? And you mark this, maybe there needs to be some additional interaction methods like a checkbox or something.
08:42
Mark the relevant ones and the system will be informed. So this can give more information. Here the thing in the relevance feedback process is our well-known graph, well-known overview.
09:03
We get results in our interface, we do relevant judgments, and this gives the system more information. In addition to the query, we now know, aha, these two documents are also typical, and of course what will we extract? We have example documents,
09:21
and from that we can get further additional query tools. We say, oh, in these two documents, term X and term Y are very frequent, more frequent than on average in a collection, and that maybe these are interesting for the user. So give this, so we add these terms to the query,
09:43
we have an initial query, now we add some terms that we extract from the relevant documents, we get a longer query, and send it again to the system for the second iteration. So the distribution of terms in relevant and non-relevant documents is of course the real information for the system.
10:06
Add terms that appear in relevant documents, maybe exclude terms that are in the non-relevant document. This is optional, right? More important is of course the relevance.
10:21
Typically, relevance feedback leads to better results, but unfortunately almost no system really offers relevance feedback. Why? Because users just don't like to do it. It's again another interaction step. Once people have their results, they wanna look at them, they don't wanna go through a second iteration loop.
10:42
Unfortunately, right, it would lead, it's well known that this would give much better results. So obviously even such a small burden of interaction is not accepted by most users. And that is why system developers came up with the idea of doing blind relevance feedback.
11:03
What is blind relevance feedback? It's the same thing, just without real user information. We do the iteration, we get the results set, now we can show it to the user and say let's stop. No, let's assume we have relevance information,
11:22
and we assume now the first 10 documents are just relevant. And this we give back to the system, and then after the relevance feedback we get a second result with a new query and this is presented to the user. So automatically we extend the query,
11:41
we expand the query this is called, we get more terms without asking the user really because it's not gonna do it. That way we might also improve our system. Blind or pseudo relevance feedback, as I said, no user interaction, the system just assumes the top 10, 20, 30 hits
12:03
are relevant and extracts some 10, five, or whatever terms from that, and adds them to the query. Okay, just the illustration of searching, and this I think ends, yeah,
12:20
the relevance feedback information. Questions on relevance feedback? Not yet, okay, so relevance feedback is just one technique within a family of techniques that are all called query expansion.
12:43
The goal is always to give more information to the system, to have more query terms than the user types, because we know more information is typically better. So we have a very short query, we try to get different terms that hopefully are useful for this user intent.
13:05
So we have user-based modification relevance feedback, but we can also have query modifications without user involvement. Don't forget it can happen to blind relevance feedback as we discussed,
13:21
but there are different ways to also do that. Knowledge-based involve a thesaurus. We've talked about thesaurus in the second class, I think, on manual indexing. Remember, thesaurus is a collection of terms with their logical and semantic relations.
13:42
And we could do it automatically, we'll discuss it in a second. All these things I can do again with and without user involvement. I could say, oh, I have a thesaurus, let's suggest something, and I can either automatically include that, or first show it to the user and say,
14:00
hey, do you really want this term to be included or not? So suggest, let the user select the expansion terms, or not, right? This is optional. User involved or not, here is involved,
14:21
and now we have these two methods. Let's discuss these two methods, knowledge-based and automatic expansion of similar terms that have something to do with the query and might be useful. Ah, here first we have an interpretation of the
14:40
relevance feedback, well, a bit too late now, in the vector space model. Basically, we could say that relevance feedback means moving the query more towards the relevant results.
15:00
Let's assume these, these process are documents that were judged as relevant, these are irrelevant documents, and this is my initial query, and saying, aha, this is relevant, this is relevant, this is relevant, the new query should be pushed closer towards the relevant documents.
15:23
But it's still close to my original query, but closer to my relevant documents, and further away from the new relevance. Okay, this is a graphical interpretation of the relevance feedback technology.
15:42
And remember, again, relevance feedback was something that we already discussed in the context of our network model of the spelling activation model. And remember, there we didn't have to think of it as a new technique, there it was inherently involved in the network functions
16:02
of the spreading activation model, right? I could inject interest at any point, it spreads out, and I can do it at any time, even after first iteration, after second iteration, whatever, it's inherent. So, second class of models, there you can look this up if you've forgotten about it.
16:24
What do I use for relevance feedback? Well, we'll discuss that, how many documents do I check, how many terms do I check, which terms do I select? And they're very simple models, like the Rockio formula,
16:41
that would say, uh-huh, my new query one is the old query plus all terms from the relevant document with their weights, minus all those that are irrelevant. So we simply add the term vectors of the relevant documents and subtract the ones from the irrelevant ones.
17:03
That's a rather simple technology, there are other ways to do that, but we don't have to really go into this in detail. Also something you might, you should be able to explain what relevance feedback is, but you don't have to know the formulas for the exam.
17:21
What is term expansion, knowledge-based term expansion, based on the thesaurus? Here is an example. Of course we have different, remember which thesaurus? Ah, here we say it already, okay. Remember the thesaurus, the original function of thesaurus,
17:41
as we discussed it in manual indexing, was consistent indexing among a group of people. I'm telling, uh-huh, we don't call it car, we call it vehicle, or we call it automobile. So in case somebody wants to enter car in the library, for example, the system should then say, oh no, no, no, you can't enter car,
18:02
we all agree that it's called automobile. And then, also if the user searches for car, we can also direct him to the preferred term. More interesting, of course, synonyms, related terms, more and narrower terms.
18:24
Synonyms, I can expand a query with synonyms. You see how the user typed this word, but there are additional words that I know mean the same thing, like car, it's example car. I also know that automobile is more or less the same, and vehicle, let's add this to the query,
18:42
maybe without letting the user know, and hopefully then he will get better results. Similar things happen also at Google sometimes. I could also do that only in case where I don't find anything, right? If I have a zero result set, I could say, oh, should I tell the user really that he didn't find anything?
19:01
How can I help him, right? Let's add some synonyms. Or if I don't find anything with synonyms, use related terms, or even broader and narrower terms. I don't find anything with bank, okay, let's give him something with financial institutions. And there are many thesauri,
19:21
I think I'd put some on the learn web also. Interesting results, for example. Different thesauri, call things differently. We said we have the example of car, vehicle. For example, Stadtskutte, Stadtsantheil. Different thesauri use this,
19:43
this thesaurus from the social sciences in Germany and somebody says this is a synonym, but we prefer Stadtsantheil, no, Stadtskutte. Stadtskutte is a preferred term. And if somebody goes with Stadtsantheil, he should be directed to Stadtskutte.
20:02
And the same thing, of course, for the indexer, okay? Again, looking at the spreading activation networks, something that happens automatically there. Remember that we enter a query in the spreading activation,
20:25
spreading interest spreads out to the network, to the document layer, spreads back, and all of a sudden different terms were activated and were also active. And those were the related terms, terms that are similar.
20:43
Okay, the next step now, we don't have the thesaurus. We want to find related terms automatically, independent of views. How can we detect related terms?
21:11
When are two terms similar in terms of information retrieval? Any ideas, any suggestions?
21:30
Yes, that will be a possibility. I can say, aha, I will find similar terms
21:43
that look similar, that also have similar words, right? Like, but I will have a hard time of finding something like car and automobile, or vehicle and automobile, right? Because they are spelled completely different. I can find things like, what would be the example here?
22:11
Before? Maybe the results that you have clicked before, like putting in the new and even the results.
22:22
So considering the query, the click logs, considering what people accepted as a kind of a search result or viewed as a result, right? That is possible to assume that clicking and viewing a result is some sort of relevance,
22:42
and then we have some relevance information, then we can use this. We can use this exactly in the same way as a relevance feedback, as a long-term relevance feedback that we use for future interactions. Generally, also adding this, be aware that relevance feedback is a one-session approach as we presented it.
23:03
The next user gets something, will not profit from the relevance feedback. There is a different approach in the query logs, probably some search engines use. Yes, but possible, but we don't talk about it,
23:20
we talk about much more simple things. Uh-huh, but it's definitely a valid approach. How can I detect that a term is similar to another one? Well, I'll put them as a one and a related one.
23:42
Yes, that is almost like a thesaurus. A very simple form of thesaurus, right? But how do I get, how do I find the relevant of? The similar, okay, I have a word bank. The user entered bank. Now I wanna find related terms to expand the query.
24:01
It's not so much, I wanna expand that. Then I can have a list, I can have a thesaurus, but how do I get it? Either manual work, a lot of cognitive work goes into a thesaurus, but we wanna do it automatically. Yeah? By checking the most frequent terms
24:20
from the top score documents, and then we could see the documents which have the term bank in it would most likely have something like money or something different. Yes, that's possible. How is that called? Something, something.
24:41
Ha ha ha ha ha. We have learned it today. It's exactly a good description. I know you already mentioned it. Blind relevance feedback. Pseudo relevance feedback. Perfect description of pseudo relevance feedback is a possible way to find similar terms once I enter the query. Very good. But then of course, if the match is not good and things like that, there is a more general approach.
25:02
Think of the simple basic data structure that we have. What was that? What describes our collection after indexing? Document term matrix, right?
25:21
We have a value for each term, for each document that describes its importance. Like this. And how can we find similar terms
25:41
looking at the simple document term? So far what we do with the document term matrix is that we add another document which we call query, which maybe has one here because the user types bank. And then we compare this query to all the documents.
26:01
So we calculate similarities between these objects. And what can we also do with these objects, of course? We can, we could, the idea is instead of looking
26:21
at the lines, we could look at the columns. We can calculate the similarity between these lines. Think of all the similarity metrics that we've studied. I can say how similar is house to bank? And that means if these two terms, like in this case,
26:45
I compare these two columns, I say aha, both have a high medium value. That's good for the similarity. Again, good for the similarity, how high, how bad. Oh, here it's quite different. That's not good for the similarity.
27:00
Just like when we have calculate the hit value, the RSV between document and query. Aha, low, high, they are not similar this time. That means I check whether the terms are similar based on their appearance in documents. Terms are similar if they appear in similar documents,
27:23
document contexts. If they appear in the completely same documents, they have a higher chance of being similar. They won't have exactly the same values, right, because of differences, but if they appear in completely different contexts and never are together
27:41
in the same documents, then we assume these terms are quite different. That's a very mathematical way to look at semantics, right? We don't know anything about the words. We just assume that, and that means associations. We call this association relations.
28:02
Objects, once experienced together, tend to become associated in the imagination, so when any one of them is thought of, the others are likely to be thought of also, for instance of psychology, things are, we remember things by associating them,
28:21
even if they don't have anything to do with them objectively, right? And in retrieval, in a mathematical way, we think of associations of co-occurrence, occurrence of terms together in documents. Association, associative rules is a term, something that you know from e-commerce,
28:41
that things are bought by similar users. This is also an association that the system uses to recommend things to you. And here, also something we can have in Google, I don't know if it's activated the current version, but in previous version, we could look for something like similar pages and that the system did exactly
29:03
calculated this between lines, and now we look at the columns. So we can have the exact same similarity matrix applied to columns, and that means we can have a similarity matrix between all the terms.
29:24
And that means frequent joint occurrence means similarity. So we don't have to know what these words really mean. In this case, we can easily, of course, say yes. There is a, bank has a high similarity with this word
29:45
and also with this word, but these two have completely, are completely dissimilar. They represent the two contexts in which bank can occur. One is the park bench, the furniture, and one is the financial institution bank.
30:01
And these contexts are probably exclusive, right? We have some similarity between bank and all of those documents, but little similarity between these. Understood? Clear?
30:20
Okay. So here we have the calculation, and now we end up with a similarity value, similarity matrix between all the terms. This is, it's not normalized, so we have values that are higher than one because we have very simple similarity matrix,
30:41
just the overlap in a product, but that doesn't matter, and we see that bank has a high similarity to money, park, and bank, but park and house and money and house have a,
31:00
what is it, money and park have a low similarity, for example, right? Financial institution reading and furniture. Yeah, please. Excuse me, so you said that terms are similar, they have been similar documents, is that right?
31:22
In the same documents, in similar document contexts. So if there are like a thousand documents, and there is a term that appears in 100, and there is another term that appears in 90 of those, and in five others, then they have a very similar co-occurrence pattern, right?
31:44
They have 90% of their documents are identical, plus five that only one has and five that the other has, so they have a high similarity. If each appears in 100 documents, two terms, and they have an overlap of only one, for example,
32:00
and they have a low similarity, right? So just where they appear, a mathematical score, again, appearance of documents determines the association value, the similarity value between terms, between words, basically. Okay? Further questions?
32:20
Please. But how come the money actually has a higher similarity to house and bank than to money? Mm-hmm, ah yeah, that's an artifact of the inner product. The inner product is simply the sum of the products
32:43
without any normalization, like in cosine, and I use it because we could calculate it here on the board easily, right, but for time reasons, and since it's not normalized, it has these artifacts. If we would use cosine, we would spend more time
33:01
on the board, but we don't have that, and then we have a normalized. For a normalized similarity metric, we could, of course, expect that all the, that the diagonal values have a value of one, and the inner product doesn't fulfill this criterion,
33:21
this criterion. So, good question, hopefully resolved, because we use such a simple metric here, so we could explain it, if there are further questions. Further questions?
33:42
Not the case, okay, we move on. So, these statistical associations, they could even look useless for humans or be senseless by meaning, basically, when we see the word meaning of a term, we think of semantics and we associate
34:02
some meaning with it, I have maybe an image of what something means, but here we just calculate the context of the word, the documents in which it appears, and this is very close to what the language philosopher Wittgenstein says,
34:20
the word, the meaning of a word, it's its use in language, this is exactly, basically what we do here, the use is the distribution of what documents, that's what we use for the meaning of a word, to calculate similarities. We don't know anything about the meaning of the real, what we, as humans, would call meaning, right?
34:42
Okay, that is so much to term expansion. Then, of course, I could look at the similar words and add them to the query, right? So, for term expansion, we have relevance feedback, involves the user, pseudo-relevance feedback does not involve the user, we have expansion
35:02
used based on knowledge in the Thesaurus, or on the knowledge that we extract from the collection by calculating similarities. All that is term expansion, query expansion, adding more terms to the query, because then we have more chance of being successful.
35:24
Okay? Any questions? If not, we move on to the next topic for optimization, and that is fusion. Also music style, isn't it? But it has nothing to do with music here.
35:41
Fusion is closely related to meta-search, it's also called rank aggregation, but it's a bit different from meta-search. We already had, basically already talked about one fusion method. Fusion means we have different ranking algorithms,
36:02
different indicators of relevance from different sources maybe, and we want to bring them together, right? We want to have several evidences and fuse them. This can, similarly, meta-search engines, or even multilingual retrieval,
36:20
but what our things that we fuse, our indicators or evidence, comes from different ranking algorithms. And we had this one case already where we did, basically, bring together two elements, and that was when we talked about link analysis. We said we have per page rank value,
36:41
and we have a normal similarity value, we have to bring them together to come up with some final result for a web page, okay? And we simply did that by adding them, I think, in the class on web retrieval.
37:00
Or multiply them, what did we do? I don't remember where. I think we added them. Now we will see that there are different ways to bring results together, very simple ones. Why do we do Fusion? We've already seen this image, I think, and we know that, aha, there are some.
37:22
This is a very typical result. We know that there are quite good systems, and the best systems, they are quite similar. Their performance is quite similar. We can almost tell the difference between the quality of these systems.
37:41
The MAP values are very close together, probably between two and three, there's not even a meaningful difference, a statistically significant difference, and we are faced with a situation where we have to select one of these algorithms.
38:00
And looking close into these algorithms will show that, yes, on average, they're very similar in their performance, but system A might find different relevance documents than system B, and that is very typical. So we could simplify and we could say that, yes, different algorithms find different hits,
38:22
but overall they have the same quality. So the idea would be to bring them together and then have a chance to find relevant documents at A and relevant documents at B, only B finds, and bring them into one joint list. That's why we, fusion algorithms are positive.
38:45
Ah, here it's, again, the text for that. And this is related to so-called committee machines, typical algorithms where several algorithms work together, like in machine learning.
39:01
We have different experts. The metaphor is an expert meeting, right? So different people come together, bring their opinion, and then hopefully the committee comes up with a better decision than any individual. And that is what we try for algorithms in AI
39:21
and machine learning. It's for committee machines, for information, we call it fusion, or sometimes rank aggregation. Also, in web retrieval, fusion is also popular, so-called meta-search engines, where you can query several search engines at once.
39:42
But there is a different reason behind that. The reason there is that each individual search engine has a low coverage, that is around 50 to 30% of the web which is a very constant value, even in the growth of the web. And there the idea is if we query several search engines,
40:01
we get a higher coverage, all right? In fusion, the idea is different. We have an RSV for each algorithm, gives me an estimate for each document, and I want to get higher quality, not higher coverage. For example, very simple,
40:20
I have three results from three algorithms. Know how we can create different algorithms, right? Different logs and different whatever. And here we have a situation, if algorithm A says, or one says, this is the top score document, A, with this value, which is the status value, he says, oh no, it's B.
40:42
B is with 0.9, must be the best one, and algorithm three would say, no, no, no, no. A is best, but only with this value. So now I have three experts, or three algorithms giving their vote, but I want to bring this knowledge together
41:01
in one ranked list for the user. What would be the top document in this case?
41:22
What do you think? Why don't we calculate one list from three or four lists? What is the number one bit?
41:43
Of course there's no perfect solution, there are many, many solutions now, but what is your initial feeling? Is it A, is it B, is it C?
42:05
C, C is not a top document in any of the three algorithms.
42:28
Okay, uh-huh. I said the whole column, this one is A, the other one is a B and G, and the other one is C. Ah, okay, it's a misunderstanding then, okay.
42:44
That's, but any other solutions? Anybody else thinks C is good? Who thinks A is a good hit? Who thinks B is a good hit? Okay, B is popular, why?
43:04
Must be because of its high, Retrieval status value, uh-huh. If I sum up all the RSVs, actually that's what we learned in the web retrieval case for aggregating retrieval status value plus page rank,
43:25
so it's a good idea to show that you paid attention. I end up with 0.4, five, plus nine, plus da-da-da, six, 1.5, quite a high number. If I do the same for A, I will get a much lower number, four, 0.9.
43:45
Okay, good idea. What about the advocates of A? What could be reasons for them to challenges? It has a top position choice.
44:01
Yeah, this is, I could simply count the top positions. The two algorithms think it's the best, right? What do you want with B? It's only once. Other reasons? Same reason. Same reason, okay. Other reasons for B?
44:22
For A? For A. If I calculate only difference, I can normalize the counting, maybe then I can sum up better.
44:42
How can I compare this first hit, that only 0.5, right, to this one with 0.9? The numbers don't mean anything to me, really. They are only used, meaningful in comparison. So I would have to normalize these lists and say, aha, this is the top, so it gets the highest score.
45:03
The highest score is different in these two cases, but I cannot compare it in these algorithms. So I would have to normalize these lists, and then I can check, and then probably, maybe I did it, let's see, maybe the next slide, I would end up with higher values for A, because if I normalize, this is one,
45:22
this is one, and this is one, right? So I end up with a higher value for, exactly. Other arguments?
45:41
I could also check, what is the minimum value? Maybe A would have, in this case, the minimum value is identical, right? I don't have an indicator. I could also check what is the lowest position. I'd say, aha, A is on four, aha, no, B is always,
46:01
is at least at position three. So there are lots of different ways to do this now, of course. Let's check some of the, I could also filter something, and transparent fusion as a filter upgrade, but basically, what we really want is to make, how can we make one ranking from several lists?
46:23
Several different ways, very simple is the round-robin approach, you simply take one from each hit at a time. If it's already in the list, then don't do anything. Then, what we did was a sum now, automatically people summed up, right? Sum of RSVs, and you said you should normalize first.
46:44
We could also take the minimum or maximum. We could take votes. How many votes are there for top position for ABCD? And, also popular would be a multiplication. If I multiply, then low values will have a higher impact. All right?
47:00
If I have a zero, if some algorithm says, or expert says, this is really not relevant, then I can't win it all, right? I'm out, if I have a zero value in multiplication. For adding up, it doesn't matter so much. If the other thing is really good, it's a really good document, it still has a chance to get there.
47:20
So, different methods. Here is an example for normalization. We have two lists, I normalized them, so we have a one and a zero, one on top, zero at the bottom, and then we can do the, we can sum up the values and compare.
47:47
Okay? So, quite easy. If you know which function to use, you can simply do fusion. Questions? Ah, and we are, in this case, just to show,
48:02
of course, the fusion method has an effect. I use the minimum, and here the sum. What else? Normalized sum, maybe. Ah, round-dropping, max, and normalized sum. Now, we have two initial searches,
48:20
and here we have five different methods for fusion, and we see sometimes A wins, sometimes it doesn't. Now we can think which one is the best result, but we won't do it here.
48:40
You should be aware that, you should know that, of course, different fusion functions can end up in different ranges. Okay? Do we? We don't have to go through the list in detail. Yes, please? No, I didn't really get the round-dropping. Ah, yeah. Round-dropping says, take one at a time.
49:01
So, start somewhere, in this case, we started with this one. Okay, the first one is A. The second one will be from this list, B. The third will be, again, from this list, our C, B, we already got, so forget about it, take the next one, C. C, oh, we already got it, so take B.
49:21
Take one from each basket at a time. Ah, F. Sure, good point. Round-dropping should also have F, yes. Forgot it. Yeah, yeah, I mean, each.
49:40
In each fuse method, you only have five hits. Oh, loud, okay, I forgot to say that. No, F would not be in the list. We have here, we have six, no, five? Five? Ah, all of them are five. F should be there, that's true.
50:02
Isn't it? Good point. Further questions? Anything that you cannot understand or say, ah-ha, how does it work, how does it? Let me come to this. Yes, please? I'm sorry about that. The example for the sum of the rice, A, 1.4, we got?
50:22
Let's check. Oh, wrong, good point. A is actually five plus five, five, so 1.05. Forget this.
50:41
And B is also wrong, wrong. 1.35, so we should, we should, we should. Your question? Is at least C, C is also wrong. Let's forget this slide. Confused, too much confusion.
51:01
But basically, you know, you know how to calculate sums and multiply and minimum, maximum, right? Very straightforward. Further questions? Fusion can also be a linear combination
51:20
where we give a value to each algorithm and say, this is a good algorithm, this is not such a good algorithm. And by time, these weights could change and we could have different scoring methods, more advanced scoring methods than simply doing sum and stuff, but for our terms, that is okay.
51:41
Interestingly, there are also systems that actually do fused search or meta search, for example, after vote. I think it's not online anymore, but it was quite a nice system that you could search, and it searched three algorithms.
52:02
It can aggregate the result and even told you, this is the first result at Yahoo and the first result at Bing. And this is the first result at Google and my ranking is like this. And this is the third result of Google and so forth.
52:21
And then you could say, oh, that's nice, but I like this better. This should be on top. Then it would say, uh-huh. Then you trust more the Google ranking and you would upvote this and say, ah, for me, Google has a higher relevance and over time, your Google ranking would make,
52:41
has a higher weight within the fusion process. After vote, because your vote will vote this thing. Unfortunately, I don't think it's online anymore. Ah, here vote, ah, there we say. This is more important for me.
53:01
Next time, this should be on top. This is something like you said, the click definition, right? But from a different perspective. So, fusion can improve results
53:22
even if one of the potential systems is not such a good contributor and fusion has a high potential when the single systems are very different. For example, I use, combine a engram approach with a work-based approach. I combine completely different ranking algorithms.
53:42
Then, there is some potential in the fusion. Questions on fusion? Not the case of query expansion, we have fusion. Then we jump over results and diversity
54:03
and we come to the last topic, to question answering. Question answering, question answering advocates said yes. Information retrieval sounds quite interesting
54:21
but basically what most people do is only document retrieval. You end up with a document list as a result and you don't end up with real information. Then you have to read the document and find information yourself. Why not present the, allow the user to ask a question, to specify a real natural language question as a query
54:44
and don't give him a document but give him, find the result for him, extract it and give him the fact as a answer. Froge Antwoord system in German since the 1980s has been researched on this topic. So give him a small,
55:01
the basic idea is to do retrieval and then extract a small portion, maybe a sentence, maybe a word from the document that is the answer to the question. Not the query, the question, right? So here we find the document and then we extract the correct part that the user wants to know and present it to the user.
55:26
So one way, so what the challenge that QA systems have is they have to determine what kind of information it is queried. For example if the query goes like where is something then I have to know, aha, where that means
55:43
I have to find a place name. Probably the answer must be a place name. I have to use natural named entity recognition tools to find where are their names and which ones are place names, extract them and find the correct one.
56:01
And of course that's not so trivial. We have to know, we have to have an answer type taxonomy. What happens if I query where? What happens if I query who? And we have to be able to not only do indexing and retrieval but also to extract the correct sentence and extract, extract answer.
56:25
So in addition to the standard technology that we need for retrieval, we need to identify actual information, recognize the errors and extract the real semantic information. Okay we can quickly jump over this.
56:43
For example question taxonomies, we have different things. These are examples for question, in which city Emile Fisher was, in which city was Emile Fisher born. And then we have, or how many inhabitants has a place and we know aha, the answer should be a number.
57:02
All I need is a number, I don't need a document. And well these things can be quite effective. Some systems switch between 10 and 60% and interactive systems just can be even higher. Okay we have here for example, a result system in this case cross-lingual,
57:23
goes from English and looks for German answers. Here the answer for the question was who is minister for environment and sex in the handheld and it's not that perfect as we see. There is a, a sentence has been extracted,
57:41
not only the real name kind entity, named entity. Also here we have systems on the web that you can use and look at Lexicom for example, where you can type your query, what is information to you, question mark, the full sentence,
58:01
full natural sentence question, and then you get some explanation. Let's see if we have other systems. A question here, what is the size of the new iPod? Then I get some answers,
58:21
plus their supporting documents. Where is this answer content? The first answer might be correct, 0.4 inch. The next answer, 1.10 inch, doesn't really help me a lot. And then there are even worse answers. Okay, but this technology is out there on the web.
58:44
Another system, start system that's still online. Again here, when was Wittgenstein born, I can query and it gives me a result with the source Wikipedia, the source Gutenberg, and we see a heart here and Wikipedia gives me a real date.
59:03
Whereas for this source, it can only extract a year. But at least in this case, they don't contradict themselves. Again, you get another result, same actually. And then if I have three sources
59:21
giving the same answer, it might be true, right? So try these things. Scour is also a system that you can use. There is another homework even if you still need to do a homework to be able to take part in the exam. We can do it until tomorrow and submit it online.
59:47
So that basically concludes what we have to talk about today. Are there any further questions on the exam?
Recommendations
Series of 6 media