We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Evaluation

00:00

Formal Metadata

Title
Evaluation
Title of Series
Part Number
9
Number of Parts
12
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
This lecture gives an overview on Information Retrieval. It explains why documents are ranked the way they are. The lecture explains the most relevant ways for content representation: Automatic indexing and manual indexing. For automatic indexing, the frequencey of word is of special relevance and their influence on the weighting of term are discussed. The most relevant models are introduced. The session on evaluation discusses new metrics like the Normalized Discounted Cumulative Gain. The session of information behavior provides a brief overview and explains the relation to IR. The session on optimization mainly introduces term expansion and fusion methods. The session on Web retrieval is concerned with the quality aspects and gives a basic insight to the PageRank algorithm.
Keywords
WeightInformation retrievalTerm (mathematics)Performance appraisalData modelSystem programmingComponent-based software engineeringEndliche ModelltheorieEuclidean vectorValuation (algebra)Process (computing)BenchmarkPrototypeProgramming paradigmUser interfacePhysical systemLogical constantRepresentation (politics)CalculationInformationSubject indexingMatrix (mathematics)Similarity (geometry)Workstation <Musikinstrument>Price indexQuery languageComputerMedical imagingVideo gameInformationMathematicsOrder (biology)Game theoryValuation (algebra)Product (business)Physical lawState of matterFunctional (mathematics)Group actionComplex (psychology)Performance appraisalMereologyResultantVirtual machineNumberDataflowMatching (graph theory)Process (computing)Musical ensemblePoint (geometry)Web pageRegulator geneSound effectBit rateView (database)Condition numberKey (cryptography)Different (Kate Ryan album)Multiplication signStandard deviationWeb 2.0Office suiteVolume (thermodynamics)AlgorithmFormal languageInformation retrievalExpert systemAverageBitPhysical systemCore dumpConnectivity (graph theory)AbstractionField (computer science)AdditionRankingProgramming paradigmInteractive televisionEndliche ModelltheorieForestComputer animation
MaizePerformance appraisalProgramming paradigmInformation retrievalPairwise comparisonStandard deviationLogical constantCondition numberSystem programmingAbstractionContext awarenessInformationFinitary relationBasis <Mathematik>Price indexClefUtility softwarePhysical systemState observerSimilarity (geometry)Query languageDecision theoryLevel (video gaming)Metric systemMultiplicationBinary fileMedical imagingVideo gameRow (database)Dynamical systemFormal languageInformationMathematicsNatural numberTheory of relativitySoftwareNetwork topologyFrequencyIterationDepictionAverageINTEGRALDialectDimensional analysisPhysical lawState of matterArithmetic meanDivision (mathematics)Group actionPhysical systemPolygon meshResultantVirtual machineMachine visionRevision controlPlanningBasis <Mathematik>MassProcess (computing)Metropolitan area networkRoutingMusical ensemblePoint (geometry)WordObservational studyWhiteboardWebsiteEndliche ModelltheorieAuthorizationRepresentation (politics)Object (grammar)Multiplication signRule of inferenceExploit (computer security)Standard deviationSpacetimeIdentity managementSoftware developerUsabilityInformation retrievalExpert systemLibrary (computing)Well-formed formulaTask (computing)BenchmarkBoolean algebraPerformance appraisalPairwise comparisonTerm (mathematics)System callGoodness of fitLogical constantReal numberParameter (computer programming)AbstractionField (computer science)Programming paradigmSet (mathematics)Statement (computer science)Internet forumCondition numberTexture mappingCalculationBoundary value problemWeb 2.01 (number)Computer animation
Metric systemCross-correlationSystem programmingCondition numberPower (physics)NavigationHome pageQuery languageWeb pageInformationType theoryVideo gameInformationMathematicsNetwork topologyMatrix (mathematics)Level (video gaming)Decision theoryLine (geometry)Structural loadMereologyPhysical systemResultantNumberBasis <Mathematik>CASE <Informatik>Metropolitan area networkMusical ensembleRankingSocial classVotingSet (mathematics)Observational studyEvent horizonBit rateView (database)Condition numberKey (cryptography)Different (Kate Ryan album)NeuroinformatikMultiplication signWeb 2.0Office suitePosition operatorMedical imagingInformation retrievalType theoryVideoconferencingBinary codeBitSheaf (mathematics)NavigationPerformance appraisalMetric systemCore dumpGoodness of fitFactory (trading post)Cross-correlationAbsolute valueProgramming paradigmWordHome pageCalculation2 (number)1 (number)Computer animation
NumberSet (mathematics)System programmingRankingInformation retrievalPosition operatorPhysical systemInformationEstimationDifferent (Kate Ryan album)Electronic mailing listMedical imagingVideo gameRow (database)InformationMathematicsPerspective (visual)Game theoryMatrix (mathematics)Software testingDialectDecision theoryCausalitySheaf (mathematics)MereologyPhysical systemResultantTorusNumberAreaGoodness of fitMeasurementWater vaporCASE <Informatik>Metropolitan area networkRankingWaveVariety (linguistics)Set (mathematics)Reading (process)Web pageHierarchyCondition numberDifferent (Kate Ryan album)Cycle (graph theory)Multiplication signStandard deviationFigurate numberPosition operatorSoftware developerFraction (mathematics)Order (biology)Theory of relativityInformation retrievalSearch algorithmWell-formed formulaSubject indexingBoolean algebraNavigationMetric systemProgramming paradigmSearch engine (computing)Electronic mailing listEstimatorCalculationComputer animation
NumberQuery languageMeasurementAverageMachine visionFreewareGroup actionGraph (mathematics)Graph (mathematics)InformationInformation retrievalType theoryAverageInterpolationPerformance appraisalTheoryPhysical systemResultantHydraulic jumpNumberQuery languageGoodness of fitMeasurementBasis <Mathematik>Exception handlingCASE <Informatik>Point (geometry)Social classStudent's t-testDifferent (Kate Ryan album)CurvePosition operator1 (number)Medical imagingMathematicsOrder (biology)Game theoryNetwork topologyComputerDialectDecision theoryPhysical lawState of matterUniform boundedness principleAreaProcess (computing)Metropolitan area networkInstance (computer science)Personal digital assistantBridging (networking)Open setWordReading (process)Insertion lossComputer fileView (database)Multiplication signRule of inferenceComputer animationDiagram
MeasurementValue-added networkMachine visionFreewareGroup actionGraph (mathematics)Query languageAverageInformation retrievalWorld Wide Web ConsortiumPhysical systemLevel (video gaming)Position operatorRankingInformation systemsArithmetic meanNumberInformationSoftware testingDecision theoryTrailCountingInformationInformation retrievalLevel (video gaming)Decision theoryArithmetic meanPerformance appraisalExtension (kinesiology)Metric systemPhysical systemResultantCASE <Informatik>RankingWeb pageRepetitionFile formatCorrespondence (mathematics)Traffic reportingDifferent (Kate Ryan album)CalculationMultiplication signWeb 2.0Position operatorVideo gameMathematicsExpert systemFrequencyMatrix (mathematics)AverageDimensional analysisState of matterBitInterpreter (computing)MereologyMoment (mathematics)Virtual machineCellular automatonQuicksortAreaQuery languageMachine visionFamilyMeasurementPoint (geometry)WordInsertion lossWebsiteKey (cryptography)Bounded variationService-oriented architectureComputer animation
RankingOrder (biology)Maxima and minimaLevel (video gaming)Wide area networkCross-correlationCoefficientInformationStatisticsTable (information)Different (Kate Ryan album)MathematicsOrder (biology)Information retrievalWell-formed formulaDecision theoryArithmetic meanBinary codePearson product-moment correlation coefficientPower (physics)Performance appraisalMetric systemPhysical systemResultantTerm (mathematics)ConsistencyNumberQuery languageCASE <Informatik>Perfect groupError messageCross-correlationSquare numberSummierbarkeitRankingCoefficientSearch engine (computing)Identical particlesClosed setDifferent (Kate Ryan album)CalculationObject (grammar)Multiplication signRight anglePosition operatorMedical imagingInformationTouch typingGame theoryLevel (video gaming)VideoconferencingDialectSubject indexingPhysical lawBitUniverse (mathematics)MereologyMoment (mathematics)Pairwise comparisonQuicksortAreaReal numberRange (statistics)Exception handlingMetropolitan area networkState observerPoint (geometry)Stability theoryBit rateGene clusterSpeciesDisk read-and-write headMessage passingSuite (music)Diagram
NumberRankingCross-correlationCoefficientDifferent (Kate Ryan album)Electronic mailing listPosition operatorMathematical optimizationPairwise comparisonSimilarity (geometry)Chi-squared distributionDivision (mathematics)LogarithmDiscounts and allowancesBasis <Mathematik>Data modelDatabase normalizationIdeal (ethics)Real numberVector spaceVideo gameInformationMathematicsTheory of relativityGame theorySurvival analysisDecision theoryPhysical lawState of matterArithmetic meanGroup actionMereologyMetric systemPhysical systemResultantNumberQuicksortAreaFamilyBasis <Mathematik>Food energyPoint (geometry)RankingSet (mathematics)outputWordReading (process)Sign (mathematics)Bit rateCoefficient of determinationTransport Layer SecurityMultiplication signDigital photography2 (number)Right anglePosition operatorSoftware developerAverageVector spaceBitMultilaterationPerformance appraisalLogarithmSlide ruleCASE <Informatik>Perfect groupCross-correlationSummierbarkeitDiscounts and allowancesTrailDifferent (Kate Ryan album)1 (number)Computer animation
Transcript: English(auto-generated)
Evaluation. Why do we need evaluation? Very briefly, we had talked about at various sessions the very different ways to design retrieval systems. Different ways of language processing, stemming different ways of weighting, different ways of modeling search, implementing search,
different ways of implementing web search additions like page rank and stuff. So it's impossible to predict what effect will change in the algorithm. Some people might say, oh, my algorithm is better
because of that and give arguments. You can just, you never know, you have to look and check empirically if it really changes the performance and if it's better or worse. So evaluation is really a core topic in information retrieval. And obviously a very holistic evaluation.
Is the user successful? Can he better work now? Can he better perform his job, his overall task? Remember information behavior last week? It's quite difficult, basically almost impossible.
We can check how good is a retrieval system finding relevant documents. So user-oriented evaluation is quite complex, difficult. So mostly the evaluation paradigm in information retrieval says, okay, let's assume the user is somewhat constant.
We assume a typical user that is looking for, whatever, the forest fire or let's take the wolf, wolves in Nidesacksee. And we don't know that one person has different ideas about that. Another person has already read 10 articles and for his or her, for him or her, these would not be relevant.
Articles that I know might not be relevant for me, but for somebody else. And then, graphic paradigm more or less says, well, that's kind of unfair to the machine. Let's assume there's an average user with average knowledge who has this information need. He wants to know what about wolves in Nidesacksee,
what aspects about wolves in Nidesacksee is interested in. And let's replace this prototypical user by an expert who can judge whether a document is interested given this information need. So if somebody who's taught here is an information need explained in like a half page,
then this person can look through a lot of documents and say, oh yes, for typical average user, this would be relevant. This does not, is not relevant. So we kind of have an abstraction from the real user and replace him with this prototypical user, expert,
who might know a lot about this topic. So for himself, it's not relevant. He doesn't have this information need right now. Basically, of course, this is a problem, but that's the standard way of evaluating retrieval systems. And this is called the grand field error,
which basically says, well, all of this, what we've talked about, information, the linguistic processing, weighting, all these matching functions is treated as one.
The retrieval system as one, we don't look at one component only, we look at a thing as a whole. We compare this information need that a potential user might have to the results. And here the expert says, yes, relevant, relevant, relevant. That's the way it is done.
Of course, we could also evaluate other components. What is the, how well does the user interaction work? Could be a bit interaction for a retrieval system, also something that is necessary, very important, but that's a different way of evaluating things. We look at the results.
We can also include everything, but as we said, that is quite challenging and difficult. So the so-called grand field paradigm is characterized by these facets here
to find objective evaluation standards for the comparison of systems. We're probably, we've found the comparison of systems, there's two or three or four systems, and we wanna say which one is better? Which one is easier to the other ones? And of course, this comparison should be somewhat fair, so the condition should be identical.
That means all the systems are given the same documents. For me, I'm here for you. The same system, the same documents. Here's the queries, process them. Then we get different results sets. And these can be compared by our experts.
So we have an abstraction from a real user situation we don't have a dynamic user situation, we have a very static, constant model of retrieval, of using the retrieval systems. And last week, we studied, we learned that, of course, retrieval systems in reality are used in a very dynamic, iterative way.
This kind of does not coincide with the typical evaluation, a problem somewhat, but well, that's the way it is. So we have objective relevance by a neutral user.
Something that really doesn't exist, there's no objective relevance in reality. He judges the relation between expressed information, somebody's written down, and a result document from the collection. Which means this, and this is the basis
of all integration initiatives. Probably we won't get to all the initiatives in the end but there are some initiatives that are very important. The Cranfield paradigm is based on the Cranfield experiments that were first carried out in the UK in the 1950s.
Well actually in the US, it was in the US, Cranfield. And before that, people discussed which information representation is the best, the times where we have briefly mentioned
Dewey Decimal classification, something like that. We have mentioned pre-post-coordination. And people discussed which is the better way to realize a retrieval system, library systems, based in that time, in the 30s, in the 40s. My system is better because, and there was philosophical arguments.
And only in the 1950s, somebody said, this is all interesting. Let's just prove which system is better and let's have an idea. First empirical study was done by Cranfield and he used the method that we have just described
and it's basically been unchanged since then. The method was then used at that time with very small collections. Also didn't really help very much. And then in 1990, there was for the first time a large-scale experiment for the time
of the documents of one newspaper, of one year. All the articles and evaluation was carried out this way. And by a neutral organization, the NIST, N-I-S-T,
in the US, and they invited research groups to say, well, we now give you all the evaluation. Standards you need. You can focus on your system development. And we show you, we compare all your systems with a standardized fair comparison.
And this really has changed the way information retrieval worked since the 1990s. Traxture runs, texture retrieval conference, initiated by Donna Harmon. And later, in 1999, European researchers said,
yeah, that's all very interesting. Really has changed everything, but it's only for English. There are some other languages also in the world. And we need to have a European issue, think that there's the same thing for European languages that was the first language evaluation forum since 1999, also still running,
and which provided evaluation statements for the 150 European languages. And later, at a very similar time, the Japanese Research Institute said, okay, we also need to do this for East Asian languages which have quite different requirements.
No stemming is required, but many other things like web boundary detection is not so trivial in Chinese and Korean, for example, and Japanese. So, another initiative was developed for Asian languages, and since the last 10 years, I think there's also an initiative for Asian language,
for Asian, I said, for Indian languages. India, as you might know, has more than 300 languages, and there are some, there's an initiative that, I think they have some 10 languages, the biggest Indian languages now in an, again, in an evaluation-based benchmark integrated. Okay, so, the basic notion, all we talk about now,
there is this term relevance. What does it really mean? Remember, recall and precision, we'll show that in a second, there might have some vague idea, there was an intersection of other relevant ones with the found ones, and there is a formula which you probably should know, of course, for the exam,
something you've also studied in the introduction. But what does relevance really mean? Okay, for the found documents, that is very clear, but whatever. There are different definitions, so I think pertinence, relevance defined as pertinence, is really the usability as observed by the users as well.
That might really help me in my future work. I find this document interesting. I might be able to solve my problems with that. And an even more challenging notion would be situational relevance, which is, well, maybe the person isn't really good in judging whether it's good or not for her, we have to change how would their work,
how would she perform having this document, and how would she perform without this document? Would she be able to better work, to perform her task better, the work task, not the search task? And it basically is impossible to measure. We cannot tell a person, once you do your job without knowing this,
and once you do your job knowing this document, that's, of course, impossible to do such an experiment. So what we really have, what we can deal with, this is really the relevance that we can decide in an experiment, and there is the relevance,
the objective relevance as defined by the Grandfield paradigm, by the Euro who says, yeah, for the average, this document would be relevant for such information needs, and this we shouldn't call relevance at all, really, system relevance, that's basically the metric that the system comes up with,
the status, value, or similarity, the system thinks this is relevant, I put it on top, basically the calculation results. Okay, so typically we talk about when relevance is mentioned, people talk about the objective relevance as it is in the Grandfield paradigm.
Now this has a lot of drawbacks, of course. For example, just to mention one, relevance is always, or mostly, basically 99%, defined as binary, so Boolean relevance model,
so document is relevant or not relevant. And in real life, you could easily think of, if you think about it, you would probably come up with fine-grained definitions, yeah, it's somewhat relevant, or this is really very relevant,
this is my best option, it's mostly relevant. This is somewhat relevant, and nah, maybe, it's not completely relevant, but has something to do with top, and this is completely off. So we would have different levels of relevance if we judge, if you're given 100 documents and you have to judge them,
might wanna say, ah, why should I have only say yes or no? This is a bit more interesting and things like that. And of course, this is a theoretical, practically obvious, everybody knows about this, but it is simply not taken into account.
The Grandfield paradigm still uses binary judgments, why? Well, there is one reason is mentioned here, they lead simply to similar results. If we compare four systems, we can do it with binary judgments. Some people have also tried to do it with maybe four levels of relevance,
irrelevant, a bit relevant and very relevant, for example, anti-SAR, the Japanese initiative has tried that, and in the end, they came up with the same results. We didn't change the ranking of the systems. The best system was the best system
for both types of relevance judgments. So is it really an advantage? And of course, if you tell people you have to judge now 1000 documents into two classes, relevant and relevant, and you tell the person you have to judge 1000 documents into four classes,
the second one will take much more time. So it's an investment to have more levels of judgment, you will think more about it, is it ready? Whereas if you see a non-relevant document that takes seconds typically in the judgment process, and it's quite cost efficient, so the two level binary judgment still is the standard,
despite the fact that it's unrealistic and so forth. Now what is a good metric? You have heard about recall precision, but before we come to them, we want to know what is a, these are metrics to compare system.
My system has a better recall than the system A, or Bing has a better recall or precision than Google, would be a decision that we had after doing evaluation. But before we talk about the metrics, most of this class today, we talk about different metrics, different ways to calculate the quality
of the retrieval systems. Let's talk about quality metrics or requirements for these metrics. If there is a number which we come up with to calculate something like recall, we should ask, is recall a good metric,
good way to judge the system? What are requirements for these metrics? And here they are. For example, correlation to user satisfaction. Of course that must be number one. If the user says, oh, I look at your system, this one is better. Or 20 users look at this system, 20 at this,
and most say system A is better. Then your metric, the way you calculate this the quality should of course also show that system A is better. If user says system B is better and your metric says system A is better, how will you do something wrong in your calculation? It's not a good metric.
Next, easily interpretable. Now as you know, recall and precision are very simple metrics and they're very easy to read, to calculate and very easy to understand. Let's say, aha, recall of 0.5, that means this finds half of the documents
and doesn't find the other half. I have no chance of finding 50% of the relevant documents with my system. That what recall means, aha, 0.5. I know exactly what it means, 0.1, 0.9. Easily interpretable. And unfortunately we have to study a few systems,
a few metrics that don't really fulfill this requirement, but they're good for other reasons. For example, correlation user satisfaction, oh, they're more robust. Robust for metric means it works under different conditions. For example, for English and German, and for situations where we have many relevant documents,
situations where we have very few relevant documents, which will lead to different situations, different numbers, and a robust metric will always say, aha, the best system is the best, and nothing else.
So, robustness for metrics. We'll also talk about robustness for systems today, that means something a bit different. Hopefully we'll get there. Capability to discriminate between systems, also very important of course, if the system is better, and I can explain it,
this is much better because then the system should also give a higher value for that. We'll have an example for simple metrics that are often used, because they're so easily interpretable, but if you think about them, you'll see they don't discriminate, so I have a good system, a bad system, gets the same number, same label.
And, on the other hand, absolute values, if it's 0.3 or 0.9, doesn't matter. Important is only the comparison between systems. The better system should get a better value as the worst system. Other things that we have to keep in mind, remember we talked about this in the web search session
and in the information behavior session, very different information needs of users, right? For example, we have navigational queries, information queries, factory queries, ad hoc search, and why do we talk about this here? Because, of course, if we have different
information needs, we also end up having different metrics. For example, for the navigational, just search the homepage, where's the homepage? Then, of course, the image need might not be ended,
but this, for the systems, where I get the query, I have to answer with the homepage of the company, then I'm done. On the other hand, I want to study information field, I want to learn everything about the evaluation metrics, that's ad hoc, tell me everything about the topic X, or everything about the word video section,
that's a different kind of story. And what is the core difference between these two search or information needs? What is something that is different from the user's perspective? If you look for company X, homepage of company X,
and you look for everything about the word video section. And the person who does the navigation search, it's more important that he gets a good position, and for the informational strategy, he needs a better recall. That's a good example.
We saw recall-oriented searches, precision-oriented searches. First thing, I want to have a good precision, the result should be quite on top. And for the spoof about Nittosax, I want to learn a lot about it. And what else would be differences?
Even more basic. Think of, you ask this query, and then what is your subsequent behavior when you look at the result set? Typically Google gives you eight or 10 results back, or Bing, what are you doing in the first case,
and what are you doing in the second case? In the second case, you have to search for information. Maybe the second or third case. Exactly, and why? Because in the first case, as you said, I only expect one hit.
I want one, hopefully it's on the first or second page, that's a very optimistic perspective, but typically we know that search engines are good at answering navigational queries, so it's often the case. But I'm interested in one hit. Don't care about recall, as you said, right?
Basically recall is 100% if I get one hit. So I'm interested in one result document. For the Wolf-Nittosaxen, I would be interested in learning more about it, so I will read several documents, behave differently on the results search page, and I'm interested in more things, so the recall is important.
So the number of documents that I might get determines, how do I determine the quality? And we'll have metrics now in the subsequent, and we'll discuss metrics, different ways to calculate quality, which are adequate for navigational search, and others are good for informational search.
Recall is the standard measure. It measures the ability of the system to uncover relevant documents. If I'm telling Chester, I think even the earlier definitions have been there, and here's the definitions of recall and precision, as you know them, the fraction of found and relevant documents,
in relation to the number of relevant documents that are there in the collection and precision. I checked the list of the details of 20 documents, and how many of them are relevant? I find five, then five out of 20, five divided by 20 is my precision for a specific situation.
Now again, five out of 20, that looks like a set-based approach, right? Like to look like a set-based approach. We're talking, in this case, always when we talk about recall and precision, we talk about Boolean retrieval.
I have a set of documents that I look at, and I don't look at a, I'm not worried about the position of the document in a ranking.
That means, are recall and precision good metrics to evaluate a system that uses ranking? Obviously not, right? Because they ignore the position of the, the position of the document.
They are set-oriented. Recall and precision come from the Boolean world, and they have survived, but nowadays, of course, we always, or mostly, we use ranking systems, so we need other metrics that bring this recall and precision into the ranking world, and help us to work with them.
Other things, of course, other problems with the recall is that I cannot really know how many documents are there. Nobody can look through one million documents, right? Just to get a recall estimate only researchers are interested in. So the recall is not really evident for the user.
He doesn't know. Did I find 30% or 10% or 90% of the relevant documents? Basically, you will never know, and also in research situations, we don't know. We have to find ways to deal with this uncertainty,
and of course, this is difficult for high recall-oriented situations like patent retrieval, where I have to know. Is this really a new invention? Should I invest a lot of money in inventing a new whatever, new technology? If somebody already has patented it,
it's economically not interesting, and I might lose a lot of money if I miss only one document. So how do we estimate recall? The standard way is a so-called pooling effort that comes along with the, and that goes hand in hand with the Granfield paradigm.
As I said, several systems should be compared in a Granfield paradigm. We have a comparison. We only know which one is better. Absolute metrics are important. And several systems are given the same documents, index and bring back result systems. Then we take this whole, always results, and mix them
so we don't know who found which document found which system found which document, and give this pool, we call this the pooling, bring everything together, give the whole set to a viewer and say, you please judge all these documents. And he doesn't know which document was formed by which system and at which position it was,
so he just goes through them independently. So that really is as agreed to be the most fair way to judge things. So set here again the different sets
that we have relevant, non-relevant, found, not found, and we see if we have a list with different values, we can say relevant, not relevant, not relevant, relevant, and now we have to find a way to bring
this ranking information into the recall. What is the issue? I can look at one document, I can look at two documents, I can look at three, somebody looks at all 10 documents,
I don't know how many documents are used to look at. You might even look at them in a different order. Mostly people don't do that, as we've learned, and we cannot know about all the situations. So, and as you see, the further down we go,
the larger set we get, if I look at two documents, I can have a set of two documents, I can calculate a recall, 0.5 now, one of them, no, precision 0.5, I can look at the third and the fourth document, again I will have a different precision
or a same precision, and the interesting thing is is that I can, in each step, looking at another document, I have a different set, and can again calculate recall and precision. So instead of having one static set in the ranking system, depending on where the user stops,
for each step, I can have a recall and precision calculation, an recall and precision pair. Let's see if we did that here, yeah? For example, at precision at document, after one document, precision after two documents, and we assume that we have a number of relevant documents that is 10,
we need this assumption for the recall calculation. Might be unrealistic, but it doesn't give an advantage to any of the systems, so for a comparative evaluation, that's just okay. So at each step, we can calculate a new precision.
We can tell the user, if you look at one document, only you will have a recall of 0.1 for this system, and so forth, so we end up with a situation like that. I go down my list, and have a different recall at each position, and a different precision
at each position, okay? Do we have to go into the details of the calculation? The formulas are quite simple, right? So here I have looked at 10 documents, there are 10 relevant ones, I have found four.
That means 40% of the relevant documents were found, 0.4 recall, and I have also 10 documents. Looked at 10, four of them were good, six of them were bad, next, precision of zero. And now what can we observe here in the development of the two metrics?
What is happening with recall as we look at more documents?
Of course, it's increasing. The more documents I look at, the more relevant ones I will find, right? Even if I have to look at maybe 10, 100 non-relevant documents, at some point there will be another relevant one, so it can only increase.
Recall really can never decrease. On the other hand, if I look at many documents, there will be more and more or less irrelevance. Because hopefully, if the system is worth anything, it will have more relevant ones on the top,
and fewer on the very low positions. So, and we can see that too, expect, except for some jumps here, the recall is generally decreasing, okay?
You have a question? No, you did say recall is decreasing. Ah, precision is decreasing, excuse me. Recall is decreasing, decreasing, decreasing, if we go at three. Now we can tell a user, if we use the system X here,
if you look at eight, you will have a recall precision of zero point three and zero point three. And what happens if I look at six and what happens, well, obviously that's not a good way to define the quality of the system, to tell the user for each step what the quality of the system, too much information.
We wanna have one number for the quality of the system and not one for each position, maybe for 1,000 numbers, 2,000 numbers, to describe the quality of the system, doesn't work. So we need to somehow get these numbers together.
What do we typically do if we have too many numbers and we need to aggregate them? Same thing, what is the performance of the class? Somebody might ask me, I could say,
okay, a student A has a zero point one, student B has a two point zero, blah, blah, blah. I can give him a lot of information or I can simply tell him the kind of average, right? Ah, performance of this year's IR class is a bit better than that of last year's. We hope, of course, that this will be true. And then I give some information to this other person.
And here also we have to find some way to aggregate and bring these values together. How is it done? We can simply illustrate this by using a graph. If we plot, recall, and precision pairs from the previous slide, on the graph,
we'll get something like this. We see, aha, the recall is increasing. This means looking at more documents. We start with one movie, right, and then we move on. So recall is always increasing. Precision is somewhat decreasing.
Sometimes it jumps up because there's one, we find one relevant document, then it jumps up. But then we look at further irrelevant ones, it goes down. And overall, it'll decrease. Now these jumps upwards are not so nice.
Ah, here we have it, right? You know it's correct, isn't it? Yes. And here's another example. We assume we have nine relevant documents. We have a different query now. Again now, for one query we have a certain recall precision. For another query, the system might be better or worse.
That's why we have another recall and precision query. So for each query, for each document, we have a value pair. For the next query again, we have different value pairs. Again, much too much information. We have to bring this together. Now we want to somewhat average that.
First we have to average on the, for one query, and then we average for queries. In a way to do that, we cannot simply take the average of a thousand documents or something. Sometimes a system finds 10, some might find 100 documents. We cannot simply take the normal average as for a class.
And for retrieval systems, someone is to some, when decided, there is no theoretical basis for this, to take the average of 11 values, or nine values, that are measured at, well, at each interval of 0.1 recall.
So we take the, we measure the recall at 0.1. Basically we take a recall of 0.1 and measure the precision. 0.1, what precision did we get?
Precision, and we have here a precision of 0.55. Here we have a precision of 0.33 or something. What is precision did we reach at recall? 0.2, and so forth. And then we take the average of all of these observations, of nine observations.
That means we'll have an average of different user behaviors. A user might just look at them, first hits, and other users might look at many, many more documents. And by doing that, we take an average of different user situations, different user types.
And to avoid these jumps, we have the interpolation based on Salton, introduced by Gerald Salton, says, well, we cannot really say this jumps up. We take the higher value and assume the curve goes like this, and when we measure 0.1,
we measure this value and not that value. And we take the average of these values, minus these metrics, and that is the quality of a system. An example, a real result from an evaluation initiative,
this case, okay, we see the measuring points, all of them, we have what we call precision curve. Here we can plot the curve, the curve is an average of the system of whatever, John Hopkins University
in this case, has a average recurposition curve like that, for 50 queries in this case. So 50 queries are given to the system with the same collection, euros judged, all this stuff, and now we see this curve, and then to get one value,
we take the average of nine measuring points. And we see system A, UC Berkeley is better than system 2, John Hopkins University, and that's all we know here.
But from the curve, of course, there is more information than this simple number. For example, we see this number for UC Berkeley is very high here, but basically here it's lower
than the other ones, other systems. What does that mean? How do we interpret this? You might say that for a longer search of the user,
the UC Berkeley system is a very good option, but for users who have rather short attention span, maybe, so they check just 10 documents,
and so we might choose the JHU ACL system. Exactly, exactly. Longer search, you mean people who look at more documents, who judge more documents,
maybe 50, 100,000, they will have a better system with the Berkeley system. For a recall level, if you're looking for a recall level, high recall level search, the better off it is, and much better.
Huge difference. But if you have a work search scenario, or even a navigational search, where people have very, very few documents, not interested in recall, your precision will be for a low recall, better. If you do see other systems, right? Better on average, not as good.
Berkeley is on average better because it has a big advantage on the high recall. So it's better for patent search. Okay, very good. So this is, and this number,
we haven't, where is the name of the metric? The metric is called the average precision. If I measure the nine matching points,
each at 0.1, 0.2 recall, and take the averages, we call the average precision. And then if we do this for several queries, we get up with the mean average precision. We'll talk about this later. E-metric, we don't have time to talk about the e-metric.
Let's briefly talk about mean reciprocal rank. This is a completely different metric. Let's say I decide which system is better based on the following decision. MRR is the reciprocal of the rank of the first relevant document. So if the first relevant document is at position three,
MRR is one-third. If it's at position one, my MRR is one, that's the best I can give you. If it's at position 10, it's only 0.1. Now, when could we use this metric
and when could we use the mean average position? Basically, mean average position,
or basically MRR would be good for searches where only have one relevant document. Because the second and the third relevant document, it doesn't matter where they are. So for navigational search, this would be a metric that is okay. Fulfills most of the requirements. Interpretable, very easily interpretable.
Corresponds to user satisfaction and a better system will get a better value. But for information query, it's completely unreliable, not good.
Now of course, what we have just learned, the mean average position took a while to explain. So it's not so easily interpretable. We have to know a lot about IRR, how is it measured, how is it metric. So it's not so easily interpretable, but a better system will really get a better metric,
a better number. Something that is used very often in publications also is the metric precision at N. Because people say, ah, often people look at 10 results
or the first rep research shows 10, the first web page, result page shows 10 results. Then I could just say, ah, which of these 10 documents are relevant which not and calculate the precision after 10. Very easily interpretable, reasonable format.
But here it says some disadvantages. Very stable, we'll see about that. We'll talk about robustness, maybe what that means. Relies on very little information about the performance, only looks at 10 documents. It doesn't report the other 1,000, not bad.
And the question that we wanna answer now is does it really distinguish well between the better and the worse system. In publications use precision at 10.
Now let's have two different systems that we can compare, system A, system B.
And we have maybe, which system is better?
Probably system A, but I think this doesn't really care about the position, so it's important. Yeah, the precision at N would be equal for both.
But obviously, everybody else knows, people would say, system A is much better. I will use system A. I'm much more satisfied, obviously it's better. But metric precision at N gives me 0.2 for both. So, it doesn't really distinguish well
between the better and the worse system. It doesn't really correlate well to user satisfaction, so it's a bad metric. Shouldn't be used. Still, people like to use it because it's so easily interpretable and easy to calculate. But, not nice. Okay, our precision at half to R also takes,
is an extension of this metric, but for time reasons, we have skipped that. We move to a metric that is now standard at the Trek conference, big evaluation initiatives in the US.
And that is binary preference, or B-Pref. And the motivation behind introducing a new metric was that in recall and precision, we have judgment and we have judge documents and unjudged documents. So basically, the system found something
and between these minors, it can be two different cases. One said, for this document, a Euro might have read it and said irrelevant. The other case, this document, maybe nobody has read it
because there is only so much human resources available, people really read the document and judge it. It's too expensive to read all of them. So nobody looked at it, basically it has a zero. We have plus, minus, people really looked at it. And we have things that basically nobody judged
and they have a zero. But in recall and precision, zero is assumed to be zero. We just assume, nobody has looked at it, we don't know anything about this document. Let's assume it's zero. And basically, that is not such a big problem.
But it became a big problem as collections grew from 100,000 to the million stage. We have Trek websites, evaluation collections now. And it's obvious that the human resources, the human power that goes into the judging, the money that is invested in euros,
cannot really grow much. So even smaller and smaller portions of the collection are judged. And that means these zeros take up the biggest chunk of the collection.
And binary reference says now, let's assume we really don't know anything about these documents. And basically, let's exclude them. So we only compare the systems based on documents that are judged as positive or negative and that we know something about.
And here is the formula that we don't have to really memorize, we can just say aha, we have plus, minus and we have unjudged. And let's assume these unjudged documents aren't there. We don't know anything about, so let's ignore them. And we end up with something like that.
And now we can say, aha, we use the so-called binary preference. We now we say, aha, how often is this, or is this basically, we look, is this in the right order? If it's the perfect search engine,
would put plus, plus, plus, minus, minus. That would be the best order of these documents. That's all we know about it. We don't know whether this document is better than this one. So the way they are ordered doesn't matter. But we know that this one is not relevant.
It should be below all three relevant documents. It's below this one, ranked lower than this relevant one. It's ranked lower than this relevant one. But it's not ranked lower, it's ranked higher than this one. So here, distinguishing between these two documents,
the search engine made an error based on what we actually have. So this is a minus. Whereas this plus, for example, is instead of this minus, that's positive. So we compare all the pairs between all documents,
25 documents, so there are how many comparison? We have 60, whatever. No, 10, should be 10 comparisons. Let's make a, ah, in this case. So, six, because we don't have to compare
the cases where documents are both judged negative or positive, because we don't know which one is really. So we end up with six cases of non-relevant documents, two and four, and three relevant documents, and we check the half. For one and two, is it ordered correctly? Yes.
For, which example did we have? Four and five. Four and five, ah, no. They are in the wrong order, they are system made an error. And when we do this, we see, aha, 50% of the rankings are okay, 50% of the rankings are wrong.
So we give a binary preference of 0.5. Quite easy to interpret, and that is expressed in this formula, but we don't have to really go into this in detail. Questions? Binary preference, easily interpretable, basically,
and also very nice metric with some other good qualities. Okay, here is the, another way to calculate it. Some observations, p-perf is very stable,
but what does that mean? Um, let's just briefly introduce, in this case we have 50 systems compared in a, or in this case 50 queries maybe?
Ah, consistence of a topic, yeah, topics, 50 queries. And here is the quality metric, MAP, mean average position, the one where we have the salt interpolation, and b-thread. And, so we judged each topic with both metrics.
Now we can say, aha, we ordered them according to the MAP, so the easiest topic, where the systems have the best quality, to the most difficult topics, where the systems have very low quality. And then we check, if we order them to MAP,
what happens to b-pref, and we see that b-pref, it's very similar results. Also, for another connection, only for a few topics there is a big difference in the ranking of the topics. So, a difficult topic is also a difficult topic for b-pref.
So that we call, stableness, robustness. The systems results are really comparable. And as you know, MAP gives more information than b-pref, and this is another way to calculate it, for time reasons
we cannot really talk about this now. But, basically result is that it's more stable. Rank correlation coefficient, there's another possibility to think about calculating the best ranking.
We have seen that b-pref is a way to compare rankings. How many of the ranking decisions are correct? And we could also say, we have a perfect ranking, and how different is the ranking of my search engine X? Does it reach the best ranking, or does it give a different ranking?
And, here we have so-called rank correlation coefficients that correlate between two rankings. We say, how different are two rankings? If I order objects in one order, and reverse the order, then the ranking coefficients would be zero, minus one.
If I have very similar rankings, I may change the position of two objects, the ranking correlation coefficient would have a high value close to one. Something that rank correlation coefficient at least you know from the methods class, so this is something you know. For example, you can say how much does the ranking, the standings between of soccer teams,
that we read in the newspaper, or the Bundesliga maybe, change from week to week? Probably not so much, not many teams change position, maybe in the early weeks of the championship, the changes are more frequent, then the correlation is lower, and in the end of the championship,
the correlations will be lower. Not, first will be higher, then will be lower. So, there are different correlation coefficients. One is the spare man, that's used a lot. That's the way to calculate it.
N is the number of objects we have. So, we'll use the 18, or maybe 10 for our ranking system. And it's calculated by one minus this, this term here, six times the sum of the squares of the differences in position.
So, the position changes. So, if an object changes by one, the difference would be one to the square, or again one, right? If there's a change of two of an object in two positions moves up two or goes down two, aha, we have two to the power of two, two squared,
and here we have N times N squared minus five. Oops, the rest is in German, I hope you understand that more or less. And this is a very standard way to calculate the ranking between two, the correlation between two rankings. Another way is Kendall's Tau,
a very similar way to calculate the correlation. And that will be, experiment will be part of the homework. So, remember this formula, we don't have to memorize it, so we can just look it up in the slides. For example, we have five documents.
So, this is a ranking, maybe this is the optimal ranking that the Euro says are hard, this would be the best. And we have two systems, AMD, that come up with these rankings. And now we can check which system would be better. And of course, it should have the highest correlation with the best ranking, that you can check later.
But which system actually would be better? ABCDE, very easy, is the best ranking. Here we have this system, here we have another solution. Perfect.
System one maybe, system two, which is better. Quite straightforward, too easy. Hopefully it's system one. This would be better, right? Exactly, because it's just a few changes,
it doesn't really matter. And here we have the best is on position three, or bad, position four comes up to two, lots of this. User would be more satisfied with solution one, the one in the middle.
And how do we do that? I have to stop a bit earlier today, so what can we talk about? We cannot talk about the robustness maybe, we talk about the cumulative gain, that is also important. Cumulative gain is another way to really include the ranking of systems.
We have seen mean average position is a way to get more information than recall position, to bring recall position into the ranking world. Average between different rankings, average between different recalls. And cumulative gain is a family of metrics
that is a bit different. Here the system assumes that, well, the further we read, the more we learn from the system, but we also invest somewhat our reading time effort that we invest, and we,
gain some benefit, or we call it benefit. And this can also be later than compared to optimal ranking and a nice thing that the cumulative gain has is that we can have different relevance. In this case we have four great relevance scale,
three would be very relevant, two relevant, one somewhat relevant, zero irrelevant. But you can do the same thing for binding one and zero, okay? Here we have a ranking of a system that says aha, I am lucky, the first document is really highly relevant
then they come to a relevant or a high relevant, but ahh not so nice, two irrelevant ones, and so on. And now their first idea is to calculate a cumulative gain that says I just sum up my relevance judgement
as far as I go. So if I go until position three maybe I sum up three, I get another good one plus two, I find another good one plus three. Five to eight. Then if I invest in two more documents, the user who reads five documents
hmm, not so lucky, I stay at eight the sum remains, and then I goes up, so it can go up or stay equal but it can never decrease. But I have to invest more and more time and effort, then I get more gain.
And this, of course, we know that people look, this is just an excursion into the weapon people again, how do people look at search results these are the search positions, and we know that people look a lot more at the first ones, this is the data from eye tracking
and the arrival time, they might look even a few seconds or almost one second at the fifth position, but they arrive quite late at the arrival time after four seconds only, so the attention is not given very much to the lower positions
and we want to know, aha, investing so much time how much gain, how much benefit do I accumulate? Cumulative gain. And to bring the effort into this metric
people introduce the discounted cumulative gain they say okay, yeah, it's fair to sum up, but if I reach the document of value one at position one or two it's good, if I reach it at position three it's not as good if I reach it at position ten it's even worse.
So we divide the real value of the document, the relevance judgment three, two, one, zero, by the logarithm of the position remember logarithm is used to downscale to downgrade further positions
further away, or higher numbers basically, and we don't give discount at position one, but then at three or three we do not fully count the documents, but we discount it based on logarithm of the position. And of course the basis of the logarithm determines the strength of this discount, but often two is used
we could use this for user modeling, but we don't have time really to go into this and then we can later normalize this. So let's look at this, here we have another vector result vector, simply cumulating the gain
with this vector, three plus three plus three plus two three plus three plus three plus two, that would be the discount, the cumulative gain now we have the discounted cumulative gain, well but I made a mistake, this is only the third position, it's not worth so much anymore
it's only worth, let's see divided by the logarithm of position three so it's only worth three plus three, it's not discounted but the third is only worth one point eight nine we go from six to seven point eight nine, so one point eight nine
which is three divided by the logarithm of position three if this would be further down, it would even get worse and as we see we have a real system we calculate the discounted cumulative gain
of the real system, and we have the ideal system in this case we have an ideal ranking, all the threes, then all the twos and all the ones, or all the ones and all the zeros, depending on what metric we have, and we can later divide the DCG of the real system
by the DCG of the ideal system, and then we end up with the normalized discounted cumulative gain so a metric that very well describes what is the, how, that brings the ranking of the positions of the documents into the
evaluation. So a lot of other things but unfortunately due to time restrictions we just jump to the homework