Information Retrieval - Definitions
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 1 | |
Number of Parts | 12 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/16257 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
Information Retrieval1 / 12
1
2
6
7
8
9
10
11
12
00:00
Information retrievalTerm (mathematics)Knowledge representation and reasoningMathematical optimizationWorld Wide Web ConsortiumComputer-generated imageryInterface (computing)Model theoryVisualization (computer graphics)System programmingContext awarenessProcess (computing)Group actionPosition operatorComponent-based software engineeringSoftware developerData managementFocus (optics)Social classComa BerenicesData Encryption StandardSystem of linear equationsInterior (topology)SoftwareData managementFocus (optics)Physical systemSystem administratorSocial classWeb 2.0Information retrievalNeuroinformatikSearch theoryMetric systemMultimediaDifferent (Kate Ryan album)Group actionUser interfaceExpert systemSoftware developerLatent heatContent (media)Term (mathematics)Representation (politics)InformationMathematical optimizationWorld Wide Web Consortium2 (number)Multiplication signMultilateration1 (number)CASE <Informatik>Context awarenessComputer programmingKnowledge representation and reasoningMatrix (mathematics)Row (database)Condition numberQuicksortIdeal (ethics)Reverse engineeringExpected valueMereologyHypermediaPoint (geometry)Insertion lossPerformance appraisalComputer animation
05:40
InternetworkingWeb pageWorld Wide Web ConsortiumPoint (geometry)TouchscreenInformationWhiteboardSlide rulePresentation of a groupSocial classData acquisitionInformation retrievalUniform resource nameGroup actionDifferent (Kate Ryan album)Slide rulePresentation of a groupGroup actionClosed setLine (geometry)Reading (process)Roundness (object)Classical physicsBinary codeComputer simulationAxiom of choiceMultiplicationMultiplication signRight angleStress (mechanics)Matching (graph theory)MereologyCASE <Informatik>TouchscreenKey (cryptography)Social classWhiteboardScheduling (computing)PhysicalismGradientRevision controlPhysical systemSpring (hydrology)Extension (kinesiology)AdditionFormal grammarDistanceBitGraph (mathematics)Motion captureSingle-precision floating-point formatNeuroinformatikGoodness of fitTask (computing)Web 2.0Vector space modelTerm (mathematics)Information retrievalComputer animation
11:20
AngleUncertainty principleWorld Wide Web ConsortiumGroup actionPixelMaizeArtificial neural networkWindowWärmestrahlungInformation retrievalInformationSelf-organizationRepresentation (politics)Data storage deviceQuery languageModal logicPhysical systemDomain nameMedical imagingData storage deviceDatabaseNumbering schemeNetwork topologyBitKnowledge representation and reasoningInformationTerm (mathematics)Self-organizationWeb 2.0Data managementRepresentation (politics)Arithmetic meanShooting methodClosed setOrder (biology)Search engine (computing)NumberPhysical lawNonlinear systemFigurate numberStructural loadAxiom of choiceRight angleMultiplication signTrailMatching (graph theory)Group actionWordReal numberHeat transferReading (process)Coefficient of determinationMathematicsDifferent (Kate Ryan album)VotingTracing (software)Ocean currentPoint (geometry)Information retrievalCore dumpWell-formed formulaShared memoryNatural languageInformation managementContext awarenessNatural numberQuery languageComputer animation
20:40
Query languageWordRankingCountingElectronic mailing listNumberDependent and independent variablesDecision theorySearch engine (computing)1 (number)ResultantGoodness of fitMultiplication signInformation retrievalCASE <Informatik>Social classMedical imagingDistanceClosed setVotingOnline helpDifferent (Kate Ryan album)Query languageSummierbarkeitProcess (computing)AreaSystem callCuboidLocal ringWeightKnowledge representation and reasoningTrailForestPhysical lawInformationProcedural programmingQuicksortVariety (linguistics)Rule of inferenceBasis <Mathematik>WhiteboardFaculty (division)MathematicsRight angleNetwork topologyComputer animation
29:40
Query languageInformationProcess (computing)Physical systemOrder (biology)Endliche ModelltheorieChemical equationGoogolDifferent (Kate Ryan album)World Wide Web ConsortiumQuery languageDecision theoryParameter (computer programming)2 (number)MereologyDirection (geometry)WordRadio-frequency identificationComputer configurationType theoryModel theoryWell-formed formulaResultantPoint (geometry)Group actionView (database)DistanceMultiplication signPredictabilityRule of inferenceWhiteboardInsertion lossClosed setStructural loadImage resolutionFerry CorstenRow (database)Right angleMoment (mathematics)Zoom lensINTEGRALInformation privacyDivision (mathematics)Physical lawSearch engine (computing)Network topologyTheoryDisk read-and-write headCore dumpUtility softwareComputer animation
38:39
Query languageCAN busComa BerenicesMetropolitan area networkSineMaizeComputer-generated imageryValue-added networkWorld Wide Web ConsortiumInformation retrievalObject (grammar)Operator (mathematics)WritingLink (knot theory)Range (statistics)Query languageWordSearch engine (computing)World Wide Web ConsortiumInformationInformation retrievalData structureDifferent (Kate Ryan album)Object (grammar)Musical ensembleCore dumpRankingAlpha (investment)Rule of inferenceGoogolSet (mathematics)Endliche ModelltheoriePosition operatorSocial classWebsiteComputer configurationFrequencyObservational studyFormal languageCASE <Informatik>Multiplication signSoftware testingRevision controlAdditionWhiteboardPoint (geometry)WeightLine (geometry)Insertion lossPolygon meshPhysical lawView (database)Right angleGame theoryDemosceneSource codeComputer animation
47:39
Computer-generated imageryObject (grammar)Information retrievalInformation retrievalInformationDifferent (Kate Ryan album)Core dumpTheory of relativitySearch engine (computing)Social classDatabaseGoodness of fitAreaResultantRankingQuery languageWorld Wide Web ConsortiumPhysical systemMusical ensemblePoint (geometry)Data structureMassRow (database)Medical imagingPosition operatorTable (information)Computer scienceElectronic mailing listVariable (mathematics)Web pageRule of inferenceMoment (mathematics)Order (biology)Relational databaseState of matterWordParticle systemDerivation (linguistics)Insertion lossFrequencyCASE <Informatik>Range (statistics)Ferry CorstenSequelNetwork topologyObject (grammar)Decision theorySpacetimeConstructor (object-oriented programming)Statement (computer science)View (database)Computer animation
56:39
Information retrievalInternetworkingForm (programming)Natural numberHypermediaObject (grammar)DatabaseKnowledge representation and reasoningTerm (mathematics)Matrix (mathematics)CalculationSubject indexingProcess (computing)Similarity (geometry)Mathematical analysisQuery languageWeb crawlerWorld Wide Web ConsortiumData storage deviceInformationLocal GroupSicSineIntegerInformation systemsFingerprintIterationData Encryption StandardPhysical systemMultimediaExecution unitSystem programmingType theoryResultantReal numberSet (mathematics)Information retrievalDifferent (Kate Ryan album)Physical systemQuery languageProcess (computing)Representation (politics)DatabaseForm (programming)Natural numberGroup actionSelectivity (electronic)Table (information)Fitness functionKey (cryptography)Slide ruleVirtual machineKnowledge representation and reasoningPerfect groupConnectivity (graph theory)Arithmetic meanMassPoint (geometry)TheoryPosition operatorCore dumpMultiplication signMultilaterationPerformance appraisalNatural languageSocial classWordCompilerPersonal digital assistantInformationWebsiteHypermediaParticle systemIncidence algebraConservation lawLimit (category theory)Semantics (computer science)SequelSeries (mathematics)Game theoryData structureAreaCASE <Informatik>Line (geometry)MultiplicationObject (grammar)Figurate numberMatrix (mathematics)Computer animation
01:05:39
Mathematical analysisProcess (computing)Query languageWorld Wide Web ConsortiumWeb crawlerSubject indexingKnowledge representation and reasoningInformation retrievalData storage deviceInformationExecution unitData Encryption StandardSystem programmingPhysical systemExistenceData managementIntrusion detection systemComplete metric spaceHost Identity ProtocolLaptopWebsiteComputerData acquisitionCompilerPointer (computer programming)Web pageInternetworkingSineDigital signalValue-added networkInformation systemsHead-mounted displaySummierbarkeitMaxima and minimaDrum memory3 (number)BlogBus (computing)Application service providerSpacetimeSystem of linear equationsSoftwareUniqueness quantificationView (database)Ordinary differential equationAddress spaceDifferent (Kate Ryan album)Social classType theoryLattice (order)StatisticsWebsiteCASE <Informatik>Observational studyWeb serviceNetwork topologyCondition numberFlow separationMultiplication signLine (geometry)NumberCellular automatonProduct (business)State of matterWordKey (cryptography)Knowledge representation and reasoningInformationGame theoryLabour Party (Malta)Cartesian coordinate systemMathematicsComplete metric spaceInternetworkingSocial classGroup actionWeb pageStructural loadVideo gameAreaProcess (computing)Message passingMereologyVirtual machinePhysical systemRepresentation (politics)World Wide Web ConsortiumData storage deviceCuboidOperator (mathematics)Different (Kate Ryan album)Sound effectApproximationCycle (graph theory)Boolean algebraWeb crawlerSubject indexingCore dumpWeb 2.0Medical imagingQuery languageInformation retrievalTerm (mathematics)Arithmetic meanElectronic mailing listSource codeVector spaceXML
Transcript: English(auto-generated)
00:00
So, also the lecture information retrieval, there will be a screencast available for you. But more about that later. What are we gonna do today? Just a brief overview. What do we talk about? What is the content of this class? Then, as usual, we start with some definitions, with an introduction,
00:21
and get a first motivation on this class. And that should be it. Well, we have a homework, we won't go into representation too much today. I'll be left for next class. Okay, overview, what are we gonna talk about?
00:42
Any questions? Questions up there? Hello? First today, definitions and basic terms.
01:05
Then we have both sides of the coin. Then you have heard a lot about information to keep already in the introduction. It's about representation and search. And for both of these aspects, we have two to three sessions.
01:22
Representation, three sessions at least. And two sessions on search models. We round up this first half then with evaluation, which is of great importance in retrieval. Talk about user behavior, probably. And then we enter a second half of topics
01:43
with optimization technologies. Web retrieval, where we talk mostly about quality aspects of web retrieval. And hopefully we have some time for multimedia and maybe user interfaces.
02:01
Okay? Questions so far? Not the case. What is the goal of the lecture similar to the HCI class? You should be able, you should know the basic terminology. You should know if an IR problem is discussed or something written about it.
02:22
What does this refer to? Is this a search model or is this representation? Be aware of typical problems that occur in IR. Be aware of methods for evaluating, knowing methods, knowing metrics, the hundreds of metrics around,
02:42
and you should know the most common ones. So if you read about something in IR, you should be able to contextualize this word.
03:00
The same in IR and data classes. This refers to topic X, or this has something to do with this. Okay? So this is the basic idea of this class, to get an introduction, to get a first idea about what is retrieval, what are the topics, what are the issues here.
03:21
And there are other classes to further elaborate this knowledge, to work on this, the seminar class. And the lab class is for really doing IR, working with IR programs. And using all this knowledge that we talk about.
03:44
Okay? Oh yeah, there is also a curricula, a general curricula for Germany that the Gesseritsche Preinformatik developed, and they defined four target groups, which is quite interesting. They said there are so many classes on IR
04:01
being taught in Germany, but they sometimes don't have anything to do with each other, because they are obviously directed towards different target groups. One target group is people doing, really, only using systems, being experts in using systems.
04:20
Others, as I said here, are of management of IR systems. And some others, like, especially in computer science, are taught to develop IR retrieval software. All right, and we are in between this. We don't develop IR systems, but we're also not users. At least, we are, of course, users,
04:42
but we don't teach you to be expert users for some specific domain, for example, medical information, or something like that. So, the approach of this lecture is basically have a basic idea about all of this, but our focus is in the management administration
05:02
of IR systems. So, you should know about these things, because you should be able to work with an installed system, or install a deployer system within a company. You know, what could be the issues that are, that occur there. If a new system is introduced or being developed.
05:22
Okay? You won't be able to develop a system, and you won't be experts after this class in using an IR system. Of course, you should know more about using an IR system, though. This is what I wanted. Okay, there is a Learn Web course.
05:43
Here is the key. It's Salton. More questions? Ah, I see, okay. That's why everybody's talking.
06:01
So, what is it called? Learn Web. For the Learn Web course, will be Salton. Gerald Salton was a famous IR researcher. He basically invented the
06:22
proximate, non-exact match, proximate match models, vector space model. So, it's quite important. We remember his name when you enter the Learn Web. Questions? There's quite a lot of discussion here on the right side.
06:41
Are there any questions? No, okay. Okay. What will you find in the Learn Web? That's here. Screencasts will be available there.
07:01
As you see, I will record the screencast. But, okay. First, it might not always work. The computer might not work, but I might do something wrong, and there is no guarantee that there will be a screencast over time. If it doesn't work, there might not be one for one week. Then, in addition to what we do here
07:22
that work quite well, maybe for HCI, but for IR, we work well with models, a little bit more formal things, so we will develop formulas or graphs. And that is quite inefficient if they are just shown on the slide. We have to develop them on the board sometimes. And they might look very complicated if you look at them,
07:41
but if you see them being developed on the board, you will see how it's so easy. That is it. It's nothing really to know about it, and it's much easier to learn that. And that is something we cannot capture with the screencast, so that is something you will miss if you rely only on the screencast.
08:00
Okay, so I don't recommend using it. Might be helpful to see what was said about this slide, but it's probably not always the best learning method. So attendance is highly recommended, but of course not enforced. Up to you what you do.
08:20
Also, the presentation slides will be available. And similar, they are not a textbook. They are not intended that you can use them stand alone to study. They need to be accompanied by the lecture, by the class. If you visit the class,
08:40
if you attend the class, then it's helpful to see the slides. If you don't, probably they're not as helpful. So if you just rely on them, might work for you, but I wouldn't recommend it. Okay, and it's not, probably it's not good to pass the exam just with reading these slides.
09:01
People have tried and failed, but anyway, up to you. How does the course work? Similar to the HCI class that probably most of you have visited, there will be occasional homeworks, seven assignments over the term. So every other week, there will be a very small assignment. And we will discuss and present your solutions here,
09:24
and see what was the right solution, if it was not. And if you have at least five of these fairly easy homeworks done correctly, you can participate in the exam, okay? So this is the way you're familiar with this method,
09:44
this way of the lecture. The exam will be after the semester term. So I schedule it for the first Monday of the class, after classes end. So we have the last week,
10:02
on Monday we have this exam. Okay? There any conflicts already for anybody? Other exams on the Monday after class? The classes end. Doesn't look so okay, so we're all fine with this date and we can fix it.
10:22
And soon, of course, you want to know what does the exam look like? There will be some text work, some binary, what is it called? Tasks, multiple choice and single choice tasks. Also, typically something to calculate,
10:41
and for you to get an idea, I will put the last exam on the Learn Web soon. Okay? Good. All right, the resources, slides, and stuff. Any questions on this formal stuff? Other case, okay. So we can get started immediately.
11:03
Retrieval systems, everybody knows retrieval systems, everybody uses them more and more often, and everybody might think they know what happens in these systems. We'll see if that is true, or you'll see if that is true during the class.
11:20
We also have systems for different modalities, right? For pictures, for images. We can work with all kinds of documents. Very well class.
11:40
There are many, many special specific domain information retrieval systems, like systems for scientific literature, like Web of Science, which is a database for scientific papers, and there you have some different ways
12:03
to do retrieval there. Anyway, but all these things look familiar. We are accustomed with these things, we use them on a daily basis, so it's good to know a little bit more about it. Start with a definition. What is information retrieval? Information retrieval deals with the search
12:21
for information, and with the representation, storage, and organization of knowledge. Okay? That's a very general definition. Big scope here, and now we can see what we do with this definition.
12:40
This should remind you of the, of what? We have two different terms in this definition. Which term is this? You're in the fourth term? Who is not in the fourth semester?
13:01
A few? Okay, only a very few. So you all remember the introduction lecture quite well. Introduction to information science. We have a few basic terms in information science that are basic terms.
13:26
Yes? Knowledge management, yes. We have management here, but we have knowledge, and? Knowledge information data, yeah? And of those two, again, we don't have data here,
13:43
but of course, knowledge management is of course also a very important term. But the very basic terms are information and knowledge. Right? Information und wissen. And here? We can check if they are used correctly from the definition that you learned.
14:04
And what was the definition of information? What comes first, yeah? Knowledge in action. Knowledge in action, yeah, very good. The short formula that Kuhn coined, right, and shortly, we can see knowledge is the big thing, right?
14:25
There's lots of knowledge. We can talk about knowledge, what could be knowledge, and then we look for information for a small chunk of it which is relevant for our current information, for our current situation, and then we get information.
14:42
So basically, information retrieval looks like the pure definition of information science, right? So it's really one of the core disciplines here that we look at. And here we have it both in context. Search for information within a lot of knowledge.
15:00
So but to be able to search for information, to search knowledge for information, we need organization and representation. And what is representation? How can I represent knowledge? Why do I have to represent it?
15:23
Difficult question, huh? Yes, uh-huh, of course, yes, uh-huh. That is the first step, yes, uh-huh.
15:41
That is, we would call this, and this basically again directed towards knowledge management. It's not a coincidence that you said the term because within knowledge management, I have an idea, right? I think of something, or I do something, I create new knowledge for myself, then I know it, then I have to share it, yes?
16:07
I can share it, how can I share it? Yes, yeah, okay. Yeah, that's a different definition of representation
16:20
that we talk about, but basically you're right. What we could use, what term would we use in knowledge management? I know something, now I want you to know that. How can I transmit it to you, let's say? I can simply, huh?
16:47
I know something, I want you to know it. What could I do? How can I transfer this knowledge to you? I can just tell you, right?
17:00
I can say it to you. So I use natural language. I represent the knowledge in natural language, we could say, yes. Here we talk about different representation to you, but I can only reach a few people, right? In my lifetime I can only talk to so and so many people, so if I think this is very important, I can also write it down, yeah.
17:26
I can write it down and give it to others and hope that they read it, maybe, right? So we could call this externalization of knowledge. This would be the term in information management. I externalize what I have, my internal knowledge is put on some medium, right?
17:45
A drawing, or a text, or I just say it, I use natural language again. I record it like here, I could record, I could also write this down if I had time. And it's externalized.
18:04
I can die and this knowledge would still be around, okay? Externalization. And then we have, in the history of mankind, a lot of knowledge has been externalized. A lot of books, we have a lot of text, a lot of websites,
18:20
a lot of information, a lot of knowledge we can say, sorry, and now we need to find the right junk for us. That was information to you, Lord, about. And now we have a lot of externalized knowledge and now we need to search it. How can we search text?
18:42
How can we now represent text? What do we have to know about the text in order to find it when we have a search problem? That's the technology that information achievement uses. Let's start with an example. We have a query.
19:02
Wolf Niedersachsen, a wolf in lower Saxony. Now maybe I have 10,000 documents, or millions of documents on the web, for example. What do I have to know about them to say? Yes, this is a relevant document for this user.
19:28
Like a search engine, we'll look at the text now. First, maybe we'll deal with text here. I have a text document.
19:40
Now I have to look at it and see, is it relevant for somebody who entered the query with Niedersachsen? What do you think? What if you search for keywords to fit to the query for Niedersachsen? Which keyword could that be?
20:08
So I will search within this text for wolf. And I will search within this text for Niedersachsen. For something else, maybe also?
20:29
For someone, a person named wolf, or the end of person wolf?
20:42
You can ask a user. Search engine, you can say, well, we don't give you answers, we ask you. What would you do as a user? You would just say, what is that? I go to another engine. I don't want to be asked questions. I want to ask questions, right? You want to have it simple.
21:02
Search engine then asks you back. One more step for you. You don't want that, right? So the search engine doesn't really know. Is it a name, like Niedersachsen's a name, or is it a animal concept?
21:21
So we have to know if there is wolf in this document or not. Or any other ideas? Niedersachsen, yeah. We could check the documents. Yes. If there is a animal or some people.
21:41
Yeah, we could do that, okay? We could say, yes, I see there is wolf, but I say this is a person. Don't confuse this with the animal.
22:00
But who would do that, and how? Of course, this cannot be done manually. Could be done manually. For images, it's done manually, right? Describe your images. If you want to find images, it can be done manually. That's too costly. We talk about manual representation next time.
22:24
Next session. So keep that in mind. And for the moment, we don't have anything. We just know, aha, there is a word, wolf, and there is a word, Niedersachsen.
22:42
We know anything else. What else could we look for? Look for the ideas? How often do the words appear, and then they're together in one sentence?
23:03
Okay, yeah, very good. So if it's, oh, they have wolf and Niedersachsen, both would be good hits, right? But now I find, oh, again, there is Niedersachsen here. So I could count these words and say, ah, this looks more promising.
23:22
Or this has two wolf. I can count the words. I can look for them. Is it there or not? I can count them. And then what else did you say? Ah, the closeness, yeah. Proximity operators, yes. Those who are in the lab class on patent retrieval, we work with proximity operators the next time.
23:42
Um, for this discussion, we could also say, in case we have only wolf and Niedersachsen,
24:03
we have two documents that have wolf and Niedersachsen, and one happened very close together, while not, we could say, this looks a bit better than this one. Our results are typically retrieval is a ranked list of results, so we could say, ah, we just have to know,
24:22
this should be in front of this one. And that's all we need to present. Okay, but now we count it. Basically, it's not a lot of ideas that you have, but it's all ideas that the information retrieval community had so far. Basically, that's what is done.
24:42
We count words, we look for words and count them. Not very elaborate. So let's look at this example. Here we have six documents. They contain wolf and Niedersachsen, we counted already. They contain another word,
25:00
in some other words, one of the words is spelled out here. And now, you are the search engine, and you make up your ranking. You have a piece of paper, you think about what could be the best hit now for the user.
25:22
What is the second hit? Quite tilted, huh? But, probably it's okay.
25:44
So think about it. You're the search engine. The user is typing, he sent the query. Now, you need your response. What is the best hit for this query?
26:06
Rank list. Anybody, any suggestion? Anybody else? Oh yeah, one, two. I'm trying E for number one. Let's write it down. So we have E, that's okay. Followed by A.
26:23
Followed by B. Well, maybe followed by F, B, C. F? E. E and C? Like that? F, B, C. B, sorry. Ah, E, we have E already.
26:47
So I agree, we have hit number one. Who would put something else on number one? We would do F. F. Okay. And then E. And then E.
27:01
Another alternative? Yeah, I would also say F first place and then followed by E. So who puts F and who puts E on the first place? Who is for E? As it turns, who is for F? Looks like a big move.
27:20
Who is leader for E and F? Who is for something else? Oh, that wasn't everybody yet. I have to get an attendance list to check if anybody has given a vote. What else? No? Nothing else? Then E and F. Let's see, why would you put E and F?
27:41
Let's look at the numbers. And what was the suggestion? E and F.
28:00
We see that the numbers are quite different. This is, if we sum up the currencies, we have three here. Most people put this way down on the list, right? The two. Then we have some, they do not have one of the two words. I have a lot of the other words.
28:22
But that doesn't compensate really. I would say, I'm sorry if you don't have Wolf, you have to go to the back. Doesn't help that you have so many ideas actually, or vice versa. So C and D were quite low, was that right?
28:42
No. Okay, we can also have different opinions. Why could we have different opinions about document D?
29:01
Well, let's discuss first the top ranks, okay? That's not too complicated. E and F. Here we see R, these have three and three, four and two, six occurrences. This has also six occurrences, A.
29:21
So if we look at the absolute numbers, we cannot make a decision. If you just look at the sum. So why can we make a decision for one of these two? What are these two documents? These three documents.
29:47
We do not have some kind of ranking within our query, so I would take document F, because it has the same amount of Wolf as the other.
30:03
So it is the same pit for both of the... It's balanced word. Yeah. I could say it's balanced. He gets a lot of Wolf and he gets a lot of new songs. Balance is nice. Why not? My approach was that you take a look
30:22
at the abbreviation for the city, and when you figure it out, the city is in low-section, and you can basically write a one or something, and if it's not in low-section, you write a nil, and so F would be six points and A would be seven points. Yeah. Now you use this additional information
30:41
that I put there, right? Hanover. You know that Hanover is in Niedersagt. So it's quite interesting, right? We could say, oh, that's where I want. Graz is obviously not in Niedersagt.
31:00
At least there's no city called Graz. Niedersagt, to my knowledge, right? Doesn't mean there isn't one, but I'll give you a little bit. So because obviously this is quite misleading. This might be most about wolves in Austria. Could be Niedersagt is there just by accident.
31:24
So we could say this goes to the, this is second place. And we could have similar arguments for this one here that also has Hanover and also six occurrences.
31:41
But it's not as balanced. Okay. So what would be, well, first of all, what about this additional information?
32:01
Can we use it? Can the system use it? Like what do you know that Niedersagt is in? Hanover is in Niedersagt, Niedersagt is not. Does Bing or Uber know that? Do they know it?
32:21
Can they use it? What about your mental models when you use these systems? What do you expect? Basically we can say this requires quite a lot
32:43
of processing to do that. Typically it's not done to use this information. Both Hanover and Graz are words that are not in the query, so they're ignored. Okay. I just put it there to distract you a little bit.
33:03
Confuse you. Of course as humans we would use it immediately. Even if there would be no Niedersagt and there would be Hildesheim or whatever, Willerburger, Heide, and Wolf,
33:20
we would say, ah, that's interesting for the person, really, because a lot of places in Niedersagt and Wolf. But we think about it. We process and analyze these words in a system, typically does not do it.
33:41
Can be done though, right? Especially in geographic theories, we have more processing of that kind. We try to disambiguate and find places. But typically for text-to-speech, we wouldn't do it.
34:00
We could also have words related to Wolf, like, Wolf is a bad example, like dangerous animals or something like that, like predators or things like that. We could just use other words, maybe instead of Wolf,
34:22
to be stylistically nicer. And we know how, he talks about the Wolf. But a system would not typically use it. So we can simply say most systems would ignore other words and just rely on this.
34:41
So we have the decision between six, oh, so six counts, right, for three documents, and now we still have to make a decision which one is the best. Good argument is balance. That's quite nice. We will see in the third class,
35:01
and I have to talk about everything today, that a similar model that is not called balance, but that basically the formula is used and has the same effect, that this would probably be the highest one. What would be if we had one document
35:25
with four wolves and two nidirzaks, like this one, and one with two and four? Let's say I have a document with four wolves would be on a higher place than the other one, because wolf is the first wolf in the query.
35:43
So it has small relevance. That's what I would think. There's an option to say, you wrote wolf, nidirzaks, and so this is more important for you. That means when you type in the wolf nidirzaks,
36:04
when you type nidirzaks in wolf, you would expect a different result, right? Is there a different result? No? Have you tried it?
36:21
Yes, I think I remember that there was no result. No difference that you could observe, okay. So everybody can try that at home to see if that meant a model that you have, which is justified because there is a border. This really how the system works.
36:41
Typically, no. Typically, these things would be, we would have to look for other things about wolf and nidirzaks to make a decision here. What could we use? Could be found.
37:03
Oh yeah? How often we see this word in? Remember that we often see the word, so I mean how often this document would be found. Ah yeah, okay, we could use, this is a quite advanced system.
37:21
We observe user behavior. All right, we see, aha, this stupid document is ticked very often. But in order to do that, we have to track like for 10 million web documents and for each we have to record how often is it ticked, for which query. Also quite elaborate processing, right? Google probably does some things related to that.
37:42
We're not sure. But if these documents are new, we still wouldn't have that idea, right? What do we present for the first time? Two and four, four and two. How do we make a decision?
38:04
The first part of the answer was five, going to the right direction. Of course the second also observing user is also right. But we could look at, what about these words, wolf and neo-saxon?
38:23
Would you want the documents with more wolf or more neo-saxon? Well, you don't have any demands for the search engines? What should they give you?
38:48
So just because neo-saxon is the location, the setting for, what I'm really looking for is the wolf in neo-saxon. I just, I don't know, I just have to know for once
39:02
that it's a neo-saxon and maybe wolf can repeat, that can be repeated in the text. Yeah, okay. Is it a nice argument? I think. Can we follow that? But what does a search engine do with this?
39:20
How can you calculate with this kind of knowledge, right? Not really easy. Wolf is in a way more specific, we could say. All right, the neo-saxon might be a bit more general. And how can we see what kind of word we have? Is it very specific or more general?
39:42
Can we make a general rule from this? Again, what do we do with words in a retrieval? We don't try to understand them, we don't try to do anything, we just count them. All right? In this case, we could count them where?
40:08
In the full collection. We could say how often is there the word wolf and how often is there a neo-saxon? And that's typically what is done here. Is it a frequent word or is it a rare word?
40:22
Is it very specific or is it very general? Doesn't it, does it really help anything? Then it depends if in this collection I have no idea or in the whole web, wolf or neo-saxon is more frequent. Then the idea is that a word that is rare
40:47
gives you more information because it defines a smaller set of documents. Okay, so basically it looks simple, we just count.
41:02
But we come, very soon we encounter a lot of problems that we discussed here. And we don't know what to count, right? We count how often is it in a document, how often is it in the collection? And these things is something that we deal with
41:22
in the class, so in three weeks you should be able to calculate what would be the hit, the first hit according to the traditional typical IR models as we find them in the textbooks. All right, we don't know what Google does. We can look at this and of course it doesn't help us
41:43
a lot, right, we don't know how often wolf is there. We see two wolf, I see two neo-saxon in the first and in the second and only one wolf. So I don't know why the position has been switched,
42:01
right, maybe if we look at the URL then wolf would be more frequent in the first one also. Okay, anyway, but Google doesn't have to do what the typical traditional IR textbook says and sometimes they do something else. We also see examples in the class.
42:20
So we have a lot of retrieval object as I said and now we can have text documents, web documents, images, movies, audio documents, music. All of that could be retrieval objects.
42:42
And we had a, you already have studied a query language in the introduction for some other kind of objects. What did you study, remember?
43:01
Hopefully, yeah? I spelled, yes, SQL typically SQL spelled out in English and what does SQL do? What does it mean anyway?
43:24
Yes, very good, structure query language. Query language, that sounds like good for facial retrieval. So maybe you had already some, studied some information too without even knowing. So what is the difference between SQL
43:40
and typical web search, web retrieval that we have just studied. This brings us to some of the core problems of IR. SQL and let's say web, IR, web search.
44:05
What could be differences? Yes? Could you maybe use a black pen? Oh, sure. We can use this, oh, so it's closer to this side too.
44:21
Ah, we have, should have made a picture of the ranking. Because the next, in three weeks we can check if we had a good intuition or not. Was quite close, of course. So SQL, structure query language, and typically web IR.
44:41
We want to know what are differences. Yes, formal, natural, let's put a small question mark
45:03
because what, it's not something I would talk, it's not the way I talk, right? So it's closer to natural, of course, than SQL. Definitely, yes? As operators.
45:22
Do we have operators in the power? No? Never? We have, which ones? Google uses minus R, like alpha. Minus, plus, what else do we have?
45:42
Potation marks, yes? Interesting, ah, where I can have quotation marks for phrase search, like wolf in Niedersachsen, then it would just look for this phrase. We can have operators like plus and minus, plus would say, plus wolf, Niedersachsen,
46:02
that would emphasize wolf by what you suggested earlier. Other operators, lots, even Google has lots of operators like site search, like link search,
46:25
these are all important for information professionals, so you should at least know the basic Google operators as information professionals, information specialists.
46:44
But what can we write here? We cannot write no, but somehow the answer is, of course, right, who got it, who was it? What should we write? Just a few of the operators. Fewer, yes, we can write fewer, but, basically, okay.
47:05
They are not needed, are they optional? They are optional, yes, that's good, optional. So they can be used or not, and what do most people decide? They don't use it. What would be the percentage range of queries without operators, what is your guess?
47:22
Well, over 90%, 95%, at least over 95%, so they rarely use an optional, rarely used optional and be mandatory, absolutely necessary, right?
47:41
Other differences, quite nice, quite a few differences already. One difference is what this is something you're starting in a class,
48:00
you didn't have to study to use WebIRR, right? Just use it without education, need to study.
48:24
And what kind of data, what kind of knowledge can we query with SQL, and what kind of knowledge can we query with WebIRR? We come closer to the core differences, kind of knowledge.
48:55
Yeah, very good, that's exactly the terminology that we need, structured data, unstructured data.
49:18
Structured data is basically, for example,
49:21
relational tables in a database. Employee data, select, name from table employee where salary is higher than 45,000, you will get a list of names of employees.
49:50
Because somebody has prepared the database in a certain way, it's all data structured, right? Variable one, age, next row,
50:01
name, next row, salary and so forth. Yeah, this is really the core difference. If we talk about unstructured data, what is unstructured data?
50:25
That the search engine just analyzes it, is it the moment not before the search? No, that's all right.
50:40
Because in order to be able to process 10 million documents, I have to prefer to process it ahead of time. So when you query, all these pages have been analyzed. But they have to be analyzed. What do they consist of?
51:01
What is web search? Okay, what kind of data? We have some examples here. Now we can, what of these is unstructured? What kind of this data is unstructured?
51:25
Images, unstructured, yes. There is no rule if there is an employee or the face, it has to be in a certain position or something. What else?
51:43
Music, unstructured, yes. Also unstructured. A musician would say, well, wait a moment, no? No, there's some structure in my music that is unstructured. Not from a reviewer point of view, from a recording. Music is an interesting case,
52:01
but we cannot do it in the class, really, too much. What else is unstructured? Well, if images are, then movies have to be unstructured as well, right? Just a bunch of images. And nobody wants to say text, what about text?
52:24
All of this is unstructured. Text is unstructured, you would say. Of course, linguists would disagree. Well, wait a minute, there are sentences, there are words, but this is a structure that we cannot exploit for mass data.
52:41
We cannot analyze mass data in text. We cannot understand what 10 million documents are about. We don't know if Wolf is a name or not. We have some technology to do that, but it doesn't work 100%. It doesn't really work for mass data.
53:00
So, text, most important, unstructured data. Here we have database. Further, differences. What else do we have?
53:25
Yeah, okay, here we don't have any problems because we don't have really syntax, right? Good. We say here retrieval objects,
53:43
and we call this class information retrieval. If you want to learn SQL, which class would you visit? When you look at the course book, oh, I need to know more about SQL.
54:02
I want to study more about SQL. No, that's a little different. That's a completely different area within computer science or information science that deals with structured data, and has a very little overlap with information treatment.
54:26
You would visit a completely different class, which is called, maybe? Discussions up there? Which one? Which class would you visit for SQL?
54:47
Cool, it's used for what kind of data, again? Nobody knows it that also shows that we don't offer this kind of classes here. Please note that our problem would be databases, okay?
55:05
Databases. Databomb. Databomb and retrieval are basically different areas.
55:20
We don't talk about retrieval within a relational database. Sometimes people do, but it means something to the database community to retrieve something. So we have different kinds of objects, and now this also has a,
55:41
we have different kinds of methods and sciences, databases. Different kinds of systems, IR system. And now what does it mean for the result? We have talked about what could be the real, the good, the best, the real ranking,
56:00
the correct ranking for Wolf-Niedersachs. What about a SQL query in the introduction? Did you also discuss what could be the best solution for this query? No, why not?
56:25
There is no ranking, that's true. And? Can we debate? This person should be in the result set or not? No, why can't we discuss about it?
56:41
And here we have a major difference again. If I'm putting a SQL query, what is the result? It's absolutely clear, right? It's fully defined. If I write a SQL query, get select employees
57:03
from table employees, salary larger than 40,000, then every SQL, every database system has to give me the same result. There's a clear definition, a standard, ISO standard, can be Oracle, can be Cypress, whatever database system
57:20
has to give the same result back. So it's clearly defined, doesn't fit you anymore, result. Defined, for database we have a defined result. Ever retrieval, what is the real result for Wolf-Niedersachs?
57:42
What is the correct result? I'd say there is none, it depends all the time. Depends, depends on what you want, right? Depends on every, it could be different for many users.
58:03
If you type it in Bing and in Microsoft and in Google they will have different opinions too. And nobody can say, ah, that's wrong, right? There just is no clearly defined result, right? Because we deal with unstructured data. We cannot guarantee one correct result.
58:23
What will be the correct result for query auto car? Nonsense to talk about that, right? And here we can see the real major difference. If we deal with structured data, we have a defined set. If we deal with unstructured data, we don't have it.
58:47
So because we deal with documents in natural language, which still is the most important resource or form of knowledge on the web, it's unstructured, dominant,
59:00
there is no way to get a guaranteed result. Ah, like this person, maybe he has a good result. We don't really know. And now that's what we already talked about. These are really the core differences between SQL and IR. We talk about different disciplines
59:21
and a wise way to illustrate, to think of this, is to, there is a real result and there is not a defined, clear result. Why is that the case? If we look at the IR process here, ah, let's look at this a bit later.
59:41
Let's see how we're doing in time. When did we start? 2.14, right? 2.15. We have to get a bit. First, look at this definition to illustrate what we talked about difference in SQL and retrieval.
01:00:00
This definition that is quite old of the Wachgruppe, of our special interest group that we have in information retrieval in Germany, it's old but it still gets it very nice to the point. It says information retrieval deals with those kind of queries, some kind of issues
01:00:20
that are characterized by vague theories and uncertain knowledge. Waze queries is something like car, Volkswagen, I don't really know what this meant, right? It can be a person name, it can be an animal, it can be anything. So a natural language query is typically
01:00:41
almost necessarily vague, right? And uncertain knowledge, that stems from the way data is processed. SQL, there is no opinion, I don't need any representation,
01:01:04
it's clearly in the table, it's structured. I get my perfect result. I have certain knowledge. I know this employee knows so much and so forth. And in retrieval I have knowledge that is only in vague, natural language. I cannot really extract the knowledge perfectly.
01:01:26
There is no 100% system that tells me this wolf is a name and this wolf is an animal, right? This is things that you study with Professor Hyde and there is no 100% performing
01:01:42
natural language processing component. And that's why we have uncertain knowledge. We don't know which wolf, we don't know a lot of things if we just read text, if a machine processes text, and we have uncertain knowledge, uncertain representations
01:02:05
of our documents. And the definitions goes on, it says there are queries that are dealing with uncertain criteria,
01:02:24
vague criteria, and those can only be answered intuitively in dialogue processes. I query, get a result, I query again. For example, I query wolf, I get answers back
01:02:49
and it's oh, that's a person name. I won't know, I want the animal. I put some other word to make clear that it's an animal. Okay?
01:03:01
So intuitively in a search process, which is typically, let's get to it again. Okay. Anybody else wants to leave within the next five minutes so maybe everybody leaves at once so we don't have constant movement. Everybody else stays until the rest of the class.
01:03:28
This is something that is very typical for processes probably, you can observe at your own behavior, you type a query, we look at the results, oh no, that's not what I wanted. I need to change, modify my query.
01:03:43
Then I work toward a better solution. So I don't know what to expect if I put type in wolf, it can be animal, it can be a person, and depending on what I get back, I will change my query. And here we have differences,
01:04:03
representation of the knowledge within the IR system is not limited, takes all unstructured data because we have unstructured data, we have uncertain knowledge about the real meaning of it.
01:04:20
I'm not sure about the meaning of wolf, right? We have limited representation of the semantics, semantic analysis as I said before, is not perfect yet for mass data and we have to deal with current technology where the system does not understand the text, it just counts words.
01:04:42
Okay, that's why evaluation is of really key importance in retrieval, we have to know how good is the system really, retrieval, recall position, something you already know, we will learn more. Metrics, that's something that doesn't make sense for SQL, for databases, because here
01:05:06
we have a defined result, we don't have to worry about recall position. Okay, so far so good, we look again at the IR process,
01:05:23
retrieval process, we will have this slide more often with more additional things coming in over time. We have the user, maybe yourselves, in front of a user interface,
01:05:40
types in equipment gets results. He walks back, he doesn't see anything else. And what happens in the information retrieval system is that long before people write texts, documents, or images or whatever, they are stored somewhere, indexed, how is that done?
01:06:03
We talk about it, representation, this is a representation part. Then we have a representation like a document term matrix, a typical one, and we have our query, it's also processed by indexing, and compared to the representation.
01:06:24
So the query is compared to the documents, that's the search part, or matching part, where you have learned different matching algorithms, like which ones, Boolean match, right,
01:06:45
vector space model, approximate search, approximate matching, this is the search, so we have representation and the search, and then the system decides on some hits based on probably some quantitative,
01:07:00
on some basically work counting, and these are presented, and that's the retrieval cycle. We have our two steps in there, and of course the system and the user interface, which is not at the core of retrieval research or retrieval topics. Okay, questions so far?
01:07:26
Not the case? Then we move on. Hope we can come hopefully to the end quickly. Typically our process now in a textual form, representation search,
01:07:41
search means analyzing queries, matching queries to documents representations. These representations have to be developed in the representation phase. Documents are analyzed, linguistic pre-processing, indexing, and we get a representation.
01:08:00
Before that, for example on the web, we have to find these documents first. We don't know that, we have to find out that there is a web document, go there, crawl it, index it, and process it. Okay, further definitions, here one by Ingersen,
01:08:21
Formatio Tiva covers the problems related to the effective storage, access, and searching of information required by individuals. Here we see the user, individual in the focus. We had our, we don't maybe need, our definition of the Fachgruppe,
01:08:41
the special interest group, and here we have one that's also quite nice by Robertson, Stephen Robertson, said leading the user, shh, not louder than me, leading the user to those documents that will best enable him to satisfy
01:09:01
his or her need. So the user has an information need, he needs to know something, and the system should support him in this regard. Okay, ah, just some examples of systems that don't work,
01:09:21
I probably have come across this. If we look at retrieval systems, it's quite, quite easily find problems. Few years ago we did a study with a Swiss company on site searches, the search box is a company is typically have on the website.
01:09:41
We had, for example, the Toshiba Europe website, if you searched for laptop, you get zero hits. Very strange, that's the main product.
01:10:00
So that's the product they sell, and you don't find it on the site search, quite strange. Other issues, ah, here we have, if we type the word Toshiba in the web search, we find 26 hits. If we type it in Google, and restrict our search to the Toshiba site
01:10:21
with the site operator that I mentioned here, we got 6,000 hits. So either Toshiba or Google does not count correctly, and the differences are quite dramatic, 26 to 6,000, strange.
01:10:40
Yahoo had 9,000 at the time. So, probably none of us was correct, but who was the closest? Nobody knows. Other things, completeness of PDFs also issues, problems with dates,
01:11:01
like is the newest press release already indexed? Can you find it on the search? Often not. And interesting also, special characters still are a problem. Here we search for the name Jurg, a Swiss name, and often you just cannot find it
01:11:22
because there is an umlaut, and that's not processed appropriately. So, let's finish this class, if there are no more questions. There's always time for questions now. Can also ask next time.
01:11:41
Homework for first week on the Learn Web. The key for the Learn Web is salt, and I already erased it, but hopefully you memorized it. On the Learn Web, you find a list of textbooks, meaning Liebucher,
01:12:02
to study information retrieval. They're always quite useful if something you don't understand in class, you want to read more about it. It's nice to look at these textbooks. Most are freely available on the web. Some are freely available within the Hildesheim website,
01:12:20
so you can use it on campus. They're all online. And the turn, the assignment is now to search for definitions, not somewhere on the web with Google, but within these books. So find a definition of IR, small,
01:12:40
give your source reference from one of these books, and then turn it in in the Learn Web. I will add a homework page within the next, well, next few days, probably tomorrow. Deadline,
01:13:01
since we have seven homeworks, we almost have one more or less every other week. We make the deadline always two weeks. So in two weeks, I will announce the solution, or we discuss the homework here, and afterwards there is no use to turn it in. So 26 of April will be the deadline
01:13:23
for this first homework. Got stuck somewhere here.
01:13:42
Everything okay? Hopefully you're not surprised by the homework. Anyway, questions? Clear? What do you have to do? What is the first question when people get homework?
01:14:01
They ask, can we do it in groups? Of course, just numbers add up, of course, right? There are seven books, online books, so if you do it in two, group of two, you have to add four definitions. If you do it in groups of three, you can find seven definitions easily.
01:14:23
Okay? I don't want to disencourage group work. It's always useful. You're much more powerful, much more capable within the group. So see you next week.
Recommendations
Series of 12 media
Series of 13 media