We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Principles of Information Seeking and Retrieval

00:00

Formal Metadata

Title
Principles of Information Seeking and Retrieval
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2023
Production PlaceHildesheim

Content Metadata

Subject Area
Genre
Abstract
This course is aimed to offer an introduction to information retrieval systems as an important area of information science. It contains some basic information on: ● definitions ● state-of-the art ● examples (focus on digital libraries, differences between metadata and full text indexing approaches) ● automatic indexing (ranking and weighting basics)
Keywords
Process (computing)Representation (politics)CalculationInformationUser interfaceSimilarity (geometry)System programmingCore dumpSubject indexingObject (grammar)Social softwareFacebookMeta elementType theoryOntologyLibrary (computing)Data structureHierarchyProcess modelingInferenceQuery languagePhysical systemSubject indexingInformationProcess (computing)Arithmetic meanRepresentation (politics)Social classMathematical analysisDifferent (Kate Ryan album)Data modelGraph coloringHierarchyData managementTerm (mathematics)Video gameOrientation (vector space)WordSearch engine (computing)Self-organizationNatural numberLimit (category theory)Information retrievalData storage deviceObject (grammar)Medical imagingLibrary catalogMeta elementLibrary (computing)QuicksortSoftwareKolmogorov complexityMessage passingResultantAreaVideoconferencingUniverse (mathematics)Type theoryDomain nameCASE <Informatik>Set (mathematics)HypermediaGoodness of fitMetadataTheory of relativityCore dumpElectronic mailing listGame controllerFile archiverDistribution (mathematics)Expert systemContent (media)Computer clusterOffice suiteRankingComputer animation
Semantics (computer science)Representation (politics)Natural languageLatent heatWordReduction of orderStandard Generalized Markup LanguageQuery languageDatabaseInformation retrievalMach's principlePartial derivativeCondition numberLattice (order)Order (biology)LogicBoolean algebraComputer fileLibrary (computing)Programming paradigmComputer programResultantSubject indexingBEEPCodeGoogolFamilyMobile appAndroid (robot)FluidProcess (computing)Mathematical analysisWeb crawlerPreprocessorEndliche ModelltheorieInformationRankingWeightTerm (mathematics)FrequencyMeasurementCountingWell-formed formulaStructural loadFunction (mathematics)DatabaseWeb 2.0Process (computing)Information retrievalCountingCurveGoodness of fitMultiplication signSearch engine (computing)Form (programming)Order (biology)WebsiteWordElectronic mailing listBoolean algebraDecision theoryMultilaterationComputer fileNatural languagePhysical systemMatching (graph theory)Library (computing)Relational databaseAuthorizationSet (mathematics)ResultantQuery languageDifferent (Kate Ryan album)Arithmetic meanTerm (mathematics)FrequencySemantics (computer science)WeightMathematicsPosition operatorAlgorithmContent (media)MassSubject indexingCalculationNeuroinformatikInformationLogarithmState of matterContext awarenessRankingCASE <Informatik>Computer animation
Query languageWeightFrequencyTerm (mathematics)Process modelingPressure volume diagramDistribution (mathematics)Computer programImage resolutionWordInverse elementWell-formed formulaParameter (computer programming)Digital object identifierElectric currentState of matterLogarithmHarmonic analysisApproximationLengthDivision (mathematics)Database normalizationMatrix (mathematics)Equals signInformationProcess (computing)Endliche ModelltheorieInformation retrievalRepresentation (politics)Subject indexingDivisorWordPosition operatorInverse elementTerm (mathematics)Distribution (mathematics)WeightNormal (geometry)Performance appraisalDecision theoryQuicksortFrequencyElectronic mailing listInteractive televisionPhysical systemLengthCountingPerfect groupAlgorithmRankingSound effectEndliche ModelltheorieInformationLogarithmInformation retrievalSummierbarkeitApproximationCurveEqualiser (mathematics)Representation (politics)Different (Kate Ryan album)Numbering schemeFilm editingNumberNatural languageContent (media)Goodness of fitComputer animation
Computer animation
Transcript: English(auto-generated)
Welcome! In this video I will talk about principles of information seeking and retrieval. One of the main take-home messages is, well, how does a search engine sort its results? There must be a reason that one document is opposition 1 and others are below it. How is that being calculated?
This all falls into the area, into the scientific area, information retrieval. Information retrieval deals with the search for information and the representation, storage and organization of knowledge. Most knowledge is represented in natural language,
so information retrieval deals with the processing and analysis of written documents. Other definitions are more user oriented. Here we see a definition by Robertson that says, well, IR systems are those that lead a user to documents and help him solving an information need.
Or Belkin, even more user oriented, says it's not about searching, it's about solving my own problem, my own problem management and finding good knowledge resources that help me doing that.
This is a brief overlook of the information retrieval process. We have somebody who is looking for information, who has a problem, who needs some documents, some information to solve those. There are authors, people who have written some documents. This could be scientific texts but also social media posts
and that could potentially solve this problem. People don't know each other so it's impossible that the seeker asks all these people directly so we have to find them via the written text indirectly, kind of. So documents are collected, they are indexed, we will see and represent.
This is the main activity or the main technology that information retrieval carries out. Then, on the other hand, the user who is looking for information is posting a query that is also being indexed
and it's compared to the representation of the documents and a result set is returned. Question is, of course, do we really have to understand search engines? Do we always understand technology that we use when we sit in an airplane,
when we go to a medical doctor, to complex systems, complex machines? We might not understand them but we still trust them. However, of course, search engines also have a huge impact on society. They are regulated, there are a lot of dangers involved that they need us to correct information,
for example, not to miss information. So, especially for information professionals, I think it's very important to understand search systems, to use them well but also to explain them,
to explain the results and limitations to other users, for example, in a library or wherever, and also to improve them. So what do we know about ranking of the documents? What features do we have to extract from the documents to say,
aha, this is how we want to represent it? Text typically is represented by words and this is simply some words that we take from the documents. Other things could be other documents, other entities could be represented differently,
for example, image by its color distribution or other things. And we have core problems in language, not a problem in everyday life but that are a problem for automatic processing, homonyms and synonyms. Homonyms are basically words that have different meanings
that can refer to different things like bank or Jaguar, and synonyms, words that more or less mean similar or even the same thing, the same concept. And well, if the system finds the word bank, it's not clear which of its meanings does it refer to,
does the document refer to, and if a user wants to find documents about money, a simple system would not get the documents on capital and the user might miss some important relevant information. These are the core issues.
Let's have a brief look, a very brief overview on indexing. I'll briefly talk about manual indexing but most important in this class is automatic indexing. There's also abstracting, writing an abstract for a short text to represent its information, its content to some reader,
which gives the reader a briefer, quicker access to this information. And there's also clustering, grouping documents that belong together that are somehow similar that also facilitates access. But we'll mainly talk about indexing in this video.
Manual indexing is an activity that is done by people and that is still widespread. Some information experts that are specifically trained select these representation items for objects, for documents, for example. So they will select some words that refer to these documents.
And only with those words, I'm able to find these documents in manual indexing. Typical examples are metadata via HTML metadata or metadata for an image. This will be the main access to these documents.
And as I said, this is still widespread. Libraries, information centers, news agencies, company archives often employ manual indexes, information specialists, also patent offices. And even lay people do that when they, for example, tag,
use a hashtag in social media. That can also be considered as kind of a manual indexing. This has a high value. It's typically of good quality. And it should be used whenever it's available. So what are the issues related to manual indexing?
As I said, we have to select representation terms, specific as necessary, but also as exhaustive as possible. And people who do this have to have some knowledge of the tools and resources that are available. For example, if there is a library catalog, they have to understand the terms that they can use
from the library catalog and know how to find them. Not too many. Quantity is also an issue when it comes to manual indexing. What are types of controlled vocabulary? As I mentioned before, typically users
or information specialists are not free to select any word. So there has to be a collection of allowed terms, of allowed words. This can be a simple list. It can also be a thesaurus, which also contains more systematic information. For example, these are two similar terms.
These are synonyms. This is a broad and narrower term relation. And here we have synonyms. And please always use this term when you refer to this concept. This we call a preferred term. And that helps the information specialist.
Furthermore, we can have classifications where we have hierarchical structure within our catalog, within our allowed vocabulary. This is typically the case in library systems where we have the scientific domains maybe in the university library and even in more specialized areas.
For example, we can have biology, we can have plant biology. And we can have the tactics of biology and so forth. An even more elaborate system is an ontology. That describes not only the concept and objects, but also what are allowed values.
What do these things contain? What are they composed of? For example, a car could be something that typically has wheels and typically has four wheels. This already is very close to data modeling and it would allow logical reasoning.
Let's move on to automatic indexing. Automatic indexing requires software that selects these representation terms for our knowledge objects. For example, our text documents. These words have to be selected.
Now we move to a situation where this is rarely done from control to vocabulary, but rather any word can appear within this as an index term in automatic indexing. The most typical approach is the so-called bag-of-words approach.
What does that mean? Well, all words are kind of thrown into a bag and within the bag they can come out in any form. Think of a bag when you go shopping. Things might be in a different order after you throw them in. So the context gets lost.
We cannot reconstruct the entire document when we only look at the words that got thrown into this bag. Why is that done? Why don't we analyze syntax and semantics more elaborately? Like bank, we have this problem of different meanings
and we cannot solve this with a bag-of-words approach. Well, the answer is simply technology is not advanced enough to solve this problem yet, to completely analyze syntax and semantics of natural language. So for mass analysis, like we have in a search engine,
we still need the bag-of-words approach. A typical first step is stopword elimination, articles, pronouns, prepositions. They are important in a language. Of course, otherwise we wouldn't have them. However, they don't have any content only by themselves.
If we only look at a pronoun or at an article, it doesn't contain any meaning, other than a noun. And these are the most frequent words in a typical collection. When we take them out, we really reduce it a lot
and a few hundred words are a typical list for a language. And of course, we have to find this for anything. The next problem is morphology. We have a word like try. A user might query with this word
and the search engine then would also be expected to find other forms of this word. Tries, tried, trying, trial maybe even, and otherwise the user would be disappointed. Of course, that's also what I'm meaning. Just the form doesn't change the meaning of the word
in the sense of the bag-of-words approach. And one question that we need to ask is how does information retrieval differ from database retrieval? Maybe some of you have already heard about database retrieval. SQL, languages that query relational databases
and sometimes also referred to as database retrieval, but they have some inherent differences. If you haven't heard about database retrieval, compare it to a query within a flight search engine, for example.
That is different from a retrieval system, like a search engine, a general web search engine. And here we have two paradigms, the exact match and the partial match. The exact match is the database approach where we have a very exact specification,
a very exact SQL query. We specify a set of results that we want to have and all of those are in the results set. And we typically don't have an order. We can only find this with some metadata. We can use Boolean logic and or not.
And typical examples are library search catalogs, file search and bibliographic search and also flight searches I mentioned before. It would be a typical Boolean retrieval system where I can search for an exact person, an author and I'll find fully the articles that this person wrote
or that referred to him. Partial match is different. That is what we mainly talk about in information retrieval. If we query with a word like hotel, we really have a very vague definition of what we actually want.
The order of the documents is defined by the probability of relevance. More relevant documents should appear on top of the list. And the user can go down as far as he wants and look at documents. We don't have to follow any syntax. And a typical example would be our web search engines
and also site searches that we have on a website or within a company. Example here would be Baidu. There's also mixed forms like here in Google Play Store. Here we might have a web search
but there might be also some information retrieval but we might also have a database approach. We cannot be really sure. So the IR process has several steps. First we have to collect documents. In the company we would have a database of them,
a collection. In the web we just have to find them, crawl them. Then we have to analyze them, linguistic preprocessing, stemming that is called so we have to go from trying tried trials to try. And then we are doing our indexing.
And this is what we will focus on in the next few minutes. Later when the search engine is running we also have to analyze queries. Users will bring in queries. We have to look at them and connect them to the relevant documents. So what should we know about the words in a document?
Our indexing has to collect information about text documents to represent their content. What information could be useful for the ranking? And while mainly what computers and algorithms can do very well is to count and calculate.
So for example if you have a query like Wolf and Niedersachsen, Niedersachsen is the state where University of Hildesheim is at. And we have six documents. How would we best rank them? And here we have already counted the words, the occurrences of Wolf and Niedersachsen.
And basically all the ranking decisions are based on these word counts. We have also maybe some other words in the document. Here's some cities, some are in Lower Saxony, in Niedersachsen, others not.
Typically an algorithm would ignore that, even if it would be useful and might be relevant for the user. Look at these documents and think about how you would rank them. Which document would you prefer at position one? Later we'll see how it works.
We'll have to find some sort of weighting. We have to define a weight for a word in a document. And this is somehow related to the frequency of word occurrence based on the idea by Loon.
And this weight describes how well does a term represent a document. And this is all about counting words, as I said. We have to count the words in the document and also in the whole collection.
And we get our first ranking criteria. That is term frequency, called TF. How often does a term, a word, appear in a document? T in D. And that we call N. And that is our term frequency.
So we could simply count the word occurrences in all of these documents. We have 4 plus 2, 3 plus 3. We would end up with 6. That does not give us really a clear decision criteria yet. So we need further information.
Term frequency is obviously not enough. How can we weigh the contribution of a word? And each further appearance of a word, let's think about it. It appears once, two, three, ten times.
Well, that shouldn't count as much as the first appearance maybe. First appearance is very important. Second also, yeah, good. Third. And the fourth shouldn't maybe not count as much. And that we can model with a logarithm. Here's an example for a logarithm curve.
That we see the more often the word appears, the smaller the gain in its contribution is growing. We have this logarithm curve. So the documents that have a lot of occurrences, like four times,
don't count as much as the step from two to three counts more than the step from three to four. So if we go back to our example, we would be able to make a calculation now, but we still couldn't decide such a case.
Well, we have four and two, two and four. We'll have a logarithm of two, logarithm of four, and here also logarithm of four and two. That would give us an equal number, equal outcome, equal result, and we need a decision criterion again. Now we have to look at these two words,
Wolf and Niedersachsen. How are they different? And here we have the harmonic sum, which is an approximation of a logarithm. The first appearance counts fully. The second only counts one plus one divided by two. The third appearance counts one plus one divided by two
plus one divided by three, and so on and so on. So which words are better for retrieval? Well, let's look at the distribution of words and their frequency in a collection.
And what words would be good words for modeling content representation? Words that appear very often, words that are middle frequency, and words that appear very little, very rarely in an entire collection.
And we have to look at a typical distribution. If we measure the frequency of words in a collection, we typically end up with a distribution that looks like this. Here words are ranked by their order, the most frequent were the second, the third, and so on. And here are the least frequent words,
and here we see their distribution. And this does not create a linear curve, but a curve that steeply decreases. So we have only a few words that are very, very frequent. We have some that are somewhere in the middle,
but appear much, much less frequent than the top words. And we have many words that have very few appearances. That's how language works. That's what we can see empirically. Now which of these words are good for retrieval?
Think about these words. They are very frequent. They appear almost in every document. They don't help us to distinguish relevant documents. Think about these words. Yes, rare words are good to find maybe relevant documents, but we will not find many.
So good resolving words are these. And we see that the less frequent words are, the better they are basically for content representation, although we don't find a lot, but they are more specific.
And this leads us to another formula, another criterion for ranking, the inverse document frequency. This is how we distinguish between words. We measure their count in the entire collection, and we look at the number of documents that are in the collection
and how often the word appears. This word term T, the word, appears in documents. We might have a billion documents, and we might only have 100,000 documents that this word appears in, only 10. And this gives us different values for IDF, inverse document frequency.
And again, we have a very steep decrease, so we use a logarithm. So a very often mentioned weighting scheme is TF-IDF. A weight for a term, for a word in a document,
is measured by logarithm of TF, term frequency. We will take the logarithm to smooth it, cut off the high impact of high values, and we multiply it with the logarithm of the IDF to see,
we have different words that have a different kind of usefulness for the retrieval, and that is expressed by the IDF. And again, we have the harmonic sum that can help us to get an idea of how a logarithm works.
If you have logarithm of 4, IDF of 4, for example, we would be able to approximate this by calculating 1 plus 1 plus 1 divided by 2, 1 divided by 3, 1 divided by 4, and add up these contributions.
And we get a lower value than 4, obviously. And this shows us how the effect, how the logarithm cuts off the effect of the high values. Another and the third important issue for ranking is the document length.
We can have documents with TF of 10, but, as I tried to show here, one can be very short and one can be very long. And now we have, of course, a problem. This term, within 100 words maybe,
it's more specific for this document. So I probably want to find this document instead of a very long document, maybe with 1,000 words, where the word is scattered around, and the percentage of words that are my search term is much lower.
However, on the other hand, a long document simply contains more information. So here we also have to make some sort of decision. Look at the simple example. We have a longer document and a shorter document. And we can simplify it because it has more words.
These are simple examples, right? But this would be a document with 18 words. This would be a document with only five words. And here we have the contributions of one term. And here it is less specific.
It's more specific probably for document A. And here we can use length normalization. We simply calculate the length of the document, divide all the counts by the entire length, and we find the contribution of this document
for a specific value. So here we have another example, 5 and 15. And we would have a normalized weight by simply dividing all TF word counts, term frequency counts, with the length that we have found.
So before the long document would be the one that we would find more often simply because longer words have more chances to be found because they have more words and the words appear more often. They have a higher TF. We will find them more in the top positions.
Now we can eliminate that effect by length normalization. However, and then all documents would have a chance to be found. However, really, there's also a problem here. Consider this example. When we have very long documents,
we will end up with very low normalized term frequency values. That means now we have, in the short, documents will have higher values because we divide the term frequencies by smaller numbers. That means now we have a bias for short documents.
If we do a full length normalization, we will rather find short documents with a higher probability in the top ranks. So here the algorithms have to make some sort of decisions, have to make compromises, and there are ways to balance this out. However, there is no perfect solution for that.
This is one of the design decisions that information retrieval systems have to make. So we go back to our system. We had a brief overview on representation.
How are these values documented, the term calculated, and we do this with weighting. And we have three main factors. Term frequency, inverse document frequency, and the word length. There are other topics in information retrieval.
We have representation. There's also models. How do we now find these documents? How do we calculate a ranking list based on these representations? And there's, of course, a user perspective. How do users interact with these systems? And we always need an evaluation for all our systems.
The length normalization, the stop word list, and many factors, kind of logarithm reviews are decisions that people take. And we don't know what decisions lead to good results, so we always need some sort of evaluation.
Thank you for your attention.