We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Introduction and fundamental notions (6.4.2011)

00:00

Formal Metadata

Title
Introduction and fundamental notions (6.4.2011)
Title of Series
Part Number
1
Number of Parts
13
Author
Contributors
License
CC Attribution - NonCommercial 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production Year2011
Production PlaceBraunschweig

Content Metadata

Subject Area
Genre
Abstract
This lecture provides an introduction to the fields of information retrieval and web search. We will discuss how relevant information can be found in very large and mostly unstructured data collections; this is particularly interesting in cases where users cannot provide a clear formulation of their current information need. Web search engines like Google are a typical application of the techniques covered by this course.
Information retrievalInformationPhysical systemComputerRelational databaseDatabaseEscape characterSign (mathematics)Web pageNeuroinformatikGroup actionInformation retrievalPhysical systemObservational studyMereologyField (computer science)Theory of relativityExpert systemData recoveryDifferent (Kate Ryan album)ExistenceDigital electronicsSubject indexingBitCausalityHyperlinkDatabaseLevel (video gaming)Dynamical systemTerm (mathematics)AdditionInformationMultiplicationDegree (graph theory)Multiplication signSingle-precision floating-point formatAlgorithmType theoryRankingRow (database)Network topologyProgramming paradigmData structureRelational databaseComputer scienceNatural languageRegulärer Ausdruck <Textverarbeitung>NumberWordAddress spaceRight anglePoint (geometry)1 (number)Power (physics)Vulnerability (computing)Link (knot theory)Library (computing)WebsiteWorld Wide Web ConsortiumDisk read-and-write headCASE <Informatik>Order (biology)ResultantQueue (abstract data type)Dressing (medical)Repository (publishing)Software testingWeb serviceOcean currentVideo gameGoodness of fitStatement (computer science)Natural numberRegulator geneTwitterTable (information)Total S.A.Impulse responseScaling (geometry)Event horizonDescriptive statisticsSeries (mathematics)Search engine (computing)TheoremMetadataData storage deviceTask (computing)Source codeOnline helpMeta elementAuthorizationComputer animation
DatabaseInformation retrievalCondition numberIcosahedronObject (grammar)InformationNatural languageComputerObservational studyMereologyInformation retrievalInformationComputer scienceMereologyAddress spaceWorld Wide Web ConsortiumGoodness of fitNatural languageObservational studyRelational databaseNumberComputer animation
DatabaseInformation retrievalIcosahedronObject (grammar)Condition numberInformationGoogolGamma functionMIDIWordDatabaseWorld Wide Web ConsortiumMereologyNumberRelational databaseQuery languageObject (grammar)Type theoryGoodness of fitSet (mathematics)Selectivity (electronic)Electronic mailing listField (computer science)Semantic WebDescriptive statisticsResultantRankingAttribute grammarAlgorithmSemantics (computer science)InformationRegulärer Ausdruck <Textverarbeitung>Ocean currentTheory of relativityComputer animation
Link (knot theory)Web pageSimilarity (geometry)Information retrievalFile formatInformationLocal GroupRepository (publishing)Server (computing)Normal (geometry)Multiplication signBitLibrary (computing)Source codeSinc functionMathematicsLevel (video gaming)NeuroinformatikTheory of relativityDifferent (Kate Ryan album)HyperlinkFile formatInformationWorld Wide Web ConsortiumWeb pageProgramming paradigmMereologyData structureLink (knot theory)2 (number)WebsiteGoodness of fitProduct (business)POKEPoint (geometry)Computer fileDegree (graph theory)MetadataSingle-precision floating-point formatDatabaseUniverse (mathematics)Dynamical systemInformation retrievalBit rateSubject indexingGroup actionCellular automatonAreaCASE <Informatik>Social classNumberRight angleOcean currentLine (geometry)Dependent and independent variablesComputer iconElectronic program guideComputer programmingCross-correlationMassParameter (computer programming)Computer scienceGoogle Street ViewRow (database)Fiber bundleMusical ensemblePhysical lawSystem callObservational studyException handlingNoise (electronics)Arithmetic meanFerry CorstenOrder (biology)Skeleton (computer programming)Electronic mailing listPattern languageWater vaporMetropolitan area networkAlgorithmComputer animation
InformationWater vaporPattern languageSession Initiation ProtocolBitPotenz <Mathematik>Bit rateNumberInformationReal numberStapeldateiComputer animation
BuildingAlgorithmGroup actionMaizeTerm (mathematics)Goodness of fitPoint (geometry)Student's t-testDependent and independent variablesDegree (graph theory)GradientPage, LarryParameter (computer programming)Cross-correlationMultiplication signTask (computing)Fiber bundleShared memoryBitAlgorithmNumberException handlingControl flowSlide ruleForm (programming)InformationGroup actionOcean currentNatural numberDisk read-and-write headVolume (thermodynamics)Right angleData recoveryPerspective (visual)Thermal radiationInformation retrievalWorld Wide Web ConsortiumSearch engine (computing)Computer animation
Information retrievalInformationCognitionPerspective (visual)Information retrievalInformationCognitionWorld Wide Web ConsortiumPerspective (visual)AlgorithmAreaGoodness of fitPoint (geometry)Search engine (computing)Disk read-and-write headVolume (thermodynamics)Classical physicsMultiplication signLink (knot theory)Group actionDifferent (Kate Ryan album)View (database)NumberOnline helpDataflowPlanningVirtual machineEndliche ModelltheorieBitSubject indexingWeb serviceRow (database)Service-oriented architectureRight angleSpring (hydrology)Valuation (algebra)Semantics (computer science)Workstation <Musikinstrument>Formal languageProgramming paradigmNetwork topologyWeb crawlerComputer animation
Fundamental theorem of algebraSpacetimeInformation retrievalEndliche ModelltheorieArray data structureSubject indexingNatural languagePerformance appraisalRelevance feedbackWeb crawlerLink (knot theory)Mathematical analysisBoolean algebraData modelSystem programmingLink (knot theory)Web crawlerGroup actionVector spaceMathematical analysisEndliche ModelltheorieGoodness of fitInformationInformation retrievalWorld Wide Web ConsortiumDivisorGene clusterSubject indexingSemantics (computer science)View (database)Fuzzy logicVirtual machineOnline helpDifferent (Kate Ryan album)BitSupport vector machineCoordinate systemPerformance appraisalPlanningFormal languageProgramming paradigmMultiplication signWindowLibrary (computing)Row (database)Tablet computerDatabase transactionNumberCASE <Informatik>Standard deviationFreezingIncidence algebraDesign by contractMusical ensembleWaveQuicksortComputer animation
Tablet computerLibrary (computing)Database transactionLibrary (computing)James Waddell Alexander IIIncidence algebraTablet computerMultiplication signRow (database)NumberCASE <Informatik>Point (geometry)Physical lawBitMobile WebSystem callLine (geometry)WordProcess (computing)Betti numberAuthorizationFocus (optics)Physical systemInsertion lossSingle-precision floating-point formatView (database)MereologyLetterpress printingElectronic visual displayOrder (biology)State of matterObservational studyInformationType theoryScripting languageVolume (thermodynamics)Execution unitMusical ensembleSoftwareComputer animation
Focus (optics)Volume (thermodynamics)InformationAuthorizationTunisMultiplication signWorld Wide Web ConsortiumLibrary (computing)MereologySingle-precision floating-point formatOrder (biology)Type theorySoftwareLetterpress printingElectronic visual displayGlobale BeleuchtungComputer animation
Library (computing)Physical systemGamma functionLink (knot theory)System programmingText editorInformation retrievalAreaInformationLibrary catalogMetadataMenu (computing)Augmented realityRule of inferenceInvariant (mathematics)Knowledge engineeringComputer scienceInformation systemsObject (grammar)Semantic WebPhysical systemInformationInformation retrievalSet (mathematics)NeuroinformatikContent (media)Web pageTerm (mathematics)Query languageAreaPoint (geometry)Process (computing)Subject indexingMereologyNumberMetadataLibrary (computing)Bit rateSemantics (computer science)Arithmetic meanPoisson-KlammerDigital object identifierVideoconferencingIdentifiabilityDatabaseLink (knot theory)State of matterPower (physics)Software testingOffice suiteAuthorization10 (number)Real numberLine (geometry)Library catalogStructural loadNetwork topologyFrequencyPhase transitionRow (database)Computer animation
Twin primeSystem programmingLibrary (computing)Text editorInformation retrievalInformationAreaLibrary catalogMaxima and minimaContent (media)Gamma functionExecution unitLink (knot theory)Physical systemMathematical analysisDatabaseInternetworkingMaizeVolume (thermodynamics)Library (computing)Degree (graph theory)Video gameState of matterDatabaseResultantPhysical systemExpert systemRow (database)Reading (process)MereologySearch algorithmTerm (mathematics)Field (computer science)Information retrievalComputerInformationWordOcean currentRule of inferenceVirtual machineMultilaterationLevel (video gaming)Point (geometry)Denial-of-service attackMetadataNeuroinformatikMathematical analysisSocial classComputer virusPlanningTelecommunicationMultiplication signLabour Party (Malta)Control systemNatural numberCartesian coordinate systemLine (geometry)Physical lawPrisoner's dilemmaSubject indexingIntegerCASE <Informatik>Metropolitan area networkGastropod shellIdentity managementComputer animation
Physical systemMathematical analysisInformation retrievalDatabaseInternetworkingInformationFirefox <Programm>View (database)Beta functionLimit (category theory)Bookmark (World Wide Web)GoogolModemSpecial unitary groupWeb pageDigital filterComputer virusSequenceProcess (computing)Newton's law of universal gravitationMenu (computing)AbstractionDivision (mathematics)Electric currentUtility softwareDivergenceLibrary (computing)WritingSimilarity (geometry)Dependent and independent variablesNetwork topologyCellular automatonCAN busTerm (mathematics)PlastikkarteQuery languageFingerprintSelf-organizationLoginSign (mathematics)Performance appraisalFaktorenanalyseReverse engineeringDemo (music)Communications protocolData managementTime evolutionSample (statistics)Statistical dispersionMassExecution unitNatural languageDeterminismLink (knot theory)Polygon meshSubject indexingAlgebraic closureLemma (mathematics)Greatest elementMusical ensembleChemical equationAbelian categoryHierarchyCategory of beingGroup actionControl flowTerm (mathematics)File formatSubject indexingBitPolygon meshDifferent (Kate Ryan album)Entropie <Informationstheorie>Category of beingDescriptive statisticsOrder (biology)Disk read-and-write headLibrary (computing)Multiplication signBranch (computer science)InformationLevel (video gaming)Social classNetwork topologyHeat transferClassical physicsMetadataProcess (computing)Computer virusPhysical systemMatching (graph theory)QuicksortStructural loadAdditionWave packetWeb pageExpert systemService-oriented architectureAmenable groupTransportation theory (mathematics)Office suiteArithmetic meanLink (knot theory)Game controllerLetterpress printingDatabaseSurgeryLine (geometry)Semiconductor memoryComputer animation
HierarchyPolygon meshWeb browserEmailSlide rulePowerPointStandard deviationFormal grammarView (database)File formatDublin CoreFirefox <Programm>Abelian categoryUniform resource locatorInformationLink (knot theory)Web pageWave packetExpert systemNetwork topologyPolygon meshDisk read-and-write headAdditionPhysical systemDatabaseWordTerm (mathematics)QuicksortReading (process)Hidden Markov modelWeb browserComputer animation
Web browserFirefox <Programm>Polygon meshType theoryField (computer science)Electronic mailing listInformationSubject indexingGroup actionNumberWindows RegistryUniqueness quantificationNetwork topologyRootTerm (mathematics)View (database)Bookmark (World Wide Web)Computer virusWeb pageLink (knot theory)Asynchronous Transfer ModeProteinSurfaceStandard deviationThermal expansionCoordinate systemFile formatSlide rulePowerPointDublin CoreSineFormal grammarHierarchyTime domainCore dumpElement (mathematics)Source codeFinitary relationData typeNatural languageExplosionNumbering schemeLibrary catalogPlastikkarteE-textConcordance (publishing)Binary fileComputerMeasurementPrincipal idealHypertextMachine visionPersonal digital assistantAssociative propertyNetwork topologyPolygon meshTerm (mathematics)Subject indexingMultiplication signCategory of beingComputer virusPoint (geometry)Different (Kate Ryan album)Descriptive statisticsUniform resource locatorPosition operatorAdditionIdentical particlesStandard deviationDisk read-and-write headData structureAreaElement (mathematics)Type theoryWeb pageSuite (music)TrailResultantData managementRight angleGroup actionMetadataCore dumpFile formatWordConcordance (publishing)Software developerHypertextMachine visionField (computer science)Sinc functionPay televisionSoftware frameworkInformation retrievalElectronic mailing listMereologyAlphabet (computer science)Library catalogLibrary (computing)Representation (politics)PlastikkartePersonal digital assistantWorld Wide Web ConsortiumInformationSemantic WebHome pageControl flowDoubling the cubeExpert systemHeuristicSlide ruleWeb browserEndliche ModelltheorieSelectivity (electronic)Physical systemData conversionSpacetimeVirtual machineMatching (graph theory)Reading (process)FeedbackAlgorithmData storage deviceFunctional (mathematics)DatabaseProcess (computing)Office suiteVotingExecution unitSimilarity (geometry)TelecommunicationSpeech synthesisRepository (publishing)Row (database)Array data structureComputer fileLevel (video gaming)Labour Party (Malta)Link (knot theory)Closed setInsertion lossBlogNumberStaff (military)Operator (mathematics)FrequencyMetropolitan area networkSet (mathematics)RankingQuery languageWeb servicePhysical lawWebsiteSystem callBenchmarkFormal languageUniverse (mathematics)SoftwareArithmetic meanData dictionaryOrder (biology)Message passingScalabilityEmailServer (computing)RandomizationOpticsCentralizer and normalizerQuicksortFundamental theorem of algebraNeuroinformatikStreaming mediaAssociative propertyNumbering schemeConnectivity (graph theory)Particle systemComputer scienceSearch engine (computing)CausalityComputer animation
Associative propertySubject indexingNumberThetafunktionInclined planePlastikkartePhysical systemMaß <Mathematik>MeasurementSimilarity (geometry)Vector space modelRelevance feedbackSystem programmingInformation retrievalLocal GroupStandard deviationBlogSet (mathematics)GoogolRegulärer Ausdruck <Textverarbeitung>Regular graphQuery languageRankingE-textData structureWeb pageExploit (computer security)AlgorithmInformation retrievalQuery languageField (computer science)RankingFunctional (mathematics)AlgorithmVideo projectorBuildingLevel (video gaming)Multiplication signLoop (music)Object (grammar)MeasurementFeedbackVirtual machineEndliche ModelltheorieTelecommunicationGroup actionDatabaseAssociative propertySpacetimeSelectivity (electronic)Search engine (computing)Data storage deviceData conversionWorld Wide Web ConsortiumHyperlinkMechanism designDirectory serviceSoftwarePhysical systemServer (computing)Boolean algebraRepository (publishing)Berners-Lee, TimVector space modelRelevance feedbackInformationBitBenchmarkExpert systemWeb pageResultantComputer scienceRow (database)Electronic mailing listNumbering schemeFile Transfer ProtocolSubject indexingGoodness of fitWordNeuroinformatikWage labourExecution unitBlogSimilarity (geometry)Computer animation
Information retrievalRankingE-textQuery languageRegulärer Ausdruck <Textverarbeitung>Regular graphWeb pageAlgorithmData structureExploit (computer security)GoogolScalabilityCore dumpData modelBoolean algebraSystem programmingEndliche ModelltheorieData dictionaryEmailMessage passingNatural languageFreewarePort scannerSimilarity (geometry)Cloud computingInformationComputer wormCognitionWorld Wide Web ConsortiumInformationFundamental theorem of algebraComplex (psychology)WebsiteDifferent (Kate Ryan album)ResultantGoodness of fitPoint (geometry)Information retrievalSubject indexingComputer virusQuery languageCentralizer and normalizerLink (knot theory)NeuroinformatikFlow separationFile archiverWordArithmetic meanLevel (video gaming)Installation artWeb pageComputer fileTerm (mathematics)Bit rateFormal languageSet (mathematics)MereologyData dictionaryNumberType theoryDatabaseEmailAlgorithmExecution unitLibrary catalogFreewareInternet service providerComputer architectureData structureScalabilityData storage deviceCore dumpWeb serviceContext awarenessPoint cloudStatement (computer science)ImplementationSound effectDegree (graph theory)Event horizonObject (grammar)Network topologyPlanningPlotterTelecommunicationSeries (mathematics)Operator (mathematics)FrequencyCASE <Informatik>Multi-core processorFigurate numberTheoryCloud computingView (database)Software developerComputer simulationMusical ensembleComputer animation
Cloud computingInformationComputer wormCognitionTerm (mathematics)Query languageComputerFormal grammarQuery languageVideoconferencingGradientInformation retrievalBoolean algebraData modelEndliche ModelltheorieSystem programmingRepresentation (politics)FeedbackDiagramPhysical systemCompilation albumPairwise comparisonFunction (mathematics)RankingRelevance feedbackMechanism designStandard deviationMultisetPresentation of a groupWordData structureTheoryData storage deviceWordResultantTerm (mathematics)CASE <Informatik>Query languageMultiplication signSpring (hydrology)Data structureFormal grammarOrder (biology)FeedbackFunctional (mathematics)Cycle (graph theory)Endliche ModelltheorieVirtual machineInformation retrievalConnectivity (graph theory)Direction (geometry)MereologyCircleRepresentation (politics)Associative propertySet (mathematics)InformationMultisetMetropolitan area networkTheoryCloud computingRelevance feedbackArithmetic meanPhysical systemPoint (geometry)Operator (mathematics)Loop (music)Degree (graph theory)CountingPoint cloudDifferent (Kate Ryan album)Pairwise comparisonSoftware developerProof theoryBoolean algebraTelecommunicationVideo gameMathematicsType theoryNumberRing (mathematics)Total S.A.Row (database)Moment (mathematics)MultiplicationVisualization (computer graphics)1 (number)Goodness of fitTheory of relativityNetwork topologyoutputSeries (mathematics)Subject indexingRankingFocus (optics)Level (video gaming)Ocean currentComputer animation
Data structureWordTheoryInformation retrievalData storage deviceEndliche ModelltheorieRepresentation (politics)Term (mathematics)Data modelPrice indexArray data structureIncidence algebraDesign of experimentsMaxima and minimaMatrix (mathematics)Adjacency matrixBoolean algebraSet (mathematics)Query languageBinary fileRankingFunction (mathematics)Regulärer Ausdruck <Textverarbeitung>Element (mathematics)Formal languageTerm (mathematics)Information retrievalEndliche ModelltheorieWordSpacetimeArray data structureSet (mathematics)Metropolitan area networkDifferent (Kate Ryan album)Normal (geometry)Formal grammarBoolean algebraMultiplication signRepresentation (politics)LogicNumberPropositional formulaOperator (mathematics)Element (mathematics)Functional (mathematics)Task (computing)Overhead (computing)Order (biology)Incidence algebraMatrix (mathematics)ResultantRow (database)Combinational logicMathematicsVisualization (computer graphics)Sheaf (mathematics)Goodness of fitAbstractionJunction (traffic)AlgebraProcess (computing)Shift operatorRule of inferenceThermal conductivityInsertion lossPoisson-KlammerPhysical lawExpressionMereologyNeuroinformatikSubsetHand fanNetwork topologyPosition operatorMiniDiscPlanningState of matterMoment (mathematics)FrequencyPower (physics)WhiteboardNegative numberSystem callCellular automatonCodeSubject indexingInverter (logic gate)Metric systemMatching (graph theory)BitElectronic mailing listComplementarityType theorySingle-precision floating-point formatVotingCASE <Informatik>Concordance (publishing)IterationData structurePairwise comparisonException handlingNatural languageComputer animation
Boolean algebraSet (mathematics)Query languageExclusive orNatural languageSubsetPrice indexTerm (mathematics)AbfrageverarbeitungMiniDiscInverse problemConcordance (publishing)MaizeMathematicsAmicable numbersInclusion mapData conversionDisjunctive normal formConjunctive normal formWell-formed formulaTheoremEquivalence relationBitCASE <Informatik>MereologyConcordance (publishing)Type theoryResultantPointer (computer programming)Boolean algebraOnline helpWell-formed formulaArithmetic meanInformation retrievalExpressionWordTerm (mathematics)Set (mathematics)Electronic mailing listQuery languageDatabaseNatural languageNormal-form gamePropositional formulaSubject indexingInverter (logic gate)Rule of inferenceConjunctive normal formDisjunctive normal formAlgebraNegative numberFormal languageComputer animation
Well-formed formulaEquivalence relationDisjunctive normal formConjunctive normal formTheoremAbfrageverarbeitungNormal (geometry)Metropolitan area networkQuery languageFingerprintSubsetProgramming paradigmTheoryFunction (mathematics)Binary fileRankingSimilarity (geometry)MereologyGradientQuery languageMetropolitan area networkResultant2 (number)CASE <Informatik>Operator (mathematics)Term (mathematics)Similarity (geometry)Endliche ModelltheorieInformation retrievalRule of inferencePoisson-KlammerMultilaterationExpressionRepresentation (politics)Propositional formulaBoolean algebraNormal-form gameRankingNumberDisjunctive normal formProgramming paradigmSet (mathematics)Conjunctive normal formSubsetSubgroupGoodness of fitDenial-of-service attackLine (geometry)Physical lawPhysical systemComputer simulationDataflowDefault (computer science)Web serviceSearch engine (computing)Point (geometry)Expert systemPower (physics)Pay televisionCybersexSubject indexingPolygon meshInformationComputer animation
DatabaseNumberPhysical systemSubject indexingBoolean algebraDefault (computer science)StatuteCodierung <Programmierung>World Wide Web ConsortiumQuery languageInformationTheoryInformation retrievalSpacetimeWebsiteControl flowAverageFuzzy logicData modelMatching (graph theory)Vector space modelEndliche ModelltheorieCASE <Informatik>FreewareQuery languageBoolean algebraDefault (computer science)Information retrievalPhysical systemWordEndliche ModelltheorieFuzzy logicCondition numberLevel (video gaming)ResultantOcean currentField (computer science)Web pageSearch engine (computing)Electronic mailing listHierarchyGame controllerSubject indexingDecision theoryPhysical lawExpert systemTerm (mathematics)AverageSet (mathematics)Type theoryRule of inferenceObservational studyVector space modelWebsitePay televisionCoordinate systemMatching (graph theory)Goodness of fitWeb serviceComputer fontInformationSocial engineering (security)Video gameInsertion lossNetwork topologyComputer simulationMultiplication signFilm editingWaveRootReal numberSound effectQuicksortComputer animation
Slide rulePowerPointMotion captureSound effectComputer animation
Transcript: English(auto-generated)
Hello everyone and welcome to the lecture series Information Retrieval and Web Search Engines. And we will be teaching this course in English because we have a new international master's program. And in any case it's good for you to know some English for your future business life anyway, you know,
because companies are getting more global, you know, and it's easy to work in interdisciplinary teams that will probably rather speak English than German. So that shouldn't be a problem of us, but we should be more concerned with the topics of what we're going to do in this lecture.
And what I want to do today is basically give you a short introduction in what the course will be about. You know, what topics will we cover, why is it important that we cover those topics, and how do these topics compare to whatever else is there in databases and information retrieval.
So I very often will use the term information retrieval during this course, obviously. So let me tell you what information retrieval actually is.
And there is a definition of, or we covered a couple of definitions from different sources. For example here, this excellent book by Christopher Manning and others, which is kind of like one of the textbooks for the course. We don't have a real textbook that we keep too stringently,
but we'll show you a couple of books that might be helpful for one topic or the other, but that is a very good introduction. And Christopher Manning says that IR is finding material, usually documents, of an unstructured nature, usually text,
that satisfies an information need from within large collections usually stored on computers. So we find a lot of terms here, we find finding, so we look for something, we also want to find it. Unstructured, it's not like in databases where we have the table data and then just select from where,
but it's text documents, it's something like that, and I have to find information within these pages, within these texts. That satisfies an information need. What does that mean?
Well, information need is we have a question when we look for something. We have a task to fulfill when we look for something. Can be something very abstract and we don't know how to solve it yet, and can be something very clear. I want to know the birth date of some person, a very clear task.
I want to know about global warming, a very open task. What does that mean? What is a good definition of global warming? What are documents that are interesting with respect to global warming? Are there different opinions, are there different bias in the document? So that is what characterizes an information need, a very vague concept, a very open concept.
And the last point is large collections, and large in this case means huge. So information retrieval started from a couple of pages to virtually terabytes
and petabytes of documents that are in existence today. Let's look at Wikipedia, always good for definitions. IR is the science of searching for documents, for information within documents, and for metadata about documents, as well as that of searching relational databases and the World Wide Web.
Okay, that sounds somehow different, doesn't it? So we don't find, we search. That's always a good distinction to make. For documents or information within documents, so we can either address a whole book,
like I want to have this book, does it help me? Well, not particularly, but I want to know how theorem one goes or something like that, then I have to look inside the book. So document, the whole document, information within documents, and for metadata about documents.
Ooh, metadata, what is metadata? Any guesses? Exactly, so it's information about information on a meta level.
So this is about information retrieval, doesn't help me very much. If I know the topics that are covered, might help me a little bit more. Also, if I know whether the author is an expert in the field, or some kind of guy next door who just wanted to get his book published,
that's metadata, that's information about the information that is contained in this book. I can judge the quality from it, and that is what is needed. So this is a very interesting thing. Well, as to searching relational databases, that is a little bit questionable here because usually metadata can be exposed in relational databases.
Unstructured documents are usually not exposed in relational databases. More likely, the World Wide Web. This is one of the largest document collections currently in existence.
And let's look at Merriam-Webster, also very nice for definitions. IR, the techniques of storing and recovering, and often disseminating recorded data especially to the use of a computerized system. That seems to be totally different from the others, doesn't it?
So one wants to find, the other wants to search, and this one wants to store and recover. Now, and disseminate, sounds like a European community proposal. There's also a work package about dissemination.
All these are aspects of IR, different aspects. Of course, when searching for something, I used indexes. Indexes need the storage of things. On the other hand, indexes need metadata to kind of organize things.
So all of these are correct. There is no one definition of information retrieval that satisfies all the needs. But they all raise important and interesting points. That is the thing here.
Ah, my friend, the storage engine, yes. So this is about storing, recovering, and disseminating the lecture. It will happen sometime, sometimes more, sometimes less. I don't know how to suppress it, but it's kind of what...
Yes, here we go. What PowerPoint likes to do. So, another definition is IR is the part of computer science
which studies the retrieval of information. So information retrieval is the part of computer science that studies the retrieval of information. That's kind of interesting and very cleverly written. Not data. Well, this makes it much clearer.
From a collection of written documents, the retrieved documents aim at satisfying a user information need. Aha, we have that before. And usually expressed in natural language. That's also interesting. So it's not structured as in numbers or names or addresses or something like that,
but rather structured as in you tell me something and I have to derive, I have to sift for information from what you tell me. So that's kind of the natural language part. However, it's the only interesting part that this definition offers.
So we can kind of come to a conclusion. Information retrieval is about documents, about unstructured data, about text, about large collections. This is something that is very, very, very prominent in information retrieval.
We have to have an information need. So users addressing an information retrieval engine will have at least a vague idea of what they're going to do or what they're trying to do. It is about storing documents, searching for documents, finding documents,
and it has something to do with the World Wide Web rather than relational databases. Okay? Good. If we look at the relational database world, many of you have heard relational databases one?
Who has visited relational databases one? Not so many actually. Interesting. But most of you are familiar with relational databases, relational database techniques. Yeah? Good.
So where's the difference? A relational database looks like this. You have an SQL statement, okay? Select ID from document where title like Semantic Web or something like that, and you get an answer that shows you the document was ID 45,000 blah
is about the Semantic Web. There's something to do with the title when used as a regular expression contains the word Semantic Web.
Okay? What does it do? A database is a structured way of storing information. I do exactly know in a database what the title field is because I have an attribute defined saying this is the title of a book,
of a document, of whatever. Okay? And when I look for something, for example, all the books about Semantic Web, I can build a regular expression saying, okay, Semantic Web should occur somewhere in the title, and then I do get a well-defined number of books
where Semantic Web occurs in the title. I don't know if it's at the beginning of the title, end of the title, an important part of the title, or if the book is about the Semantic Web but it doesn't occur in the title. In the latter case, I don't retrieve this book.
Still, it might be relevant, doesn't it? Even not having Semantic Web in the title. And that is what IR in a way does better. So what we do is we retrieve all objects that are relevant to this information need.
And the notion of relevance is one of the notions that will stay with us for the rest of the course. What is relevant, what is not. It's not like in relational databases where you say, okay, the ID has to be 14 and that's it.
That's well-defined. We're having a vague information need and having a vague description of the document. We have nothing that is well-defined. We just have the kind of strange notion of relevance. This is more relevant. This is less relevant.
This is somehow relevant. A lot of things that we have to consider. And that is also reflected in the queries that we post. It's not select from where, like in typical SQL, but it's about free text.
We just say it should be relevant with respect to Semantic Web. In any way. May occur in the title, may occur in the abstract. The book should have some insights on the topic, whatever.
And what we get is a result list. What we get in the SQL style is a result set that is well-defined. And we know these result sets. For example, from Google, if I type in Semantic Web, I get a result list ordered by relevance,
and the relevance is somehow determined by Google. I have a lot of funky algorithms and we will revisit some of them during this lecture to see how they actually do it, to see what's inside a web search engine. Very, very exciting. And then we got here articles like the Wikipedia entry
for the Semantic Web or the World Wide Web Consortium has a website about current Semantic Web approaches or activities that they're supporting. And those are somehow relevant. In what way? Google does not tell us.
Even this ranking is totally opaque. We don't know what made Google take the Wikipedia entrance as the first return item. We just have to guess anyway.
Okay. The second part of our lecture will be web search, information retrieval and web search. And web search is kind of a very similar problem because from the early days on, the web was just seen as, well, this is a large repository of documents
that is distributed over multiple servers. So basically the problem of getting relevant documents from any of these servers is an information retrieval problem and since it's distributed, it's distributed information retrieval. And that was how it was seen for quite a long time.
Recently we have novel approaches that we will also go into to make it a little bit or to change the paradigm a little bit. But basically what it comes down to is it is kind of a distributed information retrieval.
However, there are some differences to information retrieval and one that is very important is the link structure of the web. We have the hyperlink structure that tells us whenever there is an item that has something to do with some other item,
I may put a link between those two items. Well, if you want to read something more about the topic, click here. And this leads me to a document that has a higher level of detail, for example.
Or I don't want to go into this if you want to know about it, it's just on a related topic, click here. This leads me to a document that is somehow related, probably not more detailed, but just a little bit off the topic. Or it could be what you do in books with a glossary.
Oh, if you don't know this term, click here for a definition. So also that is what is used in links. And except for the glossary or the footnote in books, we don't have that in normal documents.
This was a very clever idea. And this link structure, of course, can and should be exploited when looking for web documents. So it is slightly different from just being distributed information retrieval. The same goes for the way the stuff is stored.
So if we have a library, it collects data and documents in a very structured way. So everything is indexed and you know where it is. On the web, that is somehow difficult, because it's just at the whim of the web server owner's ideas.
You take down web pages, you put new web pages up, you change them, you update them. You do a lot of things that introduces a large degree of flexibility
and dynamics in what you do with web documents, which you probably would not do with a normal document collection, because changing a book like that, making a new edition, is a lot of work. Changing something on your website is easy and can be difficult.
On the other hand, it's also how many people use the library, how many people use the web. The web is used all over the world for all kinds of purposes. It's currently the one single information source that is preeminent throughout the world.
Actually, when I went to university in the US, I was asking one of my first questions, so where's the library for computer science? And the guy looked at me and said, the library? I do know we have one, but I really can't remember.
Is it with the math library or with the engineering library? It should be somewhere over there. I haven't visited for some time now, because you get everything from the web, everything that is new, everything that is cool. You don't go to libraries anymore.
Totally different with other classes, with other disciplines. If you're in history or something, the library will still be the first point of entry. Or if you're in the humanities department, they do a lot of library work.
In computer science, probably less. Well, and then we have a big friend, spam. So usually before somebody writes a book like that and gets it indexed by a library, he invests some work. And she will make sure that it's very interesting to read for people,
because otherwise nobody will buy it and then nobody will make a new edition. Nobody will, kind of, you won't find a publisher for that. That's a lot of work, so you should have something to say. The web is a totally different area, because putting up a website doesn't require much work.
So you could put up the most nonsensical websites that you ever can imagine, or you can try to sell some products as a sideline, which you probably don't do in books. And this introduces a lot of noise,
because if you're selling something or if you want to have your poke fun on people, you want people to visit your website. And of course there are a lot of tricks how to get people visiting your website by just pretending there is information that they might want to know
for satisfying their information need. Interesting, isn't it? We will review some of these techniques also in this course. Good. Why should you know all about this? What is important for you? Well, the point is that it gets more and more important
to deal with this unstructured information in daily business life. So Gartner Group, for example, says that 80% of business is conducted on unstructured information. 85% of all data stored is held in an unstructured format.
7 million web pages are being added every day. Or the Butler Group, unstructured data doubles every three months. And if you imagine the typical cases, you work for some insurance company. How is an insurance claim filed? In a very unstructured way.
Even if there is a web form, which there hardly is, you have to, well, what happened actually? Well, you know, I wasn't paying attention and then I dropped the Ming vase and now it's kind of, the owner was kind of pissed.
You write things like that to get an insurance claim through. Well, what do you do with it? Are you storing it in a database? Can you? What is the idea? What is the structure behind it? So in the database you would have some columns like, oh, the item that was destroyed, well, a Ming vase.
The value of the item that was destroyed, well, didn't say. So you have to extract in a manual fashion the relevant information from this insurance claim. It's not going to happen because it's just too much work.
So you're still left with files and files of insurance claims. You just go through it and basically add some metadata. On the other hand, it doubles every three months.
Kind of interesting, isn't it? That's exponential. Well, nothing is really information. So there's actually a big discussion going on whether the data doubles or the information that is contained in the data doubles. A bit difficult.
Do I just get more of the same or do I get really new things, new insights, new knowledge? Probably that is growing at a much slower rate than the actual number of publications, the actual number of data that is produced every day. And it has been very often compared to trying to drink from a fire hydrant.
So you have this real batch of water coming at you and you can just sip tenderly from it and you need to get the sips that are interesting for you. From a fire hydrant, everything is kind of like it's all water.
But looking at what new information is published, most of it is totally irrelevant to what you're interested in. Finding those small bits and pieces that might be interesting is very valuable and actually has some business value.
So it's very valuable for companies to have people, analysts, that detect exactly the right information, that detect where fraud is involved, that detect patterns in the information.
And those are very highly paid professionals. So if you pay attention, you will learn all about that and can, at the end of the course, offer yourself as an analyst to the top information gathering companies. Cool, isn't it?
Yeah, well, the top information gathering companies, here they are. Google, number one employer in the US, voted for the, I think, 10th or 12th time in a row, the best employer of the whole US.
And they do quite interesting things. I mean, one can say about Google what one likes and there are many controversial topics like the Google Street View and you name it, you know. But still, they are good and they know what they are doing.
It's a player to be reckoned with. Same goes for Yahoo in America. A lot more market share than it has here in Europe and especially in Germany. Germany, Google has the highest market share, I think by 93% or something like that.
So that is a little bit biased in other countries. It's definitely difficult. Bing is the new Microsoft engine that the follow-up on MSN search that didn't take off really. There are also smaller and more specialized companies, fast, for example, or ask.com for question answering tasks.
And if you see the smile on the face of these men, anybody knows who they are? Any idea? They are the Google founders. This is Larry Page and Serge Brin, exactly.
And they are young, intelligent and successful. You are young and intelligent and this course will probably make you successful. And then you will have the bundles of money to go street viewing or something. This is basically very good.
Shouldn't be, of course, the pursue of knowledge is what you should be interested in. Not so much the money that is in it, but can't harm. It's kind of a nice thing to know. So, some organizational issues. We have 13 lectures in this course.
And we integrate the exercises into the lecture. So, we have kind of like, it's a two plus one lecture. But we will integrate it in what we call detours. The lecture will be Wednesdays, 10 o'clock, obviously.
Okay. Wednesdays, 10 o'clock till 1220, including a five minute break. Why is it doing that? Nobody knows. Anyway. And we will have a final exam
for both bachelor and master students. How many bachelors do we have? One? Masters? Diploma? Haha, cool. Good. So, there will be an oral exam, except for diploma guys
who do the diploma exam at some point. Oral exam at the end of the course. Homework. There will be homework exercises that will be published every week. But we can't grade it anymore. So, it used to be kind of the idea
that you should try to get 50% of the homework points to prepare yourself for the exam. And this is what I recommend. But we can't do it with the current guidelines for examination from the ministry.
So, we can't do it in a mandatory fashion. We have to do it in a voluntary fashion. So, if you do it, very good. And we will show the solutions and discuss the solutions with you at the beginning of each lecture. But if you don't do it, I mean, I can't tell you to do it.
It's your own responsibility. In any case, I've seen a very large degree of correlation between regularly doing homework and assignments and passing the oral exam. Because if you don't do your homework,
you will have to read all the lectures at the end of the term and say, oh my god, and it's so much stuff. And this is what I see in oral exams quite a lot. And then I'm not pleased, and you will not be pleased afterwards. And that doesn't help us.
Good. Sometimes we also have practical exercise where you have to program something, try some algorithms on existing corpora. And the idea is that by playing around, fiddling around with the parameters of an algorithm,
you will get to know how it actually works, what it does. And that's a very important thing. So, I would also recommend to do that. The interesting thing, of course, is the final exam because that will give you your grades. And I always have to say that preparing for an oral exam
is somewhat different than preparing for a written exam. Who has ever done oral exams? Uh-huh, quite a lot. So most people will agree that an oral exam, it's very important to talk about topics,
to be able to talk about topics. How can you be able to talk about topics if you've never talked about the topics before? So, doing all the exercises and reading the slides does not prepare you very well for an oral exam. You should talk about the topics. You should discuss the topics.
You should explain the algorithms. Only if you can explain something, you have understood what it actually does. So, form small groups. Whatever, two people, three people, four people. It doesn't really matter. But form groups for learning, for discussing among each other, and that will be the best preparation for your exam.
And then the rest is kind of piece of cake. Easy. And, of course, do the homework exercises. They are specifically designed to help you understand some important point about this stuff. And the more you understand,
the less you have to learn in the end. And we know, you know, end of semester, every lecture that you have has to have an exam. So you will have all the exams at once, and that's kind of very annoying if you didn't work during the term. Good. The main volume for this course goes here.
That is Christopher Manning, Raghavan and Schutze, Introduction to Information Retrieval. And it's also available online. So that's a very good thing,
a very good starting point if you can't follow any of the algorithms or want to have some more detailed knowledge or want to see some more, this is a very good starting point for doing your search. The second book that I always like is Modern Information Retrieval from Ricardo Baeva-Yates and Ribeiro Neto.
Ricardo Baeva-Yates is actually the head of Yahoo's European labs in Barcelona. So he knows what he's talking about because he tried it firsthand in industry. So this gives a rather practical perspective on things.
A rather different opinion is the BLA, so finding out about a cognitive perspective on search engine technology and the World Wide Web.
Also a nice book with a different way of tackling the problem more from a user's perspective. So especially the idea of what is relevant and what is information need are covered very well in this book. And there's of course the classical book
Kees van Rijksbergen, Information Retrieval. Ah, you see the date, 1979. And now you ask, why is this guy proposing a book that is, what, 30 years old? Because most of the old algorithms for information retrieval are still in place.
They have been adapted, they have been worked on, they have been some entirely new things. But most of the old things and the main area for information retrieval was basically the 60s and 70s.
So some of the big breakthroughs were made during that time are covered in this book in a very well explained and basic perspective. And also this is available online so you can just have a look and read up on some of the algorithms.
Very interesting to see. Good. What will we be doing? Today we're doing the introduction, some fundamental notions of what we want to do. But then we will hop right into the thing, retrieval models. We will talk about fuzzy retrieval, coordination level, the vector space model, probabilistic retrieval models,
indexes to speed up the retrieval, latent semantic indexing, which is a factorization techniques, the clustering of documents that have something to do with it, the language model, the evaluation of retrieval model, relevant feedback, so how can the user in the loop be helpful for the retrieval process,
the classification of documents, relevant, non-relevant is the easiest classification that every retrieval paradigm has to deal with. But there might be some different views or biases that are interesting to classify.
We'll talk about some machine learning techniques, especially support vector machines. And in the last couple of lectures, we will go from the basic retrieval techniques to the web retrieval techniques. And it's very easy because we can directly transport some of the techniques that are used for distributed information retrieval to the web.
We just have to enhance them somehow with important things. So, for example, the web crawling and link analysis are typical problems. And then we'll have a cover-all lecture for miscellaneous, where we'll do a little bit of spam detection and how to improve quality, so the remains of the day, so to say.
This is the plan for this semester. Good. So this was my preamble. Let's start basically with the lecture. Are there any questions? Yes.
You may also form groups to make the homework, yes. You don't have to.
So as for this course, this is a very interactive course.
If you have any problems, if you don't understand anything, just raise your hand, tell us about it. We'll directly interact here. Yes, please.
That's kind of difficult for me because I do have obligations before that, so 10 o'clock is the only way I can make it on time. Okay. Other comments, other questions? Good.
Then we will start right into the thing with a brief history of libraries, information retrieval, and web search, and then go on to have some fundamental notions and show the first, yeah, we will actually introduce the first IR model today. A very simple one, but still. So it all goes back to the land of Sumer.
There was around 2,000, 3,000 before Christ, and it was the first recorded incident where a library was built, where people collected information. So for example, in this case it is known to be about 25,000 clay tablets
that were stored in a temple, and I mean religion and science were intertwined for quite a long time, and this is the first recorded incidence of having a collection that big.
25,000 tablets, that is quite a number of things, and interestingly is what it was. It wasn't really, you know, science or philosophy or art or something like that,
but it was rather practical things. It was records or inventories of like food stuff and commercial transaction, treaties, contracts, that kind of stuff. So it was the first incidence of written law in a way,
and we know the stela of Hammurabi, one of the kings, a little bit later, that is one of the first written legal texts actually that everybody could point at and say, no, it says like what I want to have here.
So this is the first recording, and the same happened a little while after that in Egypt, ancient Egypt. So also here the temple collected information, collected knowledge,
and at some point around very novel times actually, 300 before Christ, the Egyptians, or at that time it was actually Greeks in a way, had a wonderful idea. What if we put all the knowledge in the world together in one place?
And it was actually, the idea was the Hellenic date, you know, Alexander the Great was conquering half of the world, and he was surrounded by very clever people.
So he was actually tutored by Aristoteles, one of the great Greek logicians, and he had very, very many people interested in science and technology, and so he said, well, I mean, Egypt is the land of wisdom,
why not choosing Alexandria, port city in Egypt, as the place where every piece of knowledge, every straw that has ever been written has a copy. And so at the height, the library had some 750,000 scrolls.
For 300 before Christ, that's unimaginable. That's definitely a feat worth of Alexander the Great. And going on, it's kind of interesting that the big library of Alexandria was destroyed in Roman times,
and it was just, you know, like there were several fires that, well, destroyed some of the scrolls, and most of the texts that were collected in Alexandria are today lost.
Some of the texts were, however, rescued. For example, many of the Greek philosophers, their large volumes are remaining, because they have been copied over the time after the fall of the Roman Empire, usually in monastic libraries. However, they were very selective. They didn't copy everything, but only the great and the authors
that were in tune with the Christian ideas, you know, so many of the ancient and the heathen texts, the pagan texts, are lost forever, because they were simply not covered.
But, yeah, some of them were saved, and that is good, and actually it was a very, very important work to save them, and a very manually strenuous work to save them, because what you basically had to do is you had to copy them by hand. So there were scriptoriums, which is basically part of a monastery,
part of a cloister, where monks that could read and write, that was a skill that was not mandatory for all monks, you know, there was not too many people being actually able to read and write well, were copying some of the volumes and were kind of exchanging the volumes
between the different libraries. So this was the first time, basically, when the information web became distributed. So the idea, well, let's all carry it off to Alexandria, and there it is.
After the first five other people learned, well, probably not a clever idea. At least we should have some other places where kind of copies of the things. And this was kind of like a network of monasteries during the Middle Ages, where you also had some central points, some focal points. For example, we see the Bibliotheca Palatina at the University of Heidelberg,
we see the library of the Sorbonne in Paris, and one of the biggest was, of course, the Vatican Library, founded in 1475, but in fact containing much older collections.
For example, the Bibliotheca Palatina, or what's left of it, is also a part of the Vatican Library today. And it all took off in 1450 when Johannes Gutenberg invented a wonderful thing. Well, it's not totally correct what's written here.
He did not invent the printing press. The printing press is a technique that was already known in ancient Egypt. But what you had to do, basically, is you had to carve out a wooden sheet
and then print your books. And after a couple of books, it was kind of like the wooden stock was spoiled and you had to carve a new one. So what Gutenberg's actual invention was, was the movable types.
So you took lead and you built types from it, single letters. And he had a way of putting those types together in a certain order and building your printing block from that.
Then he used it for printing and when it was spoiled after a couple of prints, you just melt down the lead and get new letters from it. You just reform the letters, put it again in the same order and start printing over.
And since it was metal, it was somewhat more durable than the wooden printing blocks. And you could print more copies. You can print copies easier and at little cost. So the copying of books became less expensive
and it was possible even for private collectors to have books, which was a very expensive thing before that because you needed actually monks to kind of copy the books by hand and illuminate them. They usually painted beautiful pictures in it.
If you want to see some of these handwritten books, it's a very good idea to go to Wolfenbüttel, which is kind of like 10 kilometers down the road here, or 15, and see the old library of Wolfenbüttel. They have one of the most beautiful volumes handwritten of the Middle Ages on display.
So it's a very good thing to see that. Good! Well, in comes the German National Library. So then people saw that collecting all this information is basically every state should do that.
Every country has to have a library system to educate its people because knowledge is power and the more knowledged engineers and scientists and high-skilled workers you have, the more money you can make and the better your economy is flourishing. So it was considered that it's basically a governmental duty
to collect knowledge and information and to hand it to your citizens. This is why many of the national libraries were founded. So the German National Library is distributed between Leipzig, Frankfurt and Berlin
with having the biggest part in Frankfurt actually. And it covers today over 25 million items. The biggest library of that kind is the Library of Congress, so the National Library of the US with 150 million items.
And it's not only the number that is interesting, it's the growth that is interesting. Because it now contains about 150 million items, but since 2009, 20 million new items were added.
Growing at that rate is incredible. Being able to handle that growth is impossible, at least manually. With some automatic means, with some computer science,
you can lighten the load. You cannot solve the problem really. And it's getting worse. So according to the Guinness Book of Records, this is actually the world's largest library. And for indexing all these items,
it introduced a classification system, an indexing system, the Library of Congress classification that you will find in any book that is worth something here in these pages. And there's a certain classification.
Where is it? So down here, where the interesting part of the book, the topics that are covered are shown in a numerical system. So for example here it says, text processing in brackets computer science, information retrieval, document clustering, semantic web,
and the number of the authors, and some weird numbers that you can't really recognize here, but that do have a meaning in this classification system, just for finding, just for indexing the information that is given. And there's the interesting point.
The items are usually cataloged by the metadata, and that's a very important topic, a very important term. Metadata is information about information. It describes the information at hand. It does not refer completely to the text that is there, but it says, who are the authors?
Who edited the volume? How can it be addressed? So you usually have the ISBN number for books to identify them. Recently you have the digital object identifiers, the DOI system where you register objects that cannot only be books,
but also videos or experimental sets, or any data set that you might find can have some of these identifiers. Except for those information that are basically describing the book in a way, you also have the very interesting information what it is about,
what is the content of the document. So for example here, information retrieval. And because contents can be different in certain areas or can be seen with a certain point of view, you will have to classify what is actually meant.
So the subject area, for example, information systems or computer science, further classifies the term information retrieval. There might be a discipline in chemistry called information retrieval. Will probably be totally different from the information retrieval
we use here in computer science. And to make it more clear, that is also given. And then you have the specialized classification systems, for example here, the Library of Congress. This is what I just showed you in the preface of the book. It says the title and who wrote it, when was it released.
It has 482 pages. It's about document processing, information retrieval, query processing, text processing as in computer science, information retrieval, document clustering, semantic web.
The areas covered are databases and information systems. These are the identifiers, the numerical identifiers, that can be resolved by using the Library of Congress thing. And so on, some of the links. And then you might have some information about the library
that contains the book, for example, can it be borrowed or is it kept, you know, and stuff like that so that you can actually find it, the physical copy of the book.
All this information is made available not only to users but also to computer systems. So it's kind of, in a way, it's machine readable or to some degree machine understandable. That's a very important point for handling the information flood
because you can't do it anymore. A computer still can, to some degree. And of course you see that the metadata is the primary handle to the book. I can tell you about the book and say it's a good book
and you should read it. That's kind of a personal opinion of the book. But when you have to describe the book, when you have to find the book, you need some more information than that. And as soon as there's low quality metadata,
somehow somebody has not said this is about information retrieval but about, well, maybe a misspelling or about information searching. The whole thing gets a new idea. And there's always one of the sayings, one of the proverbial sayings from librarians.
Once you take a book out of its shelf and just randomly put it somewhere else, it will never be found again. There's no chance because you would have to go through all the volumes in the library and actually libraries do that once a year.
I kind of clear out all the bookshelves and renew the ordering because otherwise they would not know where half of their books are because somebody just put them out and put them somewhere else. And you have no chance of finding it then. This information is very interesting for finding it.
And I want to show you an example, how that is done, with respect to the topic of medicine. This detour of our lecture clearly indicated here. This will be the part where we show you some more practical examples
of how the things we just discussed are usually done. And I want to present the MATLINE database. MATLINE is an acronym for Medical Literature Analysis and Retrieval System Online somehow. The idea is that all literature, all knowledge published in medicine,
all scientific results are collected in a large digital database and each document is manually classified by a lot of experts according to some very complicated and sophisticated classification system.
So the idea is that you now systematically can search if you are doing research in the life sciences or medicine, you can do research exactly on the topics you are currently working on. And the classification system used by MATLINE is so complex and gets updated every year that you really can rely on that.
You can find a comprehensive overview of the current state of knowledge in your particular lead part of the discipline. So MATLINE is funded by the US government, to be more specific by the National Library of Medicine,
and currently has about 18 million records from 5,000 publications from the field of medicine. So all journals, books and conference proceedings that are published in this field are delivered to the National Library of Medicine
and there are hundreds of experts, literally hundreds of experts, reading this material and classifying it according to the classification systems. So usually when you want to work at MATLINE, you at least have to have a master's degree in some life science related field,
or better a PhD. Then usually you get trained at least half a year in how to use the MATLINE classification systems, as this is so complex and there are very clear rules how specific documents have to be classified according to the system.
And it's really important to them that classification is consistent and is done correctly by different experts because this is the only way to find this information later. So as I said, each document is manually indexed and usually 12 keyword terms are assigned to each document.
So it's not just putting one word at them, so this is a document about medicine, but it's usually very, very specific on a very detailed level. So the whole database is available online and we can take a look at it.
Just let you get an impression how this works. So basically, Medline is an approach to transfer classical library ideas into the digital age. So use classification systems
maybe a bit smaller and there you can search for, I don't know, swine flu. Yeah, this is possibly related and I think the problem
why we don't see swine flu in the title is just because scientists would never use the colloquial term. I think the correct term is somehow related to H1N1 or H1N something. There's actually a whole class of viruses using here and
you could also get some hints on what possibly query refinements could be and these hints are generated from the classification system. So they have something stored, the term swine flu and then they have a label associated to it. Swine flu really is called
H1N1 or something like this and then you get these recommendations. So here a lot of documents. Let's take a little look at this one. And here you can see where this document has been published initially so I guess this would be some journal. When it has been published all the metadata we just saw
for the classical library. The title, the authors, the abstract and the classification I think this is what in process means has not been performed yet. Otherwise we would have seen some more detailed
classification information. Let's see whether I can find a document that's a bit older than that. Probably has been processed already or let's try a different search term. 2010 is also
in process. So obviously they currently have a lot of problem in indexing all this data. Very quickly so maybe I can find a way to search for older publications.
Advanced search maybe. It's not very easy to use. Managed filters maybe. I guess for
advanced searching I need to pay a lot of money which I do not want to do right now. Maybe age prevention I have no idea. Supplied by publisher.
Head of print, also very new. Usually there should be some terms associated to each document and as they sort it by date maybe we should look at the last page. 1952
index format line. Yes this should be what we are looking for. Mesh terms. Mesh is the vocabulary used by Medline and currently there are two terms associated no so because this is a quite old document maybe they used much less terms at the time
of indexing than they do today. So it's for diseases somehow. Can look it up in mesh and we will see a description of this term so any disorders
that are related to the foot. And here you can also see where this term foot diseases is classified in the hierarchical order of the mesh classification system. So it is in the diseases category
and within the diseases category it falls into the something diseases class and then the foot diseases and there could be different kinds of foot diseases that can be distinguished in mesh. It also can appear in some other categories so usually each keyword
may occur in different branches of the tree and all is related somehow and this tree is really really large. So this is the top level and you can go open the humanities category and then you will see all subtopics of
the humanities, occultism for example. Here again are subtopics and occultism occurs also in a different tree of the classification systems so this is how mesh usually looks like.
So mesh means medical subject headings this is the controlled vocabulary we just saw and currently there are about 25,000 individual keywords in this large tree. So now it isn't surprising that you usually need a lot of training before you become an expert for classifying
Medline documents. You really need to understand how this system works and what is really meant by each individual keyword and how to assign it to documents in a consistent way. So as you have seen it is hierarchical and in addition to these 25,000 base headings
let's call it that are 140,000 additional words that are synonyms for example. So swine flu would be a typical synonym that would occur somewhere in the database and it would be associated with the correct medical term
for swine flu. So H1N1 for example H1N1 maybe would be a subject heading and each subject heading has a lot of explanations and additional keywords like swine flu. Mesh gets updated every year
yeah well, no and there's this browser I'm not sure whether it's the same we just saw hmm interesting, let's use the mesh browser
see again, you can search in all of the concepts
and when I'm now searching for swine flu I hope they have indexed it, then they have here the correct index term, the correct index term would be influencer A virus H1N1 subtype, swine origin influencer and yeah, this is the location
in the tree and there are two positions in the tree where you can find this heading, there is an annotation of this heading, what is really meant by it and some additional descriptions so maintaining this mesh heading alone is a lot of work and as I said gets updated every year. So swine flu has been added to mesh in 2005
and since then persisted in the tree. So there are also terms that get removed at some point or here you can see that until 2005 there only was the index term influencer A
virus and in 2005 they decided to split it up into a more sophisticated distinction between different kinds of influencer A viruses. So now it's a special category because that was a time where the swine flu the first time
occurred in the public So this is MADline and Mesh, a lot of manual work but this way you are able to really find what you're looking for in a comprehensive way and since medical science is a huge field
it's very hard to keep track of it if you're a medical scientist and usually if you're working in this field you have a subscription to MADline and Mesh and you can use this data to find what you are looking for. Another approach to using metadata is the doubling core metadata initiative
that's a rather general framework for annotating all kinds of resources. So web pages, books cars, buildings, whatever, it's a relatively general approach to doing that and doubling core basically also defines a way of
storing this data in a structured format So currently it's an ISO standard and doubling it's not doubling in Ireland but doubling in Ohio where some workshop was hold where this metadata has been defined. So there are 15 metadata elements
title of the resource, who created the resource, the subject of the resource and other things up to who currently holds the rights on using this resource or what fields of area are covered by this resource and this is a rather general way of describing it.
Doubling core metadata can be stored in so called RDF syntax, sometimes you also find doubling core metadata at the beginning of individual web pages in the head element of an HTML page
and usually it's stored like this, you have just the document or the resource and then for each of these categories I just showed there could be a value provided to it. So one title, two creators, one description and so on and so on.
Really easy and this is the way currently metadata is used in the web. So metadata is not all, we are doing full text search in this lecture but I think we are doing this after a five minute break so we will continue at 11.20.
So of course all this lecture is also recorded and will be available on the course home page for download and direct streaming in the browser so you can see any of the comments and any of the slides
and stuff for this lecture if you want to revisit some parts of it. Good, so the obvious idea that comes to mind when looking for information is that you say okay these catalogue cards that libraries usually do
they are kind of like a minimal representation of the book, of the document and usually it's kind of like, I mean if a book is on the semantic web the term semantic web should occur somewhere in the title.
Kind of easy heuristics to get what you want. But on the other hand when you classify things you have to use a controlled vocabulary. You have to decide for some terms that you will use for indexing
and other terms that you might not use for indexing. Determining this vocabulary is a very strenuous process because first you need to discuss among experts what is a good term, what is a, well not as good term. Then you need to train all the indexes
in how to use this term, how to distinguish subtleties with respect to the index term. And then you have to go through all the documents manually to find out whether some term applies or does not apply. So the problem is this makes for a lot of manual work that you will have to
do and having a computer at hand. It might be much more easy not to use the classification at all but just look for the word in the text. So as I showed you before you can have the semantic web, so the book that contains
the term semantic web in the title, in the abstract somewhere in the full text. This is what basically full text retrieval does. So this idea is not entirely new, I mean also the monks in the middle ages came to see
the benefit of that and what they did was so-called concordances. And a concordance is the alphabetical list of words that are used in a book. And usually you restrict to some important works and there have been concordances for many of
the interesting books in the middle ages that was usually the Bible and so there are kind of like books where you see where certain terms for example immortality actually occurs. So for example immortality occurs in this
Bible part and the sentence where it occurs is a Christ alone has immortality and in this part, this is part of the chronicles the mortal must put on immortality. So you have a cross reference when looking for some
words, where does it occur in the Bible? And those concordances don't only exist for the Bible but also for other important works. However it is kind of very difficult to build such a concordance because you have to read the book, you have to note every
occurrence of every term that is of some importance, first you have to discuss of course what is of importance and what is not and then you have to look through it all and always record what's happening in a style that you can find it in a lexicographic style.
The first Bible concordance was made by Hugo de Saint Charo and he used about 500 monks to do it, to go through the Bible and record the occurrence of every word.
Kind of interesting works but the idea spread from there not only indexing the Bible but indexing everything that you have ever seen. And one of the first persons who did that was Vannevar Bush. Vannevar Bush was the Director of the Office of Scientific Research
and Development beginning of the 40s in the US. So he was a direct advisor to President Roosevelt of how the science could influence the education, the engineering, the industrial development of America
as a whole. And he kind of had this vision of something that is, well, I mean, remembered as the 40s of the last century and he had this idea, what if we have a hypertext based PDA? Well he didn't know about mobile devices, he didn't know about
Palm pilots and other PDAs yet or your Blackberry he didn't even know about hypertext. But he had the basic notions of libraries. Libraries had these concordances, libraries had glossaries, libraries had index terms. So in
1945 he wrote an essay, As We May Think, that was published in the Atlantic Monthly and his vision was a device which he called Memex. It was a personal, well, yes, document repository
that individual stores all the books, records, communications as well and which is mechanized so that it may be consulted with exceeding speed and flexibility. So it should be a machine
that searches through your document collections and records everything that you do, that you say, that you kind of communicate and the selection should be by association rather than by indexing. So you don't have an indexing scheme on top of that and then you search for the keywords
but you can follow up on conversations. So who wrote what at what time and what was the answer to what? These associations are very modern concepts. Basically those are hyperlinks and he invented it in such a way. So that was the first drawing that he made.
It was of course with mechanical technology. There were no computers at that time and there was, these are actually films, so there was microfilm, microfish, I don't know if any of you have seen it actually when you go to the library and
want to research something that is very old, you might actually be led to the microfish readers where you have these sheets of microfilm that you can put under the reader and then you can see some of the, it's
enlarged for you and that was basically the idea so to build it into a desk where you have projectors that actually show you some documents that are recorded from microfish and that basically you can search through.
There was a basic idea of the memex and building on that the idea of searching through your documents became very present. So for example, end of the 50, Hans Peter Loon from IBM was the first one to use Word as indexing units for documents and he then went on to
determine the similarity of documents by how many words they shared. And that was basically the birth of the information retrieval field with Jared Salton who is often referred to as one of the first information retrieval specialists. He was at Harvard
later at Cornell and created the smart system and the smart system is a retrieval system that is still in use today very often and can be extended. It has the basic functionalities of information retrieval technologies. So that was
in a very early way. It introduced actually two models that we will see. So the Boolean model was in the smart system and it introduced an all new model that we will cover next week the vector space model for document retrieval. One of the most important models still in use today. He also considered the
idea of relevance feedback. So taking the user into the loop and finding out more about the user's information need by getting explicit feedback on items and how well they match. We will also cover that in a later lecture.
From then on the success of IR was basically unstoppable and the ACM, the American Association of Computing Machinery, so kind of like one of the computer science clubs of the US founded a special interest group information retrieval
SIGIR and since 1978 it held an annual conference and it even had a prize for the best contribution to the field which was called after one of the founders of information retrieval Gerald Sautin and he was even the first
one honored with this prize. So it's one of the rare occasions where somebody gets a prize that is named after himself. Interesting. The problem in information retrieval is always the quality. So how good is the result? And of course that became
quite an early stage clear also for the algorithms in information retrieval. So how well does an algorithm perform? In databases it's clear what you do and then you can measure the time or the storage
space that you need or whatever. So these are kind of objective measures. But what in information retrieval if your algorithm takes a little bit longer but produces better results? That's a problem. That was pretty new because the result as such was not
fixed. It wasn't, you have to retrieve that set and if one item is missing it's wrong. It's incorrect. But you had to have ranked result lists and the ranking reflects kind of the quality of your retrieval engine.
The quality of your algorithms. There was a new concept and of course for that you should know what is relevant what is not relevant. And for this we have the TRAC conference. TRAC was the annual text retrieval conference. So beginning of the 90s
people began to design benchmarks where you say okay I have a certain corpus of texts and within this corpus I have 5, 6, 7 typical queries and this, these documents in this ranking
are the correct answer to that query. And you just take a couple of experts from the field and you let them sift to the documents and a lot of manual labor but in the end you get kind of what we call ground truth. So the true result that should come out if you have a good retrieval engine.
And the closer you are to that result the better is your algorithm. This was sponsored by the US National Institute of Standards and Technologies and the Department of Defense so that was kind of a DARPA grant and it just snowballed because it wasn't only the
problem of how to find relevant text to some queries but then there was for example how to find relevant blog entries, how to find stuff in genomics, how to find stuff in chemistry, how to detect spam, you know like all different problems that face the same problem of being very
hard to evaluate without ground truth, without certain well benchmark collections. So this is kind of what TRAC is all about. Turning to web search from the information retrieval part, we'll cover that in the last lectures.
In 1991 Tim Berners-Lee actually invented the World Wide Web and it was by no means a planned invention, it was just a good idea that suddenly snowballed. And the first web search engines
the Archy or the architect Excite engine were very closely after the first couple of web pages emerged because at first you had directories everybody knew that from Gopher and some of the older
precursors of the modern web and usually those were only held by universities, a network of FTP servers or a network of the Gopher network and it soon became clear that the web needs new technology to cover
the same ground. You can't just take every web server and catalog it and then have kind of an index of that. But with the growing web you needed scalable techniques to cover all the information and to index all the information that was on the web.
And in the first years people were quite clear about what web search was because web search was just the idea, indexes are distributed all over the place so let's just use information retrieval techniques over them and it's a scalability problem, yes, it's a
distributed problem, yes, but that's all there is to it. That was until 1998 and 1998 was kind of a breakthrough. It was the year that Google was founded and Google kind of like
introduced something entirely new to the market and that was basically the link structure. The link structure, exploiting the link structure of the web especially with the PageRank algorithm was such a novel idea that it kind of made all the
web service engines, providers to rethink their architectures and to rethink that web search is just a different kind of information retrieval and today we have very complex we have very complex retrieval
engines. So the core problems of information retrieval and web search are how to store and update large document collections, should not take too much storage place, should be scalable, new websites are added every day, how to retrieve
efficiently the interesting sites it should be fast, you know, you don't want to type in your Google query and then wait for a quarter of an hour or something and it should be effective, so you should see the best pages you should get the first 10, the first
20 results should be sufficient to get your information need covered because nobody hardly ever goes beyond the first site of the Google results you click on the first couple of results and that's it you don't see the, I mean Google always returns this number
you know, you have 19 million pages in your result list. Somebody ever looked at them? No you just look at the first 10. So the quality is definitely an important point here. Good. Let's now get to some fundamental notions
of information retrieval and towards the first algorithms how to deal with the retrieval problem A document is the central unit of information retrieval. It's a coherent passage of free text. It may be a document like
in this book but also every paragraph in here might be considered as a small document so any coherent number of words is a document. Coherent means that the topic covered is kind of just one topic
not random words from some dictionary and free means it has a certain structure usually meaning a grammar, a certain language that is covered in so it's understandable and typical documents are
for example newspaper article or scientific articles dictionary entries explaining the meaning of some word a web page is usually a document or an email message but of course it might be split into some sub documents where you can find or which makes the topics
more coherent. A document collection then is a set of documents and librarians usually call it a corpus corpus of books or whatever it is A corpus also has a certain topic that it covers and usually the documents in one corpus
are all written with respect to that topic if you have some documents of a different topic you will introduce a new corpus and then you have several corpora typical examples of such corpora are Medline or the articles covered by Google News which basically is an archive
of some renowned newspapers or even the web. That's a very big corpus and not a very coherent corpus but usually it's kind of a corpus. Then we have the notion of the information need. The information need
is the topic about which desires to know more That's a very difficult part of information retrieval because unlike the database stuff you don't really specify what you want and why don't you do that? Why don't you specify what you want?
You could be some more. Exactly! The problem is really that you hardly ever know what
you're looking for because if you would know what you're looking for you wouldn't search for it. So many queries that you post to information retrieval engines are not of a navigational give me that file I know the file I know the number and I just want to have it to look up something
but most are of an informational type I want to know about some topics I don't know what it means I mean when we looked up the swine flu I didn't really know how the exact type of virus was called it was totally new to me. Now I know it's wonderful
this is how information retrieval works. You have very unclear query and you have a very unclear information need. Do you want to get documents that explain something? So you might get a first
intuition. On what level? On a beginners level so in easy words or are you a scientist working on I don't know swine flu medication then you might need totally different documents. You don't like the Wikipedia entry for what is
swine flu because you know you've been working on vaccines for the last 15 years or something there's nothing new but the latest installment of the I don't know some medical journal saying oh we had a wonderful success rate with this
new medication or with this treatment of swine flu victims that might be interesting to you. So your information need does not only reflect your query but it also reflects your background, your context how can a computer know that? Not by your
query. Interesting. So we have different kinds of statements so for example what is the capital of Uganda? It's easy to say anybody has any idea? Very good.
Is it really true that McDonald's hamburgers contain warm meat? Rumours on the web you will find pages to that. What is cloud computing? Novel term. I mean these are all information needs that are different. The first one
is a factual question. There is one capital of Uganda and it has some name tell it. Is it really true that? It's rather speculative isn't it? Are there proof for that or is it just a bunch of assumptions or one of these nice theories that people always have with
companies or governments. What is cloud computing? Well in what way? Very vague do you want a basic introduction to cloud computing? Do you want to know about the economic impact of cloud computing? What are you aiming at? It's totally unclear.
Information need is one of the big problems of information retrieval and all you get usually to determine what the information need of a user is is a query. The query is the communication with the user. The user says I want something about cloud computing
I want the capital of Uganda or something like that and you usually do that in a query language in the easiest case it's just keyword or a bunch of keywords. You can also put some
information about where things should occur so it should occur in the title of a document or it should occur near some other term so I want something about panda near to Jaguar but not animal you know like I mean the car brands actually there's a Fiat Panda and Jaguar
car company so I want a document combining them both somehow you know so these operators can kind of steer the search but basically it comes down to somehow getting the retrieval engine
to grasp the meaning of your query to grasp your information. That's a very difficult thing to do actually. Point So why is it so difficult? Because relevance is all what counts. You want information
that is relevant with respect to your query If I get the Wikipedia entry to the cloud computing question that might be very relevant. If I get the ramblings of some professor about the newest development and his greatest
inventions on cloud computing that is probably relevant for well basically because the topic is cloud computing but it is probably not relevant for you because you want to know some basics about
cloud computing. So relevant is kind of a strange concept. It's a subjective concept. There is no relevant or irrelevant as an objective term but you the user are in the loop and something is relevant if you perceive it as being
relevant. And usually it's kind of a binary concept. Something is either relevant or not but of course there could be degrees of relevance this is very relevant or this is not as relevant and it's very hard to
formalize what relevance actually means. There are different notions of relevance and we will go through them when we talk about some of the issues later. Now after having covered some of the fundamental notions I want to go into the actual IR systems and present some models and then we
want to go to really in the Boolean retrieval model which is the easiest retrieval model. So every IR system basically has a number of components. It has to have a query. It has to represent the query somehow to make it machine readable, machine
understandable. On the other hand it has a document collection usually very big and has to represent the documents in the collection either by indexing or in some other clever way. And then what is basically done
the representation of the document collection and the representation of the query has to be compared somehow. From the comparison you get a usually ranked result and you deliver that result to the user and maybe start a feedback cycle and say is that what you wanted? Is that relevant
for you? And then the user may say yes that covers my information need or no that was totally not what I was looking for and rather go into this in that direction. This is basically the circle that is used by information retrieval all the time.
And this IR model basically has certain components. The query language, how to represent queries, how to represent documents and the ranking function that kind of associates
a score with a query document pair. How well matches some specific document, the query. And that's kind of the basic part what you need to build a retrieval engine. Then you may
still want a relevance feedback engine where you say okay this is kind of the cycle with judging relevant of the result and inputting that with query refinements to get even better results. And one of the representation that is most famous and
directly springs to mind is the bag of words representation. Bag of words basically means that a document is just a collection of words. And we just ignore grammar. We just ignore everything that structures
the document. We just focus on the individual terms that are in it. Because those are very similar to the individual terms the keywords that you use in a query. So basically I know from some document that information retrieval or the term information retrieval will occur
very often in this book here. That might mean something. I know that the word biology will occur very rarely in this book. I know that the term rocket science for example may not occur in this book at all. And that may be helpful
when considering the query. So the standard case is that you just have a vocabulary the set of all words in the documents collection and sometimes you also count the words. You take the multiset
of the words in the bag of word model and just say okay how often does some term occur? Also very interesting. So what you basically do is you take a document that's one small step for a man and a giant leap for mankind and say well what does it contain? It contains that one small four two times
a two times and so on. Okay? And this is a multiset basically the representation of the document. Easiest representation ever. Bag of words model. Well, pros and cons for this model?
Well, you totally ignore the word order. You totally ignore the grammar. I can tell you a wonderful example. I have a very small document saying this document is not about rocket science.
Ha, the term rocket science occurs so it must be relevant for a query about rocket science. No, it's not. But the bag of word model ignores that it was not recognized not rocket science but rather something totally different.
Okay? Very different documents with respect to the grammar and expressing exactly such things as I said might have similar representation in the bag of word models because you're losing semantics. And you ignore everything that is kind of visual clues
towards the document. So for example the title or the keywords in the title might be much more interesting, much more relevant than those occurring in the last section of the document where you go and other related tasks are blah blah blah. These are usually negligible.
On the other hand you get a good mathematical representation for the document. Something you can build on and something you can work on. And you can store individual terms and retrieve them afterwards in a very efficient way because you just count the number of words in
the term and that is basically your representation. And if you have another document you just take the same number of terms and add the new terms so how many terms will there be total? I mean most documents in German language can be written using a hundred, two hundred thousand
term vocabulary that reoccur very often and you just represent the documents with respect to that. And one of the best advantages of it is it actually works quite well. So just using the simple
model already gives you a very good retrieval engine. A retrieval engine that can do a lot of things. So what you basically do is you take a vocabulary, basically all the terms that somehow occur in any of your documents.
Usually called the corpus vocabulary. Then you take all your documents so these are different documents and wherever some word occurs you just add one to the vector
that is built over all these different indexing terms. So for example that's one small space for man, giant leap for mankind, one occurrence of that one occurrence of one, two occurrence of four, zero occurrence of taikonaut.
This vector describes the document in a very compressed fashion, in a very easy to work with fashion. You want a different document, okay. Taikonaut size small step is a giant leap for China.
Doesn't contain that, does not contain one, does contain small, does contain step, does contain taikonaut, okay. Different vector, different document.
Okay, and now you can compare the vectors instead of the documents. That is the trick of the bag of word models. Basically comparing the vectors is much easier, much more structured than comparing the documents.
And vector immediately abstracts from everything that is grammatical or word order or something like that in the document. Consider for example the task of finding out what words occur in both documents. What is the overlap of the documents?
This is easy, you just build vector-wise comparisons, okay. And then I know okay, this does not occur, this does not occur, this does not occur, it's easy to do. Doing it on the text is, well, I have that, let's look whether
there's that, no, there's not that, then I have one, let's look whether there's one, no, I don't have it, okay. Small, ah, here it is, so this is one of the terms that contains it's very difficult to do. It's a large overhead
with the vectors, it's easy. This is why it's a good representation. If you put it together, draw on too much, so if you put all these vectors together
as rows in a matrix with the documents here, document one, document n, so you have n documents in your corpus, then it's what's called a term document matrix or the incident
matrix. On one side you have the terms, on the other side you have the documents and they form a matrix, okay. And we will see that you can do lots of stuff with this matrix that actually leads to impressive results. So the term document matrix
will be one of our friends for the next weeks to come. Okay. Let's start with our first retrieval model and that goes back to this nice gentleman over here. This is good old Boole.
And he was one of the first to define a mathematical logic, Boole's algebra, for logical or combination of logical propositions. And
actually the Boolean retrieval model is the simplest IR model there is, where you take the sets of words as document representation and you even ignore the number how many times each word is in the document.
You just say occurs or does not occur. And then you use Boolean connectors and or not to combine the vectors. You have a binary ranking function that is just 0 or 1.
Two words occur, word A and word B, only if both occur you get a ranking value of 1. If one of them or both of them do not occur, you get a ranking value of 0. It's easy.
And the retrieval is based on the memberships in the sets, in the word sets. So for example, find all document indexed by the word mankind. How do you do that?
Exactly. You just take the respective column corresponding to the word, go down the document whenever there is a 1 you return the document. Easy retrieval engine. Find all documents indexed by the word
man or mankind. Boolean term or here. You just pick the two columns and whenever there is a 1 in the column you return it. What will you do with find all document
indexed by the word man and mankind? You look at both columns and when there is a 1 in both columns at the same time. Then you return the document and it's just a Boolean operation. 1 and 1 is 1.
1 and 0 is 0. These are just the normal Boolean operators so that's not very, very, very cleverly done. If you have the conjunction, both columns have to be 1 to yield a 1. If you have a disjunction either of the columns have to show a
1 to retrieve a 1 and with the not, the negation, it just flips the bit. That's very efficient to do. That's very easy to work on. Let's make an example. We have very small documents here
for better handling. So document 1 says step mankind and document 2 says step China. And we just ignore how often it says step or China or whatever. We just say the word occurs. So step and mankind we just pick the column
look at it. Aha! We just pick the column look at it. Here is the 1. 1 and 1 makes 1. And here is 0 makes 0.
Very easy to do. Same goes for the or kind, only that we don't do the and here but the or, step or mankind, both contain step that's a 1. 1 or 1 is 1. 1 or 0 is also 1. So both are returned.
Everybody understood? Amazingly easy actually. Well the interesting part is the negation because the negation has to be safe. If you say, well I want all the documents that do not contain mankind.
That's all of them where the mankind is a 0. Probably a lot of documents will come down on you. So if you say something like, oh I want all
documents that contain Neil Armstrong but not mankind that might be a little bit better than you first cut out all those that have the Neil Armstrong and then you kind of complement the bits. To match natural language better usually you don't say and not, you do not say
and not like and not a and not b or something like that but you say but not. I want step but not China. It's just language you know. It's the same operators as in and not. And you can also use off
very often done. So you say, well I basically want two of step mankind China. Which basically translates into I want either step and mankind or step and China or mankind and China. Okay? Basically same theory. Just abbreviating way of saying it.
Okay? In any case we can process queries with that and to speed it up we use an inverted index. That's a very clever notion of information retrieval
because if you want to know whether some term occurs in a document it's much easier to look for the column where the term occurs and then go through the documents and vice versa. And usually for most words you will experience very
many nulls in this column because I mean give me all, I take the web and give me all the documents that contain no idea Christmas pudding. Okay? Will be very little document. Still going through all the documents doesn't help much
just to find out that there are zero. So better take those documents that contain a one and put it to the idea. So if you say Christmas pudding okay? Document one document two and document five
and that's it. Okay? This is what is called an inverted index. I just look for the words and for every word like in the concordances that I described before I say in which documents does the term occur and I just skip all the documents
where the term does not occur. Okay? Can be precomputed so very easy once you go through the documents to build such an inverted index and of course the query processing is very fast especially if you order the document because now you can say okay I want all the
documents where Christmas pudding and whatever occurs you know so I go through one list and look for all the documents whether they are in the other list. That's it. So for example if you have the step mankind and step China then the inverted index would read okay for all the different terms
I do have an entry step mankind China and then I have a list of documents where this term occurs. Okay? Everybody knows the idea of the inverted index? It's not too difficult.
Using such an inverted index the queries of the type show me all documents containing term x can be answered quickly because you just go to the term and you just reiterate the list of documents. Easy.
Also unions and intersections of sets can be done very quickly and this is most of what Boolean queries are about and and all union and intersect. Result mankind and step means you take the result set of mankind and you just intersect it with the result set of step.
Or you take the result of mankind and you just union it with the result of step. What you can do with all Boolean expressions is you can convert them into a conjunctive normal form a normal form or a disjunctive.
normal form. So you just consider where the ends and the oars are supposed to be and which parts of the Boolean expression should be evaluated first. In the conjunctive normal form the ends should be considered as the last part, in the disjunctive normal form
the ends should be considered in the first part. So let's see it. A propositional formula in conjunctive normal form is a conjunction of clauses where every clause just contains disjunctions of literals.
So everything of the type I have some A or B or C and D or E is in conjunctive normal form because you first have the disjunction of literals
and afterwards you put the end on top of it. And any Boolean formula can be converted into the conjunctive normal form. You just shift around things. So there are some of the rules like De Morgan's rules that tell you how to do it actually.
It's just an algebra you can translate or transform most of the expressions. With the disjunctive normal form it's just the other way around.
You have the brackets containing all the A and B or D and E, something like that. First you do your conjunctions and then you put the disjunctions ahead. And as with the conjunction of clauses you can also transform any propositional formula or any Boolean expression into a disjunctive normal form.
So why is it important to have the conjunctive or disjunctive normal form?
So now comes the first surprise of the lecture. The interesting part is that rearranging the queries actually has something to do with what you evaluate first
and what is the cost of evaluating certain subgroups first. You saw that with the conjunction we had to do an intersect, with the disjunction we had to do a union. The cost of union intersect may be different.
So sticking to either one of the normal forms is a good idea. What can we do for example if we have some here, step and China and Taikonaut or man? We can just translate it into something. So for example here we need the conjunctive normal form
but there is some expression that has to be evaluated first and contains an and. This is not possible. De Morgan tells us that we can kind of put the or into it.
So we can do China or man and Taikonaut or man. And this is not a problem because it's not combined with any or too. So then we have step and China or man and Taikonaut or man. The or parts are considered first and afterwards everything is connected with an and.
Of course we can do the same thing for the disjunctive normal form. Then we have to do the and parts first. So step and China and Taikonaut or man. We use the same rule of applying the or into the bracket.
What does it mean? So China and Taikonaut and step is a possible solution. Or man and step is a possible solution. These are evaluated first.
The or part comes later. Okay? Simple idea how you can well transform the Boolean algebra expression. And now why is it so important? I told you it has to do something with efficiency
because if you use the conjunctive normal form you compute unions first and then compute the intersection for the ends. With the disjunctive normal form you compute the intersection first and then compute the unions.
But in terms of what you get as aggregate results a union will usually give you a lot of results. An intersection will usually give you a small amount of results. So if you do it like this you will blow up the set and then shrink it down again.
If you do it like this you will shrink the set from the beginning and then blow it up to the actual result set size. Those are perfectly equivalent. The result set was the same. But the intermediate results that you get from every operation will be large in the conjunctive normal form case
and small in the disjunctive normal form case. So it's a much more clever idea to have everything in disjunctive normal form and not in conjunctive normal form. Okay? This is our rationale for actually doing it.
Well, that about covers the Boolean retrieval model. Very easy to use. And you can retrieve the subsets by a suitable query just by cutting out those terms that you are interested in. And if two documents have a different representation
you can distinguish between them because either you use the distinguishing term or you don't. That means either you get one of the results or you don't. But of course this is rather theoretical because you would have to know the right query for distinguishing documents that are relevant or not.
Very hard topic. Very hard to find a good query that will show off the advantage of the retrieval paradigm. On the other side, of course, we have a set of results that is a term contained or is not contained
but they don't have a ranking. The term is not more important or less important. It's just there or not there. And you might even get empty results. So if you put too many terms in your query you might not have a single document containing all of them.
And you will kick out of your results set even those documents where it says kind of this is very difficult or this is very difficult and I just have all the terms except for one.
You will kick out those documents. You don't have a second grade document or third grade document. The Boolean retrieval model can not deal with that. And this is why controlling the result size is exceedingly difficult
because if you put in too many query terms you will end up with no results at all. If you put in too little query terms you will have floods of results. What is the correct number of query terms that you need to get a good result set? Depends on a corpus that you don't know.
There's no way of finding that out. Difficult. As I said, similarity is not supported so either it contains or it doesn't. And that's kind of like most of the cons that are interesting.
Still, if you think about synonyms about documents that deal with some topic however describing it somehow in an unusual fashion you will not find these documents with the Boolean retrieval model.
We will have a look at document retrieval models later that can also deal with these problems. Everybody fine with Boolean retrieval? Sure? Good. Good. And then we have our last detour for today.
So let's have a quick look at an example of an information service that actually heavily relies on Boolean queries. It's called Westlaw and it basically is the outline of US law. So it's a legal research service
that covers a lot of legal publications in the US. So some law documents and comments in law documents and newspapers about law stuff. So everything is indexed using a special indexing system similar to the MASH system
and until very recently they used Boolean search as a default method for using documents. So it seems not very modern but it can be very useful for searching legal documents. So I really would like to show you the website
but using Westlaw costs a lot of money and we don't have a subscription so I brought some examples here. So let's assume some search engine expert from Google goes to Microsoft and then Google wants to find out how they can prevent their former employee
from telling Microsoft all their search engine secrets. So this is some kind of information a lawyer might have and then the query could be it's all about trade secrets and in the same sentence also the word disclosure should occur
or some word that starts with disclose so disclosing, disclosure, discloses, something like this. Also in the same sentence the word prevent should occur
and some word starting with employee employee, employer, employee, something like this. And then Westlaw returns all documents that contain all these four conditions four terms in the same sentence. So if you really know how legal texts are written
you can construct queries like this and find all relevant documents that you need to know about. Another example is here's the information need requirements for disabled people to be able to access a workplace and the query could be
look for all documents in which terms starting with this ape occur in the same paragraph as a term starting with access which in turn occurs in the same sentence like as either the word worksite or the word workplace
or the document contains the term employment within three words of the term place.
So it's awfully complicated to build these queries but trust me if you are a lawyer you have learned in your work and while you are studying law how legal terms are used and how lawyers build sentences out of it.
And there are some rules how this usually is done and if you know what sentences to expect you can build these kinds of queries and find exactly the documents describing what you are looking for.
So usually there was some kind of study performed on what queries are asked on Westlaw and in fact on average a query to Westlaw contained ten words. So it's really really complicated and it's not designed for use by people knowing nothing about law
but it's designed to be used by real experts of the field. And the reason why they use Boolean search is that professionals really want to know what they are doing and what's happening there
and what documents are returned. So if I type in some query in Google I get a result list but I have no idea where this result set comes from or where this result list comes from. And I don't know whether there is some really relevant document on page 1000 for example. So I don't want to look through all the pages.
Google thinks that might be relevant in some way. I want to be sure that what is returned really matches my query. And so I have total control over what I'm querying for and I really know what the system is doing. This can be very important when confronted with legal issues.
Because in the U.S. there is case law so that means if some court or some court that is quite high in the hierarchy makes some decision then it becomes law. And when a similar situation
as the one decided in this court occurs later then the judge in the second case should decide exactly as the decision in the first case. So it's really important here to find all these similar cases and this can only be done if you really, really can control what you're looking for
and really can be sure that you found every document that matched your query and not get thousands of documents that are slightly similar but you have no idea why Google has an opinion that this might be similar. So some years ago there have been some experiments
comparing Westlock queries with traditional free text queries as found in Google and the result was that usually free text queries are compared in result quality
but it really depends on what you want to do here. Currently Westlock also supports free text queries and this can be helpful in some cases to get an overview of a field but usually if you really want to go very deep into details
of some legal problem then you still need Boolean search here. So that's what Boolean search currently is used for. So what we will see in the next lecture we will take a close look at some retrieval models that are of a more modern fashion
and that really come close to what Google currently is doing. So it's fuzzy retrieval, coordination level matching and the vector space model that also has been used in the smart system. So if there aren't any questions now then I would like to say thank you and see you next week.