We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Miscellaneous (13.7.2011)

00:00

Formal Metadata

Title
Miscellaneous (13.7.2011)
Title of Series
Part Number
13
Number of Parts
13
Author
Contributors
License
CC Attribution - NonCommercial 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer
Production Year2011
Production PlaceBraunschweig

Content Metadata

Subject Area
Genre
Abstract
This lecture provides an introduction to the fields of information retrieval and web search. We will discuss how relevant information can be found in very large and mostly unstructured data collections; this is particularly interesting in cases where users cannot provide a clear formulation of their current information need. Web search engines like Google are a typical application of the techniques covered by this course.
GoogolWeb pageRankingSystem programmingInformationExecution unitInformation retrievalHeat transferScaling (geometry)Link (knot theory)Point (geometry)RankingWeb pageOnline helpWeb browserInformation privacyWebsiteWeb 2.0AlgorithmInformationStructural loadCountingRoundness (object)Search engine (computing)Computer animation
Product (business)Gamma functionMaxima and minimaWeb pageRankingMenu (computing)Musical ensembleElectronic mailing listNumberDreizehnWebsiteTheoryPrice indexMaß <Mathematik>Range (statistics)Moment (mathematics)RankingWeb pageCoefficient of determinationSlide ruleBit rateWebsiteWeb 2.0Group action10 (number)CodeOrder (biology)Staff (military)BitComa BerenicesMereologyAreaProcess (computing)Point (geometry)Electronic mailing listDifferent (Kate Ryan album)Library catalogGoogolComputer animationXML
GoogolWeb pageRankingStatisticsWebsiteFaktorenanalyseComputer programEigenvalues and eigenvectorsWeb pageRankingSoftwareComputer programmingUniverse (mathematics)ResultantCellular automatonLink (knot theory)System callGoodness of fitProcess (computing)SummierbarkeitPoint (geometry)Cartesian coordinate systemDirection (geometry)Home pageFocus (optics)Electronic mailing listSign (mathematics)WebsiteNeuroinformatikDegree (graph theory)Right angleComputer scienceProjective planeDataflowDivisorField (computer science)Student's t-test1 (number)Computer animation
RankingWeb pageWebsiteFaktorenanalyseComputer programWeb crawlerFocus (optics)TheoryVariety (linguistics)Archaeological field surveyCloud computingPRINCE2Continuum hypothesisLatent heatAdditionFaculty (division)NumberHausdorff dimensionTable (information)ExplosionWell-formed formulaMultitier architectureEquals signWeightComputerElectric current1 (number)Directory serviceInformationStatisticsDuality (mathematics)Weight functionPhysical systemLogarithmBounded variationState of matterPairwise comparisonMultiple RegressionHypothesisPattern recognitionDivisorMetric systemGradientMathematical analysisTotal S.A.Cross-correlationCorrelation and dependenceComputer configurationStudent's t-testSystem administratorModul <Datentyp>MassForm (programming)Universe (mathematics)AlgorithmCodeData structureSoftware testingProgrammer (hardware)CausalityNumberLink (knot theory)QuicksortMassRankingGradientWeb pageObservational studyElectronic mailing listArchaeological field surveyDegree (graph theory)Multiplication signWeb 2.0Denial-of-service attackGraph (mathematics)Computer programmingGoodness of fitSingle-precision floating-point formatComputer animation
Web crawlerWebsiteFaktorenanalyseComputer programRankingWeb pageMathematical analysisData structureLink (knot theory)InformationEstimationQuery languageTask (computing)Physical systemVertex (graph theory)Graph (mathematics)RootVector spaceAdjacency matrixLogical constantSimilarity (geometry)RankingWeb pageGraph (mathematics)Link (knot theory)AlgorithmDatabaseData structureComputer scienceInformationOrder (biology)PhysicistSoftwareInformation retrievalExpert systemType theorySubject indexingPhysical lawComputer simulationExtension (kinesiology)Physical systemAreaSystem callQuery languageConnectivity (graph theory)Ultraviolet photoelectron spectroscopyField (computer science)RoutingNeighbourhood (graph theory)Point (geometry)Set (mathematics)Reading (process)Support vector machineEndliche ModelltheorieHyperlinkDifferent (Kate Ryan album)AuthorizationDegree (graph theory)Multiplication signSpacetimeRight angleWeb 2.0Vector spaceRecursionBitCore dumpSimilarity (geometry)RootPointer (computer programming)Data storage deviceEstimatorIdentical particlesSingle-precision floating-point formatWeb crawlerAdjacency matrixComputer animation
Adjacency matrixVector spaceVertex (graph theory)Web pageLink (knot theory)Cloud computingEigenvalues and eigenvectorsNichtlineares GleichungssystemIterationHome pageSpacetimeElementary arithmeticQuery languageInclusion mapInformationPhysical systemIntegrated development environmentContent (media)RankingSingular value decompositionTheoremMatrix (mathematics)Texture mappingEquivalence relationAlgorithmEigenvalues and eigenvectorsInformationLinear algebraMatrix (mathematics)IterationProduct (business)Flow separationRecursionDecision theoryArithmetic meanBeta functionForm (programming)BitConnected spaceContent (media)Power (physics)Physical systemElementary arithmeticResultantTerm (mathematics)TheoremNumberLink (knot theory)Query languageLogical constantNichtlineares GleichungssystemReal numberDivisorSummierbarkeitRankingSingular value decompositionWeb pageHome pageSearch engine (computing)Correspondence (mathematics)Support vector machineAlpha (investment)Different (Kate Ryan album)AuthorizationMultiplication signAdjacency matrixOrder (biology)Network topologySemiconductor memoryType theoryIntegrated development environmentCausalityDoubling the cubeSampling (statistics)QuicksortSystem callConnectivity (graph theory)Metropolitan area networkHelmholtz decompositionSpecial unitary groupReading (process)Sound effectEuler anglesClient (computing)View (database)WebsiteDisk read-and-write headNeuroinformatikContext awarenessFreewarePRINCE2Computer animation
Eigenvalues and eigenvectorsHelmholtz decompositionAdjacency matrixSingular value decompositionTheoremMatrix (mathematics)Texture mappingEquivalence relationPrincipal idealLink (knot theory)Extension (kinesiology)Query languageJava appletData structureEndliche ModelltheorieGraph (mathematics)SubsetRankingWeb pageAxiom of choiceData modelMeta elementInformation privacyAlgorithmRow (database)Data structureEigenvalues and eigenvectorsInformationMatrix (mathematics)SoftwareComputer programmingType theoryProduct (business)Software testingFlow separationCausalityAxiom of choiceLine (geometry)Group actionContent (media)Power (physics)Extension (kinesiology)MereologyMetric systemResultantSubsetMixture modelLink (knot theory)QuicksortSystem callQuery languageRevision controlGoodness of fitBasis <Mathematik>CASE <Informatik>Process (computing)Connectivity (graph theory)RoutingCross-correlationParallel portPoint (geometry)HorizonWeb pageCartesian coordinate systemInformation privacyEvent horizonMixed realityEndliche ModelltheorieDifferent (Kate Ryan album)Contrast (vision)AuthorizationNeuroinformatikLengthContext awarenessMultiplication signPosition operatorMathematical analysisGraph (mathematics)Programming languageTheory of relativityInformation retrievalSensitivity analysisFinite differenceNuclear spacePairwise comparisonTheoryRootSpeicherbereinigungExterior algebraSet (mathematics)Singular value decompositionSearch engine (computing)Java appletSupport vector machine2 (number)Adjacency matrixWeb 2.0Computer animation
Information retrievalGame theoryComputer networkData modelGraph (mathematics)Random numberRankingWeb pageInformation privacyComputer hardwareScale (map)Meta elementElectronic mailing listFingerprintEmailMathematical optimizationMechatronicsMaxima and minimaPhysical lawMultiplication signRight angleMedical imagingVideo gameMatrix (mathematics)SoftwareType theoryVideoconferencingBitGroup actionPhysical systemResultantHeat transferNumberLink (knot theory)System callMachine visionServer (computing)CASE <Informatik>Musical ensemblePoint (geometry)RankingWeb pageSearch engine (computing)Electronic mailing listAddress spaceCondition numberEndliche ModelltheorieKey (cryptography)Different (Kate Ryan album)AuthorizationNeuroinformatikElectronic program guideContext awarenessWeb 2.0DemosceneFacebookGraph (mathematics)Sensitivity analysisRecursionConnected spaceQuery languageDistanceMathematicianRandomizationView (database)Degree (graph theory)Bookmark (World Wide Web)Computer animation
Mathematical optimizationEmailGamma functionElectronic mailing listSearch engine (computing)FingerprintStructural loadSign (mathematics)Execution unitBloch waveCase moddingMaizeAlgorithmContent (media)Web pageData structureLink (knot theory)Social classWave packetTerm (mathematics)Similarity (geometry)Degree (graph theory)WhiteboardRule of inferenceMessage passingGoogolCache (computing)Correlation and dependenceDilution (equation)Bulletin board systemAlgorithmMedical imagingData structureInformationNatural languageNetwork topologyInformation retrievalFrequencyType theoryInformation technology consultingWave packetLevel (video gaming)Subject indexingMathematical optimizationArithmetic meanBitResultantTerm (mathematics)Virtual machineHeat transferCellular automatonLink (knot theory)StapeldateiGoodness of fitSimilarity (geometry)MeasurementCross-VerfahrenFactory (trading post)Error messageMetropolitan area networkMusical ensemblePoint (geometry)Social classSet (mathematics)WordWeb pageReverse engineeringInternet service providerSearch engine (computing)Computer fileMixed realityWebsiteDegree (graph theory)Multiplication signSpacetimeRight angleDesign by contractContent (media)Software maintenanceQuery languageRoboticsRankingOnline helpGoogolGraph coloringSupport vector machineDisk read-and-write headEndliche ModelltheorieWeb 2.0Computer animation
GoogolWeb pageComputer fontMereologyComputer-generated imageryWritingScripting languageWeb crawlerContent (media)StatisticsAddress spaceQuery languageRankingLatent heatCheat <Computerspiel>Link (knot theory)CAPTCHASoftwareInternet forumBlogStandard deviationWikiMedical imagingRow (database)Data structureTouch typingOrder (biology)Phase transitionSubject indexingPhysical lawState of matterBitGroup actionMereologyAutomationSubsetLink (knot theory)AreaGoodness of fitPoint (geometry)RankingSet (mathematics)WordReading (process)Web pageInsertion lossSearch engine (computing)Scripting languageStatement (computer science)WebsiteKey (cryptography)Different (Kate Ryan album)AuthorizationTouchscreenDegree (graph theory)LengthMultiplication signWeb crawlerSpacetimeWeb 2.0Figurate numberRobotBuildingMathematical optimizationFunctional (mathematics)Content (media)Coordinate systemResultantQuery languageReal numberServer (computing)Computer fontRoboticsIP addressOpen setInternet service providerFAQUser-generated contentComputer animation
Link (knot theory)RankingWeb pageCAPTCHASoftwareInternet forumBlogStandard deviationWikiAutomationTuring testPattern recognitionAlgorithmBuildingComputer programNormal (geometry)CloningDirectory serviceLocal GroupDomain nameImage registrationExtension (kinesiology)AlgorithmMedical imagingData structureData managementOrder (biology)Network topologyComputer programmingType theorySoftware testingRecursionFunctional (mathematics)Group actionStructural loadMereologySubsetHeat transferNumberCellular automatonLink (knot theory)Point (geometry)VotingOpen setSpherical capWordReading (process)Web pageReverse engineeringInternet service providerSound effectVery-high-bit-rate digital subscriber lineMotion captureDisk read-and-write headKey (cryptography)Domain nameAuthorizationNeuroinformatikCycle (graph theory)Multiplication signRight angleMechanism designThumbnailRobotTask (computing)Content (media)Software maintenanceServer (computing)Process (computing)Directory serviceRoboticsRankingAutomatic differentiationDistortion (mathematics)Search engine (computing)WebsiteImage resolutionTuring testCAPTCHAWeb 2.0Pattern languageComputer animation
Domain nameWeb pageImage registrationLink (knot theory)Extension (kinesiology)HeuristicGoogolContent (media)Pressure volume diagramMIDIWordCheat <Computerspiel>Compilation albumComputer virusCodeData structureInformationTask (computing)Mathematical optimizationGroup actionNavigationContent (media)Power (physics)MereologyTerm (mathematics)NumberCellular automatonLink (knot theory)QuicksortMenu (computing)System callFamilyMeasurementDistanceWeightProcess (computing)Metropolitan area networkPoint (geometry)Set (mathematics)Web pageOnline helpInternet service providerSearch engine (computing)Electronic mailing listScripting languageGoogolWebsiteDisk read-and-write headDifferent (Kate Ryan album)Domain nameNeuroinformatikMultiplication signRight angleWeb 2.0Pattern languageHeuristicBuildingCountingQuery languageRoboticsRankingSurjective functionGene clusterCheat <Computerspiel>Context awarenessComputer animation
Information retrievalService (economics)InformationRankingQuery languageWeb pageCheat <Computerspiel>Link (knot theory)Local GroupNormal (geometry)CloningDirectory serviceMaxima and minimaWebsiteContent (media)Electric currentRandom numberFormal verificationWeb 2.0Uniform resource locatorBlogGoogolInformationGroup actionQuicksortContent (media)Term (mathematics)Different (Kate Ryan album)Computer animation
GoogolRootWeb pageLink (knot theory)Content (media)WebsiteMereologyHierarchyExploit (computer security)Square numberPrice indexRankingMathematical optimizationLetterpress printingChecklistYouTubeError messageComputer hardwareScale (map)Information privacyMeta elementServer (computing)InternetworkingLine (geometry)Computer networkInternet service providerCharge carrierFood energyMilitary operationPersonal digital assistantStandard deviationInformationFluid staticsContent (media)NumberLink (knot theory)Goodness of fitRankingWeb pageFocus (optics)Search engine (computing)HierarchyWebsiteRule of inferenceState of matterHypertextKey (cryptography)Different (Kate Ryan album)Web 2.0CodeVideo gameInformationSoftwareSupercomputerSubject indexingAnalytic continuationArithmetic meanBitComputer simulationLine (geometry)Group actionMereologyMoment (mathematics)Physical systemTerm (mathematics)Virtual machineCellular automatonScaling (geometry)Query languageInternetworkingCASE <Informatik>Metropolitan area networkFood energyOcean currentSocial classControl flowWordInternet service providerElectronic mailing listEstimatorSound effectTrailClosed setSpeciesNeuroinformatikFlock (web browser)Multiplication signCharge carrierService (economics)Suite (music)Game controller1 (number)Computer hardwareData centerFile systemBuildingLevel (video gaming)Software testingBefehlsprozessorConnected spacePower (physics)ResultantServer (computing)MassPresentation of a groupAdditionData storage deviceChemical equationGoogolFiber (mathematics)Sinc functionSingle-precision floating-point formatCurveStandard deviationComputer animation
Server (computing)Computer hardwareInformationPersonal digital assistantStandard deviationGoogolExecution unitDistribution (mathematics)Computer networkPower (physics)Moving averageRouter (computing)Software maintenanceVirtual machineRead-only memoryMiniDiscRandom numberConnectivity (graph theory)Insertion lossVirtual realityDatabaseExtension (kinesiology)Maxima and minimaPoint cloudInstallable File SystemPhysical systemTable (information)SoftwareMeta elementInformation privacyScale (map)Query languageSearch engine (computing)Electronic mailing listTask (computing)RankingConstraint (mathematics)Exploit (computer security)Social choice theoryTheoryVideo gameDatabaseInformationSoftwareModal logicFrequencyIncidence algebraEntire functionSubject indexingGroup actionStructural loadMaxima and minimaPhysical systemExecution unitResultantTerm (mathematics)Virtual machineCountingNumberQuicksortMenu (computing)Software maintenanceQuery languageConfiguration spaceGoodness of fitInternetworkingOperator (mathematics)MassMetropolitan area networkBeat (acoustics)Distribution (mathematics)Point (geometry)RankingWeb pageSearch engine (computing)Event horizonGene clusterCondition numberSingle-precision floating-point formatMultiplication signVulnerability (computing)Computer hardwareData centerFile systemBitContent (media)Power (physics)Scaling (geometry)Hard disk driveDivisorServer (computing)Matching (graph theory)Router (computing)Control flowPoint cloudNP-hardDifferent (Kate Ryan album)Metasearch engineStandard deviation2 (number)Cloud computingEnterprise architectureComputer animation
AreaTask (computing)RankingConstraint (mathematics)Meta elementSocial choice theoryTheoryElectronic mailing listInformationExploit (computer security)Web pageoutputFinitary relationPareto distributionIndependence (probability theory)DeterminismAlgorithmTheoremArrow of timeRule of inferencePairwise comparison3 (number)Medical imagingSocial choice theoryComputer clusterMathematical optimizationArithmetic meanBitMereologyTheoryPhysical systemResultantCountingAreaGoodness of fitIndependence (probability theory)Numbering schemeExterior algebraExistenceRankingVotingSet (mathematics)outputWeb pageSearch engine (computing)Electronic mailing listParity (mathematics)Source codeCharacteristic polynomialDifferent (Kate Ryan album)Pareto distributionMeta elementDegree (graph theory)Vulnerability (computing)Web crawlerAlgorithmLevel (video gaming)Physical lawState of matterNumberDataflowQuicksortHypermediaCASE <Informatik>Beat (acoustics)Hand fanInsertion lossWhiteboardWebsiteRule of inferenceRight angleComputer animation
RankingWeb pagePairwise comparisonRule of inferenceBit rateVotingNumerical analysisCountingWeightUniform convergenceDivisorOrder (biology)Validity (statistics)Total S.A.TheoryResultantCountingNumberGoodness of fitNumbering schemeRankingVotingSet (mathematics)Web pageFocus (optics)Search engine (computing)Graph coloringDifferent (Kate Ryan album)Cycle (graph theory)Multiplication signMedical imagingFrequencySubject indexingFreezingMoment (mathematics)HypermediaCASE <Informatik>Computer iconResolvent formalismWhiteboardIdentical particlesRight angle1 (number)Bookmark (World Wide Web)Computer animation
CountingRankingWeb pageGamma functionMathematical analysisBinomial coefficientBitCorrelation and dependenceMereologyTheoryResultantCountingNumberGoodness of fitMeasurementCASE <Informatik>Perfect groupError messageCross-correlationRankingVotingWeb pageSearch engine (computing)Electronic mailing listStability theoryGoogolDifferent (Kate Ryan album)Degree (graph theory)Cycle (graph theory)Multiplication signMetasearch engineStandard deviationRight angleRow (database)MathematicsOrder (biology)Software testingFlow separation5 (number)Power (physics)Centralizer and normalizerAreaHypermediaMagnetic stripe cardMetropolitan area networkCoefficientWhiteboardSinc functionBinomial heapPosition operatorDefault (computer science)Computer animation
Maxima and minimaQuery languageSimilarity (geometry)Meta elementSineComputer hardwareScale (map)Information privacyFrequencyRandom numberWeb pageDisk read-and-write headEmailPhysical lawState of matterInformation managementFocus (optics)AreaCellular automatonCheat <Computerspiel>Crash (computing)Digital photographyDatabaseInformationFrequencyType theoryMaxima and minimaMereologyAreaQuery languageProcess (computing)Error messageRandomizationRankingOcean currentWeb pageCartesian coordinate systemInternet service providerCrash (computing)Search engine (computing)Electronic mailing listInformation privacyNP-hardDifferent (Kate Ryan album)AuthorizationMeta elementIntrusion detection systemLoginMultiplication signMetasearch engineWeb 2.0Mobile appVideo gamePerturbation theoryArithmetic meanBitGroup actionNavigationUniform boundedness principleResultantLink (knot theory)CASE <Informatik>Metropolitan area networkPoint (geometry)WordInsertion lossFocus (optics)WebsiteCheat <Computerspiel>Degree (graph theory)Message passingComputer animation
Query languageWaveGoogolWebsiteIdentity managementDatabaseNumberVideo gameLine (geometry)Query languageProcess (computing)Internet service providerElectronic mailing listCoefficient of determinationEmailSimilarity (geometry)InternetworkingRandomizationSet (mathematics)Address spaceInformation privacyBit rateIntrusion detection systemMultiplication signService (economics)DatabaseCategory of beingOvalHypermediaTexture mappingWeb 2.0Computer animation
FAQBlogQuery languageProduct (business)Hash functionGamma functionOptical character recognitionVideo gameInformationProfil (magazine)EmailWeb pageSearch engine (computing)Address spacePoint cloudDomain nameService (economics)State of matterHypermediaWeb 2.0Figurate numberFacebook1 (number)Computer animation
DatabaseData miningInformationSystem programmingData managementDatabaseData managementSoftwareCentralizer and normalizerData storage devicePeer-to-peerRight angleService (economics)Video gamePhysical systemCellular automatonInformationLibrary (computing)MereologyAdditionDigital libraryData warehouseFocus (optics)Data miningBit ratePoint cloudDifferent (Kate Ryan album)Location-based serviceSmartphonePerspective (visual)Information retrievalLevel (video gaming)Military baseCASE <Informatik>Texture mappingMultiplication signComputer animation
Transcript: English(auto-generated)
All right, then welcome to our last lecture about information retrieval and web search engines. Last week we talked about the page rank algorithm, so how links can be used to rank pages on the web to estimate their prestige. And we stopped right at the beginning of this detour here, which will show the Google toolbar. Does any one of you
know the Google toolbar? Yeah, I hate toolbars too. I also hate the Google's toolbar, but you can see or you can use it to find out the page rank of a given page. So I'm going
to install it now. Hopefully it won't transfer all my private data to Google. No, no, yes. So basically the page rank is hidden somewhere deep inside Google's ranking algorithm, but
using the toolbar, you can see on every page you visit. Yes. So years ago this toolbar
used to be really, really simple, just showing the page rank of the page and the search form, but which currently is just integrated into modern browsers. So you can Google decided to include a lot of helpful enhanced features into its toolbar. Yeah, automatic
translation, all you need or never need, but if you are visiting a site, then here you can see the page rank of the site. I don't know whether you can see it, but I hope so. The page rank of the EFS Institute page is five of ten, so they use a rather
rough ten point or eleven point scale starting from zero and ending at ten, and EFS is a medium or site of medium prestige, but for example, Microsoft com, I bet, would have
a really high page rank. Page rank of eight, yeah, that's not too impressive. So what about Adobe, for example, also very popular because all people link to Adobe because of the Adobe Reader. Yeah, they have the page rank of nine, so then let's try
the, probably the most, the most linked page in the whole web. Yeah, here we are, page rank ten of ten, so one of the most prestigious web pages in the whole web to
the download site of Adobe's Reader. So it's just nice to see how Google rates these pages, so if you are just interested in how your own pages are perceived by the people out there, what the prestige of your pages is, just use the Google toolbar and check
it yourself. So on the slide there's another link, so let's open it. So someone, I don't know who he is, took the effort to surf through a lot of pages and find out the page rank for each page that Google shows in the toolbar, and then he listed
all pages having a page rank of ten, so this is a list of the web's most prestigious pages. I don't know how, how current it is, but here are all different pages from Adobe.com are listed, so as I already showed, this is very often linked by
other people, of course Apple, very, very many in links, Frugal, Google Groups, Google catalog search, all kind of Google stuff, of course, is very, very prestigious, so I don't know whether Google tweaks it a bit, but Google indeed. Job opportunities
at Google, yeah, also very high page rank, NFS, some, some government agencies, some agency, National Science Foundation, RealPlayer, yeah, so many, many, many stuff here, so
I think he also did it for, for other page rank levels, no he didn't, yeah, don't mind. So if you are interested in what is, what is really important in the web, check it out yourself. All right, so this is how the page rank is used in the web,
but the page rank also, page rank also can be used for other purposes, so one, one purpose would be, would be for crawling, so two weeks ago we heard about focus crawling, where
you can crawl pages belonging to just a certain topic, but you can also use page rank for some kind of prestige focused crawling, if you would say so. So for example, you could decide that you crawl deep into sites having very large page rank at their home page, or
you could decide to update those pages very often, they have high page rank because they seem to be important to many, many people, and so the page rank can also be used to steer your crawling process and focus on those pages that are generally more
important or more prestigious than all the other pages. But, the page rank can also be used for ideas similar, similar to the page rank can also be used in other fields of applications. The last week we already have seen some examples from social networks, how people are linked and co-citation and all this stuff in scientific literature, so
there are many, many ways to exploit the idea of page rank, and generally you can use the page rank whenever you have some kind of directed graph, and links mean some kind of, some kind of recommendation or some kind of, yeah, something, it just does
mean that some resource is perceived as good by other people or something like this, or meaning some flow or some direction, and there you can use the page rank to find out where all the recommendations go to in your network, and of course, you can also use the
page rank to estimate the so-called impact factor of scientific journals, so scientists usually publish their results in, at conferences or in scientific journals, so there are a lot
of journals out there, and of course, it is important to measure which journals usually publish articles of a good quality and which do not, because there are thousands of journals out there in computer science alone, and to find out which are the most important ones is quite, quite important to researchers, because you want to publish your good results,
of course, in prestigious journals, so, and here are some sites, I won't open them right now, you can take a look at it yourself, who analyze citations in these journals and find out articles in which journals are very often cited, and by using the page rank algorithm,
they calculate some kind of prestige score for these journals, so then you get finally a list of journals, and the ones on top are probably the most important in the field, so, but what's more interesting in my opinion, is ranking of doctoral programs, so this
is some kind of research project conducted by some student at Harvard, and he wanted to find out to which university you should go if you want to, if you want to make your doctorate, your doctoral degree, so of course, in particular in the US, it is very, very
important at which university you get your degree, in particular your doctoral or PhD degree, so in Germany, usually that's not too important, because all universities are basically at the same level, or at least very similar, but in the United States, there
usually are rankings that are published by some agencies, who try to find out which are the best universities out there, so then take a look at this ranking, and he used
the page rank to find out which are the best universities for PhD studies, and how did he do it, he asked people currently participating in some PhD or doctoral program at some university, from which university they came from, so where they did their master's
degree or their bachelor's degree, and then draw a graph of people movement, so if you studied in Braunschweig, and going to make your PhD in Hannover, then there would be one single link from Braunschweig to Hannover, and of course, those universities having
a good prestige for doctoral studies or PhD studies, would receive many in-links from many other universities, and those who have a bad prestige won't have many links, and then you can use the page rank algorithm on this graph structure to find out, so blah,
blah, blah, some fancy math, we already did this in the last week, but here's the most important ranking, so as I said, he conducted a survey on people who received their PhDs in the 90s, and between 2000 and 2004, created this graph, applied the
page rank algorithm, normalized is to get some scores, but finally there's a ranking he got, so Harvard is the best university to make you a PhD, I think this has been limited to some kind of discipline, I'm not sure what it is, maybe political science or something
like that, so it's not some kind of overall score, so you probably wouldn't want to go to Harvard for doing your PhD, but if you would be a political scientist, then you should, or go to Stanford, Michigan, Rochester, Chicago, University of California
at Berkeley, and so on and so on, MIT, so it fits to intuition, because the famous universities are on top, and if we scroll down the list, yeah, University of Florida, pretty warm and nice, but not too recommendable for doing your PhD studies, the famous University
of Kentucky, yeah, some more, Texas Tech, Temple University of Nebraska, so at least in this discipline not too good, yeah, so this is one way to use the page rank, and
as I said, as long as you have some kind of graph structure, and your links mean something like a recommendation, you could always apply page rank and find out many interesting things. All right, well then, let's go on with the HITS algorithm, this is the second big algorithm for analyzing link structure on the web, so as Professor Balk
is not here right now, but I hope he will be soon, I will just begin explaining the topic, and then he will just dive in if he arrives. Okay, the HITS algorithm, so what
does HITS mean? HITS is hyperlink-induced topic search, yeah, obviously doing something with hyperlinks, invented by John Klimak, we already discussed him briefly, originally a physicist, but he's currently in computer science and doing a lot of these network
and all of these things, and yeah, at the middle of the 90s, he's doing a lot of things, right in parallel to the work done by Brin and Page on page rank, he worked also on these network analysis, and his problem setting was a bit different, so he didn't
want to compute an overall prestige score to each page, but he recognized that usually, if there is an information flow, or when there is a social network, then you can distinguish two types of nodes in this network, or two types of people in a social network, so one
type are authorities, authorities are people or pages who know something about a given topic, who are the experts on a topic, and can provide all the information you are looking for, so on the other hand, there are hubs, hubs are people that usually are
not experts in the field, but do know a lot of other people, so if you would ask a hub, then you could be quite sure that the hub would give you the right pointers to experts in the field, so there are experts who know everything, and hubs who know the experts,
and concentrate the knowledge of the experts and can refer you further, so of course the same is true to web pages, there are web pages providing important content, and there are web pages providing link collections about a certain topic, so and of course every page could be to a certain degree, be both an authority and a hub, so usually you don't
have a clear, a clear, a clear distinguishing between authority pages and hub pages, but it could be mixed as well, so and then this was a basic problem setting he recognized,
and simply then his problem became, yeah, given some kind, given a web query and given a web crawl you have in your database and your index, estimate the degree of authority and hubness of each web page in your index, so just estimate the score that tries
to determine to what degree a page is an authority or a hub, so and of course it's again related to how links are distributed in this network, and this is an important
difference to the page rank algorithms, because in page rank we assumed that every page has a single global page rank that does not depend on a given query, so in hits you have a certain web query, for example the topic information retrieval and the scores
to be estimated by the hits algorithms are only relative to the topic chosen, so we are now looking if the query is information retrieval, we are looking for the scores of hubness and authority with respect to this specific topic, this specific query, so a definite consequence of that is that you have to calculate all your stores at query
time, so in page rank you can pre-compute it once a month for example and then if a user wants to start to ask web query then you can just use the page rank in your ranking algorithm, but in hits you have to compute the score after each query or
you have to pre-compute it for many, many, many different topics, so hits is a bit more personalised I would say. Okay, what is the idea of hits? We are given some query of course, as I have said this is what enters the system, here is the nice user,
has some information need, asks his query to the system and then in the first step we sent the query to some standard IR system, some of the systems we already have seen which does not use any kind of link analysis, so for example simple vector space model,
here is some IR algorithm and this IR system returns a set of our pages that the system deems as being relevant, so and then we are creating from this set of pages which
is called the root set, so big R is root set, we create a set of pages called the base set which includes all pages contained in the set R and all pages that are connected
to R in some way by a direct link, so the base set contains R plus all pages that link to a set in R plus all pages that are linked by some page in R, so we are trying
to boost up our set R by including the neighbourhood in some way because this is deemed as being the area that is connected to the topic we are looking for, so the R system returns
the core of the topic we want to retrieve information about and by extending the set to the base set we try to include also all hub pages linking to these topic pages and so on, so this is the basic idea. So what can we do now with our base set?
Of course now we want to compute the hub and authority scores on this base set, so then of course the problem is how to define hubness and authority of a page and John Kleinberg
decided to do it in the following way, so it's very similar to the prestige, so again let A be our base sets adjacency matrix so that contains all the links, so if this page linked to this page then would be the one here you might remember and we define
two vectors, so the hub scores of all pages is a vector H and the authority scores are vector A, so there's a vector H with so many components as there are web pages and
again there's a vector A with so many components as also there are web pages and the ith entry of each vector is the hubness score or authority score of the ith page in your base
set. Okay now here the recursive definition of hubness and authority and the ideas similar to the page rank idea that those pages get a high authority that are linked by many
hub pages, so you might remember in page rank we just had of P equals some constant adjacency matrix times the page rank and now we split up the page rank into hubness and
authority and make two definitions that relate both scores to each other, so on the other hand the hubness is defined in terms of authority so those pages are deemed as being good hubs that link to many authorities so that they link to pages that have a high
authority score, so now we have two equations and we want to solve it, we want to find the correct H and the correct A, yeah as I said so the authority score of a page is proportional to the sum of hub scores of the pages linking to it and the hub score
of a page is proportional to the sum of authority scores of the pages to which it links and alpha and beta are these proportionality constants that can be used, very similar to the page rank, so now we can combine these two equations simply put this equation into
this one and then we arrive at a recursive definition of A and H and now we're going to simply solve these equations by means of linear algebra and again we have the form
we know, this eigenvector thingy here this is simply a real number, this is some kind of matrix and the form is vector equals constant times matrix times vector and
so the authority vector A is an eigenvector of the product of A transposed times A and the other way around the hub vector is an eigenvector of the matrix A multiplied by A transposed, so it's a bit more complicated than the page rank because there we just
had the matrix A and not these products and again we can apply results from linear algebra to efficiently compute these vectors, things are not so easy as with the page rank algorithm, so in page rank we had the fact this due to the Perron-Frobenius theorem
that there always is an eigenvector to the eigenvalue one and that has some other nice properties, so this is not so easy in hits, so Kleinberg had to make some decisions about what eigenvectors he wanted to take and so he decided to simply take
the principle or the so-called principle eigenvectors in these equations and principle eigenvectors simply means that these are the eigenvectors containing corresponding to the eigenvalues with the highest absolute value, so again this equation could have
several solutions, this matrix could have several eigenvalues and corresponding eigenvectors but Kleinberg decided that he only wants to take the eigenvector having the highest
eigenvalue, so it's just an arbitrary definition which seems to work in practice, so computation again is very similar to page rank that we had the power iteration that we simply multiply the matrices together and combine it with the vector and then we
finally arrive at the eigenvector, so we don't do that in detail here, you better skip to an example, so here's an example of how the hits algorithm work and what results it produces, this is one of Kleinberg's own examples I think, so the query is Japan
elementary schools, so we are looking for elementary schools in Japan and now we see some pages on this side ordered by hub score and on this side ordered by authority score,
so good hubs that means page that link to many pages being relevant to the topic, we can only guess what's in there but here's some link collection of school home pages that also seems to be some page linking to many, many schools, education could also
be some kind of link collection, so looks quite good, looks more like link collection than like real content and which is exactly what hub pages are intended to do, so on the other hand the authorities, this should be pages being elementary schools in Japan
or having very important information about elementary schools in Japan and of course here's the American school in Japan, so I hope it's an elementary school, yeah the link page doesn't seem too authoritative to me, so it's maybe a hub page but this
seems to be pages of different schools, primary school, elementary, elementary school, so basically you see this divide into hubs and authorities seems to work quite good,
so it's not perfect and this example is also too convincing because I don't know Japan but in general it seems to work, so at least we can now distinguish between link pages or link collections relevant to a topic and pages itself being relevant to
a topic, yeah so this is basically the HITS algorithm and yeah as it is used in the US if you have invented something novel, something important which could bring you a lot of money then you are going to patent it and of course also John Kleinberg did, so
it's patented as under this number and the title of the patent is method and system for identifying author authoritative information resource in an environment with content based links between information resources, so quite a title and the thing basically is that
he had to patent the system components, so usually you cannot simply patent a single algorithm but you always have to patent a system and this is what he did, so now IBM is the proud owner of this technology, I have no idea whether they use it in some way but I strongly believe that many search engines are using ideas very similar to HITS
when computing relevant scores, okay there is some connection also to LSI and singular value decomposition in the HITS algorithm, so remember in the HITS algorithm we need
to do an eigendecomposition of these two matrix products and essentially that's the same of doing a singular value decomposition of the original matrix, just a thing that
is quite nice to know if you are really working with these techniques and you have to decide which algorithm to use and yeah short recap from the lecture on LSI where we had a result that when we decompose our term document matrix into this product U,
S and V by using the singular value decomposition then the columns of U are the eigenvectors of this product and the matrix S squared contains the corresponding eigenvalue and similarly
the matrix V, rows of matrix V are the eigenvectors of these products, so knowing this you could simply compute all the happiness and authority scores that you need in HITS by running
a singular value decomposition of the adjacency matrix of our base set and then by looking at U and V we directly have our vectors we are looking for in the HITS algorithm,
so there is some connection, so everything is connected in information retrieval yeah, so maybe nice to know if you needed some time. There are many extensions also to the HITS algorithm, so in the PageRank algorithm we had some extensions for example these topic
sensitive PageRank where we limit ourselves to given topics and there are many, many versions of doing the PageRank and of course the same is true for the HITS algorithm, so here the idea is to separate link communities, so for example it could be true that a query
is ambiguous, so for example Java could mean, could refer to the programming language or to the island, Jaguar could be a car or an animal, the queries could also be polarized
in some way for example called fusion or nuclear power something like this, they are usually a large group of people who really like the topic, so are really enthusiasts about nuclear power so maybe some large companies will have their link network and there will be some enemies of this technique and they also have their link network, so usually
it looks like that many, many pages all linked to each other in some way all about the topic of nuclear power, but usually the enemies won't link very often to people
who really like nuclear power or the other way around, so usually these groups are really separated in the link structure, but you can use the HITS algorithm to find these two different components, so the idea as you remember in HITS was just to take the
principle eigenvector of the link matrix and this would correspond to the largest link community, so if this would be a link community having many, many pages and many, many links and they would have some kind of related structure with authorities and hubs led to
each other then the HITS algorithm would compute the hub and authority scores only basically for this network and all pages in here would receive rather low authority and hub scores because these pages are not connected very well to the large community,
so but if we do not only take the principle eigenvector so the eigenvector having the largest eigenvalue but maybe also the second or third largest then we would come up with hub and authority scores with respect to these other communities and then we would
find there is the first communities of people who like nuclear power and here the second community of people who hate nuclear power and maybe some more communities and this way you can really analyse topics that are polarised in some way or there are different opinions where there are people around that usually do not talk to each other or do not link
to each other at least, so all the nice application of the HITS algorithm. So finally let's compare the HITS algorithm and the PageRank algorithm, so as I already indicated PageRank can be pre-computed because PageRank is not dependent on the query,
so there is usually only one PageRank per page and it does not depend on anything, it only depends on the link structure as such, so HITS in contrast has to be computed at query time, so if a user asks a query then you have to find this root set, the
expensive analysis and then find the scores you need and then you can use these scores for ranking your pages which isn't very easy, so HITS is very expensive when you do it at query time. So of course you can also try to pre-compute HITS in a way similar to the topic sensitive PageRank by pre-defining some large topics
and then computing the HITS scores could be an alternative but HITS in the original is meant to be done at query time. So a second difference of course is the choice
regarding of which pages to take into consideration or what exactly to model, so HITS models hubs and authority separately, PageRank is about prestige in general and prestige usually
is a mixture of both, so if a page has high prestige then it could be a hub, could also be an authority, so PageRank scores really try to combine both ideas and in HITS they are kept separated from each other. So another difference is that HITS only works
on a subset of the web graph, so on this root set and then on the base set which usually is much much smaller than the large web graph, so it is possible or at least in theory to compute the HITS scores really at query time because the amount of data to be analysed is not as large as the PageRank algorithm, so in PageRank we
have to do these calculations on the whole web graph and in HITS it's much smaller usually. So yeah, when trying to compare HITS and PageRank when I found that in web
pages usually hubness and authority is kind of correlated, so there is as I said no clear separation into hub pages and authority pages, it's usually a mix of both, so for example Wikipedia page for some topic definitely is an authority but they
usually provide a lot of links to relevant pages about the topic, so they try to cite their sources, so the Wikipedia page also is a good hub and this is true for most other web pages, so there is no clear separation into link pages and content pages, so at least on the web PageRank and HITS are very very similar, so depending on
what you are trying to do or trying to analyse and what kind of network data you have, you either could use HITS or PageRank or version of both or mix of both, depends on what you are going to do, at least you now know what you can do, what is prestige,
what is hubness, what is authority and then you try to use it for good applications. So in the next lecture that will be today, we talk about some miscellaneous stuff that remains to be done, we will now talk about spam detection, how to do meta search, how
to combine the results of different search engines and finally let's talk about some privacy issues related with web data. Alright, yeah, professors right on time for the lecture,
but first we are going to discuss the homework, so very briefly, what is a social network? Okay, I will skip that, it's too easy I think. A prestige in social networks, what is the basic idea, who can explain it in two sentences? What is the key idea of prestige?
Yes, I think you mean that links from high prestigious pages are better for your own
prestige than links from pages having low prestige, so that's basically the idea, okay, yeah and then you can use this for ranking. Okay, what is the co-citation graph and
what is an Erdos number? Anyone? Yeah co-citation usually means that two pages or two items
are linked if they appear together in the same context, so if in some paper there is a list of references, then all these pairs of references are connected to each other
because they seem to be related as they have been used in the same context, so and this could also be done for authors, so people who write a paper together could be linked and this brings us to the Erdos numbers, so what's an Erdos number? Yeah, so Erdos
famous mathematician worked with a lot of people and for fun, mathematicians compute how far they are away from the famous Erdos, so next question, what is the Bacon number? Did any one of you take a look at it? Yeah, it's a distance to the famous Kevin Bacon,
yeah, so and what is Paul Erdos Bacon number? No, that's the Erdos Bacon number, that's
the sum, but what is the Bacon number of Paul Erdos? He has none, actually he is because he played himself in a documentary about his life and there are also other people
stared and these people have been connected to Kevin Bacon and I don't know what exactly his number is, but Paul Erdos has a Bacon number, yeah, usually in this Oracle of Bacon the search is limited to non-documentaries, but if you include documentaries then even
the famous Paul Erdos has a connection to the famous Kevin Bacon, so all things are connected, yeah, six degrees of Kevin Bacon. Finally, the random surfer model, what is it and is it a reasonable model of how people surf the web? Random surfer, what does he
do? Yeah, really, really nice model, so is it plausible, is it possible, is it possible
do people surf the web like that, so then why do we use this stupid model for page
rank? Yeah, it's easy to model, yeah, yeah, it works, yeah, I think the main idea is
that it simply corresponds to these recursive definition of prestige, so simply the idea that prestige of a page is some of the prestige of the pages linking to it and this directly corresponds to this random surfer model, so it isn't a very good way to model
actual behavior of people, but you don't want to model behavior of people, you just to model these prestige, prestige transfer in the networks, yeah, and it makes a good matrix and you can apply the Perron-Frobenius theorem, yeah, so easy thing.
Oh, another question, what do you need the topic sensitive page rank for? Yeah, but for
this query then I would use the hits algorithm, wouldn't I? Yeah, so pre-computation is
the key here, just define some hot topics you want to cover and you can pre-compute your page ranks and now then have a, yeah, but you can integrate this new knowledge
into your answer, it's always a good thing, okay, yeah, today we talk about many, many different things and this will be done by Professor Bache, I guess. So, hello also
from my side and we have a bit of work today, so I will hop right into the middle and the interesting thing that everybody knows about the web is a 90% of the web is pornography
and B, 90% of what you get from the web is spam, so the question is kind of how do
you deal with that, because it has become quite a common place problem and of course we've discussed search engines, search engines are kind of a dangerous thing, aren't they? Because they somehow restrict your view to what they show you, I mean in the early days
of the web you had kind of like this network of web servers where you could navigate through the network and everything is okay, but now we have seen that more often than not people directly use Yahoo, Google, Ask, whatever as an entry page, as a portal page to the web, so first they will only see what the web engine shows and then at least
at the entry point can start navigating, but they will usually not directly start navigating unless they have bookmarks or something like that and we have seen that even for well-known pages like Facebook or something, more often than not people type Facebook
into the Google to get to the entry point, which saves .org or .com it is, so in a way search engines have restricted our view to what they show us and of course that has
been exploited by some people that either want to show off their results or put their pages to the best potential, but of course also by malicious people that want to sell
us Viagra because they know the web is about porn, so you could do with a bit of Viagra and that is often called spamdexing, so what you want to do is you want to employ web search engines to promote your own site and it has nothing to do usually with what
you are actually querying or where you want to go, but as a spamdexer I think everybody should have Viagra, not just the people who type in Viagra, so the idea is I modify
the web to get high rankings of pages that don't deserve their rank and very often this is also called in the little bit more optimistic term search engine optimization, so you optimize your page to be indexed by the search engine correctly, and we have
seen some white techniques already with the robots.txt and how you can help search engines, but if you type in the term search engine optimization into Google you get a
how you do it and what you do it and you get a lot of people offering to do it for you for consulting contracts or something like that. How do they do it? Well, nobody exactly knows how Google works or what the exact algorithm behind Yahoo is, but
you can do a bit of reverse engineering, you know it has something to do with the terms occur on the page or what the name of the page is, whether they're images that are captioned with some topic or stuff like that, so we know a little bit
about that and that can be exploited by people using these loopholes. Of course, if the search engine creator or maintainer finds out that people start using it, they
will immediately cut down and say, well, you may not do this, this is a black technique now, but then the spammer finds something new, so it's like a race between the search engine providers and the spammers and one tries to outperform the other, it's kind
of interesting. Basically, there are two classes, one is the content spam where you alter a page's contents such that it gets higher ranks in terms of IR techniques and the link spam
where you alter the link structure between pages or deliberately create a good link structure between pages such that page rank and hits and similar algorithms are kind of tricked. One of the very early examples of the content spam were you just add some terms that are
interesting for the query in white color on a white background and then you can show the spam indexes, the spam pictures and stuff like that. Very easy technique. It looks like a Viagra advertisement but has some words in it that the users can't see but
that the engine is going to index. Very obvious has been closed down at a very early stage. One of the points of link spam or typical example for link spams are so-called link farms where you kind of like have a couple of sites that interlink each other
and thus transfer prestige from one to the other and having a large enough link farm you can do a lot of things. So for the content spam you exploit basically the textual, the information retrieval ideas which means term frequency on one hand. So you repeatedly
place keywords that people look after, that people search for into the text, into the head of the page, into the caption of images, into the link of the text and also on other
pages in anchor text of pages that link to your page. So if you have a link saying oh this is wonderful information about how to build a house or something, this link
and the meaning of the link is transferred to the site. If that is a Viagra site, tough luck. Still you get it. What is also very often done is you mix content. So there's
on a page. We know it from Google, we know it from sponsored pages. Every time there's a small bar you know like and then I place the advertisement. What if I copy a correct page, a good page that people like to visit and put only my own advertisements,
my Viagra postings in the sidebar? Nobody will notice and this can also be done. Well as a countermeasure of the search engine providers, it is usually assumed that at
the stage where we look for duplicate content, where we check whether the structure of a page is correct. Also a classification can be done. So as soon as the word Viagra occurs and it's not the Wikipedia Viagra or the Fyther factory page about Viagra but
it's kind of like sell, buy, whatever. Then you can deduce that this will be spam. So very often you have Bayesian classifiers that somehow learn to find out what is spam
and what is not. As far as the features go that describe page that can be exploited there, that's kind of a little bit tricky. As I said, the similarity to other pages that
were already classified as spam can serve as a training set and then you train a Naive Bayes classifier or whatever it is, support vector machine that kind of tries to figure out whether it's spam. But also the degree of term repetitions. I mean a high term frequency
is a wonderful thing but if a page has 90% of the terms occurring as Viagra, it cannot be a sensible page. So the too high term frequency is a sign for spam. You can train
that also and work with natural language models where you look at texts, how often words occur and what is the normal distribution of texts. So for example, staying with the example of Viagra, it may occur in medical texts but for medical texts you usually know
what the word occurrence, what the typical word occurrence of drugs or kind of illnesses is and then you can use that information to find out whether something is spam or not.
One of the best examples of content spam is so called Google bombing. Google bombing works, exploited one of the loopholes of Google that were very clever in the design where Google said well if somebody links to a page, he will usually describe the contents
of the page together with the link. More information about tropical fishes can be found here or there was this guy, Karl Friedrich Gauss with a link and then you know what
exploited by Google bombing where people that were kind of pissed about American politics in the era of George W. Bush Jr. started to build pages where the linkage text, miserable
failure directly linked to the White House page of George W. Bush Jr. CV and so at some point Google started to index the whole thing and indeed typing in
the query miserable failure led as a first result to the White House pages with George W. Bush's biography and everybody knew he was a miserable failure. Fun. So obviously
we took this screenshot when this was kind of already known so it was the newspapers reporting about the thing and stuff but those were real pages about miserable failure
where people described something as miserable failure. The words miserable failure do not occur in the, obviously, in the biography of George W. Bush because he wrote it himself. Okay, this is what was very often called Google bombing and actually Google had very
big problems of getting rid of this phenomenon because if enough people independently of each other start doing this you basically have very hard time deciding whether this is really about tropical fishes and many people talk about them or whether this is
an attack of some kind because you don't see the coordination. Attacks are usually coordinated somehow and if I have one server that starts spamming I immediately know what's happening you know. If several servers start doing it then the thing becomes more complex and
a further way to direct a content spam because the user, the reader of a web page is not
very happy about seeing the actual web page and if I have the word, I don't know, building
a house or miserable failure 500 times on the web and then comes the Viagra ad, I'm not staying long on this page to think about getting Viagra. So the page must look like what it actually is, the Viagra advertisement and it should not look like I'm being
tricked because then I'm not, I mean, spam is not really trustworthy at its best but if I'm seeing that I'm obviously being tricked by this page then I get even more
angry about it. So to keep up the chances of actually selling Viagra, spammers try a lot of interesting techniques. So for example, placing the text that you don't want users to see behind images on the web page, write the text in the background color,
font sizes of zero, using the text only in scripts and deleting it afterwards, delivering different web pages that are showed to web crawlers, then are shown to the users.
This is very often called cloaking. So if somebody serves on your web page and says, well, I'm an indexer, I'm the Googlebot, then show him a totally different site. And once somebody says, well, I'm not a Googlebot,
then show him your Viagra advertisement. So just an if-then statement, nothing more. Very simple to do. Or so-called doorway pages, you could just immediately redirect the user. So the user won't have time to see the actual site that was indexed.
Good. Most of these techniques are obviously easy to detect. I mean, things like that. Cannot be on a sensible page. Why should somebody write about tropical fishes with a font size of zero?
It does not happen. So immediately exclude them. And of course, the spammers became aware of that. It's a little bit more difficult to deal with these two. But still, you can detect it to some degree. I mean, just pretend you are the Googlebot
and then surf to the side. The second time pretending you're not the Googlebot. But, ha ha, you are. Then you will see whether this is cloaked. On the other hand, you have to access every page twice. Or you have to train a classifier
looking out for something. Costs time. Costs time in indexing. Time in indexing that you need to recrawl the web to get a good index. So the more time you invest, the more effort you invest in actually finding out whether something is spam or not.
The better your content will be, but the staler your content will be. Trade-off. Also for the web search engine provider. Good. As I said, cloaking.
Typical is a request sent by a crawler. You can check for the IP address. Then send the cheat page. If it's not a crawler, send the Viagra ad. Very easy to see. Doorway pages.
If you create a page that is a wonderful answer for some query, immediately redirect to your Viagra ad. And if you do that with enough topics, you get a lot of Viagra pages for a lot of different queries.
This is what you want. Actually, the interesting part is this technique of doorway pages has been recently discovered by Google and is sometimes used even by big companies
that try to do a little bit of search engine optimization. For example, BMW and Ricoh, so cameras and copiers, have been banned by Google for some time because they use doorway pages and they were totally pissed about that and Google just said, well, in our FAQs
and how you should index the web to get a good ranking result in Google, we said not use doorway pages. If you don't adhere to our policies, we can kick you out of our index and nobody will find you ever again.
And actually, BMW tried to sue and there's just no way that they can sue themselves back into the index. So if Google says, well, they are on the index again, still, but I think they have learned their lesson. So sometimes the search engine optimizers
can teach people a lesson, but more often than not, spam is very hard to avoid. So talking about link spam, the idea is that the more in-links you get on your page, the better your page rank will be, the better ranked your page will be. If the in-links come from good pages,
from pages with high authorities or high prestige, your page rank will start to rise. So what is very often done is kind of so-called comment spamming. You have high-quality sites,
so some news archives, so for example, the New York Times, or some news groups, some Vicky's that are highly ranked, and you insert into those pages as a comment, a link to the page.
Since Google will not distinguish between comment and the original content of the page, but just see the structure, the words on the page, the New York Times will link to your Viagra site. So it must be important. Yeah?
Or the news group that everybody reads in the Unix community will link to your Viagra page. Very difficult problem getting rid of that. You can even automate that to some degree, so writing bots that put comments in open Vicky's,
and we very often have the problem here at the chair at the institute that just robots put in our Vicky's kind of spam links to different pages just to improve their page rank,
and we actually disabled the commenting function because it's just not valuable enough to avoid the spam. One of the countermeasures that you can do is usually you try to avoid bots by CAPTCHAs.
Who knows the concept of CAPTCHAs? Okay, well, it's basically you show a word in some distorted way or a certain number in some distorted way so it's not easy to OCR it, it's not easy to recover it,
but you as a human can see it. And then you have to type in this word before you can comment on something, which excludes bots from posting comments. Good idea. CAPTCHAs, as I said, well, it's kind of like sometimes you can read them, sometimes you cannot read them, but the idea is really it's images.
You can see it, but a robot cannot read it. And even if you give that to some automated OCR engine, I mean, I have problem reading that. I'm very much in doubt that it can be OCRed properly.
And as soon as your CAPTCHA is wrong, you just don't comment. CAPTCHA is actually an acronym for completely automated public turing tests to tell computers and humans apart because these are the bad bots that spam your comment features
and these are the comments that are very valuable for your patient that you really want to have. And the turing test between them is kind of recognizing what is written in the CAPTCHA. However, the spammers kind of use countermeasures against it already
because if you know how CAPTCHAs are generated, so I take the word and then I distort it in that way, then you can reverse engineer the process of generating the CAPTCHA, and then you can have an easy OCR task because you distorted it
and then it's kind of torted. No, if you distort it, it isn't torted, I see. So kind of then it's normal again and it's an easy OCR task to recognize it. It's also interesting that the intelligence of the crowd
and the mechanical Turk and stuff like that have been used for telling CAPTCHAs apart such that you, well, just employ the mechanical Turks
and say, well, people, please solve CAPTCHAs for me so every CAPTCHA that I get will get you one cent if the companding function is disabled. So I pay 1,000 euros and I can spam 100,000 pages. Can be worthwhile, depends on how big your business is.
But of course, spam sells, so somebody is buying the Viagra, otherwise people wouldn't do it. Another method is so-called link farms. The idea behind it is that you have a large group of pages
that link to each other in some kind of a cycle and that when one page transports its authority to another, over some way it comes back and you get into this recursive cycle and it's kind of, you just profit,
you just benefit from each page that links to you and the respective prestige. What has also been done is so-called link exchange programs where you say, well, it's probably easy to recognize a link farm
if it's one server and kind of very local and they link to each other. That is obviously easy to detect. But what if I ask other people on other continents to give me some links? What if I get Wikipedia to get me some links?
Interesting, isn't it? Because then I can do a lot of things and Google has no way of finding out whether these links were deliberately put to a Viagra page or to some sensible page.
It's difficult. In any way, you have to create link pattern that look normal because obviously the web search engines will look for abnormal structures and once they find it, they will classify it immediately as spam and you are out.
You can also, if they use the hits algorithm, for example, you can also try to act as a hub and say, well, this is a Viagra spam page but the links that I give are real. They point to pages that are really important,
that are really interesting. Well, so I place the ad, act as a portal which is well received by people. They're probable to buy my Viagra and Google has no way of finding out because you just benefit from the pages
that you link for that are kind of popular, that have a high prestige and that's kind of it. So very often this is done by cloning directories or cloning Wikipedia or stuff like that.
Very often this is the way. On the other hand, if you have a higher hub score, the other pages that you link to will get a higher authority score. Also beneficial for them and very hard to find out what it is. One of the methods that uses this principle
basically is so-called honeypots. A honeypot is you put onto the page something that is sensible, that people like to read, that helps people and usually at the side of the page or whatever, you put your spam, you put your Viagra ad.
So for example, you have the Wikipedia pages that everybody loves. You just make a copy of Wikipedia, you can crawl it, it's easy. The content quality is very high and then you sprinkle it with a couple of Viagra ads or whatever you're trying to sell.
It doesn't really, I mean for the user, it's usually not too important which of the copy is used. And if you get one, two, three people buying your Viagra, it has been worthwhile because obviously you don't pay for the content.
Viagra is open content, you can do that. And if people find it practical, that can be done. On the other hand, you don't even have to show your Viagra ad on the page. You can also hide links to your Viagra page on this page
because then you have a high authority of a page which is used by many people because they don't care whether they use Wikipedia or Vicky-pledia that you just copied.
And if that is used by many people, then placing the links there is a good idea of promoting the page rank of your spam pages as an authority. So this is also a boost.
So the idea is basically you have this Wikipedia or whatever it is, a copy, the people link to it because it's very nice and has wonderful content. The content is ripped off, but who cares?
And then you add some links to your Viagra site. This site gets a high prestige, thus this site gets a high prestige because it's transferred by the page rank, okay?
Well, that can be done. What is also fun is to look for expired domains. There are some pages that are popular for some time
and often linked to, and at some point the owner of the page or the manager of the page, the maintainer of the page may decide not to prolong the page, not to use it anymore because it has been hired by some provider and then you don't want to pay for it anymore
or you move to some other site or you switch providers or whatever. Still, the domain that just expired may still have a high page rank. So look for expired domains, buy them,
put links to your Viagra page onto them, benefit from their high page rank, benefit from their prestige. Before everybody sees they are outdated, you have benefited just for buying a domain.
Doesn't cost much. So that's kind of also a very fun idea. Well, with link spam, it's even harder to detect it than the usual textual measures because pages, web pages and websites, so clusters of pages, can show some regular patterns
but can be arbitrarily different depending on what you want to read and how creative you are and how chaotically it has been created. And the point is if you create a text chaotically,
nobody's going to understand it so it reflects on the quality. If you create the navigation structure of a site chaotically, it doesn't influence very much because people can navigate it. So while creating some text
in an unusual fashion really lowers the value of the text and you can easily recognize that as a web search provider and say, well, this cannot be sensible because it doesn't adhere to what is normally understandable by humans.
It's much harder to say, well, this page is unusually, it doesn't correspond to the usual navigability by humans. That's very difficult to see. So this is kind of the idea. In general, you have some heuristics where you can say,
well, if the endings of a lot of pages look the same, then this is probably some Google bombing that's taking part. So for example, the miserable failure.
If you find doing your shing-ling that you have copies of high quality content, look very closely at what is the original and don't count the links from pages
that you define as not being the original. That kind of detects the honeypots. And of course, it's always good to add some manual fun by just creating a white list of pages that you know that are good, that are not spam, and try to figure out the link distance
from every other page to this page. I mean, we know that everything is kind of connected over six links, but nevertheless, if something is very close to one of these good sites, it's probably not a spam page. If something is far apart from the good site,
well, it might be a spam page. All heuristics, all just help you to find something. And more often than not, you will delete good pages, and more often than not, spam pages will not be detected. But it's better than just accepting everything
at face value. So in terms of the creator of pages, of course, also you for your tropical fishes or how to build a house want to have a high page rank because, I mean, it's very important to talk about tropical fishes, obviously. As long as you don't want to sell Viagra,
invest the time into creating good content because the better your content and the more natural your content, the higher the page rank will be, the higher the Google ranking, Yahoo ranking,
whatever ranking will be, because Google, Yahoo, and whatever invest a lot of time, a lot of effort in finding something else. As soon as something looks fishy, you might be punished. So that's kind of the idea. The cost of cheating search engines
are usually higher than the benefits of it. I have no idea what is written BMW, for example, that are kind of the leader of some market segment to try doorway pages to increase their page rank. I mean, people looking for a BMW
will more often than not end up on BMW DE or BMW Comm. Good, so create high-quality content. A link should be a recommendation, so adhere to that too. Build a crawler-friendly website,
also with a good robots.txt, and you can use wider techniques. So last time, we were kind of very briefly discussing Google AdWords, where you can actually pay for your page being ranked higher if it is a commercial page, and then you will not be banished from the ranking altogether.
Pong, so here we go, and look at some of the, well, investigations of how search engine optimization works. Yeah, actually, there's a way to have a lot of fun with search engines,
so-called SEO contests, search engine optimization contests, and they are started either by private people or by some magazines or companies who just want to find out how search engines work. So this is a contest started
by the German computer magazine CT. Some years ago, they just started a contest, and they invented a new animal, so hommingberger Gepardenforrelle. So what's Gepardenforrelle in English? Oh, it's a cheetah. It's a cheetah and a trout. A cheetah trout, so looks like this.
So the idea is to use some kind of term that never has been used in the web before, so if you would have entered this term into Google when they started the contest, then no pages would have been returned by Google, and the task was, within half a year or so, create pages about the hommingberger Gepardenforrelle,
and then after half a year, the people at CT magazine would make some queries at major search engines, so mostly Google and Bing and so on, and then look at which pages are returned at the top of the list, and the people who created these pages
get some kind of prize by CT. So a lot of people participated in this contest and tried to build really nice pages about the hommingberger Gepardenforrelle, created nice pictures, and here you can see the URL where this picture is from, hommingberger Gepardenforrelle.de, and yeah, they tried to find out
how a page must look like that a search engines think that these pages are really rich in content and have a really high prestige and so on. So we can take a look at the Wikipedia entry, which tries to summarize the contest a little bit.
So it actually has been done from April to December 2005, the contest, so actually that's only in German because it has been in German context and there's no English page, but what's quite interesting is to see the number of pages
that have been returned by different search engines over time. So at its peak in October of five, there have been five million pages in the web about the hommingberger Gepardenforrelle, so a lot of people had a lot of fun doing this stuff and created pages, and of course, building four million pages
is not a manual task, but they created scripts that automatically build link farms and all stuff about this crazy animal, and yeah, they can see how many pages there have been. Take a look at the article. That's pretty nice. So this has been the winner according to Google.
Let's have a look at that, and it really looks nice, so seems to be in an article explaining what the hommingberger Gepardenforrelle is, where it lives, how much it costs when you try to buy it at the market, nice recipes, links,
and yeah, really, really a nice-looking private website with all information about this topic by Forrel and Zuchter. So this is how a page must look like if you want it to have ranked high in Google, so very interesting to see.
So yeah, another contest has been Schnitzel made katoffers a lot, Schnitzel with potato salad started in some German youth group. There hasn't been some kind of prize money. It has been just for fun. Take a look at it, really nice, nice thing, and yeah, good to see.
There was also this English content. Maybe we should also that, what was the search word? English Negritude Ultramarino? Yes, something like this, so I haven't found much information about it, so but yeah, no, no, no, stop it.
So that's a term for it. Yeah, there have been different things. Seraphim, Pradel, duck,
let's scroll, blue stings sky, whatever, so idea is to just create some new words, something like that. This is not only a German. Negritude Ultramarine. It was what? I see, okay.
So you will find it if you're diligent enough, and there was a lot of attention for that. Yeah, actually mostly, mostly fun
and not really scientific, but a good way to try out different ways to build websites and see how search engines react to that. Yeah, give some new insights. So also Google has some hints for webmasters, how to build good websites,
and essentially, yeah, these are some, yeah, quite intuitive hints. Yeah, make a site with a clear hierarchy and text links. Every page should be reachable from at least one static text link, and all these, keep the links on a page to a reasonable number.
So a lot of hints Google collected for creators of websites. So basically, all these hints sum up to build websites that have good content that are made for humans, and then Google will have no reason to hurt you because Google obviously is designed,
or the ranking in Google is designed to detect exactly those pages built by humans with good content for other humans, and if you follow the rules of building these websites, then you will get a high ranking by Google. So no tricks necessary. Usually, focus on creating good content.
That's usually the best thing you can do, and you don't have to worry about anything. All right, yeah, let's have a short break for a moment, and then we continue with how Google builds its data service, data centers.
So then let's continue with our next detour. It's about what hardware you can use for large-scale web search, for example, as Google does it. So the most surprising thing is when you try to find out how Google builds
their data servers or data centers is that they do not use some fancy, modern, large supercomputer as you read it from time to time in some lists that HP or IBM or Cray just created a big new machine for computing weather data or doing nuclear testing
on a theoretical level or doing simulations. Now, Google does a completely different approach. They do not use a single large machine or even a small number of large machines. They basically use really crappy hardware. Crappy hardware means computers, as you can buy it,
buy them at every store. So very, very cheap and very unreliable hardware. They plug and switch them together in some way, and it just works. So people are not really sure how Google does it, but that's Google's idea how a data center
should look like. Use very, very simple machines, plug them together in a systematic way, in a fault-tolerant way, and then you get very high performance for very little money. So what's exactly done at Google has been a large secret for many, many years,
but from time to time, you get some information. For example, in 2007 and 2009, there have been some presentations about Google's hardware, and since then, people know, yeah, approximately what Google is doing. So as I said, Google uses only custom-built servers, so they buy standard hardware
and build standard servers from it and connect them via network. So of course, as they do not have very large machines, but very small ones, they need a lot of them. So actually, although Google does not sell its servers,
Google, in fact, is the world's fourth-largest server producer, so slightly on par with HP and other big companies who build servers for a living, but Google is building so many servers only for themselves. So in 2007, there has been some estimation
that Google then operated about a million servers. So I'm pretty sure four years later, in 2011, it's many, many more, much, much more servers. So these servers are distributed all over the world. So in 2007, there have been 34 major or minor data centers all over the world.
So of course, if you have some users in Asia who want to query Google, you do not want to transfer all the query over to the US and send the answer back. You want to be able to process the query results directly in Asia or in South America.
So Google servers are distributed all over the world and connected to a large index, a large networks, with many data replicated all over the world for reasons of performance. So, as I said, they are connected by some data connections
here, massive fiber lines, usually, and the funny thing is that about 7% of all internet traffic is generated by Google alone. So basically, Google's traffic is a large amount of all what's happening in the internet, and Google owns a lot of lines by themselves.
So 60% of the traffic that Google is having with their customers is going completely over Google's own lines. So of course, this is much cheaper than having to rent lines from some global providers.
So Google has many lines and is spending a lot of money in web infrastructure, and again here, if Google was an internet service provider, such as Deutsche Telekom, Google would be the third largest global carrier.
So they are doing it all for themselves, but they are really, really, really large. And by doing it themselves, they can provide high performance, high quality for a low price and have full control over their own network.
So here's some facts about how the Google data centers look like. So again, in 2007, they created four new data centers, costs were about $600 million. So really, really expensive doing this, but we have seen some weeks ago when we talked about,
I talked about AdWords, that Google can easily earn this money back by the infrastructure. So the cost of operating the hardware and software have been estimated to $2.5 billion in 2007. Again, it is really, really expensive.
It's also expensive in terms of energy consumption. So each data center has an energy consumption of 40, 50 megawatts. So to comparison, the whole region of Braunschweig, so all people living here and all industry being here,
has an energy consumption of 225 megawatts. And so this is basically the double amount of the largest data center of Google in Oregon. So Google's data centers are by my means of energy consumption like small cities.
And of course, they need their own power plants and power stations. That's also what's Google doing. Though they're basically trying to do everything what they can by themselves. They don't want to rely on third parties to provide energy, to provide lines, to provide hardware. They want to have full control over their own business.
Okay, some more facts about Google service. So they built their service in large racks, and each rack usually contains 40 to 80s, commodity class PC servers, so some standard hardware with some customly designed Linux.
So they also haven't any expensive license fees to pay. They use their own Linux and they use a specialized network file system where large numbers of servers are connected and it looks from the outside as they use a big single file system. And data is automatically transferred
from machine to machine. And yeah, the hardware is slightly outdated. That's because slightly outdated hardware usually is a lot more cheap than modern hardware. So there's usually this kind of curve you might know from CPUs.
These are the very modern ones, really, really expensive. And if you wait a year or two, then prices a bit are more, yeah, are more reasonable and more correspond to the power you really get, the performance you really get. So they try to not use the most up-to-date stuff,
but try to optimize their cost balance. Okay, each of their servers here, some pictures below, has some 12-volt battery connected to them because Google found out that power supplies are usually unstable in a way
and just using a battery as additional power source can counter some fluctuations in supply. Yeah, so they don't need any specialized cases for their hardware. In fact, they use standard shipping containers.
If you would use them in trucks or in harbors on chips, they just are fitted with all this hardware and can be deployed anywhere in the world, just put power in and large network cables,
and then things really, really work. So these things are customly made at the Google headquarters and then shipped to all over the world where they can just be used. Okay, of course, if they use very cheap hardware, Google servers tend to be really, really unstable, but have a lot of performance for very little cost.
So that's basically idea of Google, high bank for the buck ratio. So, and here are some typical events that occur in the first year of a new data cluster, of a new data shipping container. So there will be an overheating every two years,
approximately, so that this will power down most machines in the new cluster for about five minutes, and then you need one or two days to recover to repair your hardware, so this can happen.
Then your power distribution units will failure and 500 to 1,000 machines are suddenly not there anymore because they failed. So these are events that happen, and usually in some company infrastructure, if such a large number of machines just breaks down, then your whole infrastructure isn't working anymore.
But Google's clusters are designed that they can easily handle those events. So even if thousands of machines suddenly disappear, Google is still working. You can still post queries, and queries get answered very efficiently.
So sometimes an entire REX gets moved. Again, thousands of machines are down. You need to rewire your networks, your network by then removing some machines, one after another, over many days. REX go crazy in a year, 20 times about.
Again, hundreds of machines disappear. You need some hours to get back to the normal state. You need to do network maintenance. Your routers have to be reconfigured. Your routers do not work anymore, and so there are many, many things that happen,
and there is almost no day without a major incident in such a data center. But as I said, the fascinating thing is that Google keeps on working even under these conditions. So if you just remove an entire data server,
the Google infrastructure as such won't notice. So of course, if you use standard hard drives, they will fail many, many times. And essentially, you have some people running around in your data servers doing the whole day nothing but switching hard drives, nothing but switching your power supplies
and your machines, doing maintenance operations, because yeah, there are so many problems in standard hardware that just occur and need to be fixed, but it can be done. So the main challenges that Google has in the infrastructure is the need to deal with all these hardware failures
that you won't have with large enterprise scale servers, but you will definitely have with all this crappy and cheap hardware. At the same time, you need to avoid any data loss. So you won't, you won't, you index to disappear or disintegrate completely because then you cannot do business anymore. So you must be highly tolerant to all these kind of faults.
You need to guarantee global uptime. So as we said, there are many, many hundreds and thousands of queries to Google every second. If Google is down for a minute, then there will be many thousands of customers who are really pissed off and will go to another search engine. So you will, you want to decrease
your maintenance costs to minimum. So you don't want to have very expensive repairs on your machines. You will, you just want to switch a hard drive by pulling the old one out and pushing the new one in, and then things should work. You don't want to have any expensive reconfiguration. You want to be able to extend your data centers
just by building a new data center, plugging the network cable in and the power, then the whole system should be able to reconfigure itself and use the new hardware and new possibilities that are there. So at least no manual configuration necessary. Everything should be done automatically
by your infrastructure. Okay, what's the solution to all these requirements? Use the cloud. So cloud technologies will be very flexible, be highly distributed and have a high performance. So as I already indicated, this is done by the Google file system
or the Google Bigtable data system, some kind of file system that are optimized for distributed environments and that really look like a single file system as you know it from your home PC. So how it can be done and how this really works and why these two guys are smiling always,
this can be learned in the lecture and next semester distributed databases. So I can definitely recommend that. That will be held by Kristof Lofi in the next semester. Yes, we will come to that later again, but if you want to know what Google's hardware and software secret is, then this is a lecture
you definitely want to attend. All right, next topic is meta-search. So meta-search is one of the points where you say, well, maybe Google kind of has a special way
of delivering their results or calculating their results and Yahoo also has a special way and Bing has a very special way. So why do I trust one of these engines and not the others?
And why can't I connect their strength and kind of wipe out their individual weaknesses? This is basically the idea that if you have different ways of ranking the content
and each has individual strengths and weaknesses, then the combined results of them, the combined effort of them should be better than each individual. So it's a little bit like what we did with the wisdom of the crowds. If I have a thousand people stating something
and I take the average, then everything is wonderful because I know how many beans are in the pot or how big the ox is or whatever. And this is exactly the same idea. So I have the query and I pose the query
to the so-called meta-search engine that distributes the query to different search engine, collects the results, re-ranks it somehow and gives it back. So this is kind of the idea. And one problem that the meta-search engine faces is
it has to rely on the underlying engines because obviously Google is not allowing any meta-search engine to grab into the index and say, well, let's see what kind of page this is and how high the tech term frequency was
or how high the ranking factor transported by the page rank was or something. But it just gets the pure numbers of, this is the best match, this is second best match and so on. They used to be, with Google, some of these numbers, 0.8 or 0.875, whatever, you know,
that kind of how gradually ranked it by assigning some value. But almost all search engines have given up on that because these numbers mean nothing beyond the pure ranking. And so it's not an important information for the customers,
so why showing it? So you're not able to exploit the internal information of a search engine. Thus, the problem can be defined quite mathematically. We have a set of ordered lists
that are returned by the underlying search engines. And we have to integrate this ranking into one list and then you return this one list. How to integrate rankings?
I mean, each of the search engines has a best candidate. Which of them is the best one? Any ideas?
From the best search engines, so we always ask Google, why are they doing meta search? Yes? Huh? Maybe you check, kind of cross validate the results
by looking where it was in the other search engines, yes? Maybe you can have some topic-driven preferences for taking one value over the other.
Yeah, it's difficult, isn't it? It sounds like an easy problem, but it's difficult to actually do it. And actually the problem is not too novel. So social choice theory and voting theory has been around since the Middle Ages.
And also the problems of building a fair voting system is as old as democracy. So we should have some degree of what a good system should grant us.
And I guess there are a couple of things that we can agree on. For example, so-called Pareto efficiency. If we say that we have two pages and the one page is ranked higher by all our engines than the other page,
there should be no way that it is ranked lower in the aggregate ranking, right? Sounds good. I mean, what reason can there be? Choosing one search engine doesn't work
because it ranks it higher. Counting where it is in the other search engines doesn't work because they all rank it higher. Even if you go topic-driven or something, it ranks it higher. Interesting. Pareto optimality. So we can kind of agree on that.
Non-dictatorship. So always choose Google. Sounds like a good idea, but that's not meta-search. So non-dictatorship means we should not always choose one engine to state the result,
but the other engine should at least have some effect. Can we agree on that? Sounds good, doesn't it? Last one I want to talk about is the independence of irrelevant alternatives. So I decided for some ranking between page A and page B.
Now I add page C into the image. Then page C can be ranked higher than page A, but lower than page B or in between them.
We have several possibilities of assigning a rank to C. But the mere existence of C should not change our previously derived ranking of A and B. Either A is better than B or it is not. That's not dependent somehow on C, is it?
It's often called the independence of irrelevant alternatives. So can we build an algorithm for re-ranking our pages that kind of takes all these things into account?
Can anybody think of one? Any ideas? Nobody can think of an algorithm? Why is that?
It's not only complicated, but it's impossible. Even with these little small characteristics that all sound so familiar and all sound so sensible, it cannot be done.
And there's actually a mathematical result about that. It's Eero's impossibility theory, and he's smiling such in a way because he told all the voting guys that they can go do something different because it's just impossible to design any ranking scheme
based on different input sources that realize the Pareto efficiency, the non-dictatorship, and the independence of irrelevant alternatives.
You will always violate one of these issues, and that can be proven. Interesting, isn't it? So whatever we do, I mean, it will have weaknesses. We have to live with it. On the other hand, I mean, this is not rocket science.
There will be no nuclear explosion if our ranking is slightly off the mark, or if there might be a little bit of dictatorship, or there might be not as Pareto efficient as we would want. So we can relax on some of the alternatives,
and I will show two basic ways of doing it that are very often done. One is the majority rule, one is the border count that are actually quite old things. Border was when? 1700, 1600, something around that area.
So it's kind of very old systems, and I will show their strengths and their weaknesses, and then we kind of see. So let us assume for the minute that we have, I don't know, like five or three search engines,
and every search engine ranks every page. Obviously not true, because they use different crawlers and may crawl different parts of the web, but let's just assume that for a minute, okay? Then the first one, the majority rule, just says, well, if I have a pair of pages, so A and B,
then I will ask every search engine whether A is better than B, or B is better than A. And since this is a total ranking that search engines provide, every search engine has an answer for that, okay?
If I take an odd number of search engines, the answer is even unique, because they can obviously not be half-half. So assume I have engine one ranks A higher than B,
engine two ranks A higher than B, and engine three may rank B higher than A, doesn't really matter, because we have two engines ranking A higher than B, and only one engine ranking B higher than A. Thus, A in the aggregate ranking
should be higher than B, okay? Let's do the same for C. So A and C, and A and C, and A and C. So all engines agree that A is better than C. So A is better than B, and A is better than C. Still, we have B and C. Let's see about it.
B is better than, oh, I take a different color. B is better than C, B is better than C, and only one engine says that C is better than B. So again, two to one, B is better than C, that's it. Okay? Nice aggregate ranking. However, things are not that easy.
I mean, I wouldn't have showed you Kenneth's impossibility theory before if it would be so easy to just take a majority vote. The problem, the big problem of majority votes are cycles. And to show you what happens is kind of,
I do it for you once again. These are the three search engines, okay? And let's first focus on A and B. A is better than B, A is better than B. Two against one, A is better than B. Okay? Good, let's look at B and C.
B is better than C, B is better than C. Two to one, B is better than C. Let's look at the last one. C and A, C is better than A, C is better than A. Two to one, C is better than A.
Okay? A is better than B, is better than C, is better than A. Well, we have to live with it. Actually, the literature lists a lot of methods now, how to deal with these cycles
and can we break it somehow and can we nevertheless decide for them, blah, blah, blah. But it's a fault. It happens and you have to figure out ways how to deal with it. Let's look at the other one, the border count. The border count is actually acyclic. It avoids cycles.
And the idea of the border count is that you kind of allow ties in the final ranking by letting each search engine cast a vote that is ranked or that adds some...
some number correlated to the ranking in the rank. So if I have a search engine, the first ranked result gets the highest score. The highest score, if I just take the number of documents
that I have as possible scores, so every document gets a different score, I would just assign a three to the first ranked. The second ranked gets a two, the third ranked gets a one, and I just have three documents. If I would use Google, I would start with 60 billion
and then count down to one. I do the same for this other search engines, again, one, two, three, and again, two, three, one. Good. Now, to get the ranking for every page,
I will just look at their scores and sum it up. So A has three, three, and two, gets a total of eight.
Yes. Well, in the beginning, we kind of assumed that every search engine ranks every page.
Thus, it would be the same. This obviously is not valid. For example, we know that Yahoo ranks less pages than Google, for example. So dealing with that could be done in such a way
that you only say, well, we will just consider the first hundred results of each web page. Because how many results are you going to create in your final ranking? Probably just a couple of pages. If we assume every page has 10 entries and you want to finally create 10 pages,
then we need modular duplicates. We need about 100 documents from each search engine. And then we start counting back from 100 to one. And everything that is below rank 100 in every search
engine gets a zero. Can be done. So of course, you have to normalize the search engines against each other. Otherwise, you will have dictatorship if one search engine has a much higher ranking. So just assume that you have a search engine that
ranks 10 times the number of pages than any other search engine. Then if you do it by normalizing it through the number of pages, you would have 10 other engines that can contradict this one engine. That's kind of not sensible.
But you could do it like I said. You do the same for all the other things. So B has a 2, B has a 1, B has a 3, 2 and 1 is 3, and 3 is 6. That's it. And finally, you have the C. C is 1, C is 1, C is 1.
Gives it a 4. The advantages of the border count, it's obviously very easy to compute. Just take the ranking, assign the rank numbers, add them up. That's it.
It can also handle pages that have not been ranked by all the engines. Just assume that it's somewhere very deep in the ranking, so it will get 0 as a rank. Or you could say, well, maybe the page is not ranked by the image, so I will assign it some mediums rank
if I take the first 100 documents of each search. And I could say, well, if it's not among the first 100, then I would give it a 50, or I don't know, like a 20 or something like that. Something to kind of make sure it is not
dropped because it's just not indexed, but something that doesn't promote it can be done. You can have ties in the aggregate ranking, so you don't have the cycle problem, which is good. And you can also weigh the individual engines.
So if you say, well, I trust Google more than other engines, then I can add some 10% or something like that to the Google voting, to the Google assessment. And it kind of is not a dictatorship, but it will fold into the common result set.
Can be done. Disadvantage? Of course, if you just give this ranking numbers, you assume that the degradation of the ranking is uniform.
Assume that one engine has a total favorite, says, well, this is the best page ever. And all the other pages are crap. Then this page should be more promoted
over the second ranked page. Then if it says, well, actually, I can't distinguish between these 10 pages, I have to put them on one result page so I will just make a random order. They're on the first result page anyway.
There should be a difference. With the border count, you cannot know. As I said, it's kind of difficult for all these voting schemes and ranking schemes or re-ranking scheme to actually work. So border count, majority vote, very, very difficult to say.
And as I said, if you have different search engines, so for example here, I take three search engines and their results. Also, the results of border count and majority vote may be different. So for example, if you take the border count, we add the four here, the three, the two, the one,
the four, the three, the two, the one, four, three, two, one. Then for the A, for example, I have the four plus the three plus the two makes a seven, nine, and so on.
Let's look at the majority vote. So for example, the A and the B, we have the A and the B here, we have the A and the B here, and we have the B and the A here, whatever. Two to one, A is better than B. Same border count happens.
They just disagree, the majority vote in the border count. There's nothing you can do about it. And as I said, cycles are kind of a problem that do not occur in the border count. However, those things are pretty close in the border
count anyway. So just shift it one place more or less, so add a couple of documents and put the C deeper in this search engine and it will become worse. We'll not change the majority vote at all.
I mean, not this part of the majority vote. We'll change the border count drastically. Same here. Shifting the B, just one step, just exchanging it with the D
introduces a ranking in the border count. Doesn't change the majority vote at all. They have both their stabilities. They both have their faults, their drawbacks. Decide for either of them. There's nothing you can do because Mr. Kenneth Arrow
wanted to be clever and proved the impossibility theory. Now you know. Good. So sometimes it's very helpful to see if some engine performs better than some other engine to find out whether they agree on something.
So if you have a gold standard and say, well, let's say Google is a good ranking and I want to see if my engine is better than Google or similar to Google or whatever it is, then you need the degree of agreement between them as a measure. And actually, that measure has been built.
It's called Kendall's Tau. Very, very popular, very, very often used to compare rankings that are created by different engines or that are created against some gold standard
where you say the gold standard is perfectly correct. And whatever my engine does, it should be not too far from the gold standard. It should be very similar to the gold standard ranking. This is usually done. The idea of Kendall's Tau is that for each pair of pages
that are ranked by both engines, you determine whether the both engines agree on the ranking or not. If one engine says A is better than B and the other says, yes, I agree A is better than B, then you count one in Kendall's Tau. If one engine says, well, A is better than B
and the other says, no, B is better than A, then you count zero for Kendall's Tau. And you do that for all the pairs, and in the end, normalize by the number of pairs. So if you have perfect agreement for all the pairs
that you tested, added one, then you have a Kendall's Tau of one, perfect. If you have total disagreement, always zero, you have a Kendall's Tau of zero. The higher the Kendall's Tau, the closer it is to one,
the more your lists agree. So it's basically the ratio of agreeing pairs compared to all pairs that are ranked by both engines. Can do it a little bit more formal. So if you have the number of pages ranked by both engines as one, you count the agreements and the disagreements.
And then you take the agreements minus the disagreements by the number of possibilities to draw to out of all pairs.
Yeah, exactly. This is basically given by the binomial coefficient. And this is what gets out of this. So m times m minus one divided by two is basically just this coefficient here.
And that's it. Yes? Oh, that's right. That's right. So in this way, it's negative. So then the Kendall's Tau is minus one, if they perfectly disagree.
Right. So it's correlation and anti-correlation, basically. If one engine always says the opposite of the other engine, then you get a Kendall's Tau of minus one. You're perfectly right. And if you have zero, then it's just random.
Because if you just, for every pair, pick bigger or smaller, then you will have a 50% chance of being bigger and guessing correctly. This will result in Kendall's Tau zero. So they're either correlated, anti-correlated,
or totally independent, totally random. Yes? Yes. That's right. So as I said, it's usually always done against the gold standard. Because otherwise, I mean, if I compare my engine to Google,
for example, there are different possibilities of making errors. Either Google was wrong or I was wrong. Just assuming that I was wrong is not a good idea. So Kendall's Tau doesn't make sense in the case where Google is often wrong. But if I have a gold standard where I say, this is correct,
like we did in the precision recall analysis. We had a manual kind of classification whether something was relevant or not. And then we compared our engine, our IR engine, to the manual classification and said, well, this is the precision of my result.
And this is the same idea here. I have a perfect ranking that is induced somehow. And then I compare my ranking against this perfect ranking. And so I just need to compare two engines pairwise. Good. So if I have two engines here, A is better than B.
This one agrees. But with B and C, it disagrees. And with A and C, it again agrees. Flump. Then it has two agreements, one disagreement.
And the possibility to arrange them is basically three. So Kendall's Tau will be a third. I have more agreements than disagreements. Thus, there's a positive correlation.
But since I don't have many more agreements than disagreements, it's only a small correlation. Yes? Good. So meta search is kind of not very often used these days.
And why is that? Because, well, with all the ranking problems induced and all the, well, do they actually rank the same document? Or do I have to find out who ranked what and where does it occur in the other ranking? That's kind of hard. But there's one area where meta search
is the killer application for web search engines. And that is if you have so-called maximum recall searches, where you want to have everything that is on the web. And it doesn't really matter which engine I ask, I want to have it. Can somebody think of an application for that?
Why would I want to have a maximum recall search? Any ideas? Any applications?
OK, that would be one application, yes. Oh, yes, but that's kind of similar to estimating
the size of the web. Now, one application that immediately springs to mind is patent search. You've got to be sure when inventing something or when filing for a patent that nobody has done that before. So you cannot afford overlooking anything.
Same goes for scientific results or something. You did something wonderfully, and then somebody pops up and says, well, that has been done in 1965 by Kenneth Arrow. Well, that is my possibility. No, it is not.
Better make sure in literature search or something like that that you have a high recall. Rather looking at something that doesn't hurt you than ignoring something that does hurt you. This is kind of a typical idea. Well, for most other types of queries,
not the maximum recall queries, it usually fails to increase the research quality. And what I just said is that the meta search really works well if the engines are completely independent. Well, they all use page rank.
They all use term frequency. So how can the rankings be completely independent? So the errors are usually systematic to some degree, at least. And also, the engines used are of similar high quality. We do have a competitive advantage for some engines that crawl bigger parts of the web
or that use better technology. We know that. So also, this assumption does not really help. So kind of difficult, and meta search is not the killer app that it was originally thought.
So when there was a couple of search engines, meta search seemed to be the application, the wonderful way to solve it all by piggybacking on a lot of different search engines. It has not really delivered what we used to have.
So this brings us to our second to last detour for today. But actually, I think we should skip that because the meta gear is one German typical meta search engine. And whoever wants to have a look at it, just follow the links, try it out. It's not too impressive.
That's all I want to say here. The last part of the lecture today is privacy issues on the web. And this will be done very briefly by Joachim. So actually, it's just to give an impression of what problems could occur when dealing
with user data on the web. A very popular and very devastating example has been the so-called AOL query log. So when doing research, of course, scientists want to have access to real data. For example, real queries sent to some search engines. And so AOL, large internet provider
at the time, which had an own search engine, decided to make their query logs publicly available to scientists. So basically, a list of all queries sent to AOL in the previous year on 2005 and something like this.
So of course, they did not ask their users. But they thought, hey, by just publishing the list of queries, nobody would ever be able to find out who were the people who asked the queries. So what did they publish?
20 million queries from 650,000 AOL users. And all these queries were within a three-month period. And some technical data regarding these queries that could be used by scientists to make better search engines.
So yeah, user names in AOL have been replaced by random user IDs. So there have been no way to find out who searched for what. So unfortunately, here's a query log of some user. Obviously, he drives a Skyen XB, whatever kind of car
this might be. Had some problems with brake pads. Lives in Florida. Also had some specific things he wanted to repair. Is somehow related to the Florida Department of Law Enforcement. And wants to get revenge on an ex-girlfriend. So with all this information, it might
be able to find out who he is. So if you are able to some database on car repairs in Florida, and know the time when he asked these queries, then you will definitely be able to find out that he has some trouble with his ex-girlfriend.
So another thing is here is some user has some problem with cocaine, obviously.
Again, comes from Florida. Is interested in marriage in some way. So legal problems in Florida. Currently is in New York, because he is worried about whether the New York authorities will
expedite him to Florida. Looking for cooking jobs in the French Quarter in New Orleans. And yeah, again, with some more information, you will be able to find out what he does, what he likes, and what problem he has.
And these are not the things you want to have published. Here's someone who has a job interview with Comcast. Drives the Ford Focus. Is somehow related to the city of Joliet in Illinois. Yeah, has a criminal past. Cheating spouse.
Yeah, again, Illinois. Again, truly some problems. Yeah, these are this kind of information you don't want to have published. Again, very personal.
Yeah, also very nice. How to kill your wife. Wife killer. Pictures of dead people. Murder photo. Well, steak and cheese. Car crashes. Yeah, maybe if there is some more information about this user, for example, he or she Googled her or AOLed his or her own name, then yeah,
it's truly a problem. Even the, yeah, and definitely the problem also is if you find out who this person is, should you report him or her to the police? Are you allowed to do that? Interesting question.
Yeah, first checks the wife. All right, next one, this user. This user has been tracked down by the New York Times. So they looked at the queries.
This nice lady here. Search for the city she lived in and obviously has some very strange problems with her dog which was urinated on her sofa and made a lot of trouble. And this person is Selma Arnold, a 62-year-old widow living in Lilburn, Georgia.
So also did research on her friends and on their medical history. And just by looking at her queries, she posts to AOL, the journalist have been able to find her and have been able to find out many details
about her personal life and the personal life of her friends. So the journalist approached her and she immediately said, yeah, well, these are my queries. Interesting, how did you find me? So that definitely proves it is possible to find these people and there's no way to provide sufficient
anonymization on this kind of data. So generally be very, very careful if you have access to this kind of data. So of course AOL realized that there has been a problem. Just after a day of the release of the data,
they removed the data again and they apologized. Yeah, this was a screw up. And then they claimed this was just the work of some single guy working at AOL and nobody knew about it. It was all a big mistake and they are very sorry. But yeah, they probably always will be sorry
because once you published such data on the web, it's always out there. So this is a address of someone who just collected where the data can be downloaded. Yeah, it's easy to find. So no problem here by publishing this address. So this is a problem AOA always has
and never will be able to fix. And of course, the AOL users who can be found in the data set will have this problem for all of their life. So very, very big issue here. Similar problem on Netflix, a large DVD rental service.
They faced a very similar problem. They also released a data set some time ago about what DVDs have been rented by each of their users. Of course, they anonymized the data by replacing the users by random user IDs.
So yeah, again, should be, they thought that it's really difficult to find out who rented what or which person is which. But some team of researchers just took a look
at IMDb, the Internet Movie Database, where people can post textual reviews about movies and also post star ratings about movies. And then they tried to find out which users wrote IMDb ratings about movies approximately at the same time they rented a movie
at Netflix and gave a similar rating to these movies. Yeah, and of course, they found some people, identified some people in the Netflix data set, asked them, sent them a mail and asked them, hey, did you rent these and these movies at Netflix
at this and the time? And they found out that they guessed correctly. And of course, this is a big problem because there definitely are movies you rent from a DVD rental service you don't want to find or people find out about. Yeah, there are some movies people don't want to talk about.
And they have their reasons. So again, this is a privacy issue and Netflix also have decided not to publish a data set again due to legal reasons. So this makes research quite a big problem. But yeah, this is all you can do when you have this kind of data.
Keep it for yourself. Last example are services such as one, two, three people. You just enter a name and these services are designed
to find all information about this person available on the web. So they are looking through Facebook, they are looking through address books, they are looking through picture galleries. And yeah, here we are. All your Achim Selkis, more or less. So these four definitely are.
Some addresses of people with this name, very convenient and domains with this name, email addresses, web pages. So very, very well ordered here. The tech cloud about what I'm doing. So yeah.
Yeah. So very interesting to see. And then you don't even need an evil company publishing your private data. There's a lot of data already out there.
And so if some service that I don't think is a too big company is able to find all this data, then Google or some large search engine definitely is able to find out much, much more about your private life than you would want to.
So even if you, particularly if you have a private profile at Google, then yeah. You just can trust them and they promise to don't be evil, but yeah, who knows? All right, yeah.
It's been our last detour for this semester. These are the lectures we will offer in the next semester as soon as I can start up the thing here, but I don't have to. So the first one is distributed database and peer-to-peer data management. We talked about that, how to build distributed systems, how to store data in a reliable and efficient way
and how peer-to-peer networking does work so you have no central storage anymore, but all service in your networks have the same rights and have to cooperate.
So next one is data warehousing and data mining techniques.
So data mining, we discussed this in part. So classification, for example, the data mining techniques. Data warehousing is more about analyzing business data and companies. Also very interesting, again, next semester. We will have, yeah, yeah, here's the money.
Remember that. The next one is spatial databases. Yeah, it's about how to deal with geographic information. It's also getting more and more important also on the web, location-based services. So if you're standing in the middle of a city and you want to go dining,
just get your smartphone and see what's near, what's the star rating given by other users to all the places you have. But of course, for this, you need maps. You need localized services, and this is a topic in this lecture. And finally, you left digital libraries,
which is, yeah, there's some overlap to this lecture here. Digital libraries have a different focus. Of course, digital libraries are about storing and retrieving information and retrieving textual information mainly, but here an important aspect is long-term preservation, for example, providing a good classification.
Yeah, so it's a different aspect, a different perspective, different focus, and I think it's a good addition to this lecture also. So if you have been interested in how libraries deal with their information, this could be a good follow-up here. All right, so if you do not have, yeah,
so here's the money, here are the maps, here are the books, and here is the cloud. All right, any more questions? All right, then thank you very much,
and maybe see you next semester.