We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Retrieval in the Web I

00:00

Formal Metadata

Title
Retrieval in the Web I
Title of Series
Part Number
10
Number of Parts
12
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
This lecture gives an overview on Information Retrieval. It explains why documents are ranked the way they are. The lecture explains the most relevant ways for content representation: Automatic indexing and manual indexing. For automatic indexing, the frequencey of word is of special relevance and their influence on the weighting of term are discussed. The most relevant models are introduced. The session on evaluation discusses new metrics like the Normalized Discounted Cumulative Gain. The session of information behavior provides a brief overview and explains the relation to IR. The session on optimization mainly introduces term expansion and fusion methods. The session on Web retrieval is concerned with the quality aspects and gives a basic insight to the PageRank algorithm.
Keywords
Information retrievalWorld Wide Web ConsortiumTerm (mathematics)VideoconferencingMathematical analysisLink (knot theory)Maxima and minimaMoving averageQuery languageContent (media)Subject indexingAerodynamicsData typeMultimediaWeb pageRevision controlDifferent (Kate Ryan album).NET FrameworkArchaeological field surveyServer (computing)Deep WebMultiplication signDynamical systemInformation retrievalHeuristicMereologyWorld Wide Web ConsortiumMeasurementContent (media)Web serviceClosed setSheaf (mathematics)InformationLibrary (computing)Pattern recognitionSpeech synthesisFigurate numberExtension (kinesiology)Row (database)LoginWave packetDigitizingFile formatPhysical systemEstimatorReal numberDifferent (Kate Ryan album)NumberDynamical systemWeb pageMultimediaUniform resource locatorWordSearch engine (computing)Process (computing)BitLatent heatWeb crawlerSubject indexingInformation systemsWeightEmailAutomationInternetworkingHyperlinkRegular graphWeb 2.0WebsiteLie groupGame controllerEvelyn PinchingDesign by contractOctahedronMathematicsType theoryPoint (geometry)MultiplicationGravitationMatching (graph theory)CountingAnalytic continuationLink (knot theory)Line (geometry)Physical lawCausalitySpecial unitary groupPlanningMultiplication signInsertion lossFrequencyVideo gameProgrammer (hardware)Computer animation
SoftwareWorld Wide Web ConsortiumWeb pageOrder (biology)RobotWeb crawlerComputer programServer (computing)Subject indexingUniform resource locatorParsingAlgorithmThread (computing)FrequencyMathematicsAerodynamicsMathematical analysisTime domainFinitary relationLevel (video gaming)Different (Kate Ryan album)Cross-correlationView (database)Observational studyEquations of motion.NET FrameworkWeb pageDomain nameUniform resource locatorBitResultantForm (programming)Subject indexingPattern languageLink (knot theory)Web crawlerParsingElectronic mailing listPhysical systemDeep WebAutomatic programmingObservational studyParalleler AlgorithmusParallel computingMathematicsSearch engine (computing)WebsiteLevel (video gaming)Multiplication signType theoryAverageComputer programmingWorld Wide Web ConsortiumAlgorithmSocial classSoftwareSequenceFrequencyMathematical analysisServer (computing)Number1 (number)Different (Kate Ryan album)Local ringPublic domainEvent horizonGame theoryEstimatorProgrammer (hardware)Web serviceRule of inferenceDemosceneStandard deviationObject (grammar)Right angleAreaCASE <Informatik>Group actionState observerBit rateWeb 2.0Video gameInformationRow (database)Diagram
AerodynamicsWorld Wide Web ConsortiumWeb crawlerWeb pageFrequencyMathematicsObservational studyView (database).NET FrameworkEquations of motionMathematicsPlotterBitProof theoryCross-correlationDimensional analysisOrder (biology)Group actionVideo gamePhysical lawEndliche ModelltheorieComputer animation
InformationWorld Wide WebWeb pageFraction (mathematics)World Wide Web ConsortiumReading (process)Neumann boundary conditionInternetworkingKnowledge representation and reasoningRepresentation (politics)ConsistencyInformation securityCategory of beingSoftware frameworkAbelian categoryComplete metric spaceUsabilityContext awarenessContent (media)Presentation of a groupDimensional analysisLink (knot theory)AlgorithmGoogolMathematical analysisAlgorithmLink (knot theory)Web pageObject (grammar)InformationMathematical analysisSoftware frameworkDifferent (Kate Ryan album)Representation (politics)AreaDivisorPointer (computer programming)Field (computer science)RankingAuthorizationWorld Wide Web Consortium1 (number)Element (mathematics)Goodness of fitPrice indexTelecommunicationAverageFrequencyArithmetic meanText editorMultiplication signCategory of beingIndependence (probability theory)Context awarenessUsabilityStatement (computer science)Queue (abstract data type)Maxima and minimaRootForceDependent and independent variablesPoint (geometry)MereologySearch engine (computing)Set (mathematics)Presentation of a groupWeb serviceWebsiteGroup actionGame controllerQuicksortCASE <Informatik>Particle systemProcess (computing)MathematicsCondition numberContent (media)Factory (trading post)Texture mappingVertex (graph theory)Cartesian coordinate systemVotingGame theoryModal logicWeb 2.0Rule of inferenceChainMatrix (mathematics)Computer animation
Web pageParameter (computer programming)Link (knot theory)NumberPresentation of a groupMatrix (mathematics)Vector spaceLink (knot theory)Web pageNumberIterationVector spaceRankingPoint (geometry)Parameter (computer programming)WeightSummierbarkeitWell-formed formulaMultiplication signAdjacency matrixMathematicsMatrix (mathematics)Goodness of fitConnected spaceAlgorithmWorld Wide Web ConsortiumDifferent (Kate Ryan album)Term (mathematics)Graph (mathematics)Knowledge representation and reasoningBitLine (geometry)Element (mathematics)Functional (mathematics)Object (grammar)1 (number)Right angleIdentity managementCASE <Informatik>Formal grammarLevel (video gaming)MereologyAreaNegative numberMachine visionDivisorWeb 2.0RandomizationSpecial unitary groupResultantLattice (order)Game theoryDependent and independent variablesPhysical lawGroup actionProduct (business)AverageCommitment schemeView (database)Universe (mathematics)DialectOrder (biology)Wave packetLocal ringDiagram
Vector spaceMatrix (mathematics)Element (mathematics)IterationMoving averageInternetworkingWireless LANLocal GroupMobile WebEmailAddress spaceGame theoryPersonal identification numberGoogolArmLink (knot theory)Metropolitan area networkUniform resource nameConditional-access moduleUniform resource locatorWeb pageInformation retrievalQuery languageMathematical analysisInformation retrievalAdjacency matrixWeb pageLink (knot theory)Eigenvalues and eigenvectorsRankingConnected spaceFunctional (mathematics)WeightSimilarity (geometry)SoftwareMathematicsMultiplication signCore dumpQuicksortBitAuthorizationMereologySummierbarkeitCombinational logicWeb crawlerMatrix (mathematics)Query languageGoodness of fitWell-formed formulaOrder (biology)InformationAlgorithmTheory of relativityContent (media)NumberIterationMathematical analysisMathematical optimizationWorld Wide Web ConsortiumIndependence (probability theory)Wave packetWordWritingRevision controlPosition operatorCausalityFlow separationMessage passingOperator (mathematics)Web 2.0Cartesian coordinate systemPoint (geometry)AreaWhiteboardSuite (music)Process (computing)CASE <Informatik>HypermediaMoment (mathematics)General relativitySpecial unitary groupMachine visionEqualiser (mathematics)1 (number)Data storage deviceNetwork topologySystem callSearch engine (computing)Social classPhysical systemLocal area networkComputer animation
Web pageDivisorAverageRandom numberLink (knot theory)Data modelPage, LarryAuthorizationGoodness of fitWeb pageGreatest elementLink (knot theory)Maxima and minimaAverageWorld Wide Web ConsortiumRankingConnected spaceSummierbarkeitInterpreter (computing)Content (media)AlgorithmNumberReal numberHypothesisFraction (mathematics)Rule of inferenceMultiplication signGroup actionMathematicsProcess (computing)Video gameNetwork topologyPhase transitionPoint (geometry)InformationLoginPlanningGame theoryCellular automatonMereologyOrder (biology)Virtual machineEvent horizonRevision controlType theoryLevel (video gaming)Sheaf (mathematics)SoftwarePhysical lawComputer animation
Matrix (mathematics)Data modelRandom numberQuery languageIndependence (probability theory)Link (knot theory)Parameter (computer programming)CalculationGraph (mathematics)Mathematical analysisCluster samplingOrder (biology)ManifoldWorld Wide Web ConsortiumComputer networkInternetworkingAlgebraField (computer science)TheoryData structureMaß <Mathematik>Vertex (graph theory)MereologyHypercubeWeb pageEinstein field equationsFinitary relationMutual informationTransitive relationWeb pageGroup actionLink (knot theory)Exception handlingGraph theoryArtificial neural networkAlgorithmInformationWorld Wide Web ConsortiumTheory of relativityMathematical analysisExecution unitGraph (mathematics)Wechselseitige InformationQuery languageParameter (computer programming)Gene clusterRandomizationField (computer science)SubsetAlpha (investment)RankingCategory of beingAuthorizationMatrix (mathematics)ExpressionResultantDiagonal2 (number)Multiplication signNumberLevel (video gaming)Hydraulic jumpPattern languageRing (mathematics)Order (biology)Object (grammar)CASE <Informatik>Similarity (geometry)AlgebraSoftwareData structureObservational studyBitIndependence (probability theory)Table (information)Knowledge representation and reasoningConnected spaceWebsiteOperator (mathematics)QuicksortPhysical lawSweep line algorithmArithmetic meanLeakLattice (order)Computer fileInformation retrievalWave packetPoint (geometry)Cartesian coordinate systemProcess (computing)Graph (mathematics)Computer configurationOffice suiteWeb 2.0Ocean currentForm (programming)Uniform resource locatorCompilation albumFrequencyStaff (military)Stress (mechanics)WhiteboardMereologyGame theoryValue-added networkElectronic visual displayTransitive relationDiagramProgram flowchart
Presentation of a groupMatrix (mathematics)Component-based software engineeringConnected spaceDirection (geometry)Link (knot theory)Distribution (mathematics)Physical lawPower (physics)Web pageInternetworkingWorld Wide Web ConsortiumNumberWeb pageDistribution (mathematics)Power (physics)CurveMedianRight angleGoodness of fitAveragePoint (geometry)World Wide Web ConsortiumRankingCondition numberOrder (biology)Extreme programmingWordPhysical lawMetric systemEndliche ModelltheorieLink (knot theory)SubsetDirection (geometry)Connected spaceGraph theoryGraph (mathematics)Connectivity (graph theory)GradientDialectAxiom of choiceMereologyAngleNeuroinformatikHypermediaMedical imagingAreaMetropolitan area networkMatrix (mathematics)Web 2.0StatisticsSeries (mathematics)Multiplication sign10 (number)Formal languageLengthState observerArithmetic meanOpen setWebsiteOperator (mathematics)Decision theorySheaf (mathematics)CAN busComputer animation
Distribution (mathematics)Link (knot theory)Degree (graph theory)NumberWeb pageTotal S.A.Execution unitDistribution (mathematics)Physical lawAreaNumberProcess (computing)Web pageCartesian coordinate systemView (database)MedianTrailVideo gameRAIDHypermediaInsertion lossStatement (computer science)Point (geometry)MereologyMultiplication signTwitterSystem callMachine visionSoftware testingPhysical systemState of matterFormal languageError messageNetwork topologyReal numberChaos theoryPower (physics)Extreme programmingRight angleWorld Wide Web ConsortiumLinearizationStudent's t-testGradientMathematicianShape (magazine)Normal distributionHyperlinkLinear regressionType theorySimilarity (geometry)Different (Kate Ryan album)Order (biology)Link (knot theory)AverageCASE <Informatik>MathematicsPlotterRing (mathematics)Set (mathematics)WordBitComputer animation
Computer animation
Transcript: English(auto-generated)
information retrieval on the web, web search engines, and basically we will deal with something that is specific for web search engines that we have not talked about so far. Of course there's also weighting and all these things but the web has especially an issue with quality, with the quality of the documents
and the most well-known heuristic for quality control, for quality measurement. Will be the main topic of today's class. So before we start, just a brief hint to other lectures
and other resources, there's some resources on the web, mainly the textbooks that were also I think part of the homework, I remember well, yes. Of course there's also other recordings, some are even openly available, ours is only in the closed learn web. But for example, the very close University of Braunschweig
offers its IR class, it seems to be quite okay. In the video, in the screencast, in the service of the technical library in Alofa, so that you can access the lecture and it's citeable,
you can cite every section of the lecture, you have a speech recognition on the recording and can search within the class or within this lecture. So also quite good, this is just one for web retrieval,
also similar topics and if you also find something else in some interesting things on the web, interesting resources, just send me an email and we can post them also on the learn web for the others to see. Okay, so importance of search engines are quite obvious,
I guess, you don't have to spend too much money with that, a lot of people invest in content, companies create websites, create content or lay people create content and the issue is
what if nobody can access, nobody can find these web pages what about the investment, is it lost? Well yes, of course, and web search engines are one of the main vehicles to access this abundant information without search engines,
would be hard to use the web, information on the web, so here web search is really a big role, sometimes even supported, censoring the internet sometimes means just censoring web search engines is sufficient to control access to a large extent.
Since it's a web search engine, you have such a big role, there are more than three billion queries, it was also for master two years ago, this number per day, well it's quite a significant number, probably even a bit more today, so people use this resource quite a lot,
here we have another number, more than half of all US citizens use search engines at least daily, probably a similar number for most developed countries, so here we have a tool that people access in their daily lives, web search basics, the first step that is different from
regular information is that we do not start with a controlled collection, we have to first find the documents, the search engine has to look for these documents in the web, and this is done by programs, by automated systems, called web spiders or crawlers,
they visit the page and go through the hyperlinks between the pages, this is the main vehicle to find new pages, collect them and send them to an indexer so they can be indexed, and then finally you can search in your fixed entry field,
relying on all the work that has been done by the indexer and by the web spider, the indexer is just a regular indexing system as we have mentioned it before, stop wording to some extent, you know that web search engines don't stop word so much, then we have stemming, again a bit different
for the web search engines, they keep the stem and the original form, remember that, it's also a bit different in web search but not so much, but the web spider is something that is new, usually we start with a collection, like all the articles from a journal or whatever, and they are indexed, and here we don't even know
how many documents we have on the web, we have to find that, and each person will find a different number probably. So this crawling process is one of the problems here, efficiency of course is also something specific for web search, the size and dynamics of the web
is a problem, size for crawling, also for processing, but also the dynamics, sometimes web pages change, well how often do they change, how often do I have to go back to the same page, it's not sufficient to visit the page once to index it, tomorrow there might be new information,
so I have to go back tomorrow to every search engine, for every web page, what do I do, type of content of course, also there can be digital formats and other formats, HTML, then we have a large heterogeneity, multilinguality, multimedia, multilinguality,
something we'll talk about in the master course on information retrieval, also on multimedia a little bit, and the big issue is of course quality, in the beginning it was very clear, there's very much high quality content on the web, but there's a large percentage of low quality content that nobody's really interested,
so that is really a challenge, and systems have been developed to tackle that, and that's what we're gonna talk about today, just about a few figures for the size of the web, obviously this is, growing very large,
we have an estimate 10 years ago already of over 11 billion pages, a few years ago 100 billion pages, and since we don't know the real numbers, or estimates that we can either believe or not, we know that size is growing tremendously,
we have here the host names, the figure for the host names, for the URLs basically, homepages, and we see there is a large continuous growth, then there is another discussion when we talk about the size of the web typically,
most people just talk about a publicly accessible webpage, HTML pages that you cannot visit, but there's a lot of what people call dark web, or hidden web, things that we cannot access easily, that are behind login entries, or company in-house pages,
private webs, proprietary webs, and those are hidden. Also hidden to some extent is a dynamically generated web, something to be free, the railway information system and the query, and the trains,
documents about the trains that run for one day will be created automatically from a relational database, but we cannot access this relational database, we can index it. So these dynamically created pages are also a challenge for the web, and some people estimate that this deep or hidden web
is even much bigger again, 100 times bigger than the visible web, and even that is already quite big, so this is a challenge. Crawler, basically as we said, is a small piece of software that collects pages on the web,
to send them to search engines, or for other purposes, maybe to analyze them. These things are also called robots, spider, wanderer, walkers. There's a page on every webpage, University of Houston, DE-Robots.txt,
if you type that, something you wouldn't access typically, there is typically a website tells these crawlers and automatic programs where they can go and what they can index and what not, and sometimes they follow these rules,
robots standards, and here again, definition by Hikado by Sayeds, crawlers are programs that traverse the net, sending new updated pages to main server, where they are indexed or maybe analyzed.
So what the crawler needs is a so-called seed pages, something where he can start. So he's given a list of URLs that somebody knows, it's ah, I remember this as this URL, and you can start there, and what the system does, it crawls these URLs, basically visits them, parses them, downloads them,
and the second step is it extracts all URLs that are in these pages. So download the page, extract all URLs from one page, and take these URLs and add them to my list, where before only the seed pages were.
I find another page, extract all URLs, add them to my list, and so forth. So my list of unseen URLs that I haven't visited yet will probably grow for a long time, and this is called the URLs frontier. Those pages that I have found a link to them,
I haven't visited them yet. Next step, once there is time, the crawler visits them, extracts again URLs, and maybe moves this frontier further down into the unseen web until he probably hopefully has visited a large portion of the web,
and seen a lot of the visible ones. The crawling algorithm is basically very simple, download the page, step two, parse it, extract all link URLs, see, repeat these two steps for all extracted URLs. That's it, okay? Very simple piece of program,
where we also sometimes have this semester, we have a lab class on Lucene and patent information last year, and sometime in the future maybe we have a lab class on crawling programs and Lucene, where we crawl pages, analyze them,
also have some theses on that, very interesting topic of course. We can download a large number of pages automatically, and analyze them and find some interesting relations in them. And this is also very simple piece of software that you can easily work with in one class one year.
So, here again, the URL frontier, it's a list of URLs that we haven't seen yet. The crawler processed it in the sequence sheet. Now we see, aha, we have several threads,
so we have several parallel running programs, because crawling the web with only one program, one at a time takes too much time, so we need several, probably hundreds, thousands of parallel programs that are running, and that download continuously pages from different servers. Then of course we have the problem, how are all these programs organize
as a common URL frontier? So that's something for parallel programming, we don't have to worry about this, but this is a interesting challenge. And now, also interesting, we've talked about the frequency, that's something very interesting, pragmatic issue, how often do we have to visit a page?
There are a lot of pages, new sites that change every hour, some other pages, like for example my home page, changes very infrequently when I have time, maybe two or three times a year, so it would be a waste to visit it every hour and check, has it posted something new? And that's exactly what search engines do,
what crawlers do, big crawlers, they observe, how often does this page change typically? Visit them a few times and see, does it change hourly, or daily, or monthly, and then adopt their revisitation behavior through this modification frequency.
And how often do pages change? There's a lot of analysis on that, interesting, a large study done by Microsoft in 2003, tried to analyze how large is the change in web pages overall, and correlate it,
and also check, does it really correlate if it changes in one week, in the second week, what happens in the third week? Is it typical that this change is in a, really correlates? Here the study of Vadley et al, 150,000, 150 million web pages,
which was at the time quite a challenge for the systems, and interestingly what they found out was, here we see a level of change, no change, we see that, wow, almost 50% of the web pages overall, top-level domains don't change
at all in these 11 weeks. And we've only, and for some, almost top-level domains, this is the top-level domains, so this is the average, on average we have 65%,
no change at all, and then the next two steps are no text change, small change, like maybe less or if I want, or something like that. Medium-large change are almost invisible here, or for some, here it is, some incomplete change,
we observe very rarely, only interestingly for the DE domain, like this study, Vadley observed a much larger change pattern than for other top-level domains. Here we have 10% of the pages have changed completely in this, in the course of this study.
So, but overall, very, quite surprising, and for the German top-level domain, actually Vadley and all, they checked this result a bit more closely, and they found that they had a big link form there,
some artificially created pages that really screwed the results, and if they, and once they excluded this big chunk of millions of commercial pages that just tried to, that had just no information at all, then they ended up with a very similar pattern
as the other ones, so this was just due to a few, a few, relatively few strange pages overall. Here is a plot that is a bit difficult to read. It's a three-dimensional plot of the change over in,
of the amount of change over three weeks, and interesting what we can observe is that, this is growing here, and this is the correlation, so we see if there is a big change in one week, or no change in one week, then there is no change typically in the third week too,
it's the, the dimension is growing one, bottom, and so here, they could prove that there is a high correlation. If there's change in one week, in the second week, then it's very likely that there's a change in the third week, and vice versa, if there's no change in two weeks,
don't have to go there, it won't change again in the third week, the probability is very high, but this is a empirical proof that this heuristic that we've mentioned here works quite well. So much for crawling, and science, and revisitation,
and behavior, and frequency change, now we move to the central topic, to the lack of quality. All questions to crawling before we leave this topic, no? There are a lot of statements, of course,
over the time, information quality varies widely on the web and so some pages, for many topics, the web contains many, many documents of widely varying quality, and Larry Page, 98, I think in the first publication of the algorithm
that we will talk about now, one of the Google founders, he's wrote a large fraction of low quality web pages that users are unlikely to read, so he says quality means people won't like to read it, they don't want to see it, so this definition of quality,
and he says it's a very large fraction, so many, many pages that nobody's interested in. And Weinstein-Leumann, again, more detail about what could be low quality, sometimes it's even means false information, or either accidentally, or even with,
even in times people publish whatever wrong information because they intend to do that. So we have a lot of pages, nobody's interested in bad quality, probably everybody has seen that in their experience. There are information, there are frameworks
for information quality, here, interesting framework by Nguyen et al, they said there is something that is intrinsic information quality and contextual information quality, these are the main categories that are basically
opposites we could say, there's also representation accessibility, something that can occur anytime, basically intrinsic quality means something that is true or good or high quality, independent of whatever, we can see that it's a page,
it's accurate, objective, believable, has a high reputation, whereas the contextual information quality is more of a usage category, takes more of the context that we use into account, says it has to be relevant, and we don't know what relevance means
from the introduction class, it's value-added, timeliness, completeness, the amount of information is correct. I have a very true page, but it's too much information for me, so it's not really that good, and so that would be the difference between intrinsic quality that can,
I'll say it's always there and some more contextual definition. And we'll see, in the end we can check, well, what kind of definition does our algorithm that we'll talk about today adopt, is it more on the intrinsic side or more on the contextual side,
if we don't forget about it. So here we have situative factors, where for intrinsic we have more of a, talk about it as the truth that is said there. Presentation, representation, of course, cannot easily read and understand these things, also something important for information quality and access,
takes usability and accessibility issues into account. So, we come to the basic notion, what is the basic idea for quality control, that comes from the so-called area of link analysis,
so we take links into account. Links, hyperlinks, connections, they're pointers from one webpage to another. Okay, everybody uses links all the time, so they're links between webpages. And now the basic idea is to say, well, these webpages could be an indicator for quality.
If it's of low quality, nobody would put a link to this webpage. And if a lot of people put a link to a certain page, they take the effort to sit down at their webpage editor and say, oh, I add a link to this page, take some time, if they do that, that could mean that's of good quality.
And if a lot of people do that independently from each other, and it's probably a very good page, because people link to it a lot. And that is the basic idea. Best on the algorithm is PageRank from Google,
it's published by Adrian Brin, and maybe it's applied by Google, but we don't know, they claim that was the success factor for the search engine, but we don't know. There are a lot of factors that account for the quality of the search engine.
There's a lot of, well, we can never be sure, and we can never be sure what the algorithm really does. Probably has some elements still of link analysis that is close to what we'll talk about, but also, of course, we do not know the original, the current algorithm.
So this was an idea brought in by Google founders, but they were, of course, also not the first ones to talk about link analysis. Other people have talked about that. Steinberg, for example, he will also show his algorithm as published, I think, in the same year,
his link analysis algorithm. And this idea is not even new in the web. Things like that have been discussed a long time before in the area of bibliometrics or scientometrics, where links to an object were also considered
to be indicators for quality. So if somebody links to this book, and how do we link to a book in old times, what was the quality indicator for publications, for books or articles, and still is today.
It's quoted very often? Yes, quoted, referenced, exactly. References. Other people do not only read it, they read it and think, oh, that sounds good, I'll quote this, I'll base my work on this work, this is an excellent explanation of whatever,
and put in their references a link to, or a reference, in this case, to a scientific publication. Then it is people from all, that must be good citation, I'll read this too. Nobody cited it, well, it's not a publication that nobody seems to read or cite, at least. And if a paper accumulates a lot of references,
a lot of citations, then people say, well, that must be a good publication. And consequently, a good scientist who published it, a good journal, because maybe there were more good articles in a journal, then you would say, this is a good journal. The average of the citations of all papers in one journal
for a year is called the impact factor of that journal. That is a very important value in the scientific communication community. So, scientific metrics, an old field for a few decades, these things have been discussed,
how can we calculate the impact factor, what does it mean? And basically behind this is the idea, many links or objects support its authority or quality. Basic idea, we now adopt it to the web. We don't have references, but we have citations, we have hyperlinks, we don't have citations, we have hyperlinks.
So page rank is as simple as that, in two, three steps. The more links point into a page, the higher is its authority. This is the first step. That is what has been done in scientific metrics,
bibliometrics, this is basically journal impact factor, something like that. And now there are two new ideas that page rank added, also Kleinberg algorithm, Kleinberg from IBM has added this, it says well, not every citation is worth the same.
We have to first check if the page that links to me, that links to a new page, is it also of good quality or not? Then this, if the original page already has a high quality, high page rank, then the link is worth much more.
If it's an insignificant page that doesn't have a good quality and links to me, well, it doesn't hurt, but it doesn't really contribute to my page rank. But then to know before, but does that have a good page rank, I have to go back to the back links of this page and to the back links of this page and to the back links of these pages. So I have to a circular algorithm, algorithm starts
with all pages in the web having the same page rank, one or whatever, and then stepwise iteratively going through the whole web and calculating these things back, at one point I have to stop and say, okay, now I'm done, more or less nothing changes.
This is the page rank that I have got for each page. Overview and another, and I could say maybe a fourth element that we have to know. We start with this simple page. At one point we reach the conclusion that this page has a page rank of nine.
It has three out links, so it distributes its page rank, we can say, to three other pages. And now, because a page can have many, many out links, it would be unfair to give all the same value, we divide the page rank by the number of links we have.
So nine divided by three, three out links makes it three for each of the receiving pages. So this page here gets a contribution of three to its page rank from this page, and a contribution of 50 from this page.
It's added up simply, summed up, and we have 53. Okay, and this one has 100, it's better quality obviously at this point. Two links, 100 divided by two makes it 50. 50 is distributed to both pages. This one receives 50, continues to distribute 25
to some other pages, and so forth. And we go backwards several times to the whole web until we reach a situation where the page rank values don't change too much, okay? So simply a number, the more links,
of course the more links you get it helps, but some links are worth more than others because they come from good pages, plus you divide the outgoing number through the number of out links. That's it, and several iterations. And then you end up with a number for each page,
and this is the page rank vector for the whole web. Okay, you have at a certain point you have a vector with a number for each page, and that is the page rank vector.
Okay, we don't have a matrix, much easier than term weights and so forth, right? Only one number for each page, very easy. We can rank the whole web according to the page rank. The best page would be number one and so forth. There's very same thing now in the formula, right? Nothing different, just one little thing
that we'll talk about later, one parameter that we'll explain later. We have this parameter with some parameters, parameters that are p, page rank of page p
is basically this sum, and this sum is, as we said we add up 50 plus three in the previous example, of the rq, the rank of q, which links to me. So I sum up from all my incoming links, pages, all the q pages, and the number of out links
of q, of course, as we said, we divide by the number of out links. So all is in this formula here. Now we can also express the web. We like matrixes, right?
From, you know, the term document matrix. We also had, did we have a term matrix already? I don't think so, I don't remember. But this would be a page-page matrix on the web, and we can have two representations of the links between in a graph or on the web. We'll talk about the graphs later a bit more.
So the page rank basically depends on how are the pages on the web connected. Page that gets more in links gets a higher page rank. So it depends on how is the overall linking between the pages.
In this small example I have four pages on my toy web here, and I have two representations that are completely identical. Here I have the graphical one. I see, okay, page A receives a lot of in links. It would end up having a higher page rank
than the other ones. And I know there is a link from D to C. I can see that here, and there is a link from D to A. And the very same thing I can see in this matrix here, completely identical. Here I can see that I have a link from D to A. From this, from the lines here to these
ones on the top. Should see also that I have a link from D to three other pages, maybe A, B, C. Is it true? A, B, C, yes. Right. And I can see that D has no in links.
There's no incoming link. Whereas A has three incoming links, we can also see that. So, completely identical representations, formalisms of the very same thing. A specifically connected web of four pages in this case. We could have different connections, right? We just change the ones in the matrix.
I end up with a different connectivity. Then I get a different page rank. Okay? Still all very easy, quite obvious, right? And this is something that we'll see in the end. Can think about it now. How is the web really connected?
We said we have maybe 100 million pages. What is their connectivity matrix? What is the typically number of in links? How many out links does a page have? How often is it referenced, we could say? How often does it receive links? If we take two pages,
will there be a link between them or not? On average, probably mostly not, right? We can guess that by 100 million. But, if I select two random pages, can I find a path between them? Can I move through a certain number of links
and reach the other one? Is that always possible or not? What is your guess? No? Maybe not? How often could it be possible? Well, good question then. Okay, we'll see in the end. How is the connectivity matrix? Because that has an impact on the page rank, which is used for retrieval, right?
So, the page rank vector, which is just a number for all the pages, right? For all pages, we write it R vector, right? This dot on top is a function of only one thing,
and matrices in mathematics are written with German characters, so it's only a function of the connectivity matrix, R. We don't need to know nothing else, just the connectivity matrix and our algorithm, we can calculate the page rank for a graph,
like the web objects connected through links. We can also write it, because we do this intuitively, we do it once, and then several times, because we have to go back in a circular modern way, we have to find how the page rank of this page depends on the ones that link to it.
The page rank of the linking page, again, depends on those that link to it, and so forth. I have to do it, could do it indefinitely, but we stop sometime. So, the page rank at a certain point in time is, like the next page rank, the new page rank, is a function of the old page rank,
the last iteration, and my matrix. That's all we need to know. Oh, it's just an explanation. So, we have to have the connectivity matrix, and then we can move on to several iterations, typically three to five or something, where a number of iterations
that you can read in publications. We don't know how often it's really necessary. And this can be written in a way to get the so-called eigenvector. Once we reach a situation where R, the next vector, is a function of the previous one,
and this connectivity matrix, and when those are identical, it's an optimization problem, well-known in mathematics for a long time. We can only approximate this, then we can call the page rank the eigenvector of the connectivity matrix,
just so that you have heard this. Sometimes we can even see the page rank in the Google toolbar. Anybody uses the Google toolbar? Not? Maybe it doesn't exist anymore? Page, the screenshot is older. But there you could see the page rank
as a bar that would get filled with color, right? Had, I think, 10 steps, was not a number, but 10 categories of good to better, bad pages to better ones. And here we see the Yahoo page with a high, high page rank.
In order to know the page rank, you have to know the connectivity matrix. So something that web search engines, now if they want to do link analysis, they have to store the connectivity matrix. Which page are connected to each other? Is there a link or not, right?
And since this is stored, which also takes up some storage, many of the web pages, they also allow the users to access this information. You can also search for connectivity matrix here. You can search, is there a link to a certain page? Anybody ever done that?
No? Basic, very basic function in Google or Yahoo or Lycus. So you should know that, and you should, you have to know it eventually, to solve the homework of this class.
So search for the links that point to a certain page. The method, typical syntax is link, dots, and then the URL. And then you get the hits, hopefully, of pages, if it's correct, that have a link to this page, okay? So you can do backwards chaining,
basically, in the connectivity matrix. Forward chaining is easy, right? You just download the page and check its links. That's something the crawler does that you can do. But backward, you have to build the whole matrix before. Anyway, so how is it used in retrieval?
Typically what we assume is that there is a function combining the similarity value that comes from our weights and our query. Remember, query, terms, and weighting.
And we end up with a similarity value that we call RSV, retrieval status value. So we have a query dependent retrieval status value for a query and a document. But now for each document on the web, we also know this quality,
have calculated a quality based on the page rank. You can see if that makes sense, we'll talk about that later. And we add a page rank of the, and of course here we don't have a query dependent part. There's also a topic of query dependent page ranks, but something we won't go into this. We just assume query dependent RSV,
query independent page rank. And the function, we have to have some function that brings these two values together. Maybe the sum or the product, right? There are different ways to combine things.
One of the coming sessions, we talk about combination of values. We mentioned that in optimization. So we have a final RSV, which is some function of RSV, normally similarity value, and the page rank. Okay, how does this combine in the research engines?
We don't know, but we assume that they use some sort of link analysis. Questions. The core is really, I think this slide,
this is where you can easily see how does it work and understand the page rank. That is what you should really take home today. But of course it's not that easy. There are a bit more advanced technologies.
One idea is this, now the Kleinberg algorithm that was published at the same time or even a bit before page rank maybe, by, what's his name, John Kleinberg I think, who worked for IBM. And he said, well, this quality value, that's a bit tricky, because pages have different functions.
He said there can be a page that has content, and other pages, they are maybe just collections of links, like somebody could collect all the best links on information retrieval and set up a page and maybe not write anything himself,
because that's all on the web, everything has been written, I just bring together the best 20 pages on information retrieval. Then his page would have a different function than a content page. And Kleinberg said, well, we have these two roles, and you can have quality value for each of them.
Each page has a content value, which he calls authority value. And each page has a value, a quality value, for its kind of linking quality. And he calls that the hub. Hub is a connector in a network, right, a big hub, like people talk about hubs for airports, right,
an airport that is central and connects to many other airports, can be called a hub. Similarly, we can call a web page that is in the middle of a network and links to many others. We can also call that a hub. So he says you can have either a good hub or a good authority value.
And we calculate this very similarly to the page rank, but the roles are different. There is a backward and a forward combination. Now we have a page that links
to three different other pages. And now we have a relation between the hub value of the origin of the link, the linking page, and the receiving page. The receiving page will say,
well, we will look for the authority value of the receiving page and says, aha, this is a good authority, a good content. This page is in a good hub. Let's augment its hub value. And vice versa.
If, basically that's it, right? That's all we need. Now how do we determine the hub value? The hub value is determined,
well, let's look at the formula. I'll get it easier. All right, basically, that's it. The authority of P is the sum of the hub values. So if a good hub links to you,
you seem to have good authority. And the hub value is influenced by the authority. It's the sum of the authority of the linking pages. And the authority of your page is the hub value, again, transmitted by the hub value of the recipient page, the receiving page.
So the roles interchange their quality. If you're a good linker, if you're a good connector between pages, then you augment the quality value of this page, content quality, authority. And if your authority is good, that is good for the linking page.
It'll increase its hub value. Okay? All right? And each page has both values, but some pages can have a high hub value, other pages can have a high authority value. For search, maybe sometimes we want the authority, sometimes we want the hub,
depends on what we're searching for. Yes? Could happen. It could happen, but it's not so likely, I think. Never checked an example. Would be interesting. We could say, yes, we could imagine a page
like imagine a page that has a good information, good authority, and on the bottom has some excellent links. All right, then we could say, ah, this has both, for both qualities it's very good. But I've never checked an example. I don't know if anybody really calculated that. If in a real, normal network,
it's very likely that this occurs. Because we influence vice versa. What question? For example, if you have some kind of scientific research on page with a lot of references, and a lot of content,
which is informative, you may have a high authority and a high hub value. If the content is like... Exactly, we can think easily of such a page, right? We can imagine it and say, well, this is good for me as a link in page, I find good reference, and it's also interesting for me to read it.
But, as I said, in a real network, I'm not sure if this occurs very often. Or if it's typical that hub value is low and the authority is high, and vice versa. Or if both are high. Typically for most page, of course, that's what, if we agree on what Larry Page said,
it's that there's a high fraction of low quality page, so both values would be very low. And for some, they would increase. And now the question would be, is it possible that a page excels, or is it very good for both roles? Unsure. Good topic for a matched thesis.
Anybody interested? Still open? Think about it until tomorrow. Somebody wants to do it. But very good. Here a, an example. We have a, to make this, to illustrate this, and to also, of course, see one of the problems here,
is the two hubs that we compared. We have two hubs that link to four authority pages. And now we see that a hub, we have three very good authority pages. We have 90%, we'd say, or 0.9. In the, well actually, we don't have probabilities.
Therefore, we have 0.9 for these three pages. And we have another page that has low authority. 0.1. So compared to these back pages. Put it simple, right? Now, we have one hub, and another hub.
And, remember that this value would be channeled into the hub value. This hub links to the first three pages. This hub, this page, this is called page, links to all the four pages.
Now the simple Kleinberg algorithm would add up the authority values as a hub value. For the first page, 0.9, plus 0.9, plus 0.9, makes it 2.7, hub value 2.7.
This one, remember, links also to this one, also to the first three, so we have the same calculation, 2.7, but also plus 0.1, makes it 2.8. Hub value, 2.8, right? Sums up all the hub values of the receiving pages.
Now, which one is the better hub? Which one is the better hub?
Why? It's a three, not losing all of them on the hub.
The best hub must not to separate in too many pages. Just in three. Oh, uh-huh, yeah, yeah. The other one must still be too many, less. It gives more pages and gets a better hub value,
but we're not convinced that it's good because we say, well, it links to more, but one of them is really bad. So a quarter of the users might click on this page and end up with a low-quality page, whereas here, all the users would get a high-quality page.
But still, since we have the sum, Kleibach algorithm, as we've shown previously, would say this is the better hub. So, counter-intuitive, right? What could we do?
The average quality value. Yeah, exactly. We could take the average and also multiply even the average with the, or the average between the min and max
with the number of links, right? And then we would end up with the following situation where we exactly, and that is also, we would have that idea before 2002, before Borodin and others published it. We could have published this idea, right? Simply just don't take the sum,
but take the average here. Here's the average 0.9, easy. Here the average is 0.5. Basically the average, not the real average, the average between min and max, right? And then take this spread, we could say,
between min and max as the average, and multiply it with the number of links, and then we see, aha, here we have 2.0, here we stay with 2.7, and now we have a situation where we say, yes, the better hop that we would think as users of the better hop wins. So it would be a simple improvement
of the Kleinberg algorithm. Okay? Understood? No more explanation. How are we doing in time? Let's see. There's a lot of stuff.
8.30? 3.15 already? Okay, let's see if we can manage. Next step, the page rank value can be interpreted quite differently. We'll talk about it now for a while. One interpretation is the so-called random surfer.
Given a certain connectivity between web pages, we could imagine somebody clicking until eternity forever on links, and sometimes he would end up in a dead end, and he would change some pages that don't link anywhere,
he would jump randomly to some other page. And to make it more equal, sometime we stop clicking following links, we just click, we land on some other random page, we jump on a random page. And from there we surf on for hundreds and thousands of clicks,
something of course that nobody would do, this is just a model, and we visit web pages. Okay, quite easy thought experiment, somebody clicks forever. And now we want to know what is the probability that he reaches a certain page.
And now we can think, well, a page that doesn't have many links to it, he will reach more often. Right? If we have like 10 billion pages on the web, let's say, and we click whatever thousand times more,
we have a thousand times more clicks right through this graph, we will end up more often on pages that have a lot of in-links. The page that has a few in-links won't get there so often. Two pages with no in-link,
with very low probability. Only with the random jump we can get there. All right? So a page that is accessed by many paths will be visited more often. And now this probability that the random surfer meets, reaches a page, is exactly the page rank value.
Because again, it's high, the more links a page has, the higher the probability of the visit will be, and the more links the visiting, the initial page has, that links to me, that will bring visitors, again, the higher it is,
the more likely is that this path will be traveled to me, and again we have the same definition as for the page rank. And only thing we have to change in our definition, original definition, a new one, that also explains now this parameter that we had in the other formula,
is a, what is it here, the random, we have a very small probability of a random jump, teleportation parameter, it's called sometimes, we just transfer it to somewhere on the web, and this is alpha, only thing we have to change
so we don't stuck in a dead end and we can't calculate the algorithm anymore. Then the random surfer value probability is exactly identical to the page rank value that is used for the information field. Okay, link analysis comes with a lot of parameters,
like everything in Iowa comes with a lot of parameters that can be changed, we've seen the alpha parameter, like the probability can be higher or lower, we have global approaches, we have talked so far about global approaches, we could also do link analysis on a smaller subset.
We could say, I'm not interested in everything, I'm interested in these thousand pages, maybe that I found in my search, right? And for these thousand pages, I calculate the up value, or the page rank value, or the authority value, whatever. This is what Kleinberg originally did, he had a vincinity graph
for a small subset of the web, and this would be a query dependent page rank, maybe we can have query dependent, just the result of a query, or topic dependent, or things like that. Then we have seen we have one parameter is one, two values on the page rank, or hop and authority, well, then we can consider local links or not,
like the link from the University of Hildesheim to its sub-pages, are they necessary, or do they really express something about quality? Maybe not, because it's not an independent vote, we could say, for this page, and so forth,
we have the calculation, how do we combine the RSV and the page rank again, another parameter. Here is just an illustration of what we could call a community link, if we had thematic clusters on a certain topic,
we could have a link that is within the community, and a link that comes from outside of the community to debate which one is more valuable. We don't know, right? Interestingly, now we come to some empirical stuff, empirical results. This is a cross-topic link matrix.
This, basically, what we illustrate here, is now an empirical analysis on where are most links really, where do most links lie, are they within a community, within a collection on certain topics,
or between certain topics. Here is some interesting work also. Not very recent, but probably hasn't changed very much. Here, a large number of a few million web pages were analyzed, and they were sorted according to the Yahoo top level,
or second level topics. All the web pages were on the Yahoo directory, which is organized by categories. Here we have 160 something categories, 180 were extracted, and the links were analyzed, the links on these pages.
and people check do they link to pages somewhere else in our topic category or on the same topic category. And as we can easily see, the crosslink matrix is almost empty between the topics, only
Diagonal has really a lot of links. There are a few exceptions for some, here is a topic that has received some links from another topic, but typically most links are received from the same topic, topic 100 links to pages of the same topic and not to other topics, almost empty.
That means most, like probably 99% of the links that were checked in this study link to something that is more or less about the same topic, whereas almost all links are topic dependent, not topic independent.
And this is one of the first studies and there is a lot of quantitative research on web links, where is a link and is there a link and when is there no link and the reasons to set a link of course for the individual author who puts his page up and says I put a link here, I put a link there, they can be very individual of course, we cannot
determine who will set a link to whom, but we have extremely stable patterns if we observe a large number of links. It's a publication that is called Web's Hidden Order and this is something that
we can easily observe. Let's check these things. First we come to the bit of graph theory, which is nothing new, we just had in the beginning we had seen different representations of a graph as a matrix, as a table we could
say and as a graph, as a drawing. Graph theory is basically a field within algebra concerned with network structures, objects that are linked by connections and for link analysis graph theory methods can be applied, we can have nodes or units and links, connections or edges, similar
to the neural networks that we had last week, can also be seen as a graph. We can be directed links, we can have undirected links that don't have a connectivity for the web, we have a connectivity, we have a link that goes from one page to another. Anyway, and we can have some basic patterns like rings or transitive relation,
mutual relation of course, simple link or no link, co-citations, these pages don't cite or link to the same object and so forth. And we can have something like hubs or fans, we can have communities, we can have certain
authorities that are linked to, and we can have pages that link to the similar group of pages. So this one links to similar, this one links to in this case to the same ones, similar ones, and we could, people have used this kind of link structure to identify communities.
Not only there are links between these pages, but there is a certain group that links to a certain other group. Again, we have seen this before, absolutely identical. Now we come to this interesting question, if I'm on one page in the web, can I link, can I browse to any other page on the web?
And that would be called, in the terminology of the graph theory, a strongly connected component, the web or a subset of the web, if I look at a certain page, I extract like
a few thousand, a few millions of the whole web, and I look at this and I check, can I really travel from everywhere to everywhere? And this would be a strongly connected component, the SCC. Each node can be reached from any other. So check please, I land on any node, and I should be able to reach any other node.
Hopefully it's true, I go here, I can travel here, travel there, I can travel there, and there, and to there, from here. So this is very highly connected to it, we have more links than here, we easily see there are more blue lines, here it's strongly connected, we can reach everything.
Other components could be so called weakly connected components, where we can also visit every page, starting at every page, but we have to neglect the direction. We have to assume that we can travel this in both directions.
If that is true, we call this a weakly connected component. In the terminology of graph theory now, okay? So let's see, is it true here? Of course everything is connected, we can reach, but why is it not a strongly connected component? There must be some landing point where we cannot reach every other one if we follow the direction.
See, who has found it? I cannot go from here to there, right? No? I know this I can go, so it's only weakly connected and strongly connected, and here
I can go here and here, so I can reach every point from here to there. So now we can ask our question that we had before in the session today.
What is the web? Is the web a SCC, strongly connected component? Or maybe not? Somebody said maybe not, right? We could say, well, is it at least maybe a weakly connected component? Or is it not even that? And are there so few links that I cannot even satisfy the conditions for weakly connected?
In terminology of web pages. First, before we answer that, we check the number of links that are out there.
And the page rank value depends on that, right? We integrate the page rank value into our ranking model. For that is important, what kind of values do we have to expect? Remember that we also looked at what is the probability of a word to occurrence?
What is the distribution of words? How probable is it? How is the distribution? We also should ask that for the number of links to know what kind of values we get. And now we find one of these hidden orders on the web. Very interestingly, again, and that's similar to distribution of words in a language,
links are also distributed according to the power law, similar to tip distribution. We have a very extreme distribution of in-links over the web.
We have very few pages with amazingly high numbers of in-links. And then we have many, many pages, millions of pages with no or maybe only one in-link. Interestingly, we can observe this power law distribution, which is similar to the tip distribution,
for many metrics on the web. We'll see a few examples maybe later on, maybe next week, let's see. And something that is typical for these distributions, and we didn't talk about yet, we talked about SIPF, is that the average, the mean,
is much higher than the median. The median, remember, from statistics, is the typical value. And the average is, you know how to calculate the average, right? We don't have to talk about that.
Take the average of all pages, take the sum and divide it to the number of all pages. Now think of the income in Germany, what kind of distribution do we have there?
Think of the typical income and think of the average income. Are they close to each other or are they very far from each other? Or do you think far? How far?
And what is higher? What do you think is higher? Like we brought all the people here that have zero income or low income,
let's say, right to the highest income, billionaire, whatever, right? And now we say, how many people have this? How many people have average and how many people have high? What kind of curve do we get?
Let's say, well where is the average? Where is the median? Well where is the median? Where is the average? Is it on the left or on the right in this graph? Of course you aim for a higher salary, right? Because you're studying, will you have an average or will you have a medium salary?
Like the salary is close to the median. What do you think? Is it higher? Is the median higher or the average?
Good question, nobody knows for sure. You live in this society, how is the income distributed? Probably higher than the median since I think the income is probably structured like the power law, the Zips curve.
So we'll say the average is higher than the median? Yes.
So the median would be a typical income where most people, this must be the highest point in the curve, most people have a median, the median income. And you would say the average income is higher and of course not so many people can have it.
The median must be the highest point, by definition, right? That's the median in the distribution. So the average is probably a bit higher. There must be then that there are many higher incomes that make up for those,
so the median is not zero. Also possible most people earn nothing or have a low income, right? I'd say no. But there are so many people who make a lot of money that they school the average away from the median, right?
Let's make another example. So, with only a few numbers, let's make this. Grades for the exam, who will get an A?
One in German. German grades one and five is failed, right? F, A, there are some grades of three. What about median average here? What should a typical exam at a German high school, what kind of distribution should it get?
If not the director of the school will say, ah, that's not so nice. Maybe even redo the exam. What should be, what is the goal of the secondary education? Maybe we get something similar here.
What is the distribution according to which mathematician who used to be on the 10 mark note, on the bill for 10 marks? How was it distributed? What is the typical grade?
Why at school we have six of course also, right? Six is failed, 35 is failed. What is the typical grade that most students should get? No, not one. If all students get one, then the director will say, it was too easy, I have to redo the exam.
The median? Yeah, the median. The median should be three. The median for the grades? The median, the most frequent grade is the median. Most people should get, most students should get a three. And fewer should get a two and a four.
And even fewer should get a one and a six, right? What is this kind of distribution? A Gaussian distribution according to the mathematician Gauss.
Gaussian distribution, right? A bell shaped curve, also called like a bell. And this is a shape where the median and the average are almost identical. The typical grade is also the average grade.
In most cases of course there are differences all the time. But this is a Gaussian distribution where median is almost the average. Now we come to income and or maybe power law.
Is income also distributed according to power law or to Gauss? Or is it closer to which one? You said it's not Gauss, right? What would it look like if it would be power law?
Let's look at simple distribution for Gaussian. It would look like a lot of people would gain a very low income. And then very few people would get a high income. Yes, just like this type of simple distribution, right?
For us it looks different. Power law and mathematical difference, but for us it's the same. Would be very unequally distributed, right? Then very few rich people, very rich people. And most people would earn zero, or very low, the lowest income that is possible.
That's not a really socially nice society, right? We wouldn't want to live in it because most of us would belong to this group, making out so much money.
And now where are the median and the average here? The median is the lowest value, right? Zero to high.
And the average would be somewhere, probably here or something like that. There are so high incomes that the average would be speechless. In a power law distribution. But as I said, we wouldn't want to live in such a situation.
And our income is a bit more like Gaussian, but of course also screwed, I would say, to the high incomes. So the average would be probably, I don't know, my guess would be the average would be more to the higher incomes.
But not, the distribution of income in our society is not a zip distribution, a power law distribution. The average, but in power law we have an average that is much higher than the median. Let's look at the distribution.
Here is a plot from a very small number of links, a small number of pages that I analyzed sometime, just for an example. And the real number of in-links that they have. And here is a publication by Broder that says it shows the power law distribution of web links.
And here we see we have in-degree and the number of pages. We start with a low in-degree.
Not so many people liked me and sent a link to me. And increase, and we say aha, here is a page that, wow, 100,000 in-links. 100,000 web page sent a link to this page. Good quality, right? The page ring says good quality. And very far on the right side, right hand side.
But of course there are 100,000 of pages, even a million. Very, very few in-links.
Millions and millions and millions with less than 10, less than 100 in-links. If you have 100 in-links you are already very good. So the median is the lowest value basically. 1 or even 0. Just like the most, like for the words basically, but anyway.
And of course we see what we have already talked about. Two weeks ago we have a logarithmically scaled axis. Both axes are logarithmically scaled. Each step is times 10 and not linear.
And when we plot all the links from the web which is basically randomly, we could say, right, we think of it as random, something that nobody tells people to do, we end up with an extremely stable, almost stable linear distribution if you plot logarithmically.
So this is kind of a part of this hidden order. We get a linear distribution for an absolutely chaotic system, like the number of in-links for a web page.
Nobody tells people what to do. They just do whatever they want. Set links to some pages, not to others. And in the end, everybody has so far observed a linear trend, only at the bottom, the long tail right here. No, it's actually not the long tail. These are the high in-link pages.
This would be the long tail basically. We get a little, not so, such a nice distribution, the linear trend is broken a bit, but for most of the part of the distribution, extremely.
You know, the question would be, why is it that? How can it be that in such a chaotic system where nobody tells people what to do, we end up with such a linear trend? Also, we have the same for language. We couldn't explain it right here. We will at least try to find a solution.
But unfortunately, we cannot do it today. Thank you.