We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Retrieval in the Web II

00:00

Formale Metadaten

Titel
Retrieval in the Web II
Serientitel
Teil
11
Anzahl der Teile
12
Autor
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
This lecture gives an overview on Information Retrieval. It explains why documents are ranked the way they are. The lecture explains the most relevant ways for content representation: Automatic indexing and manual indexing. For automatic indexing, the frequencey of word is of special relevance and their influence on the weighting of term are discussed. The most relevant models are introduced. The session on evaluation discusses new metrics like the Normalized Discounted Cumulative Gain. The session of information behavior provides a brief overview and explains the relation to IR. The session on optimization mainly introduces term expansion and fusion methods. The session on Web retrieval is concerned with the quality aspects and gives a basic insight to the PageRank algorithm.
Schlagwörter
Gewicht <Ausgleichsrechnung>Information RetrievalW3C-StandardTermVerschlingungWeb-SeiteAlgorithmusAnalysisParametersystemZahlenbereichGesetz <Physik>Leistung <Physik>DistributionenraumInternetworkingMinimalgradTotal <Mathematik>AlgorithmusInformation RetrievalNormalverteilungFrequenzWahrscheinlichkeitsverteilungMittelwertAusdruck <Logik>GrenzwertberechnungLineare AbbildungMetrisches SystemRotationsflächeSpeicherabzugZahlenbereichZentrische StreckungVerschlingungLinearisierungParametersystemRandomisierungSummierbarkeitPunktAbstimmung <Frequenz>GradientWeb-SeiteEndliche ModelltheorieExtreme programmingMedianwertObjekt <Kategorie>MultiplikationsoperatorGauß-FunktionBenutzerbeteiligungSoftwareentwicklerSpieltheorieSprachsyntheseStatistikTopologieMatrizenrechnungWellenpaketAutomatische IndexierungGrenzschichtablösungUngleichungArithmetisches MittelDivisionPrinzip der gleichmäßigen BeschränktheitWarteschlangeBildverstehenMatchingHypermediaProzess <Informatik>Metropolitan area networkComputersicherheitKartesische KoordinatenDifferenzkernEinfügungsdämpfungClientNeuroinformatikEinsComputeranimation
SimulationDatenstrukturVerschlingungRechenwerkZählenZahlenbereichRechnernetzStrom <Mathematik>Maß <Mathematik>ParametersystemDatenmodellSimulationSoftwareStatistikHalbleiterspeicherWahrscheinlichkeitsverteilungSISPMAPUngleichungGesetz <Physik>Arithmetisches MittelComputersimulationGleichverteilungInhalt <Mathematik>Inverser LimesLastLeistung <Physik>RechenwerkTermZählenZahlenbereichVerschlingungTeilbarkeitParametersystemCASE <Informatik>PunktRankingWort <Informatik>Web-SeiteDifferenzkernGoogolSoundverarbeitungJensen-MaßWeb SiteSchreib-Lese-KopfEndliche ModelltheorieKurvenanpassungMultiplikationsoperatorBenutzerbeteiligungSoftwareentwicklerComputerspielDatenstrukturFormale SpracheOrdnung <Mathematik>Profil <Aerodynamik>BildschirmfensterPhysikalischer EffektDivisionGruppenoperationInterpretiererKette <Mathematik>Physikalisches SystemPolygonnetzWiederherstellung <Informatik>ZeitzoneFlächeninhaltSystemaufrufRandomisierungAbstimmung <Frequenz>Mehrschichten-PerzeptronSchätzfunktionElektronische PublikationBitrateNeuroinformatikObjekt <Kategorie>Rechter WinkelComputeranimation
SoundverarbeitungVerschlingungZahlenbereichWeb-SeiteRechnernetzDistributionenraumW3C-StandardÄhnlichkeitsgeometrieAnalysisGraphKomponente <Software>DatenstrukturInternetworkingVektorrechnungEinfach zusammenhängender RaumAnalysisDatenstrukturFormale SpracheInformationSoftwareWahrscheinlichkeitsverteilungMatrizenrechnungMittelwertGesetz <Physik>Einfach zusammenhängender RaumIndexberechnungInverser LimesLeistung <Physik>LogarithmusMereologieZahlenbereichVerschlingungLinearisierungÄhnlichkeitsgeometrieCASE <Informatik>Fächer <Mathematik>PunktRankingWeb-SeiteRichtungHomepageSoundverarbeitungWeb SiteDifferenteRechenbuchDomain <Netzwerk>Auflösung <Mathematik>MultiplikationsoperatorBenutzerbeteiligungFacebookEinsOrdnung <Mathematik>SimulationSpieltheorieTopologieIterationProdukt <Mathematik>PhasenumwandlungAggregatzustandComputersimulationEINKAUF <Programm>MaßerweiterungPolygonnetzZentrische StreckungQuick-SortFlächeninhaltComputervirusNichtlinearer OperatorHypermediaZusammenhängender GraphVorhersagbarkeitFormation <Mathematik>SchnittmengeWort <Informatik>EinfügungsdämpfungVollständiger VerbandHauptidealSAS <Firma>DickeGamecontrollerComputeranimation
HomepageTermWeb-SeiteWurm <Informatik>COMAutomatische IndexierungHardwareComputerVerschlingungDatensatzIndexberechnungGewicht <Ausgleichsrechnung>SoundverarbeitungW3C-StandardInformation RetrievalBildgebendes VerfahrenDatenstrukturInformationDeskriptive StatistikInformation RetrievalDatenverarbeitungssystemFaserbündelAutomatische IndexierungArithmetisches MittelIndexberechnungInhalt <Mathematik>ResultanteTermWechselsprungVerschlingungFlächeninhaltGüte der AnpassungStochastische AbhängigkeitParametersystemATMRandomisierungAdditionPunktRankingWort <Informatik>Web-SeiteHomepageMailing-ListeSoundverarbeitungEndliche ModelltheorieSelbstrepräsentationNeuroinformatikMultiplikationsoperatorSchreiben <Datenverarbeitung>BenutzerbeteiligungFacebookAnalysisDatensatzOrdnung <Mathematik>AggregatzustandBitLastVirtuelle MaschineBildverstehenEinflussgrößeTexteditorProzess <Informatik>EinfügungsdämpfungGoogolVorzeichen <Mathematik>Web SiteComputeranimation
VorgehensmodellWeb-SeiteSpider <Programm>LogarithmusAbfrageW3C-StandardInformation RetrievalDatenstrukturMathematikOrdnung <Mathematik>SpieltheorieTypentheorieSoftwaretestGesetz <Physik>Globale OptimierungBereichsschätzungArithmetisches MittelFunktionalPhysikalisches SystemResultanteTermZahlenbereichZellularer AutomatQuick-SortReelle ZahlHypermediaCASE <Informatik>DatenfeldSchnittmengeCAN-BusWort <Informatik>Web-SeiteKartesische KoordinatenEinfügungsdämpfungSuchmaschineEuler-WinkelBitrateAbgeschlossene MengeWeb SiteDifferenteNeuroinformatikMultiplikationsoperatorSchlussregelURLBenutzerbeteiligungOffice-PaketDemoszene <Programmierung>Figurierte ZahlOrtsoperatorFacebookEinsKraftInformation RetrievalInhalt <Mathematik>Leistung <Physik>Metrisches SystemSpeicherabzugAbfrageGüte der AnpassungTeilbarkeitÄhnlichkeitsgeometrieProzess <Informatik>Mailing-ListeAutorisierungTouchscreenComputeranimation
VorgehensmodellWeb-SeiteSpider <Programm>LogarithmusAbfrageAlgorithmusInformationMathematikGesetz <Physik>Arithmetisches MittelBitForcingInterpretiererLeistung <Physik>MaßerweiterungResultanteAbfrageGüte der AnpassungTeilbarkeitCASE <Informatik>Strategisches SpielPunktRankingQuaderWort <Informatik>Web-SeiteHilfesystemRegulator <Mathematik>SuchmaschineGoogolDifferenteMultiplikationsoperatorCybersexBenutzerbeteiligungOrtsoperatorFacebookImplementierungOrdnung <Mathematik>SpieltheorieTopologieSoftwaretestPhysikalischer EffektAggregatzustandAuswahlaxiomGruppenoperationMereologieMomentenproblemPhysikalische TheorieMatchingBenutzerschnittstellenverwaltungssystemDatenfeldLuenberger-BeobachterKartesische KoordinatenEinfügungsdämpfungWhiteboardSchreib-Lese-KopfDreiecksfreier GraphMinkowski-MetrikRechter WinkelEinsComputeranimation
BildschirmmaskeWeb-SeiteTermSpider <Programm>BrowserDichte <Physik>Wort <Informatik>Meta-TagDokumentvorlageKondition <Mathematik>Reelle ZahlInhalt <Mathematik>SuchmaschineAdressraumVerschlingungRoboterWechselseitige InformationZeitbereichDomain <Netzwerk>AbfrageRankingStreaming <Kommunikationstechnik>Gerichtete MengePunktProgrammW3C-Standardp-BlockInformation RetrievalInformationPerspektiveAlgorithmusNavigierenDienst <Informatik>CodeDatenstrukturInformationPerspektiveInformation RetrievalTypentheorieGenerator <Informatik>Kategorie <Mathematik>BildschirmmaskeNavigierenInhalt <Mathematik>MereologieSpeicherabzugTermTransaktionE-MailVerschlingungFlächeninhaltAbfrageVersionsverwaltungProzess <Informatik>Luenberger-BeobachterRankingDichte <Physik>Wort <Informatik>Lesen <Datenverarbeitung>Web-SeiteService providerHomepageBrowserSuchmaschineAdressraumDifferenteMeta-TagMultiplikationsoperatorMessage-PassingSpider <Programm>Dienst <Informatik>BenutzerbeteiligungOrtsoperatorFacebookComputerspielSpieltheorieWellenpaketGefrierenBitGruppenoperationResultanteZahlenbereichSystemaufrufSpannweite <Stochastik>TexteditorPunktVollständiger VerbandWeb SiteSchreib-Lese-KopfEndliche ModelltheorieNeuroinformatikDemoszene <Programmierung>Computeranimation
SpezialrechnerFlächentheorieDienst <Informatik>DatenverarbeitungssystemSpieltheorieAbfrageLogarithmusTermNichtlinearer OperatorWeb-SeiteAnalysisBildgebendes VerfahrenBruchrechnungInformationOrdnung <Mathematik>SoftwareTypentheorieProdukt <Mathematik>Kategorie <Mathematik>SoftwaretestBitBoolesche AlgebraPhysikalische TheoriePhysikalisches SystemResultanteSpeicherabzugTransaktionZahlenbereichE-MailVerschlingungFlächeninhaltAbfrageGüte der AnpassungRegulärer GraphNichtlinearer OperatorCASE <Informatik>BenutzerschnittstellenverwaltungssystemSchnittmengeEin-AusgabeFuzzy-LogikWort <Informatik>BeobachtungsstudieHomepageSuchmaschineFlächentheorieNP-hartes ProblemKontrast <Statistik>MultiplikationsoperatorSchreiben <Datenverarbeitung>Gauß-FunktionDienst <Informatik>BenutzerbeteiligungMathematikSprachsyntheseFrequenzGrenzschichtablösungGruppenoperationLeistung <Physik>PrimidealServerMatchingResiduumAdditionStrömungsrichtungVarietät <Mathematik>Offene MengeReverse EngineeringEinfügungsdämpfungNichtunterscheidbarkeitKonditionszahlSchreib-Lese-KopfNeuroinformatikIdentitätsverwaltungComputeranimationDiagramm
Chi-Quadrat-VerteilungAbfrageLeistungsbewertungSoftwaretestDatenanalyseLogarithmusData MiningWort <Informatik>MySpaceFrequenzDickeZahlenbereichLeistung <Physik>Lokales MinimumDistributionenraumInformation RetrievalInformationTermInformationSoftwareFrequenzTypentheorieWahrscheinlichkeitsverteilungAutomatische IndexierungGesetz <Physik>NavigierenLeistung <Physik>PrimidealResultanteVerschlingungAbfrageLinearisierungGüte der AnpassungCASE <Informatik>SchwebungLuenberger-BeobachterKlasse <Mathematik>Wort <Informatik>Web-SeiteSuchmaschineBitrateMedianwertLoginDickeMultiplikationsoperatorSoftwarepiraterieAnalysisMathematikNatürliche ZahlSpieltheorieTopologieEntscheidungstheorieDivergente ReiheEinfacher RingGarbentheorieMereologieZahlenbereichZentrische StreckungFlächeninhaltProzess <Informatik>BAYESMultifunktionElektronische PublikationWeb SiteBestimmtheitsmaßDifferenteTesselationRechter WinkelDienst <Informatik>BenutzerbeteiligungMySpaceComputeranimation
Computeranimation
Transkript: Englisch(automatisch erzeugt)
Okay, so let us finish the web retrieval topic. Remember, this is really the core definition, something that we don't have in traditional IR. We have in-bipro metrics sometimes used in IR but not so much
and in the web it's really a very important aspect, the quality metric and as the most well-known quality metric we have mentioned the PageRank algorithm we talked about PageRank, which basically has three aspects. The more linking objects to my own object, to my web page, the better.
Many people voting for quality is good. Second, if the voter is of high quality he gets a higher vote, right? It's not democratic, it's meriocratic, we could say
not everyone has this equal vote, not every web page has equal vote, depends on their own qualification, we could say their own PageRank, basically. This is not iterative because we have to step back. And the third aspect is illustrated here. The PageRank of one page
is divided through a number of links, through a number of outgoing links when we calculate the PageRank of further pages of recipients. So those three aspects are relevant, those you should know and then you know everything basically about PageRank
or almost everything. Questions? No? Very straightforward and remember that we've already the same three steps are included in one formula as it is written here, PageRank of p depends
on the outlinks of q, q of pages to me, and if q has several outlinks then this PageRank of the sending page is divided and then it's simply summed up, we have the sum there and we have one parameter in front of this sum
and we could almost ignore but it's something that the formula has and we have interpreted this as the, remember when we talked about the random surfer model, somebody
surfing internally forever over web pages and then he gets more often to pages with a high PageRank because they have more inlinks and they have more inlinks from pages that are also often visited and other pages are not visited so frequently. And since we end up a lot of times, we may end up
we don't know yet, a lot of times in dead ends there is a random parameter that jump, every end steps are jumped to some random page so every page has a chance to receive some PageRank or visitors in the random surfer model and this teleportation
parameter, the probability that the page, that you jump to a certain page at the end steps is basically a parameter with epsilon in front of the sum. What you need to know is really the sum with the three elements.
That's really the core. Okay, and then we came to our quantitative research, that's where we ended up, right? We ended with a discussion of the distribution of this quality now, we said, aha,
how is it distributed? What is happening? And we see, oh, it's very similar to the frequency of words, even more extreme, we have a parallel distribution, very few pages with many inlinks and vice versa, a lot of pages, long tail,
they have very few inlinks, basically one or zero, and the most frequent value will be one or zero. Pages that have no inlink or that have one inlink, the most frequent value. But the average, I think one,
actually I don't know the current numbers of the average inlinks of a webpage, the most recent, only number that I've read which was several years ago was eight, don't take that for granted, that was the average of the average number of inlinks of web pages.
So if we have 10 million web pages, we might have 80 million links and the average could be eight, but the median will be zero or one, probably zero, and we on the other hand, of course, we have some pages with millions of inlinks
which bring up the average but not the median, and this can be observed for many, many web metrics, and we'll see, we'll try to explain that in a few slides.
So remember it looks like this here, a simple example, and here is the, one of the first publications very often cited, Broder et al. I'll be stood up for Yahoo at that time, and we see the, if we have a logarithmic scale
then we have linear distribution, linear mapping. And what we tried to illustrate last time was that if we think of this quality as income in a society, for example, then this would
be very unequally, very unequal society where income would be distributed very extremely, a lot of poor people, a lot of extremely rich, a few very extremely rich people, maybe something we had before the French Revolution or something like that, but not today.
And if it would be quality in the exam, it would also be very strange, right, most people would fail or have a worse grade and very few people would have all the points or something, they're a very wrong person. But as we said, in the typical school scenario, we should have a Gaussian
L-shaped curve, a normal distribution, Gaussian normal distribution of the probability of the grades over a student population, whereas the probability of in-links over a web page, a realistic web page, population looks like this. Now,
why is it that way? Why do we observe such parallel distributions? Why do you observe also tip distributions? Anybody has an explanation for that?
I think also similar things are observed in bibliometrics, right, most papers that are published have zero public links or references in this case, or one reference, so if you get cited at least once, it's already okay. You're in the median, if you get cited more often,
looks pretty good. What could be reasons that we end up with such a extreme distribution? Well, we have
now, there has been a development model basically of the web published by other people who said, well, let's assume that we have we start from zero, then we add a page, we add a page,
we add a page, and each time we add a page, we add several links. Then when we have 10 billion pages, we're ready, and then we look how does the distribution look like? And if it looks like this, like a real one, then we have found a model for growth and for reasons to add links that is realistic. So each time
in this distribution, we add a page, second page, and we can only add a link here of course, right, and we add several pages, and each page we draw some links, and then we see, well, how can we end up
having such a distribution? One possibility of course would be I add a new link, and all the pages like end pages, have the same probability of receiving a link. Equal distribution. Like each time I add a link, I randomly assign it, everybody has the same chance,
and then I would not end up with a distribution like that. Bell shaped or Gaussian curve. And so this was one factor that was tried by you, everybody has the same chance.
If you have new units, of course units that are there more earlier in our growth model, have a higher chance. But at one point in time, everybody has the same chance. What else could be could we include in our model when we add a new link? There's a probability distribution for the pages.
The model that we end up with such a distribution only has two factors basically. Maybe websites or the Microsoft Word websites are interested in creating content rather than
linking the site to other content? It could depend on the content. Obviously it depends on the content, right, if you're not interested in it, and last time we talked about the cross topic citation matrix, we saw that people typically cite the same stuff. But if we go
into content, it gets too complicated. Yes, of course it's always content, and it seems surprising that we end up with a linear distribution, but let's think of purely statistical terms of probability terms. Who could have a higher probability of receiving more in links? Could depend on what?
Exactly. How many in links do I have already? So that could be another factor.
Proportional to existing in links. So if I have already 1,000 in links, I have a 1,000 times higher probability of receiving a new in link that is now created, then a new page, or a page that has one in link already only.
So this page that only has one in link has it very hard to really gain quality or page rank or in links, whatever we may think of, however we talk about it, but the page that already is on the top will very likely remain there, right, because
it has these two factors. This is exactly what has been done. Simulations for the development of link structures, as I said, let's assume we grow a graph, the web, one by one, and for each page we add something between 0 and 20 links with these two factors, right?
One is equal distribution, one is what you said, the more you have, the more you get. And then we end up, this is exactly what this model has, the probability that the link, newly created link as I said, depends on two factors, one divided by
u, means everybody has the same chance, and depending on the link count that I already have divided by the number of links in the network. So how many links do I already have? 1,000
divided by how many links are there? 1 billion is a probability, right? I need to normalize the probability. Could be the case in the early stage that there are 10 links and I already have 10 links, so then the probability is 1, but it needs to be normalized. And we see there is an alpha and 1 minus alpha, and the alpha
is like a slider that shows me how much influence is there for each of these two parameters. Equal distribution, unequal distribution we could say. And people have done this simulation and only for one value, they ended up with a power law distribution and said,
aha, that's probably the real distribution on the web, and what do you think? What could alpha be? Something between 0 and 1. We already assumed that it's more towards the unequal stuff, right? We said equal
chance, the probability that you have the same chance is very low, and in fact it's 10%, right? Alpha is 0.9. If you put alpha 0.9, you end up with power law, if not, you end up with something else. And then you see, hmm,
probably the web works like this. So 90% of the probability that a page will receive an in-link is determined by its, as you said, the in-link that it already has. And only 10% is random.
Why? Can we explain that somehow? Basically it's somewhat an economic principle.
Any ideas? This is the science part, now it's interpretation, right? I'm not sure. Well of course people will link to pages that are well known, because they know them. You have a higher probability yourself of knowing a page like Google.com or
a page like whatever other famous company, and you have a lower chance of knowing an unknown site. Because as we said, there are billions of sites, and most of those sites you've never heard of, and you will never set a link to them, right? So of course people can only set links
to pages that they know, and pages that are well known, they, the knowledge about them will be distributed much more than for unknown pages. This is a typical network effect. People will talk about famous pages, let's call it, right? And those will receive more in-links,
even more fame, more people will hear about them, whereas for unknown pages who knows about them? And that's an economic principle. Of course my head is too small, I cannot know all the pages, all the objects, I have very limited knowledge, some I will forget, and those that people will keep others will keep talking about, that will stick in my memory.
So it's an economic principle to keep the memory load low, I can only remember a few things, and those I will set a link to, and others I don't even care about. So that's it. Probably something very similar in the SIPs distribution, everybody has a limited active word,
amount of words that they know about that they use every day, I think 10,000 or something like that, we average people have a little less or much less, and of course it's sufficient, right? And if I know, if I use words that others frequently use that are constantly here,
I will also use them in my language, so those words will get used a lot, that everybody uses all the time, where others might not be used so much. So that's probably a memory-based, theory-based way to explain these distributions.
Okay, questions so far? We have now a simulation model that ends up with these parameters, we could say, aha, in the real world probably only 10% of the probability that a link created at one point in time goes from
reaches a certain page is determined by randomness, the rest is determined by what I have already. This is called the Matthew effect, something that is also reported in bibliometrics, and we have the same
phenomena, a good paper that already received a lot of citations, will again receive more citations in the future, whereas unknown papers have a few citations first before they can
really exceed, and it's hard to get the first ones. Why is it called the Matthew effect? Where does it come from? Anybody ever read the book in which the Matthew effect has been explained? No? It's the Bible,
in Germany we have the Mateus effect, right? Matthew says, Jesus says in the Gospel of Matthew, towards the end, you can read that, for everyone who has, more shall be given, and he shall have an abundance, but from
the one who does not have, even what he does have shall be taken away, so it's even more negative if things are taken away, which is not really the case in link analysis, you only can receive links, but you may have zero. So, who has shall receive more,
who has shall be given more? In German, or is it in German in the Bible? Or in Spanish, whatever language you may know. I think it says in Matthew, and that is the principle behind link analysis and
also bipyometrics. Other distributions that again follow the power link, maybe for similar reasons, is outlinks. We have a few pages that have thousands of outlinks,
which is kind of strange, so nobody can go through them, and we have lots of pages without outlinks, or very few outlinks. In and out links between hosts, so far we have talked about links in general. Sometimes you could say, well, in links, the link from the University of Hildesheim homepage to whatever institute,
that's not a quality indicator, that's just there for navigational purposes, has to be there, otherwise you can't get there. We ignore this for the link analysis for page rank calculation. Only external links count, and you end up with a very similar power law distribution, ignoring the navigational stuff.
Sizes of sites also are similar, most web pages have a very limited number of pages, and there are some like whatever IBM or something that have millions of pages, or whatever, Facebook now has probably millions of pages, right?
Sizes of WCC and SCC, remember what SCC means? Last time we had it strongly connect the component, something that has a link from everything to everywhere, and of course we can think, we have the web, we still don't know if this is a strongly connected component,
nothing at all, we'll solve that in a minute, but I can find a few pages on the web always that are SCC, and then I can find other pages that are SCC, maybe I can get bigger portions, a hundred, a thousand, a million pages that form an SCC. And if I find all the SCCs
in the web, they will also follow a power law distribution. I have millions of SCC that are small, and I get fewer and fewer ones as the web gets bigger, as my size gets bigger. Numbers of friends in social networks is also power law distributed, has been said, and so forth.
And this is, as we say, quite amazing, a linear distribution, log-log scale, logarithmic plot, and probably all due to similar reasons as the one we explained with our simulation. And now of course
we can think of, does link analysis really make sense? Does it really make sense to include the page rank as a quality indicator into the retrieval? And we can say, does it really mean quality? Just being well known, right? Being the most well known page, well, it's determined
by 90% is network effect. Once people assume you're good, you will get links anyway, right? Nobody will really check, is this a good page? I will just set a link because I know it. Then we have things like recency,
what about really new pages that might be very good, but nobody has had a chance to link to them yet. And we had thematic similarity, we remember the cross topic citation matrix again that would say a link has nothing to do with quality, it only has to do with topic, topic of similarity.
And the next thing, we'll also talk about that in a while a little bit, the page rank can also be manipulated. This point in time doesn't play such a big role, but like five to ten years ago
manipulating link-based analysis was a big business, people created link farms, huge collections of pages that had no information by themselves, that were just there to link among each other and create pages that had a high page rank and sell links from these pages.
So, also an issue, does it really make sense? Now, we come to our, we solve the final question, is the web an SCC? Can I go from every page to every other page? And the answer is simply no, not really, only in a small part of the web that is like
the biggest SCC is around 20-30% depending on the publication you read. And this is called the in domain, we'll see the domain now in a second. And actually it's in and core, what is core, we'll see that.
And is there a path? If there is a path, what is the average length? It's 19 clicks, but there are also again different publications, and it's the average, again, maybe the average doesn't always tell me so much, we've talked a lot about difference between average and median, might be a distribution
that is also power law distributed. There is the structure of the web, as has been published by many people now, also for national webs and so forth. We have a strongly connected component that is like depends on, typically people say 30% of the
web. And then we have two similarly big components, the in and the out component, and some disconnected islands and tenders that go away into it, but those also make like if we sum up all these islands and stuff, it makes up like
15% and those are between 20-30% depending on the publication. In our pages that have a link, somehow from them you can always reach the SCC. But there is no guarantee that basically you cannot come to the in
once you are in the SCC. From SCC you can reach any page and out, but you cannot go back into the in component, and vice versa wherever you are, if you are not on one of those islands you can, if you select a page in or in
SCC, you can always reach out but you cannot go back. That means in SCC and out together they form a WCC, weakly connected component, ignoring the direction I can go anywhere. And then the WCC is quite big
probably 80%. Typically explanation is out, people say this might be a commercial pages that have little interest of putting out links, because they won't go on their company site and not direct them somewhere else.
In links are maybe fans or new pages that try to become well known and that will link somewhere or personal pages that will link to others, but nobody links to them they are, nobody knows them. Those are the millions of pages that have 0 or 1 in link and those are people that could have few or many in links
we don't know, but from there once we are here, we cannot navigate via links to this area. So now the random surfer model also makes sense, right? If I make out, I will end up in a deadline somewhere and I will need a random jump parameter or if I end up here
I will not be able to go anywhere. I need at some point to jump somewhere and end up here maybe then I can navigate to again many many pages. What does this image remember?
What does it look like? That gives the name to this structure it is a so called bow tie structure of the web remember? Flieger in German men can put around their neck. So the bow tie structure
of the web has now something you already have heard about. Okay, now a few other topics related to web retrieval. We leave more or less the topic of quantitative link analysis, page rank the models of the quantitative behavior and the bow tie structure.
Or are there any questions still remaining there? Another interesting issue that has been discussed and used a lot for web retrieval really is anchor text.
So far we have said okay there is a page, I count the words that are there and only those words are really indicators for the content only what the people themselves write about them or what the page itself contains indicates what their topic is
what I can search for. On the other hand some pages might have a lot of inlinks like ibm.com ibm.com and now I can look at the text of the
link. It doesn't necessarily have to be the URL, the URL is the URL but there is an anchor text with the a tag in HTML those who are familiar with HTML, I hope everybody gives me the possibility to add a text
that is different from the URL. That can be maybe something trivial as IBM or somebody writes IBM homepage or somebody might even write IBM computer company homepage
Now the idea is, well those people, that's kind of a content description and that might be an independent content description because IBM wants to maybe be found for some topics, for some keywords
that they decide, that they put there on the website, but those are independent modes of what the content could be about all these things, right? Somebody might write whatever for another page, German news agency, right? And the news agency
itself doesn't necessarily have to have that information but it's described by outsiders as a news agency and that leads me to a very good high quality content description, sometimes can also
of course be influenced by opinions and things like that and negative comments, but basically this is something that anchor text can be used and it looks, it resembles what kind of content representation that we have talked about so far
without the people being aware of what they really do it's kind of information work, I have a URL, I decide to edit a link and then I edit the text describing this link
What am I doing there in terms of what we have talked so far in the lecture? Indexing, exactly Not automatic obviously, manual indexing, I decide I think of what could I write there, right? Ah, it's IBM homepage, big deal
but I'm creating an index I'm doing some kind of knowledge work without thinking of somebody might search for it one time, I should also put a very good description or whatever doesn't matter Somebody could also write international business machines or something
and explain the acronym and that means here we have a basic manual indexing that we can exploit, there is a lot of knowledge in the anchor text so people can easily exploit it So you stop me when the time is over
Indexing anchor text can be used, here we have for another example giant IBM computer giant If I take a broader window, if I don't only use the IBM link, if I take four or five words after it
before that also I might get a lot of additional terms that really have to do with IBM, for example the big blue right, the current nickname, that might not be on the IBM page, we don't know, but something like profits, records, we still know that probably it's a company
economic terms, computer terms, it could be in a link in a list of other computer resellers or computer companies and I'd be sad that it's a New York based computer giant, that gives me a lot of information, so when we index we can take
the anchor text in addition and also index it Of course we can have unexpected side effects like evil empire I don't know what kind of companies, whatever, could have been
IBM at one point or Microsoft or maybe now it's Facebook who knows or NSA somebody could write this as evil empire, and that might be a bit misleading for most users and that I can, of course I give a smaller way to the to each anchor text than only anchor texts that are used
more frequently will really prevail, and obviously of course as we know now from, let's connect this to the knowledge that we've talked about so far, this, I cannot assume there's always anchor text, because as we know, most people most pages don't have an in-link or very few in-links or maybe one or zero
Most, 90% of the pages will not have a good, no anchor text at all so it's not a solution for everything Next issue is that we've talked about manipulation What does manipulation mean on the web and who is interested in
web retrieval and its results Anybody said about that, right? Yes, everybody who wants to
make money wants to be retrieved exactly, yes, so people who invest ah, I'm going to build up a website I'm going to put some information, all that costs money, at least this costs time, right, but probably costs money if you do it professionally
They will think about, well, but if nobody finds my page right, that's not so good, all this investment is lost So, people are interested in being found, and that's quite a new role where we have not had in retrieval so far, we've talked about, if you remember the
IR graph, or the IR sketch of the IR system that I put in the beginning we had an author who writes a text, and we have a search engine and a user, so the search engine and the user are the main actors so far, now all of a sudden this author or the company who invests in the website
becomes a player too, and his interest is obviously I want to be found, meaning, what does it really mean in the end, what could it mean for IBM, I want to be found what does it mean for them, does it mean for every web search
whatever, I want to be number one, maybe not
but I want to be number one, or at least on the first screen for what kind of searches?
Something related to my business probably or I could have a different idea too, and some people think for that, yes, for jobs
I want to attract good employees good applicants, yes I put their best computer employer or something selling things for things that are computer related
okay, well we see that in the metrics that are used so there are different interests now, and what kind of does the user have the same interest, if he types in computer does he always want to find IBM, so sometimes
these interests are not identical, the company wants to maybe be found for computers and the user might want to find something else what would be the best result for the word
to the query computer, the query is computer, and what is the result
user is asking for the same website yes, so let's assume they sell computers, right, or maybe there are also computer companies in Hillesheim that sell computers
which would be the better result, IBM or HTC I think is one company here that sells computers what should I get
that's now a crucial issue, many people would be interested to receive to be in the list for the query computers, maybe there are other queries that people are not interested in so much, so there are different types of queries considering this interest
some queries that are commercially very interesting, because it's a very competitive, maybe computers are expensive like cars what is the real result for cars, of course it doesn't make any sense to find the real result for cars
and what are other aspects of query terms that could make them interesting, location based that could be a thing, yeah, yeah, I can think of very easy things
also query terms are probably, or I can tell you, tip for power law distributed obviously everything is, right, so what are the interesting terms, the frequent ones of course, right, that millions
of people asked, that are millions of times posted, those are interesting whereas those queries that appear very rarely nobody cares, right, so now we have different interests, and for those queries queries that millions of people post, maybe
they have different interests in getting a good hit, what could be a very frequent query on the web today maybe apart from pornography related stuff which always attracts a lot of
keywords, maybe Facebook would be one of the top queries probably, right, now that would be commercially interesting if somebody queries Facebook and it ends up, all those millions people end up on my page to buy where I sell something, right, that would be nice
but might not be what the people expect, so we have different interests all of a sudden, and now we have different techniques of course you know the keyword, what are we talking about now so called SEO, search engine
optimization is people, offers, not changing the content of their website in order to be found so what could they change for example what is relevant for being
on the top, we have learned a few aspects in the lecture what is relevant to get a higher position
this has to do with the structure of the web page we haven't talked about different fields, there is something obviously
that is used, we can talk about that in a minute, but from the knowledge that we have so far from the lecture, what is important and what are these factors can I change to get a higher ranking to get a higher similarity
what is important and what can I not change or what is the core of the ranking functions, we talked about that for hours
incoming links, yes in the web retrieval, so I want to have incoming links, but before that even more simple like in the past lectures at the beginning of the term, think of Wolf and Niedersachsen, what are those very important two things, if you don't know
these you will fail the exam, there is no chance TF and IDF, I cannot change IDF because everybody together influences IDF and its power law distributed, I cannot really change my
position there of a term, but I can change TF if I want to be found for whatever or for a computer or Facebook, I want to make sure that it appears more often, so I write I could simply write 100 times Facebook on my page and that would
give me a higher IDF for Facebook, TF does that make sense does anybody want to visit the page which says 100 times it says Facebook doesn't have any informational value
but it would increase my TF but of course this is a very simple technique that the search engine will notice and now we have something that the search engine again does, also has some interests it doesn't want to give a page that contains 100 times Facebook
that's not informational, it's bad for the user, bad for its reputation take some words and I will repeat it why not, then you increase the TF, that's the
pure knowledge from IR textbooks that you have learned, increasing TF will increase your hit what's math right, there is no other way, of course there is a logarithmic decay, at one point if I put it 1000 times
it doesn't help me a lot, as you know, well it doesn't look good for the user anyway so if I were to put Facebook 1000 times on my page this would be worth something illegal, I think I learned this
so there is a law if you don't have a reason to put Facebook 1000 times in some text box or in hidden text maybe in hidden text in white on white or black on black, something like that
the user won't see it, or there is 100 times, no it's illegal that means there is a law against this when did the Bundestag pass this law? I remember I learned it, so that we cannot use
keyword stuffing and white text yeah, you use hidden text, that's not nice anyway and it's called keyword stuffing I increase my TF by stupid means, doesn't mean anything for the user it's quite annoying for the user actually and maybe he finds my page piping facebook
he doesn't want my page, most people will want the facebook.de page so it's not, for the search engine itself it would probably have an interest of not leading this user to my page but anyway it's illegal, where is this law
what is it called the law for? Cyber law or search engine law, lowering their ranking position let's say
so Google decides on what is allowed and what is not allowed is that ok? who should decide what is allowed in a society? what do you think?
so if you accidentally if you employ an internship and he does keyword stuffing, doesn't know about it
your company will be hidden somewhere and nobody will find you and you will get bankrupt, what can you do? In this case it's ok if Google does it because it perhaps increases the quality of the results of your search
ok, so that's of course the reason that Google would say we want to deliver good results, we don't have a choice, otherwise people write ten thousand times facebook or whatever or sex on their page and that's not really helpful but how many times can I write my how many times can I write a word?
ok, so at a certain TF
it would not increase, it would just, you would get a decay, so we have not a logarithmic function, but one that decays at some point in time, probably, that's what Google says, and they tell me we don't even tell you, but too much is too much, right? Why are they transparent? They decide about the success
of your business, you don't even know how much is allowed and if you feel treated whatever unjust, you don't have any way to appeal so, of course there is no law
and this is something that is a very important factor for economic success like search engine in this case, it's not regulated by the normal means that power is distributed power comes from the people as the constitution says
in this case, something is decided by the company. So this is really an information ethical issue, we won't go much into detail now, but this will be in one of the master courses, we can talk about this a bit more
and of course we have a mishmash for different things now there is a so called, yeah, question of the consumer that is used for the search engines, interesting
but that doesn't ok, quite interpretation and extension if you put 1000 times Facebook, but if I put 1000 times car and I sell cars who will tell that I'm misleading the consumer?
Yeah, it's dependent on whether you mislead and use a strategy Unallowed strategies in this case, but if I sell cars and I want to put 10,000 times cars, why not? I'm not misleading, I have 10,000 cars maybe
but I'm punished anyway and of course we have a real mishmash and how much is too often and things like that that of course cannot be regulated in the law because we don't even know the algorithm and this is also the reason why Google doesn't publish the algorithm, we don't know how PageRank really works or
what modifications they did to PageRank, we know how PageRank works and we teach it here, but it can of course be modified and have some probably have some additional stuff in the real implementation, so we don't know how often TF is allowed and things like that, because they say if we publish this everybody will, SEO will just
go absurd and will be even worse and we have to invest much more to fight SEO illegal, what we call illegal or unallowed, unethical let's call it SEO methods. And of course there are some SEO methods, everybody would say these are unethical and they shouldn't happen if they mislead
the user, but where is the border? There is no democratic regulations to determine where is the border I cannot force Google to bring me up again into the webpage which has happened to BMW for example, they have been punished and somehow had an agreement with Google but this of course was never published
and it's all very strange so spamming, keyword stuffing that can be web query locked, like say what are the most popular queries now at this moment, let's put those words on my page link-based is something that you've mentioned of course
I want to have good in-links there are people who sell good in-links, so this is also a very profitable area in SEO probably a little declining, but still there there are other things like cloaking, this is a method
where you have two different web pages and you check who is coming in, you check the user agent in this case, is it in the HTTP protocol, is it really a user with a browser, is it a Firefox browser with some version or is it the crawler of the search engine
and if the crawler comes, you show it one page and if the user comes you show him the real page so you can again put different words to attract traffic on your page and there are many other forms of course TF-IDF, that was really the first generation of the SEO
where people just, as we said, put a thousand times there are keywords which of course would be unethical, everybody would probably agree that this is strange, I just repeat whatever Maui Resort is something that is popular that I want to be found with
and I get a higher TF-IDF and because humans would really not like to see that I can make that invisible to the browser okay, and now of course
in web IR pure density is a problem, TF-IBF has become a problem other things are meta tags, meta tags are also things you don't see, only if you look in the HTML code and people can put in a lot of stuff, MP3, Britney Spears, that have
maybe nothing to do with their content and well, keyword stuffing, we have seen that here we have cloaking, is the search engine really a spider or not and again Google would say, aha, if I detect cloaking
because sometimes I don't tell them that I'm the Google caller, I just tell them I'm a real user and I detect a difference between the page, then I can punish this, then again we have the problem how can I appeal and what can I do, maybe my internship did something wrong and things like that many many
topics, many many techniques, are they an SEO but the core problem is an ethical problem that remains in the search engines they tell you what they tolerate they don't tell you exactly what they do but they give you some hints so you won't make a bad mistake and get completely out of the ranking
and then Google does the Yahoo, probably Bing also, I don't have the Bing page here, but of course there is always a battle between SEO and web search engines and this is called the Adversal IR, the research is published in this
name, so IR from the perspective of the provider who wants to make money, this increased visibility in the search engines get higher positions for very competed terms that are commercially interesting and then I change
my content, so far we never talked about changing our content, the content was there and we indexed, we counted words and I said ah, somebody is counting the words so I add some more, right, somebody is counting the links, ah let's get links, right, so this is the other position and we have search engines
are kind of ambivalent, they keep it secret fight spam, Adversal IR as we call it, but they give you some tips and this is of course an unresolved ethical problem, maybe we cannot resolve it but it's there, web finishes the web
what is the time please, so we have another 30 minutes, almost is it really so late, I think it's not so late I don't believe you, probably have 40 minutes to I guess
now, let's move the second topic will be user behavior and once we are on the web we can observe kind of micro user behavior, we know about clicks, some people entering some keywords and those are first hints towards user behavior and that is research on the web
for example where popular is user needs we have millions of queries, now how can I find a structure in one of the popular distinctions that again Broder, the third invention of Broder that we hear today, Andrew Broder
He first introduced these terms. They're still used today. He said, well, there's a broad, there are three broad categories that we can separate within these many queries. One is information. I want to find information. I always want to find information that is misleading,
but that's what we've talked about so far. I want to find documents about wolves in need of sex. That would be information. Something I want to learn about, he calls it. Also called ad hoc sometimes. There's ad hoc retrieval scenario is what he calls informational. Everything we talked about so far
is ad hoc and informational. But he says on the web there are other types of queries. Some he calls navigational. Then it's easier to understand those and you can say, aha, information is the rest. Navigational, easy to understand. That is, what could that be?
That's something like the query called Facebook. I'm not looking to learn anything. I don't want to learn anything about Facebook. I don't want to know what it is. Why do I post this query? I want to find a homepage. I'm too lazy. I find it more convenient to type Facebook into the search engine than to add in the address,
the real address in the address bar of the browser. Or I type Google or I type whatever. Microsoft, I want to go to the homepage of that. Institution of that company. So actually, my interest might be something completely different.
I'm typing Facebook but in reality I want to send a message to a friend. I don't type send a message to my friend, right? That would be informational. But part of that information need requires that I go to the homepage of Facebook or that I go to the homepage of eBay, maybe.
eBay is also popular, or used to be, maybe. I want to buy something on eBay. And part of that is first going there, right? First reaching the eBay page. So that navigational is something that is a very brief query. I can easily click on the first link.
Typically, the navigational queries are well solved. Then I'm there and then the search engine, okay, you're fine, I have served you. Service was okay, you found your eBay page. But for the user, really, this is only the beginning. Now, he's probably spending more time on his informational read than he had for this small beginning.
So navigational queries. A third area that is, again, not so easy to understand is he calls transactional. People want to do something. They want to go through a transaction. Maybe go through a buying process
or going through a download. Or, that's what I find, is probably a little better, understandable, is I want to make a transaction on my bank account.
Email, that would be a good idea. Email is a very good idea. Email here this time, I want to go to the homepage of the web page, of the web search, the web email, and then I do my real work. Yes, but there is a large overlap,
and I find it hard to distinguish between the navigational and transactional. I could also say, yes, I want to find the homepage of the email. Then it would be navigational. But afterwards, I write an email, I make a transaction. Is it then navigational?
Before, I haven't found a real, real good definition that distinguishes them well. Something, if I want to make a download, I type in freeware, PDF creator, or something. Basically, I want to find a software that creates a PDF or something. Can you imagine? We never know what users want, right?
We just read a little query, two, three words. It's very little, and they think maybe about a lot of things. They have a lot of things in their mind. I don't know what is their real intent. Even they don't know. I cannot know from a few words. So again, a bit hard to distinguish,
but these are good examples, right? But if I type, for example, PDF creator, download. Yes, I want to do something. The navigational aspect is in the word, right? Download or write email or something. But I could also say this is informational. I want to first find a software that does that, right?
Anyway, there are some examples. Let's see what he put. United Airlines, yes, I want to find the homepage. Seattle weather, yeah, okay. This is an interesting, good example. Seattle weather, so I want to make a transaction.
I enter a keyword, and I get a service back. It's in a way a brief information, and then I have an informational case. Mars surface images, I want to access images.
Could also be informational. I'm not so sure. I don't want to go to the Mars homepage, that's for sure, but whatever. Canon, something, he assumes that the customer wants to buy or find the price of a product, and it doesn't matter if it's on eBay or whatever,
or whatever shop. So it just types in the product name, and that is in a way, transaction afterwards, make a transaction to buy it, or make an input to get the price. Yes, but yeah, maybe I want to also inform myself about the camera, and want to find out is it really good or not, and then it's a bit fuzzy.
But that's no wonder, because we have so little information, Canon S410, we know what happens in the brain of the user. But this is really a good, or one of the first and often used category system for web searches.
And I think at least the navigational category is quite easy to distinguish, and something we didn't have in the regular web search. Currently, that's also very nice. Classified queries, he says, if I have a informational, not informational,
ambiguous, here there is a analysis, a manual analysis of Yahoo. Then as I analyze 6,000 queries, sorry, it's a journey, Ricardo Bayser-Yates, a research boss at Yahoo, published this,
and he says, depending on which area we are in, for example, for news, most is informational, whereas for computers, most is not informational, of 50%, and a lot of is ambiguous, we have seen ambiguity also before,
we don't know what does the user really want. Also, very high ambiguity for others, where we don't even know the topic, and we cannot say what kind of queries it, that's very typical. But interesting, depending on a topic, the fraction of the informational queries is quite different. And how much is ambiguous overall?
1,000 of every six query they labeled ambiguous. Wow, very good, no idea. Quite a good question. Various others, yeah, that doesn't make a lot of sense.
We would have a hard time distinguishing, and what is home? A home is probably some shopping stuff, right? I don't know, good question, I would have to go back to the original paper, but I'm afraid it might also be quite vague. But that's not really the core thing,
just to give you an idea, what is the frequency? Unfortunately, they didn't label navigational, but they found it probably too difficult. Most people say between 10 and 30% of the queries are navigational, but in this case, we have a lot of informational queries, and we don't have, they didn't use the transactional,
because as we found it difficult as well, they probably too. Other things, how many queries, how many words do people type anyway? And here we see that five years ago, actually in 2000, I have to correct this query,
there was a magic number of 1.3 average query length, so most people put one word. And then, of course, the search engine, what's the search engine do if you type in one word? You won't have a lot of success anyway. And people notice that, so over time, the queries got longer, so the last number
that I got also a few years ago already was 2.3, so more than one word more. 80% are without operators, probably more, probably 90% operators, meaning some Boolean operators or phrase operators, or plus or minus
to increase or decrease the importance of a word. And 85% of users just checked the first result set page, and 78% are not modified. This is in contrast to the user behavior studies that we'll see in the second part, hopefully still today,
where we said most information needs are solved in an iterative manner. I type something, I see query wasn't good, actually I mean something different, I need to specify that, or I see a term, oh, that's a good term, I use that in my query, I have forgotten about that, it's called like that,
and so you rework the query and improve it and come to your optimized query. Also interesting, there is a strong ranking bias in the user behavior, that means basically users click very likely
on the first hits. In this case, Kian did something very simple, he gave people 10 results from whatever search engine, that time, I don't remember,
that time in the US, follow up in the, now who was a market leader, I think. And he gave another set of people, another set of test users, the reverse order. So he gave them the number 10 hit as the first, and the first from our theory would say,
aha, ranking, principle, probably the probability is very high that the hit is not such a good result as the first one, and vice versa. And what happens is that here we see the normal link distribution, aha, most people hit the first link,
they trust the search engine very well, and then this decreases to 10% and below few percent for the rest, for the remaining things. For the reverse order, we see that,
now it's also in reverse, we see that aha, maybe it was not such a good hit really, but people still clicked on it 40% times, less than for the real, and then we have a similar decrease, and only here we have a few more hits, maybe those were really good, but still they received only 10% of the clicks.
Few people noticed, probably aha, in the end there is a real nice result, and clicked to it, but most people say, well, I just clicked on the first result no matter what. So we see, we cannot, if people click on the first result,
the search engine cannot say, aha, I found a good result for them, because that's what people just simply typically do. We cannot assume that it's a really good result. Another interesting result is that by Janssen, Jim Janssen and colleagues, he said,
let's, we see that there are a few search engines, and we see they bring different results. Now I wanna take the results from Google, and present them in the MSN interface, and vice versa, and he worked with four search engines,
and we read the table, these are some theories, and what does the results, what do they mean? I think it was the links,
he observed, aha, the Yahoo links, they were, aha, if the links were presented in the Yahoo or in the ARS page, whatever result, in a very unknown search engine, people said, aha, what, ARS, never heard of this,
probably not such a good result, and they, on average, downgraded it 10%, whatever that means, I don't know, don't remember the details. Whereas, when they were presented with whatever result, maybe the ARS result or the MSN result in Yahoo, which at the time was the market leader in the US,
not Google, it was Yahoo, Microsoft, and Janssen. Then they said, oh, this is a very nice search engine, it's always good, I know it, I use it all the time, because it was a market leader, most people were familiar with it. They said this is a better result, in most cases.
So, people tend to trust the companies that they already know, that are well known, again, a network effect, everybody uses Google in our market today, so everybody talks about it, and everybody else will use it, and say, I always use it,
so if I always use it, I'm very smart, so it's probably the best search engine, I made a good decision, people justify their own behavior, and even if you present some other search engines in the other results in the Google layout, people will say, well, that sounds good. And Janssen said, well, this basically, what I found, the rating of the search results
mirrored almost exactly the market distribution of the search engines at that time. Data log analysis, again, here we can observe power log distributions,
for example, we worked once with Microsoft Asia query log with 800,000 queries from 2006, few were labeled for some classification task, and we plotted the frequency very simply first,
and interestingly, so these are the most frequent query in the company, 1,000, 40,000 times were in, and interestingly, in the beginning, we see a lot of stop words. Off, to, in, three, stop words.
And, well, and can be a Boolean operator, where we don't know, but why do people type off?
How can it be that off is such a frequent word in the queries? Oh yeah, or Prime Minister of Germany, or something like that.
Yeah, yeah, or Duke of Wales, I don't know if it exists, or some, at that time, there was a popular movie, Pirates of the Caribbean, so again, it appears in the named entity, right? Basically, as we've learned, in theory, stop words, it doesn't make sense to query stop words,
because they're not in the index. But as search engines observed, like here, people all the time query stop words, probably in phrases, it makes sense to index them. We've talked about that in the second or third section, class a lot, so that's user behavior beats everything,
beats theory, so people put it, search engines let this stuff inside, and people still query this. What were the most frequent queries now? 800,000, the most frequent query was only 25 times, not so often,
and interestingly, and probably disappointing for the Microsoft search service, the most frequent query was Google, the second most frequent query was Yahoo, and then we have MySpace, which probably is not among the tops anymore, and Yahoo and eBay, so we see a lot of the most frequent queries
are probably what kind? Navigation, right? And I want to go to the Yahoo, the Google, at the eBay page. Pirates of the Caribbean is probably informational, maybe, right?
Jesse McCartney, who is that? Who was that, I don't remember. Probably also informational, but among the top queries, there are a lot of informational, navigational queries, and that's probably still the case today. All right, we met also the length of the queries,
we see it corresponds with the observation or the fact that we have 2.3 is the average, there is a peak at three, the median is three, and then there are a lot of, some queries are very long, but they're very rare.
Again, if you put it on a linear log scale, we end up with a kind of linear distribution, we see, again, probably power law, we didn't have enough to really check for there, but it looks like a power law, and if people, in this case here, they, it's query length,
or is it query word frequency, let's see, does it say? I think this is query frequency, again, we see power law, there are some queries that are queried all the time, that are populated at the time, there are a lot of queries that nobody, that only one person might query.
And the distribution, again, is tremendous power law scale, so really, really big differences, we see the most frequent query here is one million times, or even more, more than one million times, and then there are 10,000s of queries that are queried less than 10 times.
Also, interestingly, quicks per query, I type a query and click, do I click once, two, three, four, five times? How often? Again, power law distributed here, in this case, a Yahoo log,
in this case, really big numbers, 50 million, since I didn't have to count it, like in the 6,000 queries that they checked, we get a power law. So, now what's the time now?
We cannot, not even finish, much less, we cannot even start, much less finish, the interesting topic, information behavior, so we do that next time. We move from this small little observation,
which I call micro behavior, typing something, clicking on something, to a more general understanding, how can we describe user behavior, and what does that have to do with information. So, see you next week. Thank you.