Web crawling (29.6.2011) - TIB AV-Portal

Web crawling (29.6.2011)

00:00

29

Technische Universität Braunschweig, Institut für Informationssysteme

Balke, Wolf-Tilo

Formale Metadaten

Titel

Web crawling (29.6.2011)

Serientitel

Information Retrieval and Web Search Engines (SS 2011)

Teil

11

Anzahl der Teile

13

Autor

Balke, Wolf-Tilo

Mitwirkende

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/355 (DOI)

Herausgeber

Technische Universität Braunschweig, Institut für Informationssysteme

Erscheinungsjahr

Sprache

Produzent

Technische Universität Braunschweig

Institut für Informationssysteme

Balke, Wolf-Tilo

Produktionsjahr

2011

Produktionsort

Braunschweig

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

This lecture provides an introduction to the fields of information retrieval and web search. We will discuss how relevant information can be found in very large and mostly unstructured data collections; this is particularly interesting in cases where users cannot provide a clear formulation of their current information need. Web search engines like Google are a typical application of the techniques covered by this course.

Information Retrieval and Web Search Engines (SS 2011)11 / 13

1

2:17:21

Introduction and fundamental notions (6.4.2011)

2

2:14:55

Fuzzy retrieval model, Coordination level matching, Vector space retrieval model (13.4.2011)

3

2:31:47

Probabilistic retrieval models (20.4.2011)

4

2:12:18

Indexing (27.4.2011)

5

2:19:12

Latent Semantic Indexing (11.5.2011)

6

1:23:32

Language models, Retrieval evaluation (18.5.2011)

7

2:18:36

Document clustering (25.5.2011)

8

2:16:44

Relevance feedback, Classification (1.6.2011)

9

2:21:24

Support vector machines (8.6.2011)

10

2:32:31

Introduction to Web retrieval (22.6.2011)

11

2:13:02

Web crawling (29.6.2011)

12

2:22:51

Link analysis (6.7.2011)

13

2:34:43

Miscellaneous (13.7.2011)

Automatisches Abspielen

Sprache

Text

Bild

00:00

BenutzeroberflächeInformation RetrievalSpider <Programm>Wurm <Informatik>Gleichmäßige KonvergenzHTTPAbfrageNummernsystemGoogolWikiBetrag <Mathematik>CASE <Informatik>RelationentheorieDefaultURLSmith-DiagrammDatenstrukturInformationInformation RetrievalAutomatische IndexierungZusammenhängender GraphSchnittmengeWeb-SeiteSpider <Programm>W3C-StandardLeistung <Physik>SuchmaschineComputerspielStatistikVerband <Mathematik>FrequenzTypentheorieKommunikationsprotokollProgrammiergerätUniformer RaumPhysikalischer EffektEindeutigkeitGarbentheorieIdeal <Mathematik>MereologiePhysikalische TheorieResultanteTermWärmeübergangZahlenbereichVerschlingungSystemaufrufVersionsverwaltungTeilbarkeitServerExogene VariableGewicht <Ausgleichsrechnung>CASE <Informatik>Prozess <Informatik>VerzeichnisdienstMetropolitan area networkSchwebungFormation <Mathematik>Notebook-ComputerPunktAdressraumElektronische PublikationNominalskaliertes MerkmalClientSchreib-Lese-KopfTextur-MappingDifferenteAutorisierungElement <Gruppentheorie>DickeIdentifizierbarkeitMultiplikationsoperatorFreewareURLUniverselle EinhüllendeCase-ModdingValiditätProgrammierungSensitivitätsanalyseGrundraumNavigierenInhalt <Mathematik>ZählenAbfrageHypertextFiletransferprotokollDatenautobahnBeobachtungsstudieArithmetischer AusdruckSchnelltasteVorzeichen <Mathematik>Web SiteHyperlinkStandardabweichungMusterspracheDefaultComputeranimation

09:02

CASE <Informatik>RelationentheorieDefaultURLSmith-DiagrammServerClientAdressraumDNS <Internet>Exogene VariableStandardabweichungGoogolFormale SpracheCodierung <Programmierung>KontrollstrukturCachingDatentypInhalt <Mathematik>InformationCodeWärmeübergangE-MailSchreib-Lese-KopfNichtunterscheidbarkeitSelbstrepräsentationTypentheorieDatenbankFormale SpracheInformationProgrammierungTypentheorieKommunikationsprotokollBitEinfach zusammenhängender RaumInhalt <Mathematik>MereologiePhysikalisches SystemE-MailAbfrageVersionsverwaltungGüte der AnpassungServerExogene VariableÜberlagerung <Mathematik>NormalvektorNotebook-ComputerCodierung <Programmierung>NetzadresseWeb-SeiteHilfesystemBrowserAdressraumDirekte numerische SimulationClientDomain-NameAuszeichnungsspracheWeb SiteSchreib-Lese-KopfDifferenteAutorisierungEinfache GenauigkeitDienst <Informatik>W3C-StandardCodeDynamisches SystemOrdnung <Mathematik>SoftwareVerband <Mathematik>CodierungFunktion <Mathematik>FrequenzPhysikalischer EffektArithmetisches MittelGarbentheorieResultanteDatensichtgerätVerschlingungSoftwarewartungWort <Informatik>UnrundheitMailing-ListeEreignishorizontNeuroinformatikSondierungMultiplikationsoperatorMessage-PassingDatumsgrenzeEinsComputeranimation

16:44

NichtunterscheidbarkeitExogene VariableSelbstrepräsentationTypentheorieSinusfunktionCodierung <Programmierung>HTMLGoogolDatenstrukturInformationMailing-ListeVerschlingungBerners-Lee, TimFormale SpracheInformationZeichenketteCodierungTypentheorieKommunikationsprotokollFehlererkennungProgrammierumgebungAutomatische IndexierungMereologieE-MailVerschlingungSoftwarewartungGüte der AnpassungServerHypertextFehlermeldungExistenzsatzNetzadresseOffene MengePoisson-KlammerWeb-SeiteMailing-ListeQuellcodeDirekte numerische SimulationClientAuszeichnungsspracheWeb SiteSchreib-Lese-KopfHyperlinkDifferenteAutorisierungIdentifizierbarkeitMultiplikationsoperatorMessage-PassingSpider <Programm>ZweiDienst <Informatik>W3C-StandardDatensatzDatenstrukturBitResultanteDatensichtgerätVersionsverwaltungWasserdampftafelFormation <Mathematik>PunktKontrollstrukturArithmetische FolgeWort <Informatik>EinfügungsdämpfungGraphfärbungAdressraumZehnSondierungFreewareComputeranimation

24:27

VerschlingungDatentypVersionsverwaltungPhysikerHypertextDatenbankSelbst organisierendes SystemNuklearer RaumSoftwareCoxeter-GruppePhysikalisches SystemWorkstation <Musikinstrument>Berners-Lee, TimAnalysisDatenbankDatenstrukturInformationSoftwareVersuchsplanungTypentheorieKommunikationsprotokollBitEinfach zusammenhängender RaumGrundraumNuklearer RaumPhysikalisches SystemPhysikalismusProjektive EbeneQuarkconfinementResultanteZahlenbereichZentralisatorVerschlingungVersionsverwaltungGüte der AnpassungServerInternetworkingCASE <Informatik>HypertextCoxeter-GruppeInstantiierungFiletransferprotokollIntegriertes InformationssystemWort <Informatik>Web-SeiteKartesische KoordinatenBrowserMailing-ListeElektronische PublikationSchreib-Lese-KopfHyperlinkDifferenteOnlinecommunitySystemplattformMultiplikationsoperatorStandardabweichungKollaboration <Informatik>Interface <Schaltung>App <Programm>W3C-StandardMathematikMAPAggregatzustandGeradeMereologiePolygonnetzStellenringVertauschungsrelationE-MailQuick-SortSystemaufrufOffene MengeDezimalzahlPlastikkarteWorkstation <Musikinstrument>ClientStreaming <Kommunikationstechnik>Demoszene <Programmierung>Computeranimation

32:09

PhysikerSelbst organisierendes SystemNuklearer RaumSoftwareCoxeter-GruppeVerschlingungWorkstation <Musikinstrument>BefehlsprozessorComputerBildverstehenHypertextServerInformationTexteditorBrowserBerners-Lee, TimComputerspielHardwareInformationSoftwareInformatikerForcingPhysikalisches SystemProjektive EbeneServerProzess <Informatik>HypertextWurzel <Mathematik>Web-SeiteInteraktives FernsehenRichtungBrowserVollständiger VerbandFramework <Informatik>Workstation <Musikinstrument>Elektronische PublikationMultiplikationsoperatorStandardabweichungW3C-StandardTypentheoriePrinzip der gleichmäßigen BeschränktheitVirtuelle MaschineVerschlingungFlächeninhaltSystemaufrufHypermediaWeb SiteEndliche ModelltheorieDienst <Informatik>Computeranimation

36:04

BrowserComputerBildverstehenHypertextSoftwareServerLokales MinimumGroßrechnerInformationVerschlingungGatewayPhysikerEnergiedichteVorzeichen <Mathematik>NeunzehnMosaicing <Bildverarbeitung>ARPANetSupercomputerStatistische HypotheseSpider <Programm>Syntaktische AnalyseWeb-SeiteProzess <Informatik>ParserURLRoboterGruppoidATMSkalierbarkeitBandmatrixInverser LimesWeb SiteBerners-Lee, TimHyperbelverfahrenPhysikalisches SystemVerschlingungServerHypertextBrowserMultiplikationsoperatorComputerspielDatenstrukturInformationOrdnung <Mathematik>StabStatistikTopologieGroßrechnerAutomatische IndexierungBitKette <Mathematik>MaßerweiterungMereologieMomentenproblemProjektive EbeneRechenwerkSpeicherverwaltungTermZellularer AutomatDatensichtgerätSystemaufrufCASE <Informatik>RobotikFormation <Mathematik>PunktKreisflächeUnrundheitWeb-SeiteShape <Informatik>Vollständiger VerbandSchätzfunktionSkalierbarkeitElektronisches ForumDifferenteAutorisierungNeuroinformatikStandardabweichungSpider <Programm>Kollaboration <Informatik>Rechter WinkelInterface <Schaltung>W3C-StandardSelbst organisierendes SystemParserHydrostatikSpeicherabzugWarteschlangeZustandsdichteInternetworkingProzess <Informatik>VerzeichnisdienstGemeinsamer SpeicherNetscapeSystemplattformZweiComputeranimation

45:54

SkalierbarkeitWeb-SeiteBandmatrixInverser LimesWeb SiteMarketinginformationssystemExogene VariableURLSoftwareRechnernetzServerSpider <Programm>RechenwerkRoboterDatenstrukturGraphInformationTypentheorieKommunikationsprotokollBandmatrixAutomatische IndexierungBildschirmmaskeBitEinfach zusammenhängender RaumGrundraumInhalt <Mathematik>MereologieResultanteTabelleTermVirtuelle MaschineVerschlingungFlächeninhaltGüte der AnpassungÄhnlichkeitsgeometrieVerzweigendes ProgrammNormalvektorVerzeichnisdienstWeb-SeiteShape <Informatik>HomepageBrowserSuchmaschineTiefensucheAdressraumElektronische PublikationWeb SiteWeb ServicesDickeMultiplikationsoperatorURLDigitale PhotographieSpider <Programm>ZweiW3C-StandardDynamisches SystemSoftwareWärmeleitfähigkeitFrequenzMAPDelisches ProblemSystemaufrufProzess <Informatik>Formation <Mathematik>PunktSchnittmengeAntwortfunktionKreisflächeKartesische KoordinatenSystemzusammenbruchSoundverarbeitungZoomSichtenkonzeptWeg <Topologie>Web logDreiecksfreier GraphComputeranimation

55:06

RoboterZeitbereichStandardabweichungSichtenkonzeptMarketinginformationssystemAutomatische IndexierungSpider <Programm>VerzeichnisdienstWikiInhalt <Mathematik>LastCASE <Informatik>VerzeichnisdienstRobotikDatenfeldLesen <Datenverarbeitung>Web-SeiteSuchmaschineElektronische PublikationWeb SiteDomain <Netzwerk>NeuroinformatikMultiplikationsoperatorSpider <Programm>BitProzess <Informatik>Wurzel <Mathematik>Disjunktion <Logik>SchlussregelURLStandardabweichungW3C-StandardComputeranimation

57:33

RoboterInverser LimesStandardabweichungMultitaskingMarketinginformationssystemWeb-SeiteWeb SiteSpider <Programm>Rekursive FunktionATMKonfiguration <Informatik>InstantiierungClientTeilmengeInformationMotion CapturingWikiArchitektur <Informatik>Zentrische StreckungPhysikalisches SystemBandmatrixBitrateRechnernetzCoprozessorInformationsspeicherungDistributionenraumSkalierbarkeitComputerspielDatensatzFormale SpracheImplementierungInformationMathematikFrequenzTypentheorieFormale SemantikGebäude <Mathematik>Automatische IndexierungGesetz <Physik>AggregatzustandBitGrundraumInhalt <Mathematik>LastMaßerweiterungMereologieMultiplikationPhysikalisches SystemResultanteVirtuelle MaschineZentrische StreckungDatensichtgerätVerschlingungQuick-SortFlächeninhaltSystemaufrufGüte der AnpassungInternetworkingNichtlinearer OperatorBasis <Mathematik>CASE <Informatik>VerzeichnisdienstZusammenhängender GraphRobotikMetropolitan area networkATMRoutingPuls <Technik>DistributionenraumPunktStrömungsrichtungKontrollstrukturKondition <Mathematik>BildschirmsymbolWeb-SeiteHilfesystemMailing-ListeSkalierbarkeitClientBitrateWeg <Topologie>Web SiteSchreib-Lese-KopfDifferenteDomain <Netzwerk>MinimalgradMultiplikationsoperatorStandardabweichungMessage-PassingSpider <Programm>ZweiKeller <Informatik>Minkowski-MetrikW3C-StandardHardwareRoboterFilter <Stochastik>Einfach zusammenhängender RaumProzess <Informatik>FehlermeldungWurzel <Mathematik>SuchmaschineComputeranimation

01:02:58

Spider <Programm>Architektur <Informatik>Zentrische StreckungPhysikalisches SystemWeb-SeiteBandmatrixBitrateRechnernetzCoprozessorInformationsspeicherungDistributionenraumATMFrequenzDämon <Informatik>Protokoll <Datenverarbeitungssystem>DateiformatFormale SpracheInformationMathematikComputerarchitekturFrequenzTypentheorieKommunikationsprotokollAutomatische IndexierungGruppenoperationInhalt <Mathematik>ResultanteDatensichtgerätFlächeninhaltAbfrageUnternehmensmodellGüte der AnpassungProzess <Informatik>Vollkommene InformationATMWeb-SeiteHilfesystemService providerBimodulSuchmaschineSkalierbarkeitBitrateWeb SiteEndliche ModelltheorieSemantic WebMessage-PassingSpider <Programm>W3C-StandardComputeranimation

01:08:23

ATMSpider <Programm>FrequenzWeb-SeiteDämon <Informatik>Protokoll <Datenverarbeitungssystem>DateiformatArchitektur <Informatik>BitrateStellenringIndexberechnungSyntaktische AnalyseThreadAutomatische IndexierungLastMaßstabRoboterInhalt <Mathematik>URLInformationParserAutomatische IndexierungBitInhalt <Mathematik>MereologieWarteschlangeVerschlingungVersionsverwaltungServerNichtlinearer OperatorProzess <Informatik>Zusammenhängender GraphRobotikATMDistributionenraumStrömungsrichtungWeb-SeiteMailing-ListeDirekte numerische SimulationSichtenkonzeptMultiplikationsoperatorSpider <Programm>ZweiPolygonnetzComputeranimationFlussdiagramm

01:11:04

ServerRechnernetzAbfrageInformationThreadStandardabweichungStellenringInnerer PunktZeichenkettePaarvergleichDatenstrukturIndexberechnungElektronischer FingerabdruckOperations ResearchTaskZeiger <Informatik>URLBildschirmmaskeFunktion <Mathematik>Hash-AlgorithmusMini-DiscTopologieProzess <Informatik>DatenstrukturInformationSoftwareSyntaktische AnalyseZeichenketteHalbleiterspeicherGebäude <Mathematik>Automatische IndexierungAuswahlaxiomBitMereologieNormalformPaarvergleichStellenringTabelleWarteschlangeZahlenbereichAbfrageServerMatchingProzess <Informatik>Zusammenhängender GraphParallele SchnittstellePunktHash-AlgorithmusNetzadresseWeb-SeiteHilfesystemInelastischer StoßDirekte numerische SimulationElektronische PublikationElektronischer FingerabdruckMini-Discp-BlockInverter <Schaltung>Kontextbezogenes SystemMultiplikationsoperatorSpider <Programm>Minkowski-MetrikTopologieNumerisches VerfahrenFunktionalHochdruckDatensichtgerätGüte der AnpassungDistributionenraumBildschirmsymbolQuelle <Physik>AutorisierungNeuroinformatikDienst <Informatik>Computeranimation

01:17:43

ZeichenketteTopologieElektronischer FingerabdruckProzess <Informatik>Lokalität <Informatik>URLInformationsspeicherungMinkowski-MetrikHauptspeicherIndexberechnungMini-DiscKonfiguration <Informatik>Notepad-ComputerSpider <Programm>WikiDatenstrukturZeichenketteAutomatische IndexierungBitFunktionalLoopMereologiePaarvergleichProjektive EbeneResultanteStellenringWarteschlangeVerschlingungMatchingCASE <Informatik>Hash-AlgorithmusInformationsspeicherungWort <Informatik>Web-SeiteInelastischer StoßElektronischer FingerabdruckWeb SiteMini-DiscElektronischer ProgrammführerTVD-VerfahrenSpider <Programm>Minkowski-MetrikTopologieHalbleiterspeicherTypentheorieWellenpaketMAPGeradeStatistische HypotheseZahlenbereichZellularer AutomatGüte der AnpassungATMPunktWhiteboardMultiplikationsoperatorRechter WinkelFigurierte ZahlW3C-StandardComputeranimation

01:24:22

ZeitzoneWeb-SeiteURLInhalt <Mathematik>COMInformationOrdnung <Mathematik>Arithmetisches MittelBitInhalt <Mathematik>MereologieWort <Informatik>Web-SeiteElektronischer FingerabdruckDifferenteMultiplikationsoperatorW3C-StandardMathematikPhysikalischer EffektQuick-SortSystemaufrufHash-AlgorithmusNeuroinformatikEinfache GenauigkeitComputeranimation

01:26:08

InformationUmsetzung <Informatik>DatenstrukturTropfenGasströmungInhalt <Mathematik>SpezialrechnerFokalpunktSystemprogrammierungFlächeninhaltDatenbankFolge <Mathematik>TermVorzeichen <Mathematik>Web-SeiteBildgebendes VerfahrenDatenbankDynamisches SystemFolge <Mathematik>InformationMathematikOrdnung <Mathematik>MAPTropfenAutomatische IndexierungNavigierenInhalt <Mathematik>MereologiePaarvergleichProjektive EbeneTermZahlenbereichWasserdampftafelAutomatische DifferentiationHash-AlgorithmusSchnittmengePixelWort <Informatik>Web-SeiteHilfesystemFokalpunktWeb SiteSpider <Programm>W3C-StandardGesetz <Physik>Einfacher RingPhysikalisches SystemCASE <Informatik>Prozess <Informatik>PunktBildschirmsymbolMaskierung <Informatik>DifferenteAuflösung <Mathematik>MultiplikationsoperatorComputeranimation

01:33:31

Web-SeiteFolge <Mathematik>TermInhalt <Mathematik>SchnittmengeInformation RetrievalFuzzy-LogikQuick-SortMailing-ListeAutomatische IndexierungBeschreibungskomplexitätApproximationsalgorithmusHash-AlgorithmusMinkowski-MetrikGanze ZahlAlgorithmusFunktion <Mathematik>PermutationZahlenbereichZufallszahlenTextur-MappingIdentitätsverwaltungDistributionenraumGleichmäßige KonvergenzAlgorithmusApproximationDatensatzFolge <Mathematik>Ordnung <Mathematik>VerschiebungsoperatorGanze ZahlAutomatische IndexierungTotal <Mathematik>BitMereologiePermutationResultanteTermZahlenbereichGüte der AnpassungÄhnlichkeitsgeometrieHash-AlgorithmusNegative ZahlSchnittmengeWort <Informatik>HilfesystemKoeffizientInelastischer StoßMailing-ListeObjekt <Kategorie>Information RetrievalCOMComputersimulationWechselsprungZellularer AutomatQuick-SortSystemaufrufGewicht <Ausgleichsrechnung>Prozess <Informatik>PunktReverse EngineeringNeuroinformatikSymboltabelleEinfache GenauigkeitMultiplikationsoperatorRechter WinkelIdentitätsverwaltungComputeranimation

01:40:54

Funktion <Mathematik>PermutationZahlenbereichHash-AlgorithmusZufallszahlenGanze ZahlTextur-MappingIdentitätsverwaltungDistributionenraumGleichmäßige KonvergenzInhalt <Mathematik>SchnittmengeAutomatische IndexierungLokales MinimumBeweistheorieVerfügbarkeitAutomatische IndexierungLokales MinimumPermutationTermZahlenbereichZufallsgeneratorGüte der AnpassungRandomisierungPunktQuaderHash-AlgorithmusSchnittmengeOffene MengeKoeffizientUmwandlungsenthalpieMultifunktionDomain <Netzwerk>ComputerspielZeichenketteVertauschungsrelationVirtuelle MaschineProzess <Informatik>Formation <Mathematik>Wort <Informatik>Inelastischer StoßEndliche ModelltheorieDifferenteNeuroinformatikRechter WinkelPackprogrammMobiles InternetComputeranimation

01:48:17

ZufallszahlenGanze ZahlPermutationInhalt <Mathematik>Lokales MinimumHash-AlgorithmusSchnittmengeZahlenbereichBeweistheorieZeichenketteZeichenketteArithmetisches MittelBitPermutationZahlenbereichHash-AlgorithmusSchnittmengeSelbstrepräsentationOrtsoperatorLokales MinimumVollständiger VerbandSpieltheorieBildschirmmaskeGraphfärbungDifferenteComputerspielDatensatzCASE <Informatik>EinfügungsdämpfungComputeranimation

01:51:51

PermutationKoeffizientInhalt <Mathematik>BeweistheorieVollständiger VerbandTypentheorieLokales MinimumZahlenbereichEinflussgrößeÜberlagerung <Mathematik>CASE <Informatik>SoundverarbeitungNeuroinformatikMultiplikationsoperatorVektorraumInhalt <Mathematik>FastringGüte der AnpassungKoeffizientSuchmaschineComputeranimation

01:54:27

SchätzungPermutationInhalt <Mathematik>Lokales MinimumZählenStichprobeLokales MinimumPermutationStichprobenumfangZahlenbereichFastringEinflussgrößeHash-AlgorithmusSchnittmengeKoeffizientInelastischer StoßMathematikStörungstheorieAutomatische IndexierungWarteschlangeFlächeninhaltGüte der AnpassungÄhnlichkeitsgeometrieProzess <Informatik>Notepad-ComputerCachingComputeranimation

01:57:06

ApproximationGleichheitszeichenInhalt <Mathematik>PermutationIndexberechnungLineare AbbildungBeschreibungskomplexitätTotal <Mathematik>TopologiePortscannerAutomatische IndexierungZahlenbereichMaßerweiterungWellenpaketWeb-SeiteVerschlingungSchätzungDatenmodellDatensatzAutomatische IndexierungBitLokales MinimumMereologiePermutationResultanteTabelleTermZahlenbereichGüte der AnpassungTeilbarkeitPunktHash-AlgorithmusSchnittmengeWort <Informatik>Web-SeiteKoeffizientFokalpunktSuchmaschineSchnitt <Mathematik>Elektronische PublikationEreignishorizontInverter <Schaltung>EindringerkennungMultiplikationsoperatorSpider <Programm>W3C-StandardOrdnung <Mathematik>FrequenzGrenzschichtablösungEntscheidungstheorieAggregatzustandGruppenoperationKette <Mathematik>MaßerweiterungWärmeübergangSkriptspracheSoundverarbeitungWhiteboardWeb SiteDifferenteNeuroinformatikComputeranimation

02:05:04

Web-SeiteWellenpaketVerschlingungSchätzungDatenmodellMaßerweiterungFokalpunktPaarvergleichSpider <Programm>Gibbs-VerteilungRoboterInhalt <Mathematik>Web SiteInformationE-MailCodierung <Programmierung>TypentheorieSpezialrechnerAlgorithmusBildgebendes VerfahrenCodeDatenstrukturInformationOrdnung <Mathematik>Deskriptive StatistikTypentheorieParserWellenpaketHydrostatikMAPMittelwertVektorraumTaskInhalt <Mathematik>MereologieResultanteTermZahlenbereichVerschlingungGüte der AnpassungRobotikCodierung <Programmierung>RankingWeb-SeiteHilfesystemFokalpunktSuchmaschineWiederkehrender ZustandEreignishorizontBitrateBayes-NetzWeb SiteEndliche ModelltheorieGibbs-VerteilungDifferenteMultiplikationsoperatorURLStandardabweichungSpider <Programm>Rechter WinkelDienst <Informatik>W3C-StandardSoftwaretestAutomatische IndexierungBitSystemaufrufCASE <Informatik>PunktKondition <Mathematik>Offene MengeService providerBrowserData MiningSichtenkonzeptDickeICC-GruppeComputeranimation

Transkript: Englisch(automatisch erzeugt)

00:00

Yes, so also hello from my side and as a summary for last week's lecture, we were talking about a rough structure of the web and said, well, basically the web is a set of pages that are connected by hyperlinks and if we want to get anything out of this, we have

00:28

to query the web. There has to be a possibility for accessing the information that we are interested in and accessing the information always means you have to have something like

00:41

a directory, you have to have something like an index that shows you where you have to go and this is exactly the main component of all web search engines, the indexer on which the retrieval algorithms work that can then be used to power Google and to power

01:01

Yahoo and to power all the web search engines and all the ideas. Of course, to build up an index, you need to know what is inside the web. This is the typical topic of so-called web crawlers and this is what we're going in today. So let's start with a little

01:22

look at what the web actually does, how it is built up, how does a web page look and then we will dive into the topic of web crawling and I'll show you some techniques on how to detect duplicates and how to manage the content of the web. So the World

01:42

Wide Web basically is a number of resources. This is typical web pages, for example the Technology Only Validate page or the EFS page and they have been created by people

02:03

that want to transport some information. So, for example, who is currently employed at the EFS or what does the technical university offer in terms of, I don't know,

02:21

study programs. And this information, of course, is somehow linked to each other. So when EFS says, well, we are part of technical university, branch-wide, they will add a link to the page of the Technology Only Validate. And this is a way to navigate

02:45

through the pages. So it's not true that the only way of accessing pages, though we have seen a lot of navigational queries in Google's statistics recently, it's not true that the only way to go through the web is basically by asking queries, by using

03:05

Google. But one of the types of navigating the web is basically by following links and going from topic to topic and making a stroll down the information highway and just see

03:25

where it leads to. And at some point we will end up, hopefully, at some page that covers the information that I'm expressing. This is the basic idea. Resources plus hyperlinks. Hyperlinks is exactly these navigational patterns. Of course, the resources have to be identified

03:47

somehow. They have to be unique. Otherwise, I wouldn't know where to go. If there are several pages having the same ID or the same address, that's kind of useless because where should I go? It has to be unique in some way. And this is where the idea

04:04

of uniform resource identifiers, or it used to be called uniform resource locators, URLs. Now it may not be a page, it may be something else. So it's been generalized to identifiers,

04:20

URIs, and they have to be really uniquely built to identify a certain resource. And most common is in terms of the protocol that you have to use, basically the hypertext transfer protocol, HTTP. But it could also be file transfer protocol or whatever it may be.

04:46

So you could have different protocols. Then we have the so-called authority that shows us where to go, where the actual page is located. It may have a path, so the authority

05:00

may have a file structure leading from some directories to subdirectories. This is basically where the path leads us. It may have query elements where we go like question mark, name as carrot or whatever, and there may be fragments shown by this little count sign

05:21

here. And that leads us to a special part of the page. So that is for navigating inside one page, okay? And if we restrict ourselves to HTTP, the hypertext transfer protocol, then we start with a host name, which is the authority, where should we go.

05:46

We have a path that points somewhere in the directory structure. So for example this is the Wikipedia page about New South Wales, part of Australia, and then there may be

06:01

a query or some fragment. So this will jump directly in the Wikipedia entry of New South Wales into the section that has been headed by history. So if I want to know about the history of New South Wales, I go to Wikipedia. This is what I indicate

06:26

here. I go to the English Wikipedia. This is the authority starting with the E-M. Then I sneak my way through the file structure of Wikipedia, bringing me to the correct

06:41

page, and then I may navigate inside the page to the paragraph that I'm interested in, and that will be the history of New South Wales. Okay? Well, in HTTP, of course, if you have a protocol, you have to normalize the entry somehow, because there are different

07:04

ways of entering the URI, and you don't want to rely too much on different specs telling ideas of uppercase and lowercase and stuff like that. So what you would do

07:20

is you would have special characters that could be represented in a different way. For example, typical German fashion are umlauts that are not available over most of the keyboards anywhere, so you need some way to describe them. For example, this here is the version of the tilde, and it will just be replaced in the URI, and then it

07:51

is unique. You will have case normalization, so everything will be lowercase. It doesn't matter if you type it in with case sensitive. You will also have very often the port on

08:05

which some site is addressed, and HTTP's default port is 80, so this is removed if you say colon 80 at the end. It's just so this is the idea here that you should

08:23

listen in on port 80. It's immediately removed because that is standard for HTTP, and also past segments that show stay in the directory or go one directory further are removed.

08:41

So all these addresses are basically the same. They're all normalized into the one down here. Lowercase, unquoting of special characters, query where there is no query,

09:00

so empty parts of the address, the port that has been replaced by the standard port, it's all taken down to that one. Now how does it actually work? Of course you can't address the pages by themselves because you have a logical name. You have the name

09:24

of the authority, you have the name of the page structure, but you need to know where it actually is, and that means you have to ask a client, so you're filling out a request by your client and sending the request to a server, and the server responds

09:43

somehow. And the protocol that is used here is a TCP IP, so the servers are always uniquely identified by IP addresses. This is typical IP address, as you may know it, if

10:03

you have configured your laptop or something and have chosen a special IP address or part of a network, and if you type in ipconfig in your execute environment, you will get to know what your IP address currently is. There are some dynamic approaches of handing

10:24

out IP addresses for networks, or you can have a fixed IP address. Basically, the registered IP address is how your device as a server is approachable. The host names

10:44

like www.google.com or whatever it may be, Wikipedia, have to be mapped somehow into IP addresses so the TCP IP protocol can properly work. And for this you have so-called

11:01

domain name system or the DNS, and you have a DNS server where you basically put the query, okay, what is the correct IP address for Google, and the server will respond and say, well, it's basically some of these addresses. Of course, different host names

11:21

can cover different IP addresses. I mean, the Google server would be immediately out of work if it was just a single IP address, so it has to be different IP addresses. And basically for the service that you're taking, it doesn't matter which of them you approach.

11:43

Just important that either one of them is available to answer your request. And also an IP address may have many host names, so there may be a lot of content available to send on a single server, and the server may serve different sites. Well, since we

12:10

need the IP address from the name, we first have to perform a lookup in the DNS server.

12:23

Then we get the correct IP address, and then we send the HTTP request to this server IP address, usually over the standard port. And then we get, basically get the website

12:40

back, so the web server will tell us something about the content that is available or not available on this server. This is the idea behind it. The typical HTTP request looks like this, so if we have, for example, a search query on Google, so we basically use the

13:04

authority Google com, we use the search interface, and we post the query for the ethos institute, then the request type is get. I want to know something from a server, so I post a get command. I hand over the query and the path that I want to know.

13:27

I hand over the host name, so that everybody knows where this is to be directed to, and then there are some connection details that I need, so for example, what character said

13:45

I will accept, or what possible encodings of the pages are acceptable, what language is, do I need, and stuff like that. So this is kind of like a little bit of compatibility

14:01

information that shows me what a sensible response to my request would be. So if the server just speaks Chinese, you know, I can give me Chinese character set, and I ask for something in English, it just doesn't help me, you know, so I can put it into the request. After asking the request, there's a response of the server, which basically

14:33

tells me what the exact version of the protocol is, and this is a very important code over

14:40

here. This is the status code, so for example, if it's 200, it means everything is okay, the resource has been located, and is still intact, then there are some caching information that shouldn't concern us too much here. There's the content type telling

15:00

us what it actually is, so this is a normal HTML page that then can be interpreted by the browser, so sometimes I need different programs to interpret some content, but this should be doable with every browser. We have the character encoding over here, so

15:22

also the browser can see what character encoding is needed, and then after the header tells us all the information about what the information, or how the information actually looks like, what is used to transport the information, the actual body of the site is just the contents with some markup information. As I said, this is obviously HTML.

15:50

Okay, good. There are basically two types of requests of HTTP. One is the GET request, the other one is the POST request. With the GET request, I ask for a response. I

16:03

want to have some resource. With the POST request, I want to do something like this. To transport information to the server, so for example, if I fill in HTML form, I can post the information to the server. The server can evaluate the information, for example,

16:22

a database query, and then generate the answer page according to what I was actually looking at. There's a smaller version of the GET, which is called HEAD, which basically does the same as GET, but just asks for the head of the message, not for the content

16:42

of the page. So I get all the information about the encoding. When was it last updated? Is the resource still available, or are there some error codes or whatever? And I don't transport. It's basically just for brevity. I don't transport the full body of the page

17:00

that may be lengthy, that may consist of several packets, but just get the header. Of course, this HEAD message is very important for crawlers, because they want to know, has something changed? Do I have to really crawl this page again, or is everything still like it used to be? Or is it a dangling link? So I will follow the link with a

17:24

request, and then just say, okay, the resource is there, or it's a 404 error code, and the resource is not there. I can't crawl it. I can't index it. Important status codes are basically what we all like, 200, which means everything's

17:42

okay. The resource is there and is available, and I can just download it. 301 is basically a forwarding address, like you do with the mail service when you move houses. You just say, okay, this has been moved. Update the link, whatever it may be.

18:08

302 means, yes, the site is there, but it has been moved temporarily for, I don't know, server maintenance or whatever it may be. Ask for it at a different address, but

18:21

next time you come around, try again under the same address. 304 means it has not been modified since the last request. Very important for crawlers. Very bad 404, not found, which just means anything could happen. The website could be

18:41

no longer in existence. The server is down. There's too much traffic, whatever. Something happened that doesn't allow me to access the website. Then there's the 410, which basically means the website is there, but it's not there

19:03

anymore. It has been permanently removed, and it's it will not be available, so this is not a server fault. This is not maintenance work or something. The resource is just dead. Good.

19:23

What we see is that we have URIs to identify web resources, and that we use protocols to retrieve the web sources. That means, basically, get and post commands.

19:44

In HTTP, which is the most renowned protocol of that time. Of course, a resource does not only contain some content, but it also has some layouting information. This is a different idea, so there is a structural difference between

20:06

the layout of information and the information itself, the text or whatever is said on a web page. This is the same for web resources.

20:21

What we do have is very often HTML resources, the Hypertext Markup Language. The Hypertext Markup Language is one of the early inventions of the web and was basically done by Tim Berners-Lee in 1991, and he developed this protocol just for, well, in the beginning,

20:45

just for looking up phone numbers, and then it started to snowball, and he was adding more and more, and was building, well, basically, it's always get and post. You transport information to the server. You get information from the server, but you

21:03

also have to transport some more. It's not only the information, but it's also how the information is to be displayed, and this is the idea of the browser, that you have some piece of software, some client software that allows you to view the information

21:21

as the author of the information intends you to view it, so the author of the information cannot just give you the basic facts, but can also give you some style information about how the facts should be shown, how the facts should be displayed, and this is what is

21:40

called a Markup Language. Markup basically means that you describe the structure, the layouting of text-based information in a document, and there are some typical things that you can do in writing documents. You can have headings in which

22:02

you will basically put the part that is the heading into these brackets here with H1, H2, or if you have a subheading, H2 or H3, whatever it is. You can have paragraphs, so if you want different paragraphs, you just put it into

22:25

the brackets of this P here for paragraph, and you always have an opening bracket and an ending bracket, so you would put the heading between the two brackets, and this

22:40

backslash here shows you that this is the end tag, the other one is the opening You could, of course, have lists starting in this environment over here, and for every list item, you will have the list bracket, and then you will get a list of first

23:02

item, second item, and so on, and there's one very important thing that is the link. You might have a hyperlink to a different resource. Again, you start with a list of a certain bracket, so this is A, but you add something to the text that is within the

23:28

bracket, and that is where the link should point to, shown here. Basically, this is the string of the URI that the link should point to, and if you click on the link,

23:46

if you follow the link, then the HTTP protocol has to do exactly the same thing. It has to copy the string, it has to send the string to a DNS server, get the correct IP address,

24:04

and then get the resource under the IP address. Okay? Good. Let's look at the formatting a little bit. This is a typical HTML document. It just says

24:22

here doc type is HTML, and it also tells us what kind of HTML it actually is, so there are different versions of HTML, and it tells us what document type it uses. Then it's

24:44

the HTML body, and then we get some information like here, the main heading in the age brackets, or a paragraph made up. It might have a link inside the paragraph for the word link.

25:05

See that here? Most browsers display it that way, that they just say this is a different color, and it's underlined, and it shows you that it's clickable, that you can click

25:20

it. Again, an age infrastructure where you can have a different heading and number between the ages, age one, age two, age three, shows you what is the respective size of the heading. Age one is the main heading, and age two is the subheading. You may have list items and so on. This is basically what an HTML page looks like. It does not only

25:47

contain the information here, for example, and here, but also how to display the information here, for example, and here. What you can do with the information here, for example,

26:02

is linkage structure. That's basically what you can write in HTML. Good. As I said, HTML comes in different versions, and it started off with HTML 1.0 that was designed directly by Tim Berners-Lee in the beginning, and as the web took off,

26:24

there were always more ways of dealing with the structure. The structure of web pages became more complex, and so you had to invent new HTML commands. You had to invent new structures on HTML, and that went on through the 90s. Basically, in 95, the web was available for

26:49

a large number of people, so that was kind of the actual birth of the web, where it left the confines of the academic institutions. In 2000, we had what is often referred to

27:04

as ISO HTML, so HTML was standardized by the International Standards Organization. After 2000, XML became very popular, and HTML started to work on different standards.

27:23

This age, HTML, that were based on XML. Currently, we are in the working draft of HTML5, which is about to come. That starts me up on a detour to see a little bit about the history of the web and how it actually developed.

27:44

Well, we've already seen that Tim Berners-Lee has invented the web. What most people do not know is that Tim Berners-Lee did not invent hyperlinks. In fact, before the magic year of 1989, there have been two research communities working on hypertext

28:03

and one working on the internet. The internet usually means, therefore, there's a network of data connections, as we know it, between different universities and research institutions at this time, and they basically shared files. There already has been email. They sent emails, and then they could use

28:23

protocols like FTP to share files, for example, their research data. Someone could upload it to its own FTP server, and then someone could download it in the U.S., for example. This has been the internet community. It was about exchanging data in a very technical and not very intuitive way.

28:41

On the other hand, there has been the hypertext community. Basically, in the 70s and 80s, they tried to build some kind of intuitive information systems that are easy to use for users, where information can easily be accessed and structured in a nice way. It is something like, maybe, these online

29:07

apps that some applications offered in the 90s, where you press F1 button and then some kind of web-like interface started, where you could click links, of course, and came to different pages. This is what hypertext is about, basically,

29:22

text pages with links connecting these pages. However, the central idea of hypertext was that there is a central instance managing all documents inside the hypertext system and managing all links. The idea is that you could easily move a page and all links are changed accordingly, so you don't miss any information, and

29:45

everything is well-connected and high quality of the whole thing. Usually, these have been some kind of specialized software systems and didn't have any connection with the internet. What Tim Berners-Lee did, he was at the

30:00

time working in the physics department at the European Organization for Nuclear Research in Geneva, abbreviated as CERN, and this is a large nuclear research project funded by the European Union. Because it's so expensive, all European

30:20

countries wanted to join forces and create a central agency for nuclear research, and of course, because many European countries are involved, they want to share their results, they want to share their data. So there is an urgent need to distribute the data and the experimental results gathered in Geneva to distribute them across Europe

30:44

to other researchers for analysis. Of course, this could have been done by FTP downloads, but it would have been much better if there had been methods for true collaboration so that researchers really can share their information and data and results and experimental designs in an intuitive way.

31:04

And he recognized this problem that there was no way to share data in a nice way, and no common presentation software that people could use, so it was always about exchanging files which isn't too nice usually. And then he had the idea to write a proposal,

31:21

yeah, a large hypertext database with typed links, and his idea was basically to define something like web equals hypertext plus internet. And his application case was, for starters, the phone book at CERN.

31:43

So CERN is a pretty large agency with thousands of researchers, and it would have been great to have a kind of online phone book where people could look up where they can reach their friends and researchers they are going to work with in an easy way.

32:01

So he had this dream of having such a collaboration platform in some way and just started to implementing it on his next workstation, some kind of, yeah, state-of-the-art server at the time, so not very impressive hardware, but this basically was the first web server. They started building his system, and who knows what's written on this sticker here?

32:29

Yeah, don't switch it off. This is the server. Because there are always stupid guys at CERN thinking someone left his workstation on over the night and switched it off. So it was basically what he was doing,

32:42

implementing the web or the roots of the web on his small, old workstation and leaving it on for days and nights and hoping that other people are interested in the information he had to offer. So of course, it always is with good ideas. His proposal that he initially wrote has been declined, so nobody recognized his brilliant

33:08

idea. Some people of you know that. It always happens if you think you have a big idea, but now other people are able to recognize this. Yeah, well, that's life.

33:22

But Tim Berners-Lee had found some colleagues who wanted to support him. So Robert Gaglio, a computer scientist at CERN, decided to join forces with Tim Berners-Lee and they started a new proposal and presented their idea at some European conference on hypertext

33:40

technology where the hypertext vendors at the time came together every year and discussed their new software systems. But yeah, nobody was interested in their wonderful idea. So it would have been a big chance for vendors of these systems to say, well, we will support you, and we will sell your idea,

34:02

and then they would have made a lot of money, but nobody decided to do so. And I think mainly because of this, the Web now is as free as it is because there is no big company behind it steering the Web standardization process and how things are working. So because nobody wanted the Web or at least

34:22

no commercial, vendors wanted his technology. Tim Berners-Lee and his colleagues just started to or just continued to implement their idea into a bigger, bigger framework that can be distributed across many servers all over the world. So by Christmas 1990, they had gathered and created all tools they need for working

34:46

Web, and most of these tools are still in action today. So HTML, we have HTTP for transferring the files. We've seen that. Then a Web server software. Of course, you need some kind of server that is responding to your requests.

35:03

Hardware the server could run on. So the first Web host was info.cirn.ch. And of course, you need a Web browser, and this was called World Wide Web and ran only on his next workstation at the time. So not a very common hardware at the time, but at CIRN, they had some of these machines,

35:23

and he has been able to distribute this software to his colleagues. And what's also important, from the beginnings, Tim Berners-Lee thought the Web as some kind of interactive medium where people could easily edit the pages. So in HTTP, there are some requests that are intended to modify pages directly.

35:46

And so his Web browser was a browser slash editor. So this feature was dropped after some time because people didn't want to edit pages by themselves at the time. But yeah, you see projects like Wikipedia or Wiki software in general. Today, people are going to interact much

36:04

more and want to share the information, edit pages, and it's interesting to see that Tim Berners-Lee already anticipated this need that would come up, yeah, ten years later. So this is how the first Web browser looked like. Yeah, it's a Unix-type interface, not very comfortable.

36:25

But it looked like the hypertext systems at the time looked, but links have been designed in a way that you could refer to other servers all over the world. And this was his main contribution that he built a distributed hypertext system, which was

36:44

very novel at the time. So then things started to get going. The Web became more and more popular, and yeah, some people created simple text browsers for the first home computer that usually had text-based interface, so some kind of

37:02

Unix or DOS or whatever. And then he extended his idea and made it his large telephone directory public at CERN, which previously was located on a large mainframe

37:20

computer where you have to log in on some text-based interface. And with his phone directory that you can easily search via Web interface, now people recognize that there was something like the Web, and they also saw, well, if you could build a telephone book with Web technology, then you could also present your own research,

37:41

you could collaborate on research projects and share information in an intuitive way. And this is maybe the point where the Web really started off, and finally they made an announcement in some hypertext news group, some kind of Internet discussion forum where they simply told, yeah, we have this Web project, and we are looking for people

38:04

who are interested in this. Please join us. Please try it out. We are very happy to find people who are going to work with us. And some people did it, and the Web, as you know, spread around the whole world, and new Web browsers have been created, so the graphical Web browser Mosaic, for example, one of the first mainstream

38:26

browsers that have been able to run on most Unix computers. And in fact, this browser has been programmed by the team led by the founder of Netscape, Mark Andreessen, one of the great entrepreneurs at the time. I have no idea what he's currently doing, maybe

38:44

spending his money in the Bahamas, but this was one of the big names at the beginning of the Web. So 1994, the company Netscape was founded, and Mosaic became the Netscape navigator. Some of you might remember one of the most famous browsers before Microsoft

39:06

launched the Internet Explorer. And at the same time, also Tim Berners-Lee founded the World Wide Web Consortium at MIT in the U.S., and the goal of the W3C is to

39:25

standardize Web technologies. So they develop all these HTML standards, they say how HTTP should look like, they develop technologies like XML, and are currently developing HTML5, the next big thing in hypertext technology. And World Wide Web Consortium is designed

39:48

as a kind of meeting platform for different companies and different people all over the world, where they really discuss how standards should look like. So like any other standardization

40:01

organization in industry, this is now the World Wide Web Consortium for Web things. So you know how the story continued, Web is really famous, W3C is inventing new things, some more popular, some less popular, but yeah, that's the way the Web has been invented.

40:23

And now we're going to talk about web crawling. Well, as I already said, web crawling is the first step in building an index, in knowing what is actually out there, what is part of the Web. And this is actually a very interesting problem, because if you think naively, what do you mean by naively?

40:45

What do you do? Well, basically, a crawler or some say a robot or spider would just queue all the URIs there are, retrieve the Web sources, process whatever is given

41:07

or returned by the request, so basically all the HTTP data, then you have to use a page parser that extract links from the retrieved resources, add them to your queue,

41:27

add all the information on the page to your indexer, and then work on with your queue. And basically starting with any couple of seed pages where you say, okay, this is important

41:45

and just let's go from there. We know this bow tie structure of the Web. If we pick some seed pages from here, they will lead us into the core at some point. And from there, we start kind of exploring all the different pages in the core, and of course also exploring

42:05

the out links in this other part of the Web, and that would give us a very good impression of how big the Web is and what has to be indexed and how we can reach it. So from the seed pages, we would just work our queue, retrieve and parse the pages, extract

42:27

all the URIs, add new URIs to the queues, take the next URI, and so on. So this would basically be the naive approach of Web search. This is what a basic crawler does.

42:41

But if we look into it a little bit deeper and just take a very conservative estimation of how big the Web is, 60 billion pages. This is one of the really conservative things. And let's assume we want to crawl each page once a year. We had some statistics recently

43:05

last week on how often Web pages change and stuff like that. So let's assume for a moment that most of the Web pages are kind of static and the information does not change too much. And let's leave out dynamically generated content and stuff. Just focus on

43:27

static, but we want to revisit them once a year to get some updates on our index. Then how many pages would we have to crawl per second? Let's break it down. That would

43:40

be 60 billion per year divided by 12 makes 5 billion. Divided by 265, we will also take Christmas and Easter and everything into account, so we're working all year round. Makes it a whopping 166 million that we have to process per day, which means

44:04

about 7 million per hour, 100,000 per minute divided by 60 gives us about 2,000 pages a second that we have to retrieve, that we have to parse, that we have to extract

44:24

the relevant terms for our indexer plus extract the links, add them to our queue. 2,000 per second. Quite a bit. It's not something you do with your smartphone, is it? So crawling

44:45

really is about scalability. It's really a problem of how do you get to the pages, what is the important part, how can I quickly get the information out of a page, how often do I have to revisit pages, and of course it comes with further complications because

45:08

it's not only the scalability, it's also what do you want to have in your index. Do you really want to index everything or do you want to check if it's sensible information?

45:22

Or do you want to break down the information somehow? Do you want to avoid duplicate content? Could it even be that your crawler is trapped somehow by pages? So consider pages pointing to each other and you always add the next URI to your queue, go there, add the backlink

45:48

to your queue, go there, add the URI to your queue again, you know, like you will run in circles again and again. This is obviously not what you want to have. I mean the web is obviously not a tree shape. So it's not depth first search or breadth first search

46:07

or something like that that you can do, but it is a cyclic graph. How do you avoid cycles? Two thousand per second. That doesn't sound like one machine doing it. That sounds

46:23

like many machines doing it. If many machines are doing it, how do you synchronize between the different servers? How do you synchronize between the parts of the web that are crawled? Because the web doesn't break down. It's not one branch here and one branch there.

46:41

It's this all-connected graph and as soon as two machines crawl the same locations, you're doing double work. It doesn't make sense. What about latency and bandwidth? Typical networking problems, connection problems. I mean, if you crawl the pages all the time,

47:05

then this consumes a lot of bandwidth. If I have a lot of web search engines that crawl, my website will not be open to customers anymore because all the connections, all the bandwidth that I have available is given to the crawlers. That's not what I would

47:24

want. Also with the sites, sites are becoming bigger and more complicated by the minute. You have a lot of dynamically generated content. It's not just your normal file structure anymore like it used to be in the 90s where you had, okay, you have the home page in

47:44

the top directory and then you have three or four subdirectories, one for the photos and one for the blah, you name it. But you have a lot of dynamically generated addresses that are filled in time. So how deep should you crawl a site? Is it sufficient

48:03

to know what is actually on the top level? Well, this is a site about so fast that you could buy furniture or something like that. Is that sufficient, leading you to the top page of a portal? Or do you want to know about that special table that is

48:22

the wonderful seminar room, technical university brown strike table that you want for home because you miss it so? Do you want to index that? Well, some people might want to know. And sometimes it's also a little bit complicated what the owner of a website

48:48

wants. Does the owner of the website want the website to be publicly announced? Is it more or less private information that should be accessible to some people but not

49:01

to the world at large? Or are there certain parts of the information that are kind of maybe not confidential because then I wouldn't put it on the web. But maybe I would be more comfortable if not everybody looks at it or if Google does not display it as a first

49:20

result set when querying for some information. So how do we do that? There are a lot of questions that have to be discussed. And some of the must-have features is definitely robustness because the web is very diverse. It's very heterogeneous. In the beginning

49:42

there was a couple of universities that were having web service and basically what you experienced in terms of content was rather homogeneous because they did it the same way or copied sites from other locations. But nowadays for all the problems that you

50:01

might have, cyclic links or dangling links or whatever your crawler might experience, there will be web pages experiencing these problems. I mean it's gotten so big and so diverse that you will definitely find all kinds of problems that you may run into. The web pages and all the information that is transported by the protocol can be assumed

50:30

since it is a protocol to be correct. Of course, all the browsers allow for some robustness in terms of displaying web pages that are not well formed. For example, what happens

50:46

very often is if you have some kind of this paragraph, you know, like the end paragraph. Very often this is just missing. And somehow once a browser sees another one of this type

51:06

it will just add the end paragraph to the page and just display it in the right form. And this is happening very often actually. So it's very often malformed and if you read it your crawler may crash and may just have some loop or whatever.

51:26

So robustness is always a very good idea, but it should also cover not only the mistakes that are inadvertently made, but also the mistakes where some websites want to trap

51:45

you. Somebody knows how a crawler works and exploits that for whatever reason, be it just pure maliciousness or whether he wants to kind of tease people or there is

52:04

a sense in keeping crawlers or creating traffic for some large amount of time or this is what's often called spider traps, where the crawler thinks he's crawling new sites, but basically is running around in circles. We will discuss some of these

52:26

techniques when we go into spam detection and similar techniques and discuss them a little bit in more detail. For the must-have features very often politeness

52:44

is argued about because, I mean, Google can do whatever Google wants to do and Google's slogan always was, don't be evil. And if you believe that, well, that's okay and

53:03

that's polite, but is it the reality? So what is more important to have a very 100% index or to accept the wishes of the owner that this page should not be crawled or that

53:23

you shouldn't have too many requests on this page to create a little bit less traffic or crawl it? Yes, you may crawl it, but just at a slow pace so that it's always

53:42

open for or available for customer traffic. So this is kind of what is perceived under the heading of politeness because the website owner usually has to pay for the website traffic. If I just hammer it with requests from my crawler, that doesn't create anything

54:06

for the website, so it doesn't transport information, it does not do anything good for the website owner, creates a lot of traffic, the owner pays for that traffic. Bad idea. Also if I hammer it, nobody else may access it and it has amazingly long loading times.

54:27

My customers will be miffed, you know. Usually the policy for crawling some websites

54:40

which area you should avoid when crawling, how often you may crawl, whether some engines may crawl it at all. This is given by the site owner in a file called Robox Text

55:02

and we will discuss that in a little bit. Like now, there's no time like now, I see. So a little bit is now. Yes, Robox exclusive standard is some way to tell crawlers what

55:21

they should do with your website. So this is not only about politeness, so if you are some kind of web search engine and you have your own crawler, then you could simply ignore what the people are telling you, but if they do hurt them too much, then they will simply simply lock you out of their pages. So they will simply use an IP filter

55:45

or something like this and so you get no content at all. So usually it is a good idea to adhere to this standard and to listen what people really want crawlers to do and whatnot. So the basic idea is that if you are having a website, you put a file named robots.text

56:03

into your root directory of your domain. So for Wikipedia, for example, it's this URL and in this file you can specify what resources crawlers are allowed to access, how often they should access them and what they should not access. Well, and a crawler then,

56:24

the first URL they are retrieving from the site is this robots.text. They parse it, read it and then load all the other pages, but obeying all things that have been written in the robots.text. So a bit of caution, this is not a standard in the W3C sense. So there never has been some kind of standardization process for robots.text. It's just, yeah, it's

56:45

just there and if some web crawler is visiting your website every day and downloading gigabytes of content, then yeah, you could use robots.text, but you won't be able to sue this crawler

57:03

company. Yeah, because they are, yeah, there is no standard at all, so be aware of that. Okay, some small examples. For example, this is a very easy robots.text that simply allows robots to view all files. So the idea is you can distinguish between different

57:21

crawlers and here you say the following rule applies to all crawlers and disallowed pages are none. So all crawlers can access everything. So some more examples to keep all robots out, all user agents, for all user agents, the root directory and everything

57:43

below it is disallowed. So it, when you say, just say slash, it also applies to all pages beginning with slash. So it's your whole domain. You can also exclude certain resources, for example, some dynamic content or your secret private directory nobody should know about and if someone links to it by accident, search engines shouldn't index it.

58:06

You could exclude a specific bot, so usually crawlers submit their own name in the HTTP request and then you could tell him, yeah, dear bad bot, do not read my private directory.

58:21

You could also say that only one request every five seconds should be made or every ten seconds or whatever you like. You could also say that crawlers should only come in a certain time interval, for example, when you think that there will be no traffic on your site and so crawlers won't hurt your performance. And there are a lot of

58:43

things you can, you can adjust here. So a very nice example is the Wikipedia robots.text. This is pretty long because they, they accounted a lot of problems in the past years and they have nicely commented all the things they have been doing in the robots.text.

59:02

So for example, they, they disallow every, every bot from Google that are solely focused on, on AdWords or something like this or some, some advertising-centered crawling type. So they allow Wikipedia's own bot explicitly and some other bots and essentially they,

59:25

they created a large list of bots that behave badly and disallowed them. Yeah. This goes on and on. This is also quite nice, so wget is a, is a Unix tool that you can use

59:42

for downloading web pages and Wikipedia have, have seen that some people use it in a bad way. Some people just try to use this tool to download whole Wikipedia and, and don't leave any, any, any pauses between the loading of different pages and so Wikipedia get, get instantly hammered.

01:00:00

especially if people do this from some kind of university connection which is pretty fast and large, so there's a lot of traffic, and because of that they decided to log out this tool, so if you now try to download Wikipedia using this tool you will get the message that the site doesn't allow it. So some more clients that have behaved very poorly

01:00:22

according to Wikipedia, so blah blah some more, and again you can see in the comments what have happened here and why this user agent has been excluded, so if you have your own web page and you want to start a new robots tech then maybe the Wikipedia page could be a good starting point

01:00:41

so that you can immediately log out all bots you don't want to have there because Wikipedia probably encountered all bots that are in the internet and they made their own things here.

01:01:02

Okay then some special pages are excluded that shouldn't be indexed so these are some internal pages so here's a page called trap where they are simply enforcing web crawlers to obey the robots techs

01:01:21

so I'm guessing trap is just a page linking to itself or linking to randomly generated URLs and so they can simply check which crawlers actually obey the robots techs and which not so if they identify a crawler which is permanently loading the pages linked by this trap page then they can identify it and log it out using

01:01:43

IP filters or something like that. Alright this is robots techs and I think we make a five minutes break now. Okay apart from the must have features that the spider

01:02:00

or the web crawler should feature there are also some features that are nice to have that are not strictly necessary but that will speed up the process that will keep the index fresh and stuff. So one of the points is a distributed mode of operation where the crawler resides on multiple machines

01:02:24

and makes you robust against hardware failure obviously and also distributes the work so you can do it more efficiently. It should be scalable because I mean a stale index is annoying your customers as a web search engine.

01:02:44

All the customers get the first top 10 stale pages that either do not contain the information any longer that you have been looking for or you get 404 errors because the page has been moved you know it doesn't help you either.

01:03:01

So to satisfy your customers you should have fresh index and that means scalability given the size of the web. Performance efficiency is always a good idea because you can do clever crawling and you can do stupid crawling and depending on what you want in your index

01:03:22

and how fresh you want it to be, how many resources you want to spend in building your crawler you should consider efficient techniques on the top of your agenda and of course you should consider the quality of the index.

01:03:41

So if you index all the spam pages and for every query return some spam page among the first few results then your customers will definitely be dissatisfied. So you have to find the useful pages, you have to find the non-spam pages,

01:04:01

the pages delivering better information than others or more complete information than others or the information in a nice mode of displaying it like you can also, well it is often claimed that information and display should better not be mixed

01:04:23

but it's actually that way how we humans think and how we humans perceive things that a good display of information helps you to understand the information. So also those pages having a good display or having clever display should be preferred.

01:04:41

Freshness is kind of one of the basic ideas of getting good information to your customers so your crawler has to operate in a continuous mode and should crawl the pages once in a while

01:05:03

that reflects the frequency of changes on this page. So there are some pages where you might afford to well visit it once a year and see if it's still there and it's still kind of having the same information but a lot of pages are changing continuously

01:05:20

or changing very, very quickly. So for example if you think about a newspaper portal having some page giving the newest information you can crawl that once a day because there will be new messages every day. It doesn't help you to have stale messages from a newspaper page. They will be offline and same goes for pages

01:05:46

like I don't know like governmental information census information or something that will change once a year. There's no problem if that is a little stale

01:06:00

the last crawl. So this is kind of the idea that you should consider the rate of change to a page within your crawler. Also if you see that for some page a large demand is currently developing then crawling this page more often

01:06:22

to get fresher information would be a good idea. So one of the typical examples always the World Cup. So as soon as the World Cup comes we did that with the Google site guys last week as soon as the World Cup comes there's a spike in queries and of course the people are customers

01:06:43

that are kind of supporting your advertising model your business model. So which engine, which search engine gets the most attention by the people interested in the World Cup is definitely a business issue.

01:07:01

It's a financial issue. So better provide fresh information, better provide good information that people are immediately finding what they want because if they have to click through 10 pages that are spam pages or that are last year's World Cup or four years ago or something like that

01:07:21

that won't help them and they will change the search engine. So this is a good idea. Another feature that is very helpful is extensibility. So if you have new data formats, new protocols developing as we said the World Cup Web Consortium is kind of currently in the process

01:07:42

of refining some of the web languages be it HTML5 but also other kinds of languages, XML, some of the semantic web languages, OWL and RDF and you name it. So they have working groups for a lot of areas

01:08:02

and once they develop something new and once they standardize something new it will be out there. There will be some people using it and relying on these issues. So your crawler should reflect that and basically what you need is a modular architecture where you can just plug in modules that are interesting for crawling new types of content.

01:08:25

If you look at it from a large-scale view then the basic part of your crawler is doing that naive process that we were describing. So you have a queue of URIs,

01:08:42

you fetch the resources, first in the list, second in the list and so on and then you parse the resource, extract the new URIs and then you put it into your queue and this is basically the mode of operation. But with all the requirements that we have

01:09:03

we have to do some more. The resource fetcher has to handle the DNS somehow. So it can't hammer the DNS server for all the different resource but should be a little bit more clever. The resource parser should first

01:09:20

kind of try to find out whether this is good content, whether this content has already been indexed, whether it's kind of spam or whatever. So duplicate checking, text indexing, other analysis, spam detection and stuff like that is very important here.

01:09:41

For all the links, not only for the content but also for the links, the same has to apply. I can't crawl the same links times and again but I should also arrive at some point at new links. So whenever I have already crawled some URI in the current batch, so not a year ago or something

01:10:04

but in the current crawling version that I'm doing, I have to check whether I already did that URI. I might have to check what's said in the politics so in the robots text, what can I do, what may I do?

01:10:21

And I have to have some kind of distribution component that manages different parts of the crawler that are working more or less independently that also manages how often a page is retrieved

01:10:40

so also this tool to avoid this hammering. And if I'm doing that and if I have all these components working nicely together, then at the end, I will have a sensible index, an index that contains useful information, that contains fresh information and that doesn't annoy the owners of the webs,

01:11:04

the basic idea how it works. So for the DNS handler, the fetching of DNSs is usually quite slow due to the network latency because everybody has to go for every request to some DNS server and look it up.

01:11:21

So your crawler is just one among the crowd and if you start, what did we say, 2,000 pages a second, you would hammer the DNS server so there has to be a lot of DNS servers that are queried in parallel. And the handler is actually a local DNS component

01:11:45

that prefetches some information that will be needed in the near future. So if I already have the URIs in the queue, depending on how fast my parsing is

01:12:01

and how fast my processing duplicate checking and stuff like that is, I can decide to already query the DNS servers for the correct addresses, for the correct IPs in advance and keep them in my queue.

01:12:20

So I can immediately request the page then. And it could also use a relaxed policy with respect to the DNS updates. So avoiding unnecessary DNS queries is a good idea for the DNS handler that you have local.

01:12:41

A little bit more complex is the duplicate checking, both in URIs and in context. So let's start with the URI checker. The basic idea is if I have crawled URI within the last whatever it is, week or month or whatever,

01:13:03

my turnaround time for the crawl is, I should definitely not crawl it again immediately but delay it at some future point. But of course, checking the URIs against everything that I've crawled recently,

01:13:22

remember 2,000 a second, maybe a little bit difficult because I mean, string matching over the last, I don't know, week or something like that, 2,000 a second, that's quite a task. So what you basically do, string comparisons,

01:13:44

that's just out of the question, doesn't work. Even if you index your strings, you're doing some kind of inverted file index or something like that. It doesn't help you too much because it will just end up in building this index

01:14:03

is more complex than doing the actual, maintaining the index is more complex than doing the actual matching. It doesn't make it fast enough. So what is usually done is you use not the URIs as such but you use fingerprints of the URI.

01:14:22

So you abstract the URI but into some hash value. So you use URIs in the normalized forms. That's one point that already reduces the search space.

01:14:43

And then for any of the URIs that you get, you compute hash value or the fingerprint and use some hash function for that. One of the most prominent is MD5

01:15:01

which is basically delivering 128-bit fingerprints. So for example, if you use the EIFS webpage, this is what is delivered by MD5, can be computed very quickly. And now, indexing this number becomes far easier

01:15:25

than indexing the strings. So for example, you can build a B-tree or hash table or whatever you may use those if you're using a lot of servers and working in a distributed fashion.

01:15:44

Something like distributed hash tables might be a very good choice for actually doing that. For illustration, we will do it with a B-tree here. So what you basically do is you take the numbers and for every block of the B-tree, you will decide which part of the B-tree,

01:16:03

which sub-tree of the B-tree you have to address. So for every number that is smaller or equal to three, you will be pointed to the first block. For every number that is larger than three but smaller or equal to five, you will be pointed to the second and so on and so on.

01:16:21

So the block size over here usually reflects the size of the disk blocks that you have to load to read it, but usually the index structure for detecting duplicate URIs should be done in main memory

01:16:41

and you should just traverse the B-tree until you arrive at the different fingerprints and then you can efficiently look up whether a fingerprint has already been used or not. Since it's a hash function, collisions are very improbable

01:17:02

depending on the size of the hash function that you used and the numerical comparisons can be done quite quickly. So if you find that some fingerprint is already in your B-tree, it will with a very high probability have been produced by the same URI.

01:17:23

Might be that there's a collision, but usually there's not. Good, so what you basically do is what I'm saying. You ask whether a URI is contained in the B-tree. If it's not contained in the B-tree, then it's definitely new, okay?

01:17:42

So there's no URI that has created this hash. If it is contained in the B-tree, then it could still be a collision. So what we basically do is we ask if the same URI strings originate or basically latch to the fingerprint

01:18:01

and this results in only a couple of string matches. So here you need string matches, but it's a very limited number, all the strings that kind of result when being hashed in this collision bucket, and that's kind of okay. So if that is true,

01:18:20

if they were actually hashed from the same string, then it's known, so then we will not take it into our list. If they were not created by the same string, then again it's new and this is basically put into the list or into the queue of URIs

01:18:41

that still have to be crawled, okay? Quite efficient for the fingerprinting of URIs, yes. Oh, so the question is,

01:19:03

if we have good duplicate checking, how can traps work so that you would not look up the same URI? And you're definitely right that the simple traps, like we will just create a loop of pages,

01:19:22

do not work anymore if you have a good duplicate checking. However, the traps become cleverer as the crawlers become cleverer. And one of the ideas of creating a simple spider trap is dynamically generating new URIs that are however mapped to the same site,

01:19:42

which then again creates as a link a dynamically generated URI. So you get new URIs and there are no duplicates of these URIs, they're in the same site usually, but they have never occurred before. However, they link to the same page basically

01:20:02

and this results in a typical spider trap, so it can still happen. Good, the thing becomes a little bit more problematic if we consider the size. So let's say we have one billion URIs

01:20:21

and we fingerprint it and the fingerprint is at least 16 bytes. This means for one billion URIs, we have 15 gigabytes of storage, plus more space to store the actual strings that we might need for string comparison

01:20:41

in the case of collisions. And as always, you have two options of storage. You can have it in main memory, which would be very advisable at that point, or you put it on disks. If you put it on disks, loading from the disk is a very time-consuming issue.

01:21:03

And time-consuming is a word that a crawler doesn't like because the crawling itself is quite time-consuming and quite a task, as we've seen. So in either case, enforcing locality is kind of a good idea.

01:21:24

So if you're considering the structure of the web, you usually have multiple pages within a single site. So you do have the EFS website and then there's a sub-web page for every researcher at EFS

01:21:42

or for every lecture that EFS offers or for every, I don't know what's on our website, for every paper that EFS publicized or whatever for every project. So keeping them close together and saying, well, basically, the idea is we crawl the site

01:22:02

and then do the individual pages. And if I already crawled the site, I won't look into it again, is a good idea. So locality can usually be enforced

01:22:21

if you consider the hostname as being the main feature of the site and then all the pages are just variations in the path that is given to some pages. So one idea to make it a little bit more localized

01:22:41

is you take two fingerprints, one for the hostname and one for the rest of the URI. So if I have, for example, here the Wikipedia, I take the Wikipedia org, which is English Wikipedia, as the site name, the hostname, and then I have all the different parts

01:23:02

that are different pages within the English Wikipedia. For example, here the New South Wales page and then there will be the Northern Territories and the rest of Australia pages. So you just concatenate both hashes

01:23:21

to form a fingerprint, which means that the URI of the same hostname are located in the same subtree of the index because they start with the same prefix. And our B-tree always looks up the prefix and guides you to the correct part of the tree, okay?

01:23:45

And the longer the common prefix, the deeper you are in the tree, okay? This is basically how you can do it. Otherwise, if you would just hash the whole URI,

01:24:01

also if I just remove this S, the hashing function would hash it probably to very remote places in the index, even if it's almost the same. You cannot predict, that's the basic idea of the hashing function, you cannot predict where it will end up.

01:24:24

Good, a little bit more complicated than checking the actual URIs is checking duplicate content. This is a bit of a problem. So of course we can do something like, okay, let's see the web page and then we will just have a fingerprint of the web page

01:24:41

and do the same trick as we did with the URIs. But that's kind of strange. So for example, if we take the, some of the dynamically generated pages here where basically the time is running and it's always different information,

01:25:01

so that would result in different hashes, which would not be a good idea, then the same kind of information can be transported in different layouts and different wordings. You can build different sentences that basically have the same meaning.

01:25:21

Is that duplicate content or isn't it? If you just switch sentences around, you just exchange them somehow. You just move them around. You still have the same content on the page. But of course if you compute the hash, since the order has been reversed, it's different.

01:25:42

Same goes for web pages that offer advertising and just exchange the advertisements at every visit. Has something changed on the web page? Well, a very negligible part of it because where there was said by Viagra, there's now by Cialis or I don't know,

01:26:01

whatever these advertisements are. And this of course is a problem. So what do we do about that? And this was often called near duplicate detection. So you don't want to detect whether it's an exact copy, but you want to detect whether it's almost the same.

01:26:25

Just the ordering has been switched around. Or it's no longer by Viagra but by Cialis. So just few information on the page has been exchanged. And the first step is always to focus on the content only

01:26:45

because whether the layout changes is not interesting to see if you have the content already in your database. So you just remove all the styling. You also want to focus on text because with the images,

01:27:03

is it the same image over there or is it not? It's not? Yeah, that's right. So here we can see it's not the same image, though it looks amazingly similar.

01:27:20

So if that is hard for us to see, it's hard to see for the crawler. And you can't just go pixel by pixel through the image and then say, well, one pixel is different, so this must be a totally different image. No, it's not. Just skip it, drop the images, drop all dynamic content,

01:27:42

drop everything that is not the focus of the page. Consider the text only and in terms of text, only the text that is written in paragraphs and headlines and stuff like that, not navigational elements where we go to,

01:28:02

okay, this is the EFS part researchers or teaching or projects or whatever. I don't want to know about that. I want to know what is on the page, what is written on the page. And this is quite a problem, actually, how to segment a webpage to see what is content,

01:28:21

what is navigational, what is dynamically created. And sometimes you see it on the web sources, sometimes you don't. It's kind of difficult to extract. There are even some visual techniques where you kind of have the perception of the webpage

01:28:42

to see, well, the navigation bar is usually on the left-hand side and the ads are usually in terms of banners and if you see something blocky, that might be an ad or stuff like that. You know, so there's a lot of techniques and it's usually ongoing work.

01:29:02

And so what stays of a website is not that much anymore. I skip all the navigational stuff. I skip all the navigational stuff over here. I skip all the pictures. I skip all the pictures over here and just focus on what is interesting, okay?

01:29:21

And breaking it down will lead to some very small text fragment. This is what I want to extract. This is what I want to index because now I know this is a webpage about some Institute for Information Systems and actually it's at the Technical University of Braunschweig and so on, huh?

01:29:42

This is the interesting part. Well, still, if I change around the order of words, does that help me or not? If I just exchange sentences, same content, not the same?

01:30:01

Well, at least near duplicate. So I can't do the comparison on a text-based, you know, like if I exchange one word, it's not the same text anymore. So again, it does not work in terms of hashing it and then considering different hashes

01:30:21

that were created by different content. If I just exchange one word, the hash value of the whole page will be totally different in terms of content, okay? So again, I cannot do it on a word level. I cannot do it on the hash level.

01:30:41

What do I do? And there's one clever technique that has been called shingling. Shingles are what we use in building houses for covering up the roofs. So they're kind of overlaying so the water can drain easily and the idea is that we should do exactly that

01:31:04

with a text. For every text that we have, we have a number of shingle size, which we will call K, and we will have the terms in succession that make up the text, okay?

01:31:23

And then we say a K shingle of the sequence of terms are all the consecutive sequences of K terms over the document. So if I have some document over here and decide for some K, for example, four,

01:31:41

what I will do is I will create the first shingle that goes like that. It's four successive words. Then I will create the second shingle that goes like that. Then I will create the third shingle that goes like that, okay? So they're overlapping shingles and they're kind of like showing snippets

01:32:03

of size K of my text. Good, so these are typical four shingles of my text, as I showed you just. And now what is the trick in near-duplicate checking?

01:32:21

What does it have to do with the shingles? Yes, but these are shingles of the same text.

01:32:48

So we're still talking about one document here and we get multiple shingles of that document, exactly.

01:33:01

So the trick is that we now can say, well, if I have two documents and I do not focus on the sequence of words but just on the set of the shingles, if two documents have a large overlap in the set of shingles,

01:33:23

this is near-duplicate content. Because it doesn't matter whether in the second document there is not rows here but some other term. It just affects the blue and the black shingle.

01:33:42

The red shingle is not affected by this exchange. Also, I refer to the set of shingles, not the sequence of shingles. This means if I take the red shingle and put it at the end of the text, it doesn't matter, it's still the same shingle.

01:34:01

Though the order of the words has been reversed. It means the same. So we can say that two documents are near-duplicates if the two sets of shingles generated from them are nearly the same.

01:34:20

The larger the overlaps in terms of shingles, the higher the similarity in terms of the documents. Well, take two documents, shingle them, and then what we can do is compute the jaccar coefficient.

01:34:43

The jaccar coefficient we very briefly had it in the fuzzy retrieval part. So what we do is we measure the overlap between the text. We take the shingles that are in both documents and divide it by the number of shingles that are totally in the documents, okay?

01:35:02

And if the two documents perfectly overlap, this will be one. If there are a lot of shingles and the overlap is very small, zero, it will be zero, the jaccar coefficient. And the higher the jaccar coefficient, the more it turns towards one,

01:35:21

the more similar the objects are because the shingles are the same, okay? So we say, kind of like, I don't know, 0.9 or something like that, 90% overlap, that is a near duplicate. And that accounts for the different words taken from the advertisement, so it's not Viagra anymore, it's Cialis,

01:35:41

or whatever, you know, can be done. Good. If you have all the shingles computing this jaccar coefficient, it's kind of easy. So you could just sort the sets of shingles,

01:36:06

then you find the intersection, and then you merge the sorted lists so you get the total number of shingles that is there, and then in n log n, you could compute the jaccar coefficient.

01:36:23

This is a little bit tricky because what we don't do is we compute two pairs of shingles and consider them, but what we have to do is we have a set of shingles documents already that we've somehow crawled,

01:36:43

and now we have to find out whether for a newly shingled document, the jaccar coefficient with respect to any of these documents that we had before is higher than 90%, 0.9, okay?

01:37:01

This is kind of, hmm, we can't do that because, I mean, keeping the shingling sets of all the documents that we had before doesn't work, you know, like that's too expensive. Then computing jaccar coefficients for every newly shingled document with all the documents that we have brought before.

01:37:22

Definitely prohibitive, we can't do that. So again, we need clever indexing technique to deal with that problem. That's really a difficult problem, and it sounds kind of unintuitive what is done now, so bear with me for a minute and I will show you

01:37:41

because a very clever way of dealing with the problem is a randomized approximation. So I will not do the exact thing. I will use a randomized algorithm to approximate the result for my jaccar coefficient.

01:38:03

So a shingle comes as no surprise, hashed into some value, for example, 64-bit integers. Then the set of shingles is basically transformed into a set of hash values, hmm, simple.

01:38:26

A hash value could be derived from several shingles, collisions may occur, but if a hash value is not there, there is no shingle that created it, okay? So we have positive, false positives,

01:38:43

but we never have false negatives. Then we could say, well, the jaccar coefficient between the sets of shingles is basically the same as the jaccar coefficient between the sets of hash values because, I mean, collisions do occur, yes,

01:39:02

but that in some document, the collisions are exactly the same as in the other document. So different parts in each document created exactly the same hash values. That is rather improbable.

01:39:22

Collisions are improbable as such, but synchronized collisions are kind of very improbable. So with that argument, I just assume that the jaccar coefficient of the hash values will be sufficient to approximate

01:39:41

the jaccar coefficient of the shingles. Now I use another little trick and use a random permutation on the set of all 64-bit integers. That means I just take a 64-bit integer

01:40:03

and shift around the numbers in the 64. In some way, it's a determined way, you know, but it's any permutation, just randomly picked, yeah? Good, simplest permutation obviously is identity.

01:40:23

I map them to themselves. That doesn't help me very much, but could be kind of like just add one and so every bit is kind of increased by one. And if I have an overflow, I go back to zero and that's it.

01:40:41

Could be a possibility. I take one way of our permutation, so I have to decide for one which one random, okay? So I do not randomly permute every hash value

01:41:02

and then everything's gone, but I just decide for one permutation whatever it has. So I have a set of shingles. I hash it. I have a set of hash values. I permute it by this chosen permutation,

01:41:24

randomly chosen permutation, and then I look for the smallest value that was derived by this permutation.

01:41:41

Good. What I do is I shingle it, flop, flop. I hash it, flop. I randomly permute it, flop, flop, flop,

01:42:03

and then I choose the minimum which is the one over here, okay? And now my claim is that the jaccar coefficient of the two hash sets of documents

01:42:27

is basically the probability that their minimum numbers are the same.

01:42:42

So the overlap is higher if they create the same minimum number by a random permutation. So it's no longer the issue whether the smallest shingle is there or not in terms of the hash value,

01:43:01

but the random permutation creates the same minimum, okay? So that is the idea. Do you believe it or not?

01:43:22

Ha ha ha ha ha! No, it's a probability. It's a probability. It's not a possibility. So, I mean, a probability is always one or zero after you've seen the fact.

01:43:42

So that's the same with, I don't know, Schrodinger's cat. Open the box and the probability is gone. It has become certainty, you know? But the probability is the interesting part, and that is somewhere between one and zero as is the jaccar coefficient. So the domain is the same, exactly.

01:44:17

No, no, no, no, no, no, no, no, no. Either the minimum is the same or it's not,

01:44:22

but what is the probability that a random permutation of your hash values will create the same minimum or not? That's a different question. That's not the question whether the minimum is exactly the same. I mean, that was what we were doing before,

01:44:40

you know, like we, that is the basic idea, you know? I have a lot of random permutations. I have a lot of possibilities to apply permutations to our numbers, okay? And if the probability is very high

01:45:02

that either of these permutations will create the same minimum, then the overlap must be big. Because then, I mean, I create the minimum. I create the minimum with one fixed permutation.

01:45:20

In each random experiment. So the random experiment is like follows. I choose randomly a permutation and then I apply this permutation to all the hash values. The same permutation, okay?

01:45:42

This permutation will make one specific hash value in both documents the smallest because it does it evenly. If I draw a different permutation, it will do the same.

01:46:03

If no matter what permutation I draw, this is where the probability comes in, I will end up with the same minimum. Then it must have come from the same shingle in the original set or from a shingle

01:46:22

that had the collision, which is rather improbable, okay? This is the idea. So the higher the probability that random permutations really result for two documents in the same minimum,

01:46:45

the higher the overlap must be because the minimum is directly permuted from a shingle. And the more possibilities I have to permute the shingle, the more shingles must overlap, okay?

01:47:12

Well, getting the probability is a different thing from knowing that the probability is proportional to the jaccar index, huh?

01:47:23

We will talk about how we get the probabilities, but for now, the point that I want to drive across is really this probability is really proportional or equal to the jaccar index of the hash values, okay?

01:47:41

This is what I want to do. This is exactly what we do, you know? We have two sets of shingles and we have the corresponding hash values. Then we take a random permutation, so we choose one, apply it. We calculate the minima, compare it with respect to each other.

01:48:03

We choose another permutation, apply it. Calculate the minimum and so on, huh? So as a random experiment, this goes on like that and we can prove that,

01:48:21

well, if we have the sets given as bit strings, huh? What basically happens is that we have the positions in the bit strings and the hashes for the documents, huh? So a permutation is basically

01:48:42

a random swapping of the columns because we cannot say what it will be transferred into. And a random swapping in the columns means that we swap, I don't know,

01:49:00

zero here with the one here and then we have the one in front, huh? And so on. What is the minimum number of this hash values now? Well, minimum means obviously that it has leading zeros because the more leading zeros it has,

01:49:22

the smaller the thing is that remains, okay? This is a bit representation of the number, huh? If I have a one here, it will be quite a large number. If I have a lot of leading zeros,

01:49:40

it will be a small number, okay? And this is the basic ideas. If I take the minimums as positions of the first zero columns, huh? So this would be the first zero columns in this game. Then the probability that the minimum of one

01:50:04

is the minimum of the other is proportional to the probability that the first zero column occurs in the same position. Because, I mean, the rest is noise.

01:50:23

But if the zero column is not in the first position, it's definitely not the same number, okay? So this has to be proportional. What is the probability that both have the first zero color of the position?

01:50:41

Well, since the matching zero columns over here can be ignored, it's a probability that the first non-zero column is a column of the form two ones, huh? So we don't want a one zero or a zero one column.

01:51:01

We want a one one column. That means that the probability that the minimum hash values after a random permutation has been applied, or that the same random permutation has been applied to the different hash sets of the shingles

01:51:21

is the probability that the first non-zero zero column indeed is a one one column, huh? I've basically crossed out all the leading zeros, look at the first column, and if that is one one, then chances are very well that it's the same minimum, modal noise, huh?

01:51:44

We're talking always about, you know, like, it's proportionate. And this has to do with efficiency, so we can well lose something here. So, if we continue our proof,

01:52:02

what is the probability that the first non-zero zero column is actually a one one column? So this case happens, I cross out the leading zeros, and then look at this column, and this is indeed a one one column. Well, the probability is the number of one one columns in the whole vector divided by the number

01:52:25

of non-zero zero columns, huh? Because if I have a one one column, or if I cross out the zero zero columns,

01:52:41

the column that is remaining could be of three types. It could be a one one column, it could be a one zero column, or it could be a zero one column. The probability that it's the first one, the one one column, is the probability of a one one column existing, upper part, divided by all the possibilities there are.

01:53:07

The number of non-zero zero column, or the number of one one column plus the number of one zero column plus the number of zero one columns, okay?

01:53:20

Good. And this is exactly the definition of the JAKA coefficient. Cool, isn't it? Well, I grant you that this is not, I mean, you wouldn't probably see it in the beginning

01:53:42

when somebody says, well, it's easy, you know, like you shingle it and you permute it and then if the minimum is kind of the same, then it's kind of the same thing, near duplicate content. But it makes sense, doesn't it? I always like that, it's fun, shingling is really fun.

01:54:02

And the father of shingling is Andre Broder, actually, who worked for Yahoo quite some time. And this was, so this was actually a problem that search engines really, so this is not some theoretical issue and some nice hobby of mine, but this is really what Yahoo did

01:54:23

to detect near duplicate content. So, idea is we can estimate the overlap by applying random permutations complaining the minima. And now your problem comes, how do we compute this?

01:54:42

Do we really have to do it? And the answer is yes, we have to do it. And what we will do is we will just take 200 random samples of functions, apply them, and that is a good measure, sufficiently good measure for what we do.

01:55:03

We take 200 randomly chosen but fixed permutation. We take the minimums of the permutation, worked on the shingle sets, well, the hash sets of the shingle sets, actually, okay?

01:55:23

And this is what is called a sketch of the document. Okay, so the jaccar coefficients of two documents can now be estimated by counting the number of places

01:55:41

where the two sketches of the documents agree. That is where the random permutation creates the same minimum. If that happens in a lot of places, it's near duplicates. If that happens only rarely, they share a shingle or two,

01:56:04

but it's not the same content, okay? That's a basic idea. And since the jaccar coefficients for the hashes and the actual shingles, so just forget about the collisions.

01:56:24

We don't want that, so that's reasonably improbable that the collision happened. If that is kind of similar, the sketch is a good way of estimating the jaccar coefficient. And this sketch is actually a very efficient way

01:56:44

of, the sketch is actually a very efficient way of representing a document, because it's 264-bit numbers

01:57:02

that can be properly indexed and where I can easily compute the permutation. So, again, to recapitulate, I take a document, I shingle the document, I hash the shingles, I permute the hashed shingles,

01:57:24

and basically break it down to the minimum. This is where the actual reduction is. So far, I really have taken the document and blown it up into shingles, which is kind of very much overlapping.

01:57:42

Shingles to hashes makes it numerical, but still, nothing saved. Hashes to permutations gives me 200-fold the thing, and then I kind of do the cut and say, okay, I just want the minima of them.

01:58:01

I can argue for the minima being all I need. Good. This leads me to the sketch of the document. And I do the same for all the other documents, end up with a number of sketches, and then I will just put the sketches

01:58:20

on top of each other, see where they collude, see where they are different, and then I get the number of places where the minima are equal, can be between zero and 200, divide it by 200, give me a number between one and zero,

01:58:41

which basically can then be computed for the threshold. Everything above 0.9 is a near-duplicate. Clever technique, isn't it? Well, maybe. Anyway, so now back to the initial problem.

01:59:04

What we have is a large collection of documents and the sketches, and now we have a new document. The near-duplicates of this new document can be found by computing the sketch of the new document

01:59:21

compared to the sketches of all the documents that I've seen before. And of course, this is much faster than looking at the individual shingles or trying to fiddle around with words and then how many words and what is the overlap in terms of the text and blah, blah, blah. This is much superior to all the string-based techniques

01:59:44

and gives quite good results. Still, if you have crawled just a billion documents and you have their sketches and in comes a new document, then you have to compare it to the billion sketches.

02:00:00

and compute the jaccard coefficient for all of them. That seems rather bad. This doesn't seem to be too impressive. Again, we can use a trick because for every index document and the sketch of the document, so the part of the sketch of the document,

02:00:21

we can create a pair of the minimum and the document ID. So if you have the document or if you have n documents,

02:00:40

you basically get 200 such pairs because you have 200 documents, 200 minima in your sketch and that's for every document. But now we can order it by the minima

02:01:01

and just for every new document, look up whether this minimum has been reached by some document ID. So all the document IDs basically could also be done by an inverted index just for every minimum recording in which documents does it occur.

02:01:24

And then I just ask for all the different minima that occur in the new document sketch and for every document ID that is there, I just add a counter. And if I find one document that has a 90% overlap,

02:01:43

that is 180, so if one document occurs more than 180 times, that's it. Quite efficient. Good. Again, new document, sketch it.

02:02:03

That means shingle it, hash it, permute it, blah, blah, blah. Sketch it. Now we find the B-tree to find the index document whose sketch contains at least one of the minima, so it has a minimum overlap, and then we look at the set of documents

02:02:26

such that the first document is in the sketch and so on. Only the document sketches can have a non-zero overlap with the sketch of D. If they have at least one minimum in common,

02:02:43

then their Jaccar coefficient is bigger than zero. And I just need to check between them. Makes it more efficient. Okay? Well, there's also an extension,

02:03:03

so if you consider that two documents, you know that two documents are near duplicates. If their sketches match in at least m places, then we can restrict the search to all the documents that have at least m numbers in common.

02:03:24

Again, I don't know whether the B-tree here is the best choice or whether an inverted file index could be used or a hash table. There's definitely several possibilities of what you could use for performing the check. But the idea is to really order the thing invertedly by the minimum,

02:03:45

and then for a new document, look for all the minimums which document IDs have the same minimum, and then count the documents with respect to the places where they overlap. And then if you are above, say, 180, that's it.

02:04:02

Okay? Well, the last thing I want to do today is spend a little bit of time on focus crawling. So let's assume we own a web search engine that focuses on a specific topic. You know, like, I just want news items,

02:04:21

or I just want sport events, or I just want tropical fishes or whatever. I'm the tropical fish search engine, and every diver wants that or loves that. Do I need to crawl the entire web?

02:04:42

Probably not. And the idea is that some focus crawling would be enough. So if I know, okay, there are a couple of pages that deal with tropical fish or with sport events or whatever it may be, why don't we crawl in their surroundings?

02:05:00

Why don't we only take their out links? That then may at some point lead to, I mean, also the tropical fish web page may transfer me to Google, and from there I can go everywhere. But chances are that if I hit one of the pages about tropical fishes,

02:05:22

I will enter a small world, as they call it very often. So there are other fish lovers, and they interlink each other. And crawling this part of the web seems far more promising for having good results on my search engine than basically crawling the entire web

02:05:41

and throwing out everything that is not fishy. Tropical fishy, that is. So what do we do? We train a classifier that is able to detect whether some web page is about the relevant topic that I'm interested in. It can be an Eve Bayesian classifier,

02:06:00

whatever, support vector machine, you name it. Then we take a number of pages, seed pages for all crawlers that are totally on topic and follow the out links of these on-topic pages. And then for each page that we land on,

02:06:20

we can decide whether this is still concerned with our tropical fishes or sports events or whatever it may be. And if so, we again follow the links on this page, and if not, we just cut it out. That's the basic idea.

02:06:40

You could also extend that and use clever probability models like we've just seen for finding out whether some links probably point to something that is interesting or not, and then crawl in some ranking of the pages that are more fishy or less fishy or whatever.

02:07:02

You can do a lot of things with that. But the basic idea is just take a couple of pages that are definitely on the topic and use them as seed pages, explore their surroundings, and you will have a lot of information about your topic and avoid a lot of unnecessary crawls of irrelevant signs,

02:07:22

basically, the idea of that. If you do it and if you compare it to unfocused crawlings and look at the URLs that you fetch and the relevance of the topics in terms of the harvest rate,

02:07:41

so you kind of try to not to start from interesting sides, but kind of start from just follow all the links, you will find that the more URLs you fetch, the average relevance will go down.

02:08:04

So you start very, very focused, and you have a good average relevance, and then you follow the first link, and it leads you somewhere else, and that leads you somewhere else. And at some point, you're at Google, and you're basically everywhere. So the way that you crawl

02:08:23

will not lead to high relevance harvest rates. If you use a focused crawling approach and crawl the URL, we can see that the relevance is kind of,

02:08:43

well, it's jumping a little bit up and down, you know, like more or less relevant, but staying basically on a higher level for the first couple of URLs, which would basically be here in the other thing. So focused crawling really helps you

02:09:01

getting a good harvest rate and getting the most relevant pages first, which in turn helps you in keeping up your good result sets, which is very important for focused crawling engines because usually they are specialized.

02:09:21

They have, in general, a smaller community. So who's interested in tropical fishes? I mean, a couple of divers or whatever. People having aquariums or whatever, you know, a very small amount of people. But still, you have to maintain your search engines.

02:09:41

You have to pay for it. You have to pay for the traffic. You have to pay for the crawling. Getting a high-quality harvest rate is much more sensible than crawling the whole web and then saying, well, yeah, out of order because I can't afford it anymore. You know, that's basically the idea.

02:10:01

And our last detour for today will be how to deal or how to make websites exactly crawler-friendly so you can help Google and Yahoo and Ask and whomever to do what they have to do. Yeah, we've already talked way too much because I will do this here very briefly.

02:10:21

So to help crawlers, what can you do? First of all, use a robots.txt to exclude everything you don't want to have indexed. Use a static site map containing all pages in your site listed in a good way such as a crawler can easily see it. Services such as Google already offer a standard for this.

02:10:43

Use good HTML, which makes it much easier for the web crawler's site parser. Avoid some dynamic and scripted content wherever you can. So use it only when it's really necessary. Provide fashioners and caching information

02:11:01

in your HTTP headers, for example, right there where the page has been updated the last time so that crawlers don't have to fetch the whole page but only the headpiece of it that only states when the last update has been made for the page. Send correct HTTP status codes.

02:11:21

This is particularly relevant for redirects, so some people still make redirects in the browser, and this could be difficult for crawlers to detect. So there is a status code in HTTP for redirects. Use it. Use the correct MIME types and content encodings for your documents. Again, this will help crawlers to process your content in the right way

02:11:42

so that all umlauts and all the other stuff are indexed correctly. Use canonical host names, which basically means do not provide the same content using different URLs. So there are some websites that are available with exactly the same content using different host names.

02:12:02

Don't do this. This will detect these duplicates and will cause trouble. Avoid some spider traps, some session IDs some people might generate within their URLs. These are almost always a problem for crawlers. Try to avoid it. And if you have some time,

02:12:20

try to annotate your images in your page by some textual description. This will help image search tasks and also might help users who cannot deal with images easily, but definitely will also help crawlers indexing your content. All right, that's it.

02:12:40

The next week, we will talk about how we can exploit the link structure for better web results in rep search. And these are the hits and the page rank algorithm. This one is used by Google, very famous. Well, that's it for today. Thank you very much for listening, and have a good lunch.