We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Support vector machines (8.6.2011)

00:00

Formale Metadaten

Titel
Support vector machines (8.6.2011)
Serientitel
Teil
9
Anzahl der Teile
13
Autor
Mitwirkende
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produzent
Produktionsjahr2011
ProduktionsortBraunschweig

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
This lecture provides an introduction to the fields of information retrieval and web search. We will discuss how relevant information can be found in very large and mostly unstructured data collections; this is particularly interesting in cases where users cannot provide a clear formulation of their current information need. Web search engines like Google are a typical application of the techniques covered by this course.
Mailing-ListeKurvenanpassungShape <Informatik>Statistische HypotheseTaskVerhandlungs-InformationssystemTermGruppenoperationPhysikalischer EffektResultanteInformation RetrievalMaskierung <Informatik>RelativitätstheorieStochastische AbhängigkeitDifferenteDienst <Informatik>Klasse <Mathematik>Demoszene <Programmierung>KontrollstrukturZahlenbereichEreignishorizontAggregatzustandMailing-ListeWellenpaketDiagrammMultiplikationsoperatorElement <Gruppentheorie>LeistungsbewertungDiskrete UntergruppeBefehl <Informatik>CASE <Informatik>EntscheidungstheorieDickeSystemaufrufNeuroinformatikEinflussgrößePhysikalisches SystemSoftwareentwicklungTrigonometrische FunktionPaarvergleichSoftwareschwachstelleLastEndliche ModelltheorieGüte der AnpassungWhiteboardCodeMailboxSuchmaschineBitrateGeradeDivergente ReiheVektorraumZeichenketteSoftwaretestEinfacher RingCluster <Rechnernetz>HeuristikWort <Informatik>FlächeninhaltKonditionszahlTypentheorieMenütechnikAlgorithmusMAPPunktDatenstrukturMinkowski-MetrikMathematikTeilbarkeitNatürliche ZahlMereologieOrdnung <Mathematik>IterationVorzeichen <Mathematik>FokalpunktStabilitätstheorie <Logik>ComputerspielShape <Informatik>AbfrageRechenschieberBitStrömungsrichtungKartesische KoordinatenSpieltheorieVirtuelle MaschineBenutzerbeteiligungRückkopplungÄhnlichkeitsgeometrieKategorie <Mathematik>SelbstrepräsentationInformationMaschinenschreibenTopologieTaskHyperebeneLinearisierungFigurierte ZahlElementargeometrieMinimalgradStereometrieWurzel <Mathematik>Prozess <Informatik>FrequenzSchnitt <Mathematik>BildverstehenGarbentheorieQuick-SortFormation <Mathematik>TransportproblemStabEinsAdressraumPolygonnetzLuenberger-BeobachterVersionsverwaltungArithmetisches MittelAbgeschlossene MengeProgrammbibliothekFehlermeldungStatistische HypotheseZweiReelle ZahlHarmonische AnalyseVektorraummodellMittelwertKippschwingungKurvenanpassungOrtsoperatorSchnittmengeComputeranimation
AlgorithmusRelevanz-FeedbackMathematikÄhnlichkeitsgeometrieInternetworkingProgrammbibliothekAlgorithmusResultanteBitMailing-ListeBildschirmmaskePunktMAPCluster <Rechnernetz>MultiplikationsoperatorKonditionszahlTopologieRelevanz-FeedbackIterationStabilitätstheorie <Logik>Hierarchische StrukturAbfrageSupport-Vektor-MaschineRichtungInformation RetrievalRückkopplungNeuroinformatikEndliche ModelltheorieEinflussgrößeFokalpunktWellenpaketGüte der AnpassungVisualisierungDatenstrukturKartesische KoordinatenTermZahlenbereichGeradeSuchmaschineOptimierungsproblemVektorraumReelle ZahlVektorraummodellInformationInteraktives FernsehenSchlussregelComputeranimation
BAYESAnpassung <Mathematik>Funktion <Mathematik>Objekt <Kategorie>Ein-AusgabeAlgorithmusSupport-Vektor-MaschineSummierbarkeitNichtlineares SystemBinärdatenSelbstrepräsentationLineare AbbildungHyperebeneMinkowski-MetrikMereologieKlasse <Mathematik>TaskBitTypentheorieZweiKlasse <Mathematik>PunktDimensionsanalyseEinfache GenauigkeitRückkopplungLinearisierungSchätzfunktionMathematikTheoremBefehl <Informatik>Element <Gruppentheorie>Kartesische KoordinatenResultanteDickeSupport-Vektor-MaschineCharakteristisches PolynomPhysikalische TheorieKategorie <Mathematik>HeuristikTermMeta-TagFlächeninhaltBestimmtheitsmaßAbfrageStochastische AbhängigkeitRandomisierungEndliche ModelltheorieComputerunterstützte ÜbersetzungWort <Informatik>Minkowski-MetrikBinärcodeTaskInformation RetrievalDifferenteNeuroinformatikVektorraummodellInformationWellenpaketRechenschieberAlgorithmusE-MailComputerspielSoftwareschwachstelleSuchmaschineDivergente ReiheAnpassung <Mathematik>Prozess <Informatik>SchnittmengeSoftwaretestVektorraumIterationComputeranimation
GrenzschichtablösungTaskCliquenweiteLineare AbbildungEinflussgrößeRandwertLokales MinimumVirtuelle MaschineSupport-Vektor-MaschinePunktMarketinginformationssystemKlasse <Mathematik>FehlermeldungBruchrechnungDimension 2PunktWellenpaketGrenzschichtablösungRauschenGeradeRandverteilungMomentenproblemGreen-FunktionBitAutomatische HandlungsplanungEinflussgrößeGüte der AnpassungElementargeometrieFigurierte ZahlMinkowski-MetrikSupport-Vektor-MaschineLokales MinimumÄhnlichkeitsgeometrieProgrammbibliothekRechenschieberKlasse <Mathematik>SchnittmengeMathematikStandardabweichungLinearisierungTermSelbstrepräsentationKategorie <Mathematik>TaskHyperebeneDifferenteFehlermeldungVariableAusdruck <Logik>CliquenweiteRandwertGlobale OptimierungDimension 3DimensionsanalyseFormale SpracheVirtuelle MaschineZweiSystemaufrufHypermediaCoxeter-GruppeSymboltabelleAbgeschlossene MengeGrundsätze ordnungsmäßiger DatenverarbeitungWhiteboardGebäude <Mathematik>FlächeninhaltSichtenkonzeptZusammenhängender GraphCodeGarbentheorieKonzentrizitätPerspektiveDistributionenraumComputeranimation
Support-Vektor-MaschineVirtuelle MaschineLokales MinimumPunktKlasse <Mathematik>FehlermeldungBruchrechnungSkalarfeldHyperebenePhysikalisches SystemVerschiebungsoperatorNormalvektorLineare AbbildungNebenbedingungRandwertPunktHyperebeneLokales MinimumSupport-Vektor-MaschineLinearisierungWellenpaketRandverteilungAutomatische HandlungsplanungNichtlineares GleichungssystemKlasse <Mathematik>URLSkalarfeldTermParametersystemVerschiebungsoperatorPhysikalische TheorieMathematikBitNebenbedingungMinkowski-MetrikAlgorithmusSelbstrepräsentationFehlermeldungGeradeZweiZusammenhängender GraphSkalarproduktProdukt <Mathematik>DimensionsanalyseDatensatzLie-GruppeWahrscheinlichkeitsverteilungZahlenbereichVektorraumRichtungGrundraumSpieltheorieMaskierung <Informatik>MereologieAbstandTypentheorieMultiplikationsoperatorDichte <Physik>SkalierbarkeitReelle ZahlNegative ZahlBridge <Kommunikationstechnik>SichtenkonzeptImpulsSystemidentifikationPhysikalisches SystemPrimidealRechter WinkelOrtsoperatorAutomatische IndexierungDickeWhiteboardStichprobenumfangOrdnung <Mathematik>Zentrische StreckungComputeranimation
SkalarfeldHyperebeneRandwertSupport-Vektor-MaschineKonstanteVerschiebungsoperatorGleichheitszeichenCliquenweiteInverser LimesLokales MinimumExistenzsatzMinkowski-MetrikHyperebeneNebenbedingungZahlenbereichLokales MinimumMathematikMinkowski-MetrikNichtlineares GleichungssystemSupport-Vektor-MaschinePrimidealRandverteilungRechenwerkAusdruck <Logik>Rechter WinkelRichtungCliquenweiteVerschiebungsoperatorDatensatzSchnittmengePunktFormale GrammatikKlasse <Mathematik>Automatische HandlungsplanungNormalvektorKategorie <Mathematik>DifferenzkernAbstandDifferenz <Mathematik>SkalarfeldNegative ZahlMaschinenschreibenMittelwertGeradeEinsOptimierungsproblemDickeLineare AlgebraVorzeichen <Mathematik>GraphfärbungTypentheorieStrategisches SpielDisk-ArrayGrenzschichtablösungCASE <Informatik>Ordnung <Mathematik>DefaultBitrateMomentenproblemPhysikalisches SystemAnwendungsspezifischer ProzessorBetrag <Mathematik>EreignishorizontResultanteComputeranimation
CliquenweiteHyperebeneNebenbedingungMinkowski-MetrikAlgebraisches ModellLokales MinimumGlobale OptimierungQuadratzahlWurzel <Mathematik>StandardabweichungTeilbarkeitSupport-Vektor-MaschineQuadratzahlFunktionalHyperebeneWurzel <Mathematik>NebenbedingungMinkowski-MetrikPunktBetrag <Mathematik>OptimierungsproblemVerschiebungsoperatorGlobale OptimierungCliquenweiteOrtsoperatorParametersystemRandverteilungZahlenbereichRechenschieberDickeTeilbarkeitLokales MinimumStandardabweichungRichtungDifferenzkernWellenpaketAbstandTermEinsGeradeNegative ZahlMultiplikationsoperatorNichtlineares GleichungssystemResultanteLineare AlgebraBildschirmmaskeMathematikNichtlinearer OperatorRechenbuchBitAlgorithmusDatensatzRechenwerkSystemaufrufWeb SiteLesen <Datenverarbeitung>Coxeter-GruppeAutomatische HandlungsplanungGefangenendilemmaCASE <Informatik>TelekommunikationEndliche ModelltheorieShape <Informatik>AggregatzustandMomentenproblemGesetz <Physik>Vorzeichen <Mathematik>BitrateMittelwertComputeranimation
Lokales MinimumNebenbedingungStandardabweichungQuadratische GleichungSoftwareentwicklungDatenstrukturSupport-Vektor-MaschineGlobale OptimierungRelevanz-FeedbackTransformation <Mathematik>MultipliziererLagrange-MethodeDualitätstheorieKategorie <Mathematik>AliasingSkalarfeldProdukt <Mathematik>Funktion <Mathematik>TermOptimierungsproblemKategorie <Mathematik>DualitätstheoriePunktSummierbarkeitSupport-Vektor-MaschineNebenbedingungJensen-MaßVariableZahlenbereichReelle ZahlMultiplikationsoperatorSelbstrepräsentationLokales MinimumBildschirmmaskeMathematikerinMinkowski-MetrikBruchrechnungSchnittmengeQuadratische GleichungMultipliziererRandverteilungAlgorithmusKontrollstrukturProgrammbibliothekGlobale OptimierungProdukt <Mathematik>MathematikNumerisches VerfahrenFlächeninhaltWellenpaketHyperebeneVorzeichen <Mathematik>MultifunktionLinearisierungBitSoftwareentwicklungFunktionalTermAutomatische HandlungsplanungHilfesystemNichtlinearer OperatorMechanismus-Design-TheorieCASE <Informatik>Nichtlineares GleichungssystemElektronische PublikationProzess <Informatik>Ordnung <Mathematik>RelativitätstheorieDichte <Physik>VerschiebungsoperatorWhiteboardRechter WinkelRückkopplungTransformation <Mathematik>MereologiePhysikalischer EffektVirtuelle MaschineAnpassung <Mathematik>RechenschieberDifferenteCoxeter-GruppeNegative ZahlBitrateComputeranimation
Support-Vektor-MaschineSkalarfeldProdukt <Mathematik>DualitätstheorieFunktion <Mathematik>TermSampler <Musikinstrument>FehlermeldungTotal <Mathematik>Notepad-ComputerVorzeichen <Mathematik>NebenbedingungLokales MinimumWellenpaketMatrizenringPunktNeuroinformatikJensen-MaßVorzeichen <Mathematik>VektorraumAbstandRandverteilungBitRechter WinkelVariableOptimierungsproblemSkalarproduktDivergente ReiheReelle ZahlMultiplikationsoperatorGüte der AnpassungTeilbarkeitComputerspielFehlermeldungCliquenweiteHyperebeneMinkowski-MetrikRichtungVerschiebungsoperatorParametersystemAusnahmebehandlungSupport-Vektor-MaschineNegative ZahlNichtlineares GleichungssystemSelbstrepräsentationFunktionalKonstanteDualitätstheorieÄhnlichkeitsgeometrieNebenbedingungURLGlobale OptimierungEnergiedichteRauschenMathematikCASE <Informatik>SoftwaretestGrenzschichtablösungOrtsoperatorKette <Mathematik>RelativitätstheorieAdditionAutomatische HandlungsplanungPhysikalisches SystemKlasse <Mathematik>Zentrische StreckungOrdnung <Mathematik>SpieltheorieSchwebungMatchingNotebook-ComputerBestimmtheitsmaßDerivation <Algebra>GeradeFlächeninhaltEreignishorizontMigration <Informatik>Coxeter-GruppeMehrschichten-PerzeptronMaßerweiterungGefrierenComputeranimation
Lokales MinimumNebenbedingungFehlermeldungVorzeichen <Mathematik>Klasse <Mathematik>MultiplikationSupport-Vektor-MaschineGebäude <Mathematik>Minkowski-MetrikKlasse <Mathematik>Support-Vektor-MaschineMultiplikationsoperatorZahlenbereichWellenpaketHyperebeneZählenBinärcodeSchnittmengeRandverteilungNebenbedingungGewicht <Ausgleichsrechnung>PunktDualitätstheorieBitBitrateFehlermeldungGeradeAlgorithmusTeilbarkeitRauschenJensen-MaßVerschiebungsoperatorGrenzschichtablösungParametersystemGlobale OptimierungKonstanteUnendlichkeitDifferenteMaßerweiterungOptimierungsproblemVirtuelle MaschineSoftwaretestCASE <Informatik>RechenwerkZellularer AutomatDatenfeldArithmetischer AusdruckReelle ZahlZeichenketteInformationMehrschichten-PerzeptronDrall <Mathematik>TypentheorieCoxeter-GruppeProzess <Informatik>Bildgebendes VerfahrenLinearisierungQuick-SortPhysikalischer EffektGüte der AnpassungMAPWeb SiteDatenstrukturRechter WinkelDienst <Informatik>Produkt <Mathematik>Spannweite <Stochastik>Inhalt <Mathematik>GefrierenOrdnung <Mathematik>Endliche ModelltheorieHeuristikFitnessfunktionKreisflächeKugelkappeTransformation <Mathematik>VisualisierungSchnitt <Mathematik>Vorzeichen <Mathematik>ComputerspielAutomatische HandlungsplanungMereologieAlgorithmische LerntheoriePhysikalisches SystemTermLeistung <Physik>Negative ZahlWhiteboardBasis <Mathematik>OrtsoperatorWärmeausdehnungVideokonferenzKategorie <Mathematik>Metropolitan area networkComputeranimation
Nichtlineares SystemSummierbarkeitSupport-Vektor-MaschineAusnahmebehandlungSchnittmengeMinkowski-MetrikLineare AbbildungQuellcodePolynomCluster <Rechnernetz>Endliche ModelltheorieÄhnlichkeitsgeometrieInformationSchnittmengeCASE <Informatik>Extreme programmingAdditionInhalt <Mathematik>Spannweite <Stochastik>Transformation <Mathematik>Nichtlineares SystemVisualisierungSelbstrepräsentationWellenpaketBitTypentheorieDifferenteEinflussgrößeTermMinkowski-MetrikMultiplikationsoperatorMusterspracheDimensionsanalyseAlgorithmische LerntheorieFehlermeldungPunktPhysikalisches SystemHyperebeneKlasse <Mathematik>GeradeEinsSupport-Vektor-MaschineRauschenMereologieGrenzschichtablösungUnüberwachtes LernenTaskNormalvektorVirtuelle MaschineÜberwachtes LernenComputeranimation
SummierbarkeitNichtlineares SystemVisualisierungMetropolitan area networkMathematikWeitverkehrsnetzWurm <Informatik>GammafunktionPolynomVerschlingungMenütechnikTransformation <Mathematik>Minkowski-MetrikSkalarfeldProdukt <Mathematik>Globale OptimierungRegulärer Ausdruck <Textverarbeitung>Kernel <Informatik>Funktion <Mathematik>Textur-MappingMustererkennungStatistikSoftwareDemo <Programm>VideokonferenzKreisflächeDimension 3DimensionsanalyseAutomatische HandlungsplanungSchnitt <Mathematik>CASE <Informatik>HyperebeneTransformation <Mathematik>Minkowski-MetrikSchnittmengeSkalarproduktOptimierungsproblemKernel <Informatik>Kategorie <Mathematik>Support-Vektor-MaschineNeuroinformatikKoordinatenInformation RetrievalMultiplikationsoperatorMinimalgradFunktion <Mathematik>Wurzel <Mathematik>QuadratzahlPolynomGüte der AnpassungPrimidealWellenpaketFunktionalMereologieTermPunktGrenzschichtablösungVirtuelle MaschineLinearisierungSoftwaretestProdukt <Mathematik>RandwertKartesische KoordinatenKlasse <Mathematik>ResultanteWeb-SeiteMAPRechter WinkelBoolesche AlgebraSuchmaschineElektronischer ProgrammführerVektorraumSprachsyntheseKomplexitätstheorieHoneywell-HoldingBasis <Mathematik>Logistische VerteilungDifferenteDistributionenraumFlächeninhaltCodeLoginZellularer AutomatWeg <Topologie>Algebraisches ModellAuswahlaxiomRichtungGreen-FunktionMailing-ListeWhiteboardPhysikalischer EffektRankingProzess <Informatik>Metropolitan area networkOrtsoperatorNebenbedingungTabelleEinflussgrößePhysikalisches SystemLokales MinimumTaskVerschlingungRückkopplungElektronische BibliothekMustererkennungDefinite-Clause-GrammarSpieltheorieWärmeübergangZentrische StreckungSkalierbarkeitTelekommunikationRelativitätstheorieZahlenbereichUnordnungPrinzip der gleichmäßigen BeschränktheitAbstraktionsebeneMehrschichten-PerzeptronRechenschieberGeradeDerivation <Algebra>RandomisierungFormation <Mathematik>ProgrammbibliothekTextur-MappingAutomatische IndexierungComputerspielBildgebendes VerfahrenWeb SitePixelRhombus <Mathematik>Coxeter-GruppeAbgeschlossene MengeTopologieInformationTypentheorieNegative ZahlEntscheidungstheorieWiederherstellung <Informatik>GraphfärbungIterationElektronische PublikationExogene VariableMathematikUmwandlungsenthalpieEinsDickeLesen <Datenverarbeitung>AdditionDatensichtgerätDivisionÜberlagerung <Mathematik>ParametersystemMittelwertHypermediaSystemaufrufUmfangNotebook-ComputerFamilie <Mathematik>Ordnung <Mathematik>FrequenzMessage-PassingSummierbarkeitGruppenoperationPhysikalische SchichtSoundverarbeitungComputeranimation
Web-SeiteWendepunktMusterspracheLokales MinimumSupport-Vektor-MaschineSupport-Vektor-MaschineGeradeLinearisierungBenutzerbeteiligungAppletDifferenzkernNichtlineares SystemMinkowski-MetrikDimensionsanalysePolynomHyperebeneKurvenanpassungGrenzschichtablösungTransformation <Mathematik>Kernel <Informatik>Radiale BasisfunktionAuswahlaxiomWellenpaketComputeranimation
Web-SeiteKategorie <Mathematik>Radiale BasisfunktionFlächeninhaltSupport-Vektor-MaschineSichtenkonzeptPolynomGüte der AnpassungDifferenteStichprobenumfangKlasse <Mathematik>Abgeschlossene MengeKernel <Informatik>WellenpaketPunktZahlenbereichComputeranimation
Demo <Programm>Nichtlineares SystemMustererkennungStatistikSoftwareSchwach besetzte MatrixVektorraummodellInformation RetrievalBAYESSupport-Vektor-MaschineMarketinginformationssystemKategorie <Mathematik>Lineare AbbildungAbfrageRankingWeb-SeiteFunktion <Mathematik>SelbstrepräsentationTaskAlgorithmusTabelleTermKernel <Informatik>FunktionalKlasse <Mathematik>Kategorie <Mathematik>EinflussgrößeNebenbedingungLinearisierungMinkowski-MetrikSupport-Vektor-MaschineDifferenteSelbstrepräsentationEntscheidungstheorieBildschirmmaskeRankingArithmetisches MittelSchnittmengeSoftwaretestPerfekte GruppeFlash-SpeicherWeb-SeiteAbfrageSkalarproduktTaskOptimierungsproblemKartesische KoordinatenVirtuelle MaschineInformationCASE <Informatik>ResultanteProdukt <Mathematik>WellenpaketInformation RetrievalMailing-ListeTypentheorieMittelwertMultiplikationsoperatorBasis <Mathematik>ParametersystemRegulärer GraphDickeVektorraummodellEndliche ModelltheorieDimensionsanalyseStandardabweichungPunktMathematikFlächeninhaltAutomatische IndexierungMultiplikationComputeranimation
NebenbedingungFunktion <Mathematik>Nichtlineares SystemLokales MinimumStandardabweichungSupport-Vektor-MaschineTaskÄquivalenzklasseRankingSinusfunktionMailing-ListeMarketinginformationssystemArbeit <Physik>RückkopplungÄhnlichkeitsgeometrieAbfrageInformation RetrievalAlgorithmusProzess <Informatik>InformationSummierbarkeitZahlzeichenMustererkennungInvarianteTranslation <Mathematik>FehlermeldungPixelTabelleBitrateSoftwaretestRoboterGrundraumDifferenteFlächeninhaltWeb-SeiteFunktionalMinkowski-MetrikSupport-Vektor-MaschineNebenbedingungE-MailKartesische KoordinatenMailing-ListeNeuroinformatikCASE <Informatik>Leistung <Physik>HyperebeneLokales MinimumSelbstrepräsentationTaskMustererkennungResultanteLogistische VerteilungCodeDistributionenraumMinimumBildschirmmaskeVirtuelle MaschineSuchmaschineMerkmalsraumVerschlingungSchnittmengeSkalarproduktRückkopplungCodierungBinärdatenFreier LadungsträgerRankingInformationInformation RetrievalRichtungZahlenbereichKlasse <Mathematik>WellenpaketSoftwaretestPixelElektronische BibliothekDimensionsanalyseBildgebendes VerfahrenReelle ZahlPunktRandverteilungFehlermeldungBitrateMultiplikationsoperatorAlgorithmusComputeranimation
GrenzschichtablösungMinkowski-MetrikMereologieSupport-Vektor-MaschineZufallszahlenGlobale OptimierungFehlermeldungVarianzCharakteristisches PolynomVarianzKolmogorov-KomplexitätBenutzerbeteiligungKartesische KoordinatenParametersystemWellenpaketRandverteilungGüte der AnpassungInformation RetrievalRegulärer GraphSoftwaretestGewicht <Ausgleichsrechnung>FehlermeldungKategorie <Mathematik>BildschirmmaskeCharakteristisches PolynomSchnittmengeTaskAlgorithmusSupport-Vektor-MaschineTransformation <Mathematik>DifferenteDimensionsanalyseRandwertKernel <Informatik>StichprobenumfangCASE <Informatik>InformationVirtuelle MaschineBitrateAusreißer <Statistik>Nichtlineares SystemZahlenbereichNegative ZahlFitnessfunktionGruppenoperationUmwandlungsenthalpieFlächeninhaltPunktMereologieWeb SiteQuick-SortComputeranimation
Transkript: Englisch(automatisch erzeugt)
All right, so then let's begin to our ninth lecture on information retrieval and web search engines. Professor Balke is not here today because he has been invited to give a talk at the National Librarian Conference, which currently is held in Berlin.
So it's all up to us today, but I think there won't be any problem. All right, so let's start with the discussion of homework, also known as exam preparation. So I will go through the last exercise and say some words to it, and then this should be enough.
Okay, first question, what's the F measure used for and what's the intuition underlying it? So usually you need the F measure when you want to evaluate the retrieval quality of a given information retrieval system. This is usually done by comparing the actual result set returned by the system for some query to some manually prepared reference result set.
And it's done by measures like precision and recall. So as precision and recall usually aren't some kind of conflict, so you can get a very high recall at a low precision and vice versa.
Some people would like to have a joint measure that combines precision and recall into a single figure. And this F measure used for it, it's simply a harmonic mean or a rated harmonic mean of precision and recall. And the idea is that if both precision and recall are high, then also the F measure is quite high.
And if either precision or recall is low, then the F measure also will be low. So good F measure means that precision and recall, those are at acceptable levels.
Yeah, we're now doing the homework number five because last week we didn't do any homework because I went there. So just going through the exercise of the last two weeks and then going on with the real stuff here.
OK, second exercise from homework five. So when drawing precision and recall curves for ranked lists, we have seen some sawtooth shape in the diagram. So here was this position-recall diagram, precision and recall.
No, I think it was the other way around, so actually it doesn't matter. And then the idea was when you have ranked lists from results, for every result or every step in this result list, we evaluate the precision and recall at the documents lying above this step and then draw points in the diagram.
For example, after the first step, let's say this document here is relevant, the first one. Then the precision would be 100% and the recall would be some value below one usually because there are many relevant documents.
And then it might happen that this document isn't relevant, the second one. So the recall stays the same, but the precision is lowered. We got this one. If the third document is relevant, then the recall increases and all the precision increases and this gives some nice sawtooth shape.
And this is all about it. Usually you use some kind of smoothing technique to get a better diagram and just take the average over many, many queries. And that's how things are done in precision-recall diagrams. So they look a bit strange, but this is really from the discrete evaluation of steps in the ranked result list.
All right, next one. What does the cluster hypothesis state and how it can be exploited for information retrieval tasks? So the cluster hypothesis basically says that if two documents are relevant or if two documents both are not relevant, then they tend to...
No, other way around. So if two documents are relevant to some query, then they tend to be close together in the space or the other way around. If two documents are close in the document space or in the LSI space or in whatever kind of space.
So if two documents are similar, then they are either both relevant or both irrelevant. So this is basically the idea that we are able to model our document space in a way that there are groups of relevant or non-relevant documents and relevance and non-relevant is not mixed in clusters occurring in the document space.
So how can this property exploited for our task? So we have seen some examples. For example, we could use this to cluster our result sets into different groups. We could use this if we get returned one relevant document from some cluster.
We could use this to extend our results to all documents lying in this cluster to get a better recall. We could use this for scatter-gather navigation and all kinds of browsing. So there's a lot of things one can do with clustering in IR and this usually relies on the cluster hypothesis to be true.
So usually it is true, but it all depends on the document space used and the similarity measures applied there. So there's no guarantee that it will hold for a given collection, but usually it does, more or less. All right, next one.
How can one determine whether clustering is good? So we have seen two ways to determine this. The first is to do a comparison within some kind of external reference clustering. It's very similar to evaluating position at recall.
So given a reference clustering that has been designed by humans and then we can compare it to a clustering returned by some algorithm and then check whether all pairs of items or documents in the clustering are both classified, are both seen in the same way by the reference set and by the algorithm.
So if the algorithm says that two documents should belong to the same cluster, but the manual reference clustering says that they should not, then this would be an error. Then we can simply count how many errors we make and how many correct pairs we have identified. So the second way of determining whether clustering is good doesn't require an external reference clustering.
So we could simply try to measure how similar documents are that are in the same cluster. So this is the intra-cluster similarity. This should be high. And on the other hand, we can compare how similar documents from different clusters are.
So we would expect them to be very dissimilar, otherwise they should have gone to the same cluster. So this is called intra-cluster similarity and it should be as low as possible. So this is some ideas and measurements you can also use to check whether some clustering is good
or whether some algorithm you designed or tuned gives reasonable results. Because usually you don't have the time or don't want to take the effort for creating a manual reference clustering. So this intra- and inter-cluster comparison could be a hint of how good a clustering is.
Of course, this always depends on the similarity measure chosen. So it should in some way relate to human understanding of document similarity. So usually some kind of cosine similarity in the vector space model works quite well here.
But it's no guarantee, just a hint to get an idea whether clustering is good or rather bad. Okay, next one is the k-means algorithm for clustering. Well, the idea simply is to identify a fixed number of clusters in some kind of document space.
And it's a rather heuristical approach. The idea is to start with k initial seeds, some documents in the space, where we're assuming that this document is the center of a new cluster and then we simply assign all other documents to the cluster which is closest to the document.
And then we recompute the center point of each cluster and reiterate this idea until we have some kind of stable solution or we have done enough iterations or the changes between iterations are small enough so there are many possible stopping conditions here.
And the idea is here to minimize inter-cluster similarity and maximize intra-cluster similarity.
So we have some kind of mathematical idea presented in the lecture. So this is basically how it's implemented in the heuristic k-means algorithms. And under some conditions one can prove that the algorithm actually gives the best solution in terms of our optimization problem we formalized in the lecture.
Alright, next one was the dendrogram and how it can be used. So dendrogram basically is some kind of tree which depicts how similar different clusters are. So you start at the bottom level where you assume that every document in the collection forms a single cluster.
And then you have some measure of cluster similarity and then you're looking for the most similar pair of clusters. So here's the bottom level and here are all the initial clusters, one document each. And then for example you see that these two clusters are very, very similar
according to some measure of cluster similarity we had different kinds of them, those with advantages and disadvantages. And assuming that this is the most similar pair then we can join them here and assume they have a similarity of say 0.91 or something like this.
And then we can continue now comparing this cluster to all the others and again all clusters to each other one. And then we find the next most similar pair of clusters, this could be for example these two here with a similarity of say 0.85 like this.
And then we can go on until our whole collection got clustered in some way into a single cluster. And then we can decide where we want to break down our clustering into a reasonable compromise by drawing a line for example. And then we have one, two, three, four, five clusters, whatever number of clusters may lie on this line.
So it's simply a visualization of the hierarchical cluster structure in some kind of document collection depending on the similarity measure and depending on the similarity measure between clusters. So, okay, homework for this week.
So what's the underlying idea of Barocchio's algorithm for pseudo-relevance feedback? So the general idea here is that the user get returned the list of results from some kind of retrieval algorithm which is usually the vector space retrieval model.
And then the user is able to mark some of these results as being relevant to him or her or being not relevant. And this feedback is used to modify the initial query point or query vector used by the vector space model. So simply the initial query vector is shifted from the cluster or from the region of documents
marked as being relevant, as being non-relevant into the direction of the documents marked as being relevant. So this is the initial query vector and the user said that these documents are not relevant and these documents are all relevant.
Then the idea would be to create this vector and add it to the initial query vector in some way and then we, if we are lucky, get a better result.
So this is how it's done in Rockio. Yes? Ah, no. Yeah, it's a mistake.
Rockio is not pseudo, it's real relevance feedback, sorry.
Pseudo, yeah, it's not bound to probabilistic or vector space, it's not bound to any model. So relevant feedback just means user gives some feedback about the results and then the algorithm incorporates this feedback in some meaningful way.
This could be by moving the query vector as in Rockio's, it could be done by estimating some probabilities, probability terms in the probabilistic model. It could be done in any other way that fits the model used, so there are no general rules here how to do this.
Pseudo relevance feedback of course is different because there is no user interaction so we simply assume, we simply assume, now we do really have pseudo relevance feedback and the idea here is to assume that the documents estimated as being most relevant by some algorithms really are relevant
and then assuming that the user would have just given this kind of feedback and then we can try to give a better result. So it is quite reasonable because usually if our query is generally enough and we have a large document collection, we can assume that there are some kind of safe bets for the algorithm.
For example, the first three results are definitely relevant because they are so similar to the query that there's no point of discussing whether they could be non-relevant. So somehow it works. Usually pseudo relevance feedback has worked better in times where we had closed document collections like in libraries
but today with the spam problem, it's quite easy to get results at the top of the result list that are spam and therefore are not relevant. So pseudo relevance feedback is a bit harder in rapid treatment scenarios than it was in library scenarios.
So what are main advantages and disadvantages? So disadvantages, I just told this and another disadvantage might be so-called topic drift. We have the apple example. So if the whole internet talks about the apple company and nobody talks about the apple tree, then you're most likely to shift your results into the company direction
and lose all information about the tree because it's so predominant. So main advantages is of course that you are able to focus your results into some kind of main
direction and able to filter out results that are quite strange or not the most popular way of seeing things. So if you're looking for mainstream topics in your search, then usually you get good results with pseudo relevance feedback given that you have some way to detect spam and disable it in some way.
So that's basically the idea of pseudo relevance feedback. So next one we have talked the last week about classification methods in information retrieval and now what are typical applications. So we have seen some examples. We will see some more today.
So typical examples really might be spam detection. So simply assuming document collection can be split into spam and not spam and we just label some documents as training examples and use some algorithm to detect
for some unknown documents whether they are more likely to be spam or more likely to be real documents. This is usually done by comparing similarities. So if a new document is very similar to known spam documents, then chances are good that this new document is also spam.
So another application is for example classifying documents according to topics to support a more focused kind of search. For example if you have a search engine with 10 predefined big topics, for example sports and politics and computer stuff and
something else, then you are able to get a better understanding as a search engine what are documents about and try to estimate whether the user's query is about one of these topics and this could be used to focus or even sort the results. So these are typical applications of classification information retrieval.
Of course there are many, many more but usually that's what's done here. Okay, now if base was a classification method we've seen in detail last week, so the idea here was to estimate the probability that some document is contained in some class.
So given that the document has some kind of known representation, for example containing some terms or having a certain length or whatever it might be, so we have a characterization of a new document which
is given and we want to know what the probability that this document is contained in some predefined class. So we can then use Bayes' Theorem to swap around these probabilities and if we have it the
other way around, if we know the class membership and we have to make some probability statements about the characteristics of documents belonging to that class, then we can estimate these probabilities from some training collection.
Because in this training collection we know what documents belong to the class and what document doesn't belong to the class and then we can simply see what are the properties of these documents in the class and not in the class. And then by Bayes' Theorem we can use this information in a direct way to estimate this probability for new documents.
So that's basically an easy idea. So why it is called naive? It's basically called naive because it assumes that occurrence of different terms or different document properties is independent. So we've discussed this before in the vector space model and the probabilistic retrieval.
It's all the same problem. It makes your computation and your model really, really easy but of course synonyms are definitely not independent or antonyms or just semantically related words. For example cat and dog are very, very likely to occur together and are not really independent.
Or all terms from the area of animals or just any terms that are related in some topical sense have a higher probability of occurring together than they would have by just randomness or just by chance if we assume independence.
It is a problem. However, Bayes' classification and probabilistic retrieval all work quite well. So either this independence assumption is not a big problem at all or it just works as some kind of heuristic and the theory behind it doesn't really matter at all.
So nobody knows but it doesn't matter. There is some nice idea underlying these approaches and they work. So that's usually all that's interesting for information retrieval people. So the last one was adaptive boosting. So what is it? What is it used for and how does it work?
So adaptive boosting is some kind of meta techniques for making a classification method better and it's usually done by classifying some kind of iterative process. So if we have some training set here, training set with labeled examples, then we can train a classifier using this training set.
For example if you take a base classifier here, learn some probabilistic model and then we take a look at what examples in the training set or what examples in some holdout test sets are classified wrong by this model.
And then we weight the training examples to give wrongly classified examples a higher importance and correctly classified examples a lower importance. And then we can reclassify it all again using base for example and now we have a strong importance on the mistakes of the first model.
And so we get a whole series of models that are all used to classify our data and by reweighting correct
classifications and misclassified items we are now able by combining all these different models to get a better classification at all. So the weaknesses of this classifier can be fixed by this one and by this one and all together they make a better classifier, that's the idea.
So if you ever have a problem with some classification algorithm that doesn't seem to fit your data at hand, try adaptive boosting. Usually you can get a performance increase of 5 to 10 percent I would say.
Of course it depends on your document collection but usually adaptive boosting gives some advantage. Alright, now to the topic of today's lecture. It's again about supervised classification. Today we will see probably the most successful or the most advanced
or the best classification algorithm known to mankind, so-called support vector machines. And they are simply another approach for supervised classification. So support vector machines will be our topic today and to give a brief summary of last week.
Supervised classification simply means giving some examples where some human has labeled these examples according to which classes or topics they belong. And then use this information to train some kind of model which can be used to classify new documents according to the human manual classification.
So for example if some user said this is spam and this is not spam, you all know this from your email program, then we learn some kind of classifier from this information and are now able using this classifier to find out whether a new email is spam or not.
And with continuous feedback from the user we are able to make our classifier better and better over time. And we've seen these three algorithms and now we will talk about support vector machines.
So there are different kinds of support vector machines. We'll start with the most simple one, so-called linear support vector machines and then we will go one step further and show more advanced techniques and how it works.
Okay, let's start with a brief problem definition. We will make some assumptions which will make our life much easier in the next slides. We will do a little bit math in the next 10 or 15 slides, but it's all not too difficult. And this is also because we made some assumptions that make the math a bit easier to understand.
So we assume that we have a binary classification task, so documents either belong to the class at hand or don't belong to the class at hand. And there's only one single dimension for classification, so spam or not spam or relevant or not relevant or belongs to the animals topic or doesn't belong to the animals topic, whatever.
And the second assumption is, which is also quite common for all types of classification algorithms, that we assume that any item, any document to be classified can be represented as some kind of d-dimensional real vector. So we have a space consisting of d-dimensions, so many, many more. I can draw it.
And each document is just a point in this space and each coordinate is some real number. So that's the usual approach to classification. And the task now is, find a linear classifier, so-called hyperplane, we will see an example soon, that divides the whole item space into two parts.
So on the left side, all documents or items belonging to the class and on the other side of this divider, all documents belonging not to the class. So we simply try to cut the space into two halves.
Okay, here's an example, a two-dimensional training set, so here could be our two axes and each document is a point in this space. So this could be the vector space, for example, whatever you like to have it. So the task then is for linear classifiers, separated the space by a straight line, for example, this one or this one or this one.
So there are many possible ways to do this and all these are linear classifiers.
So simply drawing a line or if you have a three-dimensional space using a plane or if you have a higher dimensional space then it's called a hyperplane. So it's basically some kind of linear geometric figure that divides the space into two halves.
So here again are some ways, so let's remove my painting to see the better one, different ways to do this. Next one, next one, next one, yeah. And now the question arises, so there are so many ways to do it, which one would be best?
So what's your intuition about this? Which line would be the best one to use? So I will go back a bit so we have a clean picture. So which line would be best when you want to build a linear classifier?
Okay, what about this one? Good idea? Okay, too close to the points, okay.
Yeah, what about this one? Isn't that great? Okay. This one? What about this one?
Yeah, I see we have the same intuition. So basically the idea is to find a line dividing the space that leaves room on both sides to make a safe classification. So for example it could easily be that there are very similar points in this class which we haven't observed in our training set.
And if we would use a very, very classifier that is very close to our training examples, for example this one here, then it could easily happen that there is some kind of data point which is very, very close to the one class
which gets misclassified if we use the line which is too close to the class. So the general idea is to find a classifier which is somewhere in the middle, leave some space just to be on the safe side given our training data.
So what we will do on the next slides is to derive some mathematical approach, how to find a good line and what properties it should have. So usually when you work with support vector machines and you read something about it, you get a large formula with some kind of variables in it you won't be able to understand.
And the goal of this lecture is to give you an understanding how this formal model that is currently used for support vector machines, where this really comes from. So it's basically this idea here, it's totally easy but nobody tells you about it
because usually when you do support vector machines you use a formal definition of the problem that is easy to solve in terms of mathematical optimization but which is not easy to understand. So we will go the other way around, start with understanding and then arrive at the complex representation.
Okay, so as we have seen, somehow the idea of margin seems to be important for defining a good linear or defining the quality of a linear classifier. So margin simply means the space between the points where the linear classifier is close to or closest to.
So here using this linear classifier, the margin on this side would be this width here or this one because they are at this point and on the other side it might be this
and all taken together is the so-called margin. So the width, the boundary could be increased without hitting a data point. So the more margin, the better the classifier is. Easy. So different examples, different margins, different ways to do it.
So and then we arrive at the notion of a so-called maximum margin classifier and a maximum margin classifier is the linear classifier that gives the maximum margin here. So and as we have seen, maximum margin seems to be a good idea
and so our task is to find a maximum margin classifier given some data set. Okay, so actually the maximum margin classifier is the simplest kind of support vector machine that one can use and it's also called linear support vector machine.
So support vector machines may sound complicated when you read about them or use libraries using them but essentially they are just about the idea of maximum margin classification. Okay, let's assume for now that there always is such a maximum margin classifier and that our data space always can be divided by a line or a plane or hyperplane into two areas.
So for example it could be that there is another data point or it could be that this data point is located here because we have made some errors in measurement or it's just some human who misclassified this point,
then we won't be able to find a maximum margin classifier because there is no way to draw a line between the blue and the green data points. So let's assume for the moment that there is a way to do this and later I will explain how we can cope with situations where we have some noise or some errors in the data.
Okay, mathematically one speaks of linear separability. This is just the assumption we're going to make for the moment. So another important concept are the so-called support vectors and that's where support vector machines get their name from.
Support vectors are exactly those data points that are pushing against the margin or touching the margin here. Yeah, maybe this one and this one and this one and this one. These are support points or if you want to see points as vectors,
it's also possible then these are the so-called support vectors. And support vector machine simply tries to find the maximum margin classifiers which is directly related to the support vectors.
Okay, we have already seen that maximum margin classification is some kind of very intuitive idea. So large margin, two classes, sounds good. But there are some more reasons why this should be a good idea. So we've also seen that the largest margin is intuitive
because it seems to guard best against small errors. So as I explained before, if we have simple deviations in our initial data and if we have a line very close to the original data, it simply might happen that very, very similar data points
get misclassified by this line. So leaving space on both sides always gives some margin against errors, which is good. So another one which is really important is this here.
So this approach is quite robust against changes in the data. So let's go one slide back and I can see it. If I'm adding new training data points here which don't really seem to change much, then the maximum margin classifier or the linear support vector machine
doesn't change a bit because it simply depends or just depends, only depends on the support vectors. So linear support vector machine is only defined by those points being most critical in terms of classification. So it doesn't look at points lying very close outside,
so it doesn't matter what points or how points look. It could be classified in a very safe way so that there's no doubt about the classification, but maximum margin classifiers concentrate only on those points that seem to be hardest to classify.
So this is also a good idea to do this. So there are also some theoretical arguments I won't go into now why maximum margin classification is a good idea. So actually there are people who write very, very thick books about theory of classification
and they're assuming that the data has some underlying probability distribution. And one can prove that maximum margin classification minimizes the error made under some assumptions. So even from a theoretical perspective, support vector machines are a good idea.
And probably what's most important for IR people, it just works well. So support vector machines usually are among the classifiers that work best for any given kind of data. Of course there are exceptions, but usually for real-world data support vector machines perform very close to the best available classifier
or just are the best available classifier. All right, so now we have to do some math and we want to formalize our problem how to find the maximum margin classifier because it's not a good approach to simply draw a picture and draw a line somewhere
because we need an algorithm doing that for us and an algorithm needs clear instructions. Okay, how to formalize our approach? So let's define our training data at first. So we assume that there are n training examples so n data points in our space that have been labeled by some human
into belonging to the class or not belonging to the class. And each training example now could be represented as a pair consisting of a component Yi and Zi. This is the ith training example
and Yi is the real vector representing the item to be classified or the document to be classified and Zi is just the label of the class. So minus one means this item doesn't belong to the class
and plus one means this item belongs to the class. So this is our training representation of first class and second class. It doesn't matter, so it's a binary kind of classification. So here's an example. For example, this point has coordinates minus one, one
and doesn't belong to the class and I've drawn a minus here in the picture and this one has coordinates five, minus one and belongs to the class and that's why it gets a plus here. So when doing maximum margin classification
we would expect the dividing line to be located somewhere here or maybe here. I have no idea. We have to calculate it which is what we are doing next. Okay, first of all, what's a valid linear separator?
So how can we find out which lines in space actually separate our data set into two pieces and for that we need to recall from linear algebra. Some of you may have learned this in school. Some of you may remember this from some elementary lectures
here at the university. So in general any hyperplane, so line, plane, or yeah, in more dimensional spaces called hyperplane can be defined by some kind of row vector, a d-dimensional row vector w and some scalar b, so some number b
and the idea simply is that the row vector determines in which direction the hyperplane stands particular to. So this would be this vector w showing the direction which is perpendicular
to the hyperplane and the number, the scalar b is related to the shift of the hyperplane from the origin of the coordinate space. So in formal terms we could represent a hyperplane like this.
The hyperplane consists of all those points in the d-dimensional space for which this equation here x multiplied with our row vector, the scalar product here and plus b is exactly zero
and those are exactly those points lying on the hyperplane. So some of you might remember this and if we multiply out the scalar product then it's simply the product of all components of w and vector x multiplied and summed up plus b
and they should give zero. So this is the definition of a hyperplane and as I said w usually is a normal vector of the hyperplane that means it's perpendicular to it, so orthogonal in German.
So when b is a shift from the origin of the coordinate system and in our example the line shown in the example of the hyperplane or just in the example it's just line follows this equation so all points x, y
satisfying this equation lie on the line here and all points giving a value larger than zero are located on this side of the hyperplane and all points giving a value smaller than zero
are located on this side of the hyperplane. So dividing space into two and have one equation to determine on which side the point is located. Quite easy. So we will talk about this only in two dimensional space in the following because for hyperplanes
and high dimensional space it's exactly the same but you get a larger scalar product. All right. Okay, so we can start by defining two constraints that every separating hyperplane in a maximum margin classifier must satisfy. So for any i, so for any training example we have
if the training example does not belong to the class so its label is minus one then the definition of the hyperplane so w and b in this equation must be smaller than zero, must be negative
so simply we say negative label means negative value of the location equation if you would say so. So on the other hand all positive examples have a positive value.
So negative examples are here, positive examples are here. So of course still there are many, many ways to do this but no matter which plane we want to use they all have to satisfy these two constraints. So next we will define some more constraints
so that we finally end up with only one hyperplane namely the maximum margin hyperplane. Okay, so we've seen if some hyperplane or some line is a valid separating line of our data set then there must be some two scalars
r plus and r minus which are both larger than zero such that if we add r minus and r plus or add r minus or subtract r plus from our equation
then we would exactly touch our support vectors. So no matter where we put our separating hyperplane there's some room to the left and some room to the right and if we extend our hyperplane in both directions
at some point we will touch one data point at each side and r minus simply specifies how much we need to extend our plane to the one side to meet the first negative example. This should be r minus in some way
and this is r plus. It's some kind of distance to the next positive example. And the margin would be just r minus plus r plus in some way.
So I should note that r plus and r minus are not just the margin width here. So if we would take this point and this point maybe a different color this point and this point and simply would calculate the Euclidean distance
between these points the distance would be different from r minus because r minus is simply some kind of shift and x and y also depending on their lengths contribute differently. So you might remember this from linear algebra. So this is basically some kind of shift constraint
using the equations but doesn't have to do something with actual length in the space. So we will see this later why this is true. Okay, now I'll go on. We now assume that we have some separating hyperplane for our data set
and we know scalars are plus and r minus that give us the space to the left and to the right. And now we just want to get some normalized hyperplane that lies simply in the middle between our left support vectors and our right support vectors because as we have seen
it doesn't make any sense to draw our line closer to the positive examples than to the negative ones or the other way around. We simply want to have our separating line in the middle of both classes or right between both classes. So we can do this by just taking r minus and r plus
as a shift constraint to both directions divided by two and then use just this average shift as a shift for the two directions. So use equal shift constraints here, r plus
and we can simply define and we see after we have defined b prime to be b plus this average shift that this one is also a valid separating hyperplane because it lies just in the middle between the two hyperplanes touching the positive and negative examples.
So this was our initial hyperplane which has been too close to the negative examples and then we look here and here and move this one right to the middle by redefining our shift constraint. And now we have a way to get a normalized hyperplane
that lies directly in the middle given any valid separating hyperplane. So now we can safely assume that we already have given a hyperplane that lies directly in the middle
and now we want to calculate how large the margin really is into both directions because in the end we want to find the hyperplane having the maximum margin. So we simply, the idea is to try all possible separating hyperplanes calculate the margin for each of them
and then simply take the hyperplane having the maximum margin. And for this we need a formula to calculate the margin and to do this easier we can again normalize our equation.
So we simply divide w and our shift constraint b prime and our margin with r prime by r prime so we get some kind of unit margin and we just define so by dividing the whole equation by a given positive number
that doesn't change the hyperplane at all because on the right side there's a zero and after dividing zero by some number it's still zero and after dividing the left hand side of an equation by some positive number the equation is still the same. So let's do some kind of normalization.
We define w prime prime as w divided by r prime and b prime prime is just defined by b prime divided by r prime and then this hyperplane still is a valid separating hyperplane.
And the shift constraint then is one because here we have our new hyperplane and the hyperplanes touching our support vectors are this hyperplane with one added and our original hyperplane
with one subtracted here. So it's again some kind of normalization and now given some hyperplane we are able to find a normalized one which having a unit shift constraints to the left and to the right. Okay, let's put it all together.
So if there is a valid separating hyperplane then there always is a hyperplane such that it's still valid such that the margin to both sides to the positive and negative examples have the same width and the bounding hyperplanes
to the left and to the right are shifted away by one. So what's the benefit of doing this? So as you have now seen that for any valid separating hyperplane there is this special valid hyperplane we are now able to limit our search
for the best hyperplane only to those planes having or satisfying these constraints which will make our optimization problem a bit easier. So we will now forget about the class of all possible hyperplanes and we will now concentrate our search
on the hyperplanes having unit margin to both directions because there always is such a hyperplane. So and again when doing this it's a good idea to use a linear classifier that uses equal space to both sides. So this is some kind of formal step
before doing this. And now our search space because we want to search for the hyperplane having the maximum margin so we want to go through all possible hyperplanes and find the one having the maximum margin. Our search space is the set of all
row vectors that which are perpendicular to some hyperplane or shifts. So all possible hyperplanes this which are separating hyperplanes so splits the space in positive and negative and the constraint is
that exactly because we have done this normalization we now can see that at some point our hyperplane touches one negative example. This is exactly at the point where this equation is minus one because we had introduced this minus one shift
and we have the constraint that the hyperplanes we are looking for have the property that there is a positive example just one shift away from it. So now as we have specified these constraints the question is what really is the margin of such a hyperplane.
This is what we do next. So we can use some results from linear algebra so let's assume we have given such a hyperplane satisfying this equation here and the hyperplanes touching the negatives
and the positive examples are shifted by one and minus one into each direction. So then we know from linear algebra that the distance of the hyperplane so let's assume this is our coordinate space this goes right here somewhere
then we know that the distance of a hyperplane to the origin of the coordinate space which would be here is exactly distance in the Euclidean sense is exactly the length of the vector b divided by the length of the vector w
so we don't want to prove it take it as it is and because of this we know that the distance from this line to the origin of the coordinate space
because here is b plus one is just length of b absolute value it is here b plus one divided by the length of w and the distance of this hyperplane
to the origin of the coordinate space is b minus one absolute value divided by the length of w and then we can see that the margin width is exactly two divided by the length of w because b is the same
in both distances and the distances only differ by a number of two therefore the margin width is two divided by the length of the row vector defining the hyperplane and if we want to maximize our margin width
subject to the constraints we have seen on the slide before then our goal must be to minimize the length of this vector here so it doesn't matter how large the margin is it should be minimal among all hyperplanes that are possible
then finally we get the following optimization problem and the solution of this optimization problem is the one hyperplane having the maximum margin in our space so we want to find parameters w and b so we want a definition of a hyperplane
dividing our space that satisfies the following so we want the hyperplane that maximizes this number maximizes the margin width and at the same time satisfies the constraints that the positive examples
or the next positive example is exactly one shift away and the negatives are exactly one shift in the negative side away and of course there is one point touching touching these shifted margins
at the shift of one so there is one support vector and there is the other support vector so as our goal is to maximize the margin the margin will always be taken as big as possible and this means
that we don't need these two constraints because if we want to maximize the margin then we will always make the margin so large that we have equality at this place so it doesn't make any sense to use a solution here
to look at some hyperplane that has a smaller distance to the next support vector has a larger distance to the next support vector than one so this can become equality
We don't need these two anymore. And then the problems get a bit simpler. We want to maximize the margin over these parameters in our search space, and the constraints simply are these two here. So for each training example, we want to have that the training examples,
the negative ones lie on the left side of our plane, and the distance should be at least one. And on the right side, all points should be, all positive examples should lie on the right side, and the distance should be at least one.
So there is enough space between our training examples and our dividing hyperplane. So next step would be to make this term a bit simpler. So if we want to maximize two divided by the length of W,
we could also minimize just this number, because that's essentially the same. If we have two solutions, and the first solution has a larger value here, then it will always have a smaller value when we just take this one.
So because it's a monotone function. So we could also, instead of minimizing just the length of W, of the vector W, we could minimize 0.5 times the length of W squared. So doesn't seem to make any sense here,
but the idea simply is, because this is a length, write it down, length of W as W is a vector, you remember, is always something like this, W1 squared plus W2 squared, so all the coordinates squared, and then the square root taken out of it.
So if we want to make calculations in algorithms a bit easier, then it's always a good idea to square this one to get rid of the square root, because taking the square root is always a quite expensive operation. And the optimization problem still stays the same, because squaring a positive number,
again, is a monotone operation, and this won't change our optimization problem. So then we could also add some factor 0.5, which doesn't change anything in our optimization problem, and actually there's no good reason why we should do this,
but in mathematical optimization theory, there is some kind of standard form how to define these kind of problems, and they usually put a 0.5 in front of their term to be minimized or maximized, and that's the only reason why we're doing this here. So to get some standard form,
which can be solved by existing algorithms. So next slide. This is now the problem we want to solve, so we want to look through all hyperplanes satisfying these constraints, and from these hyperplanes that satisfy these constraints,
we want to take the one that have the minimum value here, and this is a maximum margin hyperplane then. So to get a simpler representation, we could also combine these two constraints into a single one. We simply could put the z i here as a factor,
and don't need to distinguish between two cases, between negative and positive examples, because when we do this, let's do this for the minus one example. So for negative examples, we have minus one times w times y i,
the coordinates of the ith training examples, plus b minus one greater than or equal to zero, and this simply becomes minus w y i minus b plus one
is greater than or equal to zero. Then we could move the one on this side, gives a minus one, so we want to arrive here.
This is strange. Ah, the sign is, yeah, so here is the parenthesis. Yes, this must be plus then.
Thank you for the hint. And then we could simply multiply the whole thing with minus one, and we get w multiplied by y i plus b less than or equal to minus one, and this is exactly what we get here, and the same is true for the positive examples,
so this is just a fancy way of expressing this distinction of two cases. So this is now our final optimization problem we need to solve when we want to find the maximum margin classifier given some training set, and it just consists of a single constraint
for each training point. So looks pretty simple, but as you have seen in the last 13 slides, the duration is a bit complicated, but every step is really, really simple, and this is just a way of expressing this,
but on the other hand, if I would have started with this expression, you would have never known what we are doing here, so I hope you now know, or at least have an intuition what we are doing here. Okay, this is our optimization problem. How can it be solved? So now we need some kind of numerical optimization algorithm,
so fortunately there is some discipline in mathematics which deals with so-called quadratic programming problems, QP for short, and there are many, many algorithms to find the optimal solution of problems that look like this, and this is also the reason why we added the 0.5 here,
that's because all the people in the QP area just defined their problems this way. Okay, so now we are able to find the maximum margin hyperplane using some standard algorithms. There are also big books on this,
how this can be done. This won't be the topic of this lecture. Just use some library for solving this kind of problem, or if you have heard some kind of numerical computation lecture, they should have dealt with that. Is anyone of you there who has some knowledge on QP problems?
So then just trust me or look it up if you like to. It's quite complicated stuff, but these problems can be solved in a standard way. Okay, let's go on and then have a break. Nope, I think it's better to have a break.
Now we will see us again in five minutes. Okay, then let's go on. So we finally arrived at some optimization problem which we can solve, and when we solved it, we find the maximum margin classifier,
which is the linear support vector machine we would use to classify a data set. So unfortunately, that's not the end of the story because mathematicians are quite clever, and they found out that if we want to solve this problem we've just seen,
it might be easier to solve a different problem which is equivalent to the one just shown. Yeah, we won't go into details here. It's just that you've seen it and heard the terms,
and don't get confused if you read something about it because this is usually the representation of the problem you will find in books. So they simply tell you support vector machines are a great idea, and you'll just have to solve this maximization problem and you have no idea where all this comes from. So the trick is called duality,
and each optimization problem or each quadratic optimization problem has a so-called dual representation which can be derived by introducing Lagrange multipliers. We've already used them in rock your relevance feedback, and then we can also use some kind of data transformation.
It's all quite complicated. I've done this myself some time ago, and I'm not a mathematician and found this really confusing, but actually I found out how it works, but you don't need to, if you want to look it up in a book, so the thing really is you should remember that there is some kind of dual form of the problem
to be solved which looks like this, and in this representation, we are going to maximize n variables. So alpha here is a vector consisting of n numbers,
n real numbers, so in our initial formulation, we wanted to find d-dimensional vector representing our hyperplane at the shift, and now we want to find alphas. You should also remember n was the number of training examples, so now we have one number per training example
which usually is much smaller than the number d when we want to find the d-dimensional row vector, so the problem is usually simpler when we define it in this form. So the term to be maximized is this one here,
so we want to maximize the sum of all the alphas minus some kind of strange product here, doesn't matter where it comes from, and the constraints to be satisfied is that all our alphas are greater than
or equal to zero, and for any training example, this equation, no, for any a, this should be, for any i, this should be larger, greater than or equal to zero,
and also this sum of products should be zero, so it's some constraint on how our training examples are linearly scaled together in some way, and this should be zero, so you directly arrive this from the initial problem if you use this duality mechanism,
but it doesn't matter here how this is done. So it looks a bit more complicated, but usually it is easier to solve when we formulate the problem in this way. So one very important property of this optimization problem
is that you've seen each alpha variable corresponds to one training example, and any solution or any optimal solution to this optimization problem has the property that if in this solution, one of these variables is larger than zero,
then the corresponding training example is a support vector. So another important thing is that because for all non-support vectors, so all points in the training space lying far outside and they are not really important for defining the separating hyperplane,
for all these points, these variables are zero, and so we can just ignore all training examples that are not support vectors. So because all non-support vectors have zero values here, we can simply ignore all data points
that we can rule out as being support vectors, which also makes solving this optimization problem much easier. So, and as you've seen in our examples, usually most training examples are zero, so we don't have to deal with n training examples usually, but just a small fraction of it,
which also makes solving our problem a lot easier. Okay, when using this dual representation, we could also give a different representation of our training, of our classification function. So do you remember the classification function simply is the equation of the hyperplane,
and it returns a negative value, no, not minus one, returns a negative value, nah, no, this is what I wanted to say, and negative values for all data points
lying on the left side, lying on the negative side of the space, and a positive number for all data points lying on the right, or the positive side of the data space. So when we use the dual representation, one can derive a similar classification function,
which is just this one here, and we take the sign of this value. So this directly corresponds to the distance in some way from the separating hyperplane, and if for some data point this distance is positive, then our classification function returns plus one,
and for all the other points, the classification function returns minus one. So this classification function is used to classify new data points. X is a new data point, I put it into this function, and then calculate this value here,
so the alpha i's are known from the solution of our optimization problem, which we've just seen, the z i's are known from the representation of our support vectors, the y i's are the labels of our support vectors, these are positive or negatives, x is, as I said, the vector space representation
of our new training example, and b is a constant that also can be computed directly from the solution of optimization problem. So when we found a solution to the dual optimization problem and get all these alphas, we are able to directly derive
a classification function from these alphas, and again, the benefit from using the dual representation is that we only need to look at the support vectors, because all summands that are zero, all non-support vectors, do not contribute to our classification function, and also they do not contribute
for calculating our shift constant b. So using the dual representation makes classification a lot easier. So this is what I said. And you should also note that our classification function, and also the definition of b,
also only depends on scalar products between our initial data points and some support vector points. We don't need to do any other computations with our data point to be classically classified, then comparing it to a series of support vector points
by taking the scalar product. So this is here the case, and here we don't even have our data point. So this is an important property, a key feature, which we will use very, very soon.
So we only need scalar products and we only need support vector points. Okay, yeah, now we know how we solve classification problems which are linearly separable. This was the assumption at the beginning, but you know, real life isn't that beautiful
as you would like to have it usually. But it might easily happen that some data point might be located here, right in the middle of our positive examples, or the negative examples, and we have no chance to draw a separating hyperplane between the two classes.
So this won't be a valid hyperplane because it misclassifies this example. So now we need some way to cope with these exceptions in a reasonable way. Okay, what we can do now are so-called soft margins. So we do not assume that the classification is crisp,
that all positive examples are on the right side and all negative examples are on the left side, but we assume that the classifier may make some mistakes when drawing the hyperplane. So you don't need to divide the space into two crisp halves, but there might be training examples
being on the wrong side. Of course, when a training example is on the wrong side of the plane, this should introduce some kind of classification cost or classification error which should be used to find the best hyperplane used for separating the data.
So the best classification, so the best hyperplane separating our data space then is a hyperplane having a margin that is as big as possible and at the same time makes as little errors as possible.
So of course, these are two factors you have to wait in some way. So the goal is to make a good compromise between width of the margin and errors in classification. All right, so, and the error simply is the distance
of our misclassified training examples from the hyperplane which would make it correctly classified. So this would introduce you some kind of cost which is very similar to the distance. So now we can extend our optimization problem a little bit
by so-called slack variables. These are these beta one to beta n. So for each training example, we assume that the training example can be shifted a little bit to the left or to the right
and each training example or each misclassified training examples is then shifted into the right direction of the space such that we really have a correct classification. Okay, then the formal definition is as follows. We now have these better variables
as a parameter to be optimized. So again, this is our initial optimization problem and not the dual problem. So now we want to minimize the margin width plus a constant C multiplied with our shift constraints, our slack variables.
C is a constant that is chosen in advanced and that is used to weight the margin width versus the error we make in classifications. So if C is large, then making errors in the classification is much more important than having a large margin
and if C is small, then having a large margin is more important than making errors or than avoiding errors. Okay, then we have to increase our constraints. So the better eyes should all be positive or zero,
should be positive shifts and we assume that our distance here, so this was the constraints saying that all positive examples should be at least one away from our separating hyperplane in both directions and now we are losing this constraint a little bit
saying that of course, the distance has to be one. It also could be smaller, it also could be negative and this distance then is changed by better eye.
So and if for some training examples, let's go back. For example, if this is our hyperplane, then we would need, so again here our one hyperplane, this is shift one and this is also shift one. We won't be able to get a shift of one for this data point because already the distance here
is for example, say five, then we would need to define, if this is our eyes training example, we need to define better eye equal to five to allow a margin correction here of the size of five and now we assume that this data point
is located here, we treat this data point as it would be located at this location and introduce a penalty for shifting this data point. And now we are able to use our hyperplane because it now makes a correct classification because this point is now located here, we shifted it,
but for shifting, we introduced a cost of five and this is all what's happening here. So if the eyes data point gets misclassified with a shift of better eye, then the price we pay for having to use this shift
is the penalty constant C multiplied by the strength of this shift. So as I said, C is some positive constant which regulates how expensive errors should be treated in our optimization problem. Again, if C is large, then making classification errors
is a significant factor in our value to be minimized. So if C is large, then we try to avoid errors at any cost possible. So in the extreme case, we could choose C equal to infinity and then we would have our initial optimization problem
and then our problem would only be solvable if there is a hyperplane separating the data as it is. So by introducing a C which is a number different from infinity, we are able to allow for errors
and rate them accordingly. Yes, exactly, for every training example, we have now this constraint. This has been zero before in our initial formulation and all training points that can be classified correctly
and exactly the way we did it before, this better would be zero because it doesn't make any sense to take a larger zero because this would introduce a penalty. And for all those data points that cannot be classified correctly, we are forced to take a better value
which is larger than zero and thus we introduce some kind of penalty here and then we have to wait margin with against penalty and arrive somehow at a solution that compromises on both. Okay, yeah, with soft margins,
we don't need to assume linear separability anymore and again, one can formulate a corresponding dual problem and which looks simply like this. So you might remember our original dual problem simply doesn't have this constraint here and now we simply assume
our alpha values lie between zero and C. So C is just some kind of upper constraint in our alphas. So it's not very intuitive why there is an upper bound in the dual problem now but it simply is the way it is
and this is also a positive thing for using the dual problem formulation. It really stays easy and doesn't introduce any more parameters. It's just a slightly changed constraint when we use soft margin. So and again, still for these kind of problems,
there are algorithms available that can find solutions efficiently. Okay, now we're able to classify any data set we have using a linear classifier and we also can account for noise in the data or just situations that are not so clear
that you could use a straight line. Okay, next question is, we assumed that we only have a binary classification problem. So spam versus not spam or relevant versus not relevant. What happens if there are more than two classes?
So in naive Bayes, for example, or in k-nearest neighbor, it's really easy to use different classes. For example, use a space where you have sports and politics and something else and you just use three classes.
So support vector machines are in some way bound to a binary classification but there are methods to extend them to multiple classes. So one idea is to, for each class, train a classifier.
Yes, for each, for example, we have three classes. This is sports, this is politics and this is science and we assume that each document belongs to one or to exactly one of these classes
and there's a problem we cannot solve using a support vector machine because support vector machine can only make a distinction between two classes. So what we can still do is train a classifier for each of these three classes. So we build a classifier for sport versus non-sport.
We build a support vector machine for politics versus non-politics and we build a support vector machine for science versus non-science. And then for each new document we want to classify, we use each of the three classifiers
and check the margin of the new training point. So if this is our space and this is our sport classifier, so here's sport and here's non-sport and our new document we want to classify is located here, then this would be the margin for this new document
with respect to sport, for example, margin of six. And then let's say we have a different classifier, this is our politics classifier, here's politics and here is non-politics.
So now let's take it a bit differently to give a better example. This is our politics classifier, here's politics and here's non-politics. Then again, we can use this classifier to determine whether our new document is politics or not politics and we will find out, for example, this document is politics
but with a much smaller margin than it is sports. Two, for example, we do the same for a science classifier and then we assign the document to the class where the margin is largest. So in this case, we would assign the new document to the sports class because it's sports
with a weight of six, with a margin of six and it's only politics with a margin of two. So it's more likely to be a sport than it is politics. All assuming that each document can only belong to a single class. Okay, then what we can also do
is build a one versus one classifier. Instead of training sport versus non-sport, we will train a classifier sport versus politics, sport versus science and politics versus science.
And then again, use all classifiers here to classify our new training set and count the number of times an item is classified as sport and an item is classified in the other classes and then simply take the class
chosen by most of these one-to-one classifiers. Of course, if you have 10 classes, you need 50. No, you need, you need, you need, you need. 50 times 49.
No, 100 times 99 divided by two. This is the number of classes you would need, the number of support vector machines you would need and as you can imagine, if you have a large number of, a large number of classes, then this is not a good approach so then we would,
you would stick to this one here. What's also possible are so-called multi-class support vector machines which are quite complicated extension of this binary support vector machine concept where you use multiple hyperplanes at once and combine them in some ways.
We will not talk about this in this course. So there are ways to deal with support vector machines in a multi-class setting. There are different ones you have to decide for yourself which one you want to use. Okay, the next part of this lecture, so-called non-linear support vector machines.
Yes. Yeah, you could also, you could also retrain the support vector machine with each new document you have and that has been classified by the user but usually you take a training set that is large enough,
use this training set to create your classifier, create your hyperplane and then stick to this hyperplane for classifying all new documents. So of course it makes sense after some time to reevaluate your support vector machine
given the new information you have but this is how it's usually done. Then you can't use this information
because you don't have any. There is a concept called transductive support vector machine which tries to use the data to be classified. So then the problem becomes you have a training set with labeled examples and you have a set of documents
from which you only know the coordinates and the other idea is to look at some patterns of the space whether you could use this and you could use some clustering technique to estimate what the correct label of training, of these examples could be if we don't know them.
So it's rather complicated and it doesn't work too well I would say. Sometimes it is a little bit better than a normal support vector machine but usually you can only use the training data set that has been manually labeled as data to build your classifier. And that's the same for all types of classification
because in supervised classification you need a training set that consists of a document representation and for each document a correct label or a label known to be correct.
It's a different situation. In clustering you usually don't have any hints what might be correct information. So one could also think of a supervised clustering approach
where someone gives you a reference clustering that has been created manually and from this clustering a more general clustering of the space has to be derived that also could be a problem. So actually these are different problems. I think two weeks ago we had this distinction into different classification
or different machine learning tasks. One was supervised where you provide correct information to the system which can be used to learn something. Then you have semi-supervised setting where you give some correct information and some additional information.
For example, the data to be classified which also could be used if you analyze how this data is located in space to learn a better classifier or a better clustering or a better model. And you have these unsupervised classification where you don't have any information which usually is true in clustering
because you don't know what should be correct clusters. You only have some coordinate representation of your document that you have no idea what to do. You can just use some heuristics and assume that some kind of similarity measure is good in some way or resembles human similarity judgment.
This is all you can do. Sometimes if you work with documents and you want to make classification, you can use the document content. For example, if you want to classify all politic documents, politics versus non-politics, you could try to, the user could give you
some politic terms. For example, politician, minister and something else and then you can try to analyze using co-occurrence of terms but also could be other politic terms and use this information to do classification. So there's a huge range of how classification or clustering can be done
and the extreme points are just you learn from data known to be correct and the other extreme is you don't have any information, you just have the data and do whatever you like and how good you can do it but there's no guarantee. So the most clear setting is supervised learning
but there's a large space in between. So there are also approaches where the classifier is able to ask the user for additional information. For example, if the classifier identifies a set of especially vague examples which seem to be critical and it doesn't seem to be clear
whether this should be classified into one class or the other, then the system could ask the user for more information. This is also possible. So again, that's a huge range of ways to do it. In this lecture, we are only dealing with the most simple one and in my opinion, these are complicated enough usually. So if you're interested in classification,
get a book about machine learning. That's usually the major term spanning over all these approaches. There's a lot of theory, there's a lot of different approaches how to do this. So machine learning is the way you want to go then. All right, okay. Nonlinear support vector machines.
So we have assumed that data always is linearly separable or in cases where it is not linearly separable, then this is because noise in the data or some classification errors but sometimes it might be true
that the data itself isn't linearly separable. For example here, it might be perfectly reasonable to assume that one topic or one class is located in the middle of the space and all negative examples are just outside. So there's no way to describe this situation using a line.
So it doesn't make sense to divide here because there would be misclassification here. It doesn't make sense to do it here or here either. The only way to do it would be to have two separating lines, so middle where there's outside.
So of course, this is not linear and the question is how to do it. So how to do nonlinear classification. Idea is quite simple how to do it. If we have this situation, then we transform our data into a higher dimensional space.
So in this example, we have a one dimensional data space. Each training example is described by a single value on a single line and if we transform this data into a two dimensional space, so for example, by giving the values that lie on the outside
a high second coordinate and the ones in the middle a low coordinate, then we could arrive at this representation and here we can easily draw a line to separate the positive from the negative examples.
So and this is the whole idea underlying nonlinear support vector machines. There's always a transformation step bringing the data into a higher dimensional space in which we can use a linear classification. So the question is how to do this transformation,
what is a good transformation and what is not and how to deal with data in very high dimensional spaces. So in some cases, these spaces might have 10,000 dimensions or even more. Doesn't sound too handy. So here's some kind of visualization again of this nonlinear classification.
I hope it works but obviously it doesn't. Therefore, we will open the video.
All right, this is how it works. So here in the middle of the circle, we have the positive examples and on the outside, we have all the negative examples. What we want to do is have a classification circle inside versus outside. And now comes the transformation. We are now mapping our two dimensional space
in a three dimensional space by introducing a third dimension. And now we can use a hyperplane or plane in this case in this three dimensional space to separate the positive from the negative examples. And if we again break down this linear cut
into our original two dimensional space, then we finally arrive at a circle which divides outside from the inside. So this is how this transformation step usually works. Quite simple idea. And the good thing is that it stays quite simple
because of the good properties of the dual problem we just discussed. So of course, when we work in high dimensional spaces, when we do this transformation step, then computing the transformation
and then doing the linear classification in a very, very high dimensional space might be very difficult. So it would be very good if we don't have to do this transformation explicitly. So we just, yeah, so the step is as follows. Here's our original data, original training data,
which is not linearly separable. This is our transformed data in the new space which is linearly separable and the support vector machine can only work directly on this data. So what we usually would do is first transform the data
and then work in the transformed space with the support vector machine. So what we are going to do because transformation might be expensive for large data sets and for large and for high dimensional spaces to work directly in the original space but doing all computations as if we are working in the transformed space.
So we are looking for some kind of implicit transformation that is computationally efficient. And fortunately there is a way to do this. And this again depends on the fact
that we only need to deal with scalar products when we define our support vector problem as a dual problem. So we only need to deal with scalar products of our original coordinates of our support vectors and we only need to compute scalar products of new data points with support vectors.
So we don't have to make any other computations. And if we are now able to compute the scalar product of the transformed space directly then we just have to change this part
of the optimization problem and the rest still stays the same. This is called the kernel trick. So a kernel is some kind of mathematical function with some special properties. Have you ever heard of kernels?
Very good, so then you know all of it. We don't go into details here how this really works. But the idea simply is if you use kernels then you are able to compute scalar products in a transformed space in a very, very simple way. So let's assume that H is some function
that maps our original data from some D dimensional space in some D prime dimensional space. So typically D prime is much larger than D. And then if we want to solve our optimization problem we would transform all our data points
and all our new points to be classified and then just formulate our optimization problem in terms of this new space. So all Yi becomes eight of Yi and all X becomes H of X.
And again you can see that we still need to only compute the scalar product of these transformed vectors. Scalar product here, scalar product here, scalar product here. And if we are able to compute the scalar product efficiently in the new space
without having to perform this transformation then this can be done really, really quick when solving the optimization problem. So if our transformation, and that's a good thing about it, has some, is of some special type,
so a so-called kernel function. So there are many, many possible kernel functions and all have the property that you can compute the scalar product of two transformed vectors using only the original coordinates
in the original space. So we will see an example soon. This is for example a polynomial kernel transformation of the second degree. Our original space is two-dimensional and we can compute a mapping into one, two, three, four, five, six-dimensional space by saying
that the first coordinate is always one, the second coordinate is the square root of two times the original first coordinate, and so on and so on. And the interesting property is the following. If we want to compute the scalar product of two transformed vectors, so given two vectors x and x prime
in our original space, and we need to compute the scalar product of these vectors in the transformed space, then this is simply the scalar product in the original space multiplied by one and everything squared. So we don't need to compute the transformation
into the new space itself. For computing the scalar product in the transformed space, we can just rely on simple computations involving the original coordinates. This is all the kernel trick is about. There's a whole mathematical theory behind it,
but that's essentially the idea. Compute scalar products in the transformed space efficiently because that is all we need to do when dealing with support vector machines. Okay, I have a demonstration of linear support vector machines.
So there are a lot of Java applets in the web available illustrating how support vector machines work. This is one of them. So I encourage you to play around with this at home to get a better understanding of this.
So here we can define different kernels. We want to use, for example, a polynomial kernel as in our example. For example, a degree of five. And here we can just place our example data. So these are our positive examples.
And unfortunately, our negative examples lie just in the middle of all the positive examples. So there won't be any way to derive a linear, or a line separating the space and we need a higher dimensional or nonlinear classification
in this example, a polynomial kernel, which is now computed. And this is what a polynomial kernel would do. So these are the support vectors. Three support vectors on this side. Here one, another one, another one, another one.
Five support vectors on the other side. This is how it is done. So again, think of this as a transformation in some high dimensional space where we can do a linear separation and projecting it down to the two dimensional space and so the separating hyperplane becomes some kind of curve.
So this is a polynomial curve. Another popular choice are so-called radial basis function. Many, many training examples. And again, looks quite differently.
Radial basis functions usually draw some kind of closed curves, like the one here. So this is closed and goes along this way. In the polynomial setting, you usually have large areas,
for example here, starting here and going around and being opened to the other side. This polynomial and radial basis function usually only draw some closed curves. There's a difference between these two. Hmm?
Good question. This is also a property of these radial basis functions. They are, it's hard to imagine. In some way, this area is continued somewhere on this side here.
So when you have the transformed space, then this is connected to an area lying on this side, which you can see in the picture now because it's only a very small view of the whole space, but there is some area being located here,
which is the same as this one here and therefore, we have here a number of support vectors. Isn't very intuitive, so there's always a problem when drawing these things in the original space, but for purposes of classification, it works pretty well.
So of course, the problem is, if you now have data points lying here, they are likely to be misclassified or lying here because this could be an area already belonging again to this class. So it isn't perfect, but usually, if you have a representative sample of your data,
if you have good training data, then using support vector machines with kernels usually brings very, very good classification performance. Okay, this is how it works, how the kernel trick works.
So now to the question, how support vector machines work in information retrieval. So I will now show some small examples to prove this because classification is a quite general problem and it's just applied in information retrieval.
So of course, one important area of application is text classification. As I said, politics versus sports versus something else. This could be, for example, important for libraries, which have to index or classify new books they get and if they have the books in a digital form,
they could use an automatic classification systems to derive some classes the book might belong to, given some training data, and then the librarian can decide whether this classification is the one they want to use or just have to make little changes but doesn't have to make all the classification by himself.
So this could be an application. Spam detection, as I said, is also very, very popular. So the documentary presentation, you can use for support vector machines. Either are the standards vector space model
as we have seen it. You could also add some additional features, some additional dimensions like, for example, document length. So if you have some reason to think that long documents might be contained in some other class than small ones, for example, if you make the observation that in your collection, spam documents tend to be very long or very short,
then adding the document length to the document representation might be a good idea. So the same is true for other derived features. You could also use an LSI representation for classification purposes.
All is possible. The only thing you need to have is some kind of feature representation of your documents so that each document is a point in space and that documents belonging to the same class tend to have similar properties. So as a support vector machine, we'll find out what is really meant with similar
and then derive a classification model from it. So the dimensionality, of course, when using the vector space model is quite high because every term then becomes an axis in your space. This usually is not a big problem
because for most documents, only very few axis are different from zero. So no document contains all terms of your collection. So usually each vector or each document is represented by a vector
where there are some numbers in but many, many, many zeros. So this is also a situation where support vector machines can deal with the algorithms that can handle the situations where axis containing zeros are just ignored in the algorithms and you still are able to find an optimal solution.
So support vector machines can deal with sparse data in a very good way. So here's an example of classification performance. This is from 1998 where some experiments on the writer's collection have been performed. So remember, the writer's collections is a set of news,
of news flashes from the writer's company and all these, and for some categories in the data set, positive and negative examples have been collected. The classification algorithm has been trained and then a test set of new documents
has been classified using this algorithm and then it has been evaluated how good the classification was in terms of precision and recall and each value you see in the table below is the F measure of performance. So 100% means perfect classification, zero means very bad classification.
So these are the categories which have been tried and these are the algorithms. So this is naive Bayes, this is rocket classification, these are decision trees, another way of performing classification. This is k nearest neighbor and these are different forms of support vector machines.
Two linear support vector machines using two different penalty constraints for misclassification and support vector machine using a regular basis function kernel with a parameter here, yeah, doesn't have to bother you. So and they found out. If you look at the average over all these classes
that the classification performance for the support vector machines always on average was around 68, 78% and that the other algorithms performed significantly worse. So Rockio has been quite good of course
and k nearest neighbor but support vector machines have been performed very, very good here and this was quite surprising at the time because support vector machines around 98 have been a quite new technique. So the community was very happy to have a technique
that really, really performs well compared to other techniques and can be applied also to document collections. Of course if you look at different categories there might be situations where support vector machines perform worse than other techniques. So for example here for the interest category
support vector machines here are around 70% or 75 at best and k nearest neighbor is better in this example but over the whole when doing the average support vector machines have been proved to be better.
This doesn't have to be always the case of course it depends on the collection you have. It depends on the representation of your documents what kinds of space you're using but usually support vector machines perform very, very well. Okay and another application of support vector machines
is called learning to rank. There's also a quite new topic information retrieval and here a special type of support vector machines is used so called ranking support vector machines. So and here again the training set consists of documents
but here now you have n pairs of documents. And the idea is that if you have a training pair of documents y i one and y i prime, then this pair expresses that document y i is
preferred to is better than y i prime with respect to some query. So for example, if you have a ranked list of results for some query, you might derive some pairs from it that for example, this result is better than this one. For example, if you're looking for Viagra, then the Wikipedia Viagra entry might be a better result
than some spam page and the Viagra page might be better than the manufacturer's official page about the product. And of course the manufacturer's official page is a better result than some spam page.
So here you would have three pairs. Wikipedia is better than spam. Wikipedia is better than the manufacturer's page. And the manufacturer's page is better than the spam page. Three pairs, three training examples. And again, every example is represented
as a point in space. And this information then is used to learn a ranking function. And with this training data, the ideas that we are now able to decide for every new document whether it is better or worse than any other document known in advanced. So the task is find a ranking function
that assigns a numerical score to each document. So the idea is that if the score of one document is higher than the score of another document, then the first document is preferred to the second one, is a better result, fits the query better in some way.
So a straightforward approach would be to limit yourself to a special kind of ranking function. So linear ranking functions, they could just do a scalar multiplication of the document space representation by some other vector.
And scalar products always have to do something with support vector machines. And here again, one can formulate this problem as a support vector machine task. Again, we want to find a hyperplane with maximum margin such that the score of the better training example,
so scalar product of the hyperplane times the representation of the document is better than the score of the second document. And here we enforce a standard margin so that if we say that one document
is better than another, then the constraint is that the difference in scores must be at least one. So again, we can define a list of constraints from this. This constraint is equivalent to this formulation here. Some number to be determined,
multiplied in a scalar fashion by some vector because the difference of two documents, again, is a vector in space or is some point in space. And this is exactly the form we used in our support vector machines original formulation. But now each training example is a pair
and each pair now gets treated as a single point by subtracting the training pairs from each other. And now we simply have the same support vector machine formulation as before and can solve it the way we did before. So using soft margins or using nonlinear scoring functions.
So and again, we can use this to increase result quality. So again, Tost, which is some support vector machine guy, information retrieval, he's done a lot of work there. Also discussed the question where
these preference pairs come from. We need to learn a ranking function. And his idea was that if users get returned from a search engine a list of results, then users tend to linearly scan through these results from top to bottom.
And the users click on those results, they think they are relevant. So for example, if a user gets this result here and his first click is on the Wikipedia page, then we can assume that the Wikipedia page is better than this page, better than this page and better than this page.
It might be true that all these three pages still are relevant in a binary sense, but the user feels that this result is the best one. And the search engine should, in future retrieval task,
in future queries, should put this result at the top because it seems to be more relevant than the other three. This is the whole idea. Just observe what other users are doing, learn a ranking function from this feedback, and if some other users ask the same query,
in this case Viagra, we could use this ranking functions to optimize the ranking in the future. All with the power of the bot vector machines. So again here, computer initial result lets the user click something, learn a ranking function,
and then use the ranking function to create a better ranking in the future. Of course, one could also use the ranking function
to compute the initial result. If we already have feedback from other users, then of course we would present a result that already includes user feedback. Yeah, this is a link. I think you should take a look at yourself.
This is just a list of applications of the bot vector machines. So actually, the bot vector machines are used in many, many different areas of science. So in biology, chemistry, geographic problems. So bot vector machines are some kind of universal tool for many, many different classification tasks.
This is a page of someone who collected, who collected a large list of applications. Take a look at it, might be interesting. Okay, finally, we have a detour about recognizing handwritten digits. This is very important for mail companies.
And does anyone know the correct term? So in Germany, it's a Deutsche Post mail carriers. Is this the correct term? Brief, swoosh, DHL, something like this.
So any kind of logistics company, I don't know how to call it in English. But they usually have the problem. You have zip codes on your letters. And these zip codes are the most important information for distributing the letters. Because depending on the zip code,
the letter either goes into this bin or this bin or into this truck or this truck and gets distributed into different directions of Germany or the world. So the first information one needs to know is the zip code written on the letter. And of course, when dealing with zip codes, you need to read the zip codes.
And since they are often handwritten, you need to find out whether this is a three or a nine or something else or this is a seven or a two or a zero. And it would be a very good thing if some kind of machine would be able to read the zip codes and translate it
into a machine readable form. So if you get a letter, usually there is this bar code on the bottom of the letter, where zip code and other information already has been transformed into a machine readable way. And then the distribution machines
in the big logistics centers of the mail companies just reading these bar codes when dealing with the letters and distribute the mail where it belongs to. Okay, this is a problem. And this data set is a popular data set
in pattern recognition. You just, they have just taken a large collection of real zip code numbers or zip code digits which have been handwritten. And then some people manually annotated all these numbers. And the task is to learn a classifier from this data
that is able to decide for new digits which digits it is. So again, we can directly derive a space representation from this data because every digit is a two dimensional image.
So each pixel is either one or zero and has a coordinate. And we just simply treat each pixel as one dimension of the space. And then each digit simply becomes a binary. Yeah, let's say these are 100 pixels
and these are also 100 pixels. Then we just have a 10,000 dimensional feature space for representing our data. And we have different classes. For example, the class six versus the class non-six or the class six versus the class five. This also could be done using support vector machines.
All right, so they used a special kind of support vector machines using 10,000 test X and using a large training set. I have no idea how large it is, but they used 10,000 test examples to evaluate how good the classification was.
And they found out that when trying to do this 10 class classification into all the 10 possible digits, the error rate using support vector machines was in the best case using a special kind of support vector machine by only about 0.5% which is very, very good.
So of course one could ask whether it is impossible to make it even better. But if we look at the 65 misclassification this machine made on these 10,000 test examples, then you would see that it's even difficult for humans to determine what number this might be.
So I've seen this picture and for each image, for each digit, two numbers have been provided. The correct number and the number assigned by the algorithm. So unfortunately, I don't know which number is which.
So I have tried to find out, but I'm not able to find out whether this is a human judgment or this is a machine judgment or vice versa. So for example, yeah, if you look at this, this could be a five or a nine. So there might be people who write a five like this,
but there might also be people who write a nine like this. So these are all situations where you don't have any chance to determine whether this is a five or nine. In most other cases here, the machine doesn't have any chance because even humans would not be able
to do this classification correctly. So in other words, having an error rate of 0.5% might be the best you can do because of these strange examples here. Because the reference shouldn't be zero,
but the performance of human classifications. Next topic and last topic for today is the overfitting problem. So when we use support vector machines, especially when we use non-linear support vector machines, there's always opportunity to use a transformation in some higher dimensional space
that is really, really complicated so that we can make a perfect classification even of the weirdest data. So for example, if we have this data set here, we could use a linear classification and just assume that these here are two outliers
that shouldn't be taken too serious, but we could also say, well, this data set definitely is correct, so we use a very, very complicated kernel and then we arrive at this classification boundary. So the question is, which one is better?
Is an easy solution, is a simple solution better, or do we always want to have a complicated solution that fits the data perfectly? So obviously, we wanted something in between. We want a solution which isn't too complicated, but is complicated enough to be able to represent
systematic properties of our data set. So what is the problem here? If we have a perfect classification, a very complicated boundary, then it could be the case that it fits the training set very well,
but on new data, it doesn't really work because if you have just three training examples, it could look like this, and then our classification boundary, for example, might be something like this for some reason, but the truth could be that this would be a perfect fit
because if we would have had more data, then it would have looked like this. So we are always wanting simple solutions that fit the data, but that are complicated enough to find systematic properties.
So what can we do to avoid overfitting? There's a technique called cross-validation. We randomly split our training data into two parts, so-called training set and the so-called test set, and then we use the training set to learn a classifier,
to learn a support vector machine, to learn a classification boundary, and then use the test set, which is unknown to the classifier, to check how good the classifier really is. So in the ideal case, the classification performance on the training set would be very, very similar
to the classification performance on the test set because this would mean that the classifier has found all relevant information in the training set because this information is also available in the test set but hasn't used any properties that are just by random chance
available in the training set, but not in the test set. And then you can use different kernels and different parameters for weighting the errors in the soft margin task to find this support vector machine that is equally good on the training and the test set,
or is just best on the test set. It's also an opportunity. So there's a different method to do this apart from cross-validation. So in many cases, you don't have much training data,
so it won't be a good idea to split the training data in two halves and use only one half of it for training purposes and the other half just for testing your classificator. So usually you want to use all data you have for training the classifier.
So you need a way to avoid complicated classification solutions and this is where regularization comes into play. And a simple form of regularization is introducing penalties for complicated solutions. For example, if you have different ways to different,
if you have found different classifiers classifying the same data set, for example, one linear classifier, which makes some errors and some non-linear classifier, which has a perfect fit, then you would want to assign scores for complexity of the solution.
So the linear solution would get a penalty of zero and the complex one, for example, would get a penalty of 100 and then you combine the classification performance and the penalty to a general score, to an overall score and then at the end, you use the classification algorithm
that has the best score overall. That combines good classification performance with simplicity of the model found. So this is the initial idea and essentially the soft margin technique we use to support vector machines is a good example of regularization
because here classifiers having large margins and a few errors are preferred over those having large margins and making no errors. And by using the parameter C, you can determine what's more important to you.
All right, the last aspect of overfitting is a so-called bias-variance trade-off. So usually, as I already said, there isn't some kind of trade-off when choosing the right type of classifier.
So on the one hand, if you ignore some specific characteristics of your training set, you might lose some information that might be important. So you would do some kind of bias in classification. So for example, if you would simply ignore that here is some kind of curve,
then this might be a mistake and would result in misclassifications in the future. So on the other hand, if you try to account for all possible, especially all possible things that might be special in your training set by using a very complicated classification functions,
well, for example, even if there's here some negative examples, you might be tempted to use such a classifier. Then you will perfectly fit the training data, but you get a large variance over classifiers when you randomly sample your training set. So for example, in one random sample,
there might be a negative example here and in the other sample, there isn't a negative example here. And then you would get very, very different classifiers just depending on how your training set or how your training sample looked like. And what you really want is that quite independent of what training sample you used,
the classifiers should look nearly all the same. So you would want to have a small systematic bias by being too simple. And you would also want to have a small variance by being simple enough. So typically you cannot have both.
You have to decide, you have to use some kind of weighting and this always depends on your application as always information retrieval. Try it out in your collections, look what works best. And this it is for today. And next week we will talk about web retrieval.
Thank you very much for your attention.