Language models, Retrieval evaluation (18.5.2011)

Video thumbnail (Frame 0) Video thumbnail (Frame 1755) Video thumbnail (Frame 7160) Video thumbnail (Frame 10575) Video thumbnail (Frame 14065) Video thumbnail (Frame 17125) Video thumbnail (Frame 21250) Video thumbnail (Frame 28800) Video thumbnail (Frame 34665) Video thumbnail (Frame 37575) Video thumbnail (Frame 40955) Video thumbnail (Frame 46450) Video thumbnail (Frame 49440) Video thumbnail (Frame 53305) Video thumbnail (Frame 56895) Video thumbnail (Frame 60235) Video thumbnail (Frame 62690) Video thumbnail (Frame 68230) Video thumbnail (Frame 72925) Video thumbnail (Frame 75595) Video thumbnail (Frame 78280) Video thumbnail (Frame 80650) Video thumbnail (Frame 82630) Video thumbnail (Frame 85975) Video thumbnail (Frame 87765) Video thumbnail (Frame 91040) Video thumbnail (Frame 94740) Video thumbnail (Frame 96435) Video thumbnail (Frame 98965) Video thumbnail (Frame 103215) Video thumbnail (Frame 107160) Video thumbnail (Frame 112495) Video thumbnail (Frame 117835) Video thumbnail (Frame 120640) Video thumbnail (Frame 122255) Video thumbnail (Frame 124790)
Video in TIB AV-Portal: Language models, Retrieval evaluation (18.5.2011)

Formal Metadata

Language models, Retrieval evaluation (18.5.2011)
Title of Series
Part Number
Number of Parts
CC Attribution - NonCommercial 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
10.5446/359 (DOI)
Release Date
Technische Universität Braunschweig
Institut für Informationssysteme
Balke, Wolf-Tilo
Production Year
Production Place

Content Metadata

Subject Area
This lecture provides an introduction to the fields of information retrieval and web search. We will discuss how relevant information can be found in very large and mostly unstructured data collections; this is particularly interesting in cases where users cannot provide a clear formulation of their current information need. Web search engines like Google are a typical application of the techniques covered by this course.
Point (geometry) Natural language Model theory Special unitary group Number Independence (probability theory) Information retrieval Word Performance appraisal Inverse problem Befehlsprozessor Subject indexing Videoconferencing Ranking Physical law MiniDisc Physical system World Wide Web Consortium World Wide Web Consortium Information Tesselation Structural load State of matter Bit Term (mathematics) Stochastic differential equation Word Computer animation Search engine (computing) Network topology System programming Natural language Reading (process) Row (database) Spacetime Data compression
State observer Application service provider State of matter Model theory Natural language View (database) Disk read-and-write head Mereology Special unitary group Tracing (software) Usability Emulator Blog Atomic number Query language Formal grammar Area Tap (transformer) Computer Parameter (computer programming) Bit Repeating decimal Category of being Latent heat Arithmetic mean Different (Kate Ryan album) Computer science Endliche Modelltheorie Quicksort Point (geometry) Statistics Maxima and minima Term (mathematics) Conditional-access module Metropolitan area network World Wide Web Consortium Time zone Information management Multiplication Tape drive Magneto-optical drive Model theory Theory Coma Berenices Word Algebra Computer animation Query language Information retrieval Formal grammar Natural language
Point (geometry) Randomization MUD Observational study Natural language Model theory Tournament (medieval) Multiplication sign Set (mathematics) Bound state Regular graph Special unitary group Focus (optics) Dressing (medical) Word Frequency Order (biology) Root Different (Kate Ryan album) Term (mathematics) Set (mathematics) Process (computing) Formal grammar World Wide Web Consortium Rule of inference Texture mapping Concentric Model theory Modeling language Motion capture Line (geometry) Statistics Sequence Category of being Subject indexing Word Arithmetic mean Computer animation Phase transition Different (Kate Ryan album) Formal grammar Endliche Modelltheorie Natural language Routing Writing Asynchronous Transfer Mode
Group action Building Context awareness Distribution (mathematics) Structural load Length State of matter Natural language Model theory Logarithm Multiplication sign Parameter (computer programming) Counting Independence (probability theory) Word Coefficient of determination Different (Kate Ryan album) Square number Stability theory Electric generator Block (periodic table) Electronic mailing list Bit Statistics Sequence Demoscene Whiteboard Data structure Finitismus Observational study Number Product (business) Causality Term (mathematics) Contrast (vision) Computer-assisted translation Summierbarkeit Condition number World Wide Web Consortium Context awareness Distribution (mathematics) Uniqueness quantification Model theory Length Independence (probability theory) Word Computer animation Personal digital assistant Natural language Game theory Table (information) Routing Window Computer-assisted translation
Point (geometry) Pattern recognition Euler angles Code Natural language Model theory Multiplication sign Direction (geometry) Combinational logic Special unitary group Revision control Word Coefficient of determination Moving average Software testing Computer-assisted translation Condition number World Wide Web Consortium Speech synthesis Pattern recognition Optical character recognition Block (periodic table) Model theory State of matter Bit Statistics Sequence Type theory Word Computer animation Telecommunication Endliche Modelltheorie Natural language Reading (process) Computer-assisted translation
Random number Statistics Random number generation Model theory Natural language Set (mathematics) Neuroinformatik Information retrieval Estimator Query language Ranking Process (computing) Information Maize Pairwise comparison World Wide Web Consortium Information Model theory Sampling (statistics) Computer simulation Cartesian coordinate system Statistics Demoscene Process (computing) Sample (statistics) Computer animation Estimation Query language Endliche Modelltheorie Natural language Ranking
Axiom of choice Point (geometry) Natural language Model theory Multiplication sign Maxima and minima Insertion loss Parameter (computer programming) Content (media) Sparse matrix Special unitary group Number Frequency Insertion loss Term (mathematics) Core dump Maximum likelihood Maize Kolmogorov complexity World Wide Web Consortium Theory of relativity Structural load Model theory Sampling (statistics) Total S.A. Maxima and minima Cartesian coordinate system Open set Computational complexity theory Berechnungskomplexität Category of being Word Computer animation Estimation Estimator Endliche Modelltheorie Natural language Musical ensemble Quicksort Pressure Hydraulic jump Sinc function
Information management Smoothing Direction (geometry) Model theory Natural language Direction (geometry) Multiplication sign Mass Parameter (computer programming) Term (mathematics) Cartesian coordinate system Estimator Word Computer animation Arithmetic mean Estimation Estimator Term (mathematics) Endliche Modelltheorie Maximum likelihood Smoothing Hydraulic jump Stability theory World Wide Web Consortium
Divisor Distribution (mathematics) State of matter Multiplication sign Division (mathematics) Special unitary group Emulation Number Word Frequency Heegaard splitting Coefficient of determination Estimator Bit rate Term (mathematics) Smoothing Linear map World Wide Web Consortium Cohen's kappa Electronic mailing list Counting Number Arithmetic mean Word Frequency Computer animation Estimation Order (biology) Linearization Hydraulic jump Computer-assisted translation
Point (geometry) Graph (mathematics) Smoothing Model theory View (database) Multiplication sign Mathematical singularity Model theory Group action Term (mathematics) Average Special unitary group Number Word Number Frequency Computer animation Estimation Term (mathematics) Maximum likelihood Smoothing Metropolitan area network Stability theory World Wide Web Consortium
Point (geometry) Email INTEGRAL Model theory Natural language Feedback Disintegration Decimal Special unitary group Vector space model Product (business) Hypothesis Independence (probability theory) Estimator Mechanism design Causality Bit rate Term (mathematics) Solitary confinement Query language Energy level Ranking Maize Endliche Modelltheorie Pairwise comparison Metropolitan area network World Wide Web Consortium Dependent and independent variables Key (cryptography) Smoothing Uniqueness quantification Model theory Mereology Statistics Sequence Computer animation Estimation Query language Natural language Codec Ranking Figurate number Reading (process) Resultant Spacetime
Building INTEGRAL Model theory Natural language Multiplication sign Tangible user interface Field (computer science) Scalability 2 (number) Usability Independence (probability theory) Word Performance appraisal Query language Subject indexing Physical system God World Wide Web Consortium Execution unit Focus (optics) Algorithm Information Model theory Sound effect Coma Berenices CAN bus Stochastic differential equation Computer animation Search engine (computing) System programming Natural language Right angle Resultant
Convex hull Dependent and independent variables Code Multiplication sign Data storage device Function (mathematics) Number Measurement Information retrieval Response time (technology) Different (Kate Ryan album) Term (mathematics) Befehlsprozessor Operator (mathematics) Software Moving average Spacetime Information Physical system World Wide Web Consortium Operations research Information View (database) Optimization problem Point (geometry) Data storage device Sound effect Perturbation theory Price index Measurement System call Subject indexing Number Computer animation Computer hardware output Right angle Key (cryptography) Quicksort Physical system Resultant Spacetime
Cognition Algorithm Multiplication sign Mereology Special unitary group Vector space model Uniformer Raum Different (Kate Ryan album) Term (mathematics) Set (mathematics) Query language Information Physical system World Wide Web Consortium Algorithm Matching (graph theory) Information Model theory Division (mathematics) Arithmetic mean Fluid statics Computer animation Utility software Physical system Curve fitting Gradient descent
Point (geometry) Web page Cognition State of matter Multiplication sign Maxima and minima Numbering scheme Content (media) Semantics (computer science) Event horizon Area Element (mathematics) Independence (probability theory) Causality Different (Kate Ryan album) Query language Finitary relation Aerodynamics Information Summierbarkeit Physical system World Wide Web Consortium Area Standard deviation Theory of relativity Information State of matter Fitness function Mathematics Type theory Computer animation Query language Network topology Information retrieval Curve fitting Relief Library (computing)
Cognition Dynamical system Context awareness Structural load Multiplication sign Maxima and minima Content (media) Special unitary group Event horizon Area Number Independence (probability theory) Frequency Mathematics Query language Finitary relation Daylight saving time Aerodynamics Information Physical system World Wide Web Consortium Polygon mesh Information Weight Software developer State of matter Electronic mailing list Content (media) Basis <Mathematik> Cartesian coordinate system Degree (graph theory) Mathematics Type theory Computer animation Personal digital assistant Query language Network topology Reading (process) Resultant
Dynamical system Theory of relativity Information Model theory Virtual machine Staff (military) Special unitary group Perspective (visual) Event horizon Mathematics Computer animation Internetworking Finitary relation Utility software Website Utility software Information Object (grammar) Task (computing) Resultant Condition number Physical system World Wide Web Consortium
Web page Cognition Algorithm Code State of matter Virtual machine Special unitary group Mathematical model Event horizon Frequency Ideal (ethics) Information World Wide Web Consortium Data type Rule of inference Algorithm Image resolution Decision theory Moment (mathematics) Computer animation Information retrieval Utility software Natural language Figurate number Physical system Reduction of order Electric current
Purchasing Standard deviation Multiplication sign Special unitary group Event horizon Independence (probability theory) Frequency Goodness of fit Performance appraisal Benchmark Term (mathematics) Different (Kate Ryan album) Bridging (networking) Query language Information Physical system World Wide Web Consortium Execution unit Theory of relativity Information Suite (music) Structural load Model theory Independence (probability theory) Benchmark Degree (graph theory) Process (computing) Computer animation Query language Network topology System programming Game theory Physical system Reading (process) Resultant
Point (geometry) Multiplication sign Icosahedron Vector potential Special unitary group Total S.A. Event horizon Number Element (mathematics) Causality Single-precision floating-point format Set (mathematics) Query language Maize Aerodynamics Physical system World Wide Web Consortium Algorithm Model theory Arithmetic mean Computer animation Search engine (computing) Query language System programming Website Physical system Resultant
Axiom of choice Standard deviation Multiplication sign Icosahedron Special unitary group Neuroinformatik Supersonic speed Independence (probability theory) Information retrieval Personal digital assistant Query language Set (mathematics) Species Information Curvature Distortion (mathematics) Pressure Algorithm Simulation Trail Structural load Electronic mailing list Computer simulation Velocity Compiler Website Software testing Representation (politics) Point (geometry) Beat (acoustics) Algorithm Consistency Maxima and minima Electronic mailing list Event horizon Emulation Wave Degree (graph theory) Maize Statement (computer science) World Wide Web Consortium Dataflow Continuous track Expert system Theory Correlation and dependence Field (computer science) Transformation (genetics) Binary file System call Computer animation Query language Kolmogorov complexity Boundary value problem
Multiplication sign Set (mathematics) Valuation (algebra) Mereology Event horizon Twitter 4 (number) Performance appraisal Causality Set (mathematics) Maize Pairwise comparison Physical system World Wide Web Consortium Model theory Expert system Electronic mailing list System call Computer animation Personal digital assistant System programming Ranking Physical system Resultant Resolvent formalism
Area Execution unit Computer animation Set (mathematics) Negative number Negative number Event horizon Physical system Resultant Position operator Physical system World Wide Web Consortium
Number Computer animation Decision theory Web page Set (mathematics) IRIS-T Condition number Measurement World Wide Web Consortium
so welcome everybody to lecture number 6 of lectures information midfield and Web search engines some point lead with some problems with the audio and video recording the rich lecture so this is the read recording of this lectures and pastorale to world that would for words are we doing it and bit it's more quickly than usually in
particular we are going to skip the
whole more questions than nobody has nobody who like could
ask about it and
directly begin with fellow
1st there are the 2 topics today by not which is language most language most are approach to on following feel tiles and information to the the system but essentially its about different coach as a disaffected like space multiple receive a tree Load language smaller start from a different assumption of the of
the properties of the document collection of basic observation ease that to if you want to talk about different topics than you will use the different styles of writing example if you write a more full of document and you don't use and and the local terms and if you use it for some comment a read in in an abrupt that would use Cloaca terms and of course but also depends on the topic yogurt talk about found so when you talk about information retriever as I'm doing today and probably you will use the word such as document and term and Matrix and all this stuff and talk somebody talking about politics he definitely will use in different book every area of land in the UK some of what it takes you probably all only talk about the Chancellor some ministers Angela Merkel and would related to that so obviously depending on the topic on your language of talking about this topic did so and that is essentially what sort of the key idea underlying language multi found as the smaller fight to describe the language used to describe a certain told zone via the general dealers that each language below belonged to some fine grain topic and fine-grained he means means on the document led to each document is about topic and this topic is associated with a certain language used to describe his topics that of cross documents that of air that deal with the same topic will use and very similar language so that essentially the idea and Wigan to create a full of model that captured the statistic of properties of each of these Languages escalate with going to offer model how often each term occurs in each topics public set in the in the topic of politics for foreign politics you probably you read a lot about the use a and some head of state and all these book emulator so and we have and what look that describes each document and when being now want to answer some kind of such a read such and and the and the question is centuries and given the language model that corresponds to the Creery which documented in all collection to fit how best to the cream of those the query and documents describe language miles buffeted by statistical language worlds and very going to equate to walk of all Compaorà document model to the cream model and those smaller that left victory model best those documents related the smaller with get return so
another question is how to describe a language for the topic as set in the following so I'm a bit should let your Yalda about Holman grammars from the basic costs on your computer science and basic in the by gram us the used to describe syntactical prop up piece of language Sumbawa but to describe natural language he could use the following gram of sand this is built by combining now faces and overpraise and announced race could for example be a demand for the book and the praise on the other hand them is built from 0 followed by now traces work for sound and this and this very small grandma could be talk and now phrases on demand and the book has been already seen than a possible way to build a sentence would be sent and and the need another Fridays and overpraise and now raised could be the man who were raised his over and and non Andrea essentially substituting each year area will until we reached the and Tommy atomic part of talk that talk The Book of all the man took the the man of the book took command which differently would be less reasonable but it still correct in a grammatical attacked the cash semantically number but from a full of point of view that the Bank of building a sentence of
this work for McGram a model of his you know it's all about so that the full Abramowitz mode helpless year installing a problem of model laying topic so will receive grammars definitely captured syntactical correctness of language but not writing side so but usually by many as NetNet line with language that there every point and pointed into his stylish and and you do not all ways of a grammatical route street fashion so natural language some kind of living thing and doesn't really hands to try to try to describe it Syntactic competition meaningful way so found and so multiple party it will be the writings and we use is Whiting stylet to to describe what a topic really and also the writing style was sent opaque depends on the to pick a word used in phase this for a sentence of up some Ballymena far in politics at killed some will this depending on the topic if you wish talking about the German dress relationship and you would different use some names 1st only to to politicians from the lessons from Germany on the other hand if topic this the German depends German Japanese complicate the ubiquity which use different names so it's not that I am air of index by men after the words and writing style and dividing study tournaments the tough so that pyelogram was completely fail to kept a sadistic of properties J they just say whether a sentence is correct on text is correct or not but when we go into it to achieve is to to find out which would of could afternoon which seldomly because those words of seemed to be most Houghton to this
document or topic so and to this end we go on to use statistical so called sadistic language most which completely or else texture roots and ground it's the way that are similar to other backwards Motorola used although the backwards smaller completely annoying anywhere of Boston tax all what words belong to the same sentenced the only concentrate on the frequency of the words and language most meddling and in a quiet with some away win are focusing on statistical regularities comes to generating the document so fiscally which documents which 2 terms of grow more often than other has begun to model year so and this time it is a language mostly using it to work for generated more and he the basic assumption is that every month documented in your collection that has been generated by some random pro on demand growth and basic early means that we have some probability of generating had certain sequence of were so and each of the book document is as a sequence of work and into each sequence weekend signed the probability of Brent Cross as the and from by all possible command that could be generated from the start because what we are just going some randomly of cost documents that are very difficult for this generated more at a Higher probability of getting prawns and those that occurred more Rayleigh so as a set
stipulated consists of publicity distribution set out the basic building block of the language model so that and the number of were but terms in the documentary going to generate than the assuming that of British distribution such that every word of the language and had signed from the duty of generation so enjoyed Pr of the sequence of the work so are under wave of make a various in the assumption that only the words can the cat and the piano collection of what and the Christian Zia thing is quite easy to generate the anti documented the exactly 1 the possibilities generating the and document the ability is 1 of every case and want to generate I wonder randomly generate the and document well we get the empty document easy became and the group's want so we are now trying to model the case where redrawing operating a random document containing just a single time and that 2 possible documents the document Andrew containing only the work and the document containing only that would talk and that this Humeston are model the probability of generating get is that Oedipus and and the ability of generating dog is 70 per cent of gross the some must be 1 of for a week with 2 of their 4 different documents that could be generated and again we could this your comes employability see somebody captain canvas and kept up something that we assumed that talk at 70 per cent and of 20 per cent and again this time must be 1 end of cursory coach create tables for all different and and if I were to have a list of the state of the game in Finite list of the stable of then this would be fined for language more so that the language more that is the route that the science for each and for each document length a generation probability to each possible document of this so that the but clumsy
but usually used humour that some more structured document so it makes no sense to assume that a document of land and is completely unrelated to regarding the sadistic a puppet these 2 with a command of length 11 and so on Monday X 1 1 makes some some more restrictive assumptions about how documents ungenerated so the popular Mollet this so called unique lamama beauty gram because on every single word is treated as being independence from other works so we had to ignoring the context in which the word pianist so the probability of generating the document contesting of these Wood W 1 W and the probability of generating this 1 is the product of the probability of generating it individually terms to assume that the current state of the 1st on the 1st work is completely independent off when the 2nd term also and the document not itself as we know from probe even model usually this assumption is set off his as is quite quite because usually sometimes will be related to a window could will go close to each other to reach to each other but in the early where model to keep things simple readjustment is the subject of some of you want to add a bit of it more complicated you could also use some of by where Mahler here you assuming that each word probability depend on and there were appearing just before the cause so century assuming that the probability of generating this and document it the probability of generating 1st worked multiplied by the probability of generating the 2nd would give them their 1st words has been W 1 multiplied by the probability of serving the that word given the pro given that you already know that the word before this work was W 2 and so on and so on so in the end you do well to complete the study rise generated from model for different and we just need to sign the probability to each word that in contrast the by all we need to assign probability to each payout of were so for each of the of the because we have with the conditions and we have a ability and and each ability to each at atomic from the devalue depends on 2 were sold on if you like it you more complicated than you could use a trigram model which is fumes dependent on the previous 2 words of ability of generating the word sequence could Delhi 1 to W and was equal to the probability of generating 1st words times the probability of generating the 2nd board games and the 1st words has been W W 1 multiplied by probability of serving W 3 given and that the 2 words appearing just before W 3 have been that of the 1 intimately 2 and so on and so on in this model but we need to to define the probability of generating a documentary need to define the probability of every single word of the probability of every 2 word culmination the and we need to define the ability of every here the recommendations so essentially in uniform or we need to would defined number of words probability values here we need to define number of word from village values for this month last number of words Square and probability values and the trigram model of you're country really complicated the need for this and we need even more from the millions of migrant quite values because the model become so complicated so many summit parameters to deal with by their motives are quite popular because they can be handled by the hobby the most popular such a unique and model of the simplest which also a similar to the model for the scene in the publicity between
a right number so that sets slogan examples of by where for a vocabulary consisting of 3 words of wrote the care dull and my also and as we have seen we need to define the code of the Kurds probability for each mutallab's with its assuming that caterpillars the probability of that it was and block with a probe into a 50 per cent and miles the the ability of 10 per cent and then minimize remember how we also need to define the probability for each word Communications the ability that these were occurred after 1 directly after this would be here so ideas what his you fall model for example biassed you that with the way can't has been or the and the next would never will be kept began so kept qat will never car and the and document in direct sequence this like this you know more of the same eschews true for the combination dog and now so if we have seen the 2 miles and a world be no doubt that after what and the same is true for almost 15 miles and there won't be any mouse directly after this to so on the other hand if we have also detained kept document and it's highly likely that the next time will be mouse and quite likely that the next time will be dull again please note that some of each column must have to wonder because the Collins the notes that the note that the conditions but it and look at some randomly generated 6 but sequences Sequenza that have been generated from this model so as to concede that followed in writing the 1st where we need to take a look at of single were probabilities and her can see that it's a likely to get talk at 1st Bird and a bit more unlikely this 1st cat and attitudes quite well here and there and when we look at all 1st document Rossi that we have also dock at 1st documented the 1st words and then the question is about the probability of a mouse being gay generate directly off wall to a 50 per cent so that would like this to happen and so on and so on we can our and regenerate on documents and this is 1 of Chris when you run this fall 1 6 times this just 6 random sequences and as we concedes already indicated that is no sequence can can't and no sequence of Delcam and now stock and no sequence of mouse because these combinations can occur point almost so but on the other hand if we also kept them very likely that the next time we must get most had lost had lost most a Doxiadis sometimes this might happen heard can't mouse kept most had mollusc at all so it as a seat of the car last communications very likely and the directly reflected in what we see applying or model so what work
every of Sir African reconceived that generate more can easily used to be when regenerate documents well quite nice but not exactly what we are what we want but that can only be used the can also be used to recognise documents so and this is what we see of the team on testing times because usually we have to find some public mullets and we are going to find out which model decree refits best so we of the decree read and we have seen a lot and I have a lot of documents Mahlasela language was not collection and I want to retrieve those documents having a language model that corresponds best with the creek so which most which documented the given model as and usually on you based this unpublished of generation and for some of this 1 is the 3 down and occasions to kill Akaka recognition of the Czech version of the new have a problem that you're a recognised some some words so his German example and for example you are not sure about what this last less that could be because the deal would be a nobody knows but as soon as you know the 1st 3 words that he could do use that after this sequence of words the only words that can be a here what type likely that after Loyst and and blowing the word more Corus which means move in German so that is 1 way to do it so another question is
how to apply language wants to information to tusks sold again Wrexham assumptions we we assuming that for each documents we have no collection there is some the statistic document model that has been used to generate this particular document but also we don't know how this Molexor looks like what the probabilities are and there for we need to estimated how the model looks like a from the document we of Sir so that they could with human each document has been generated from the smaller by a random generation processes that means that the documentary is seeing is just a singer random samples from the how a language model that corresponds to the document and decreed is also 2nd book of something which Mahler which described the user's information need so the user has some really topic and his or her query is directly generated from this creamy topic model on the other hand we assuming that the 1st book command has underlying model the 2nd document is not a lot more of this such a command has underlying model and by using the generated processes as if he can't stop and not example we have generated by the 1st documented 2nd documents the document and now we go to find out this week that only also the documents and the CRE which who want to find out how or the smaller look like what estimated the model from the documentary scene we aren't estimated query model and then Megally to compare antiquorum model to all the document model and find those most Adamu simulator to the cream of and those promote of are highly likely to be on a high like the to be associated with the most relevant stock from the most relevant documents so out
of do where a set to to be application is that the 1st estimated language model from each documented of this estimation processing and then for each model that we computer probability of generating decree which has seen for each directly computer probability of each document can directly computer probability of generating document the the same can be done for 4 free and then we ranks of documents by the ability of example if the model which is highly likely to generate a document that looks like a free then this assuming document 3 is highly relevant to agree
on the complicated but usually used on to to find out to be words contained in a document came out of sounds like thiérrée but some from Poland of remaining the 4th sampled held estimated that the true language most from all documents we have so we have only seen a 1 thing sample and we have to estimated all these for abilities and Load which seems to be quite complicated how do we do that and then in a consistent and then full way and and calls which kind of Language model should be used so you new and by Graham trigram never pump depends on your own the Applications notice usually unique and has a good choice of that in practise usually unique Ramis used sometimes used by promotes as it seemed on the complexity of estimating parameters is much higher than by remote is good at is quadratically highest usually 1 1 mistake equimolar because to handle gap pressure reasons he prefers to to be 1 2 accused of computation complexity but don't want to have good complexity on which it also means that he also had to deal with all that so documents and usually very sure so that say the the document containing of any to thousand words and we have will Cabriolet or collection of say about 2 million a million words and it's quite complicated to estimate the millions ability values from just 2 thousand observed so estimating its is is tricky enough for you to go and by promotes it often seems to be a hopeless some yet so and the basic problem is that if you use the richer model that kept more semantic of public season-long relations between terms of and they are definitely of their data to express more statistical properties offered up command at that but that because you need to estimated more parameters Bangura guestimates and usually on a very good and so sold the gains from a using a richer model is old waited by the losses to get from the doctor spouses problem of since reunion Ramsgard and I saw that will deal with the where
most only this lecture and how use by Romola's you can use in further is from attacks on the Sri renowned giving it's not too difficult essentially so hot estimated through model from Poland documentary of the so straightforward approach would be if have the given some document and saying I can't consisting of and terms than we just estimated probability of each by the term by the frequency of the term in the current document divided by the number of words so the probability of all model is just the the way the frequency of each of called the maximum like it estimated that is takes and seems to be completely reasonable of reasonable seemed to think it to be a for example the seen this document Antony to estimated the probability of talk and foreign of care at the ability of some sort or other words document point that are pentagonal vocabulary and could estimated probability of the term glauconite more model by the number of times that can dockworkers 1 to n in total we of 1 2 3 4 5 6 7 8 2 so the red frequency of the terms of the cost of the same can be done with care Kent as the time cash does not earn on document died the estimated its probability to be the role and the ability of some of the estimated that the cost only 1 in out documents to be in this core solos
problem with likely estimation approach and has already indicated promised that the document size often too small to make a liability relentless estimation for this parameters so as he seemed on the critics on the
with document of size 8 and the and we probably possibly would have to have Cabrilo containing houses words for somebody had could be such a word Oman almost all 1 the with a lot vocabulary
and we can really estimated the from the peace in in a good way in a reliable way because overstaying terms that are missing and and document which the estimated 0 publicity which usually thousand thousand feets to model so it could be just because of chance that the document about the topic politics doesn't mentioned the term politicians because of the writing side of the also for somebody but we we would expect the term application to have the probability large than your because it's highly related to the top of the for we need to be at the problem that the terms of during only once in the document and no medical estimated Bacall's would we have been lucky generating just by the chance and the buy to let her go to do is that we want to lower the probability of rest Ms and the race appropriately of terms that do not occur the document in a meaningful way and solutions to do this way to do this is called smoothing so that this deist ideas and some ability to all missing terms and told and and to put all his teammate pro-British estimated into the direction of the collection mean that means if we know that the application is a very very frequenter term in all document collection then we would assuming that document that does not contain the competition has never less time ability off the of current for the stability because it's so frequent in all collection so of smoothing can be done in many ways to
look at the could to a 3 of them are the most and the 1 that is shown here so we all waiting lists and the basic state and his idea of fun but just add some small number are he called off off to walk of so called so he and Alexander the big dog jumped with a small adult we just that are thought with 2 1 to which count so we accounted 2 times the where the and we just say that we have of the 3 times the same stand with big seen only once thought does 1 year so we have to Times and the or order initial public guesstimates the but higher than I would expected and terms that do not occur in or document it estimated with 1 count of 0 plus 1 to is our initial and it is thought initial estimation so it not as we at as we need it all these all these frequencies while ability to some up to 1 of the divide it by some of these values of some here would be a 15 over 8 15 devised by 8 region and the divide each terms but this value and arrive at the time before that we have to do this and that is to the which secret to the 5th so that the real find estimated summing up to 1 and 2 would expect it so and as he has been seen in terms not during of a collection that some of ability must and the same time those terms during the documents that the No 8 and I would expect without nomination so of quite nice
idea so another way to do it smoothing so called Linear smoothing so we estimated probability of each terms by awaited the meaning of the term frequency seat in the respected documents and the frequency of the word over the whole of collection sold for example if we know that the but put distant bookers 3 times in all or off 1 does doesn't does occur only once in our document of says size 8 matches you may have found a document and politician could once and this document but only know note that are collection every 2nd word is split hidden because he said the various frequent term so collected size the number of words the collection let's say we of thousand words and all document collection time and the collective frequency of word is 500 so every 2nd word is petition but in or Cohen documentary only be of only seen petition once there would be a good year to raise the rate of Pouliches Walker document because it seems to be by a buyback black that we already have generated petition once and so we are taking the right way to do it means that his you for waiting for the ball rating for his half then we would have won 8 times half last half time half which would be 1 of the 16 is which he over 16 which would be on this team mate for the competition in all corridor
to be the way so that the methods to due smoothing has been taking a point and the with profits with all the number graph from the view of point 9 estimated from basic model so cross was the man of the of easy produced takes and point is an Croft idea was to use the best in the compost to stability licences to stabilise the document model so if the term is missing and the document and we just take their offers by ability to cope the popularity Wallace
this 1 here and but he is in
the document and the and the smooth the estimated estimated which we would have made by using the again the published by ability but this time all the documents containing the terms and in this way we would again raised the probability of terms that we did not seen the document and no possibly was a rise depending on their rich estimation of of the stock falls to we've seen the document so there are many many different ways to the smoothing the but the general idea is to come up with the best in nations that it's not so dependent on the document so that also use some Deco knowledge about collection are
so would not let this year we have created a documented and model from each documented so estimated although unique grandma by Gremel Avapro probabilities and now we want to compute the probability that the query has been generated with his back to some of the documents model so far Malawian given a query consisting of Kate returns in this sequence you and the ranking school or offer document that is set out what estimation that query has been generated from the documents language model and as we are now like some using with the level of the Justice human that this ability is the to from the product of 1 of the keys them postponing to the individual returns and
then we can directly assigned school to each document and return or documents in the collection depending on the probability that the previous generated by the smaller and we hope that this car response to rather than a specific by human of the Solar the pros and cons of language most across definitely with a quite clear statistical idea behind it and no a talk waiting so many anyway so that is a solitary read smoothing can be can be a on motivated quite easily so it is a credibility as its at similar to to her ability to publish figure to model with the wheels as quite clear but the and the Middle you have these Thesis gastrins assumptions sometimes but and general of its could be and so on so of the collective that this takes on directly used in model said of instead of using for some kind of MoMA's Asian purposes zoos in the Victor space models in all this rating mechanisms and of cause it works and a full man's is compatible too that this is not so bad as a result of the medications on the cockpit collection of that the night India and the works quite good so that the essence of it while the disadvantage language was sold of cost when we use your near model of which is the most TV thing we again at the independent assumption that various you the if the put is workers and the probability of the term urlacher operating its independent of usually if we know that somewhere to the left and document contains determined and America's than usual you can be quite sure that the repetition would be quite would be when you or small but it was more likely to appeal so 2nd contest that their that has no notion of red an integrated into the model of found it just about scoring and matching the creamy to the document collection of their for the integration of users the biggest quite difficult not does based model became easy incorporated the relevance to the Dec 5 by moving and moving grids point to a different into different erection is much more complicated in language 1 of
her on the grid to skip the to of this some kind of search engine with which tries to use ideas their simmered language model him just have a look at it the Dalai it quite
interesting but they are not go to the and second half lecture it's about the integration of information to the system so in the past 2 which on use some different approaches to the commander field and we have all this week we have said that is not almost quite well in practise but did not explain details but this really mean so now Wigan to take a look at how to make the whole good information to the system really works
so am when when evaluating the made into a systems we need to evaluate to think so as it is science for any any algorithms your going to design a to use to things are bought and efficiency and effectiveness efficiency it means found how how good you use the available resells Intel Scalability the Idriz and his time so basically the question out of the of efficiency is that are we doing this thing right so we have some some defined defined the outcome for some below language model and now efficiency issues only during the system that is really the Fast Food building across London should be superior Scalability and Distributed of the order and so it it should be Foster and may be used to maintain the usually efficiency is is not the most and it is not the quickest thing because you can evaluate Efficiency just 9 may during the time it takes to answer creamy catching efficiencies quite easy what with some of the more problematic information to be the effectiveness of the Christian God effectiveness is how we doing the right thing at to the right thing means all be using the correct on the river all use of making correct assumptions about what we want is some and 1 in the focus of the effectiveness question is results quality and also usability passage few Wright was assistant does to me how the results for rather than what they correspond to what would expect from the system so that is a question that generally related with the human being human nation a human standing and badly very hot to capture that but I don't try of the
said efficiencies about using as less storage space as possible using using only very few be time doing only a small number of input output operations guaranteeing quick response time and so on and so on all these things you can eat the measure depending on your Hogman soft and it basically arithmetic problem off from of South engineering to come up with aviatrix indicating a sauce code to speed up your your operations on to use the right hot discoed Audlem Hutchison New York their into a on on different hyped is that you can do a chicken due to the part of creaky all to with how to define you are and the index where they think the sorted Efficiency on could the golden information to be efficient enough that this is not to hear the most of 2 mobile and the most efficient almost all of the optimal solution but it should be efficient enough which which is really means if you press the button on the grid pressed the search button after asking you create after Mary 2nd you would expect of results that it isn't enough sort out said efficiencies usually very easy to evaluate just met at the time of the space all the time was to be results is because you biography and easy to do no problem here so
small problematic its effectiveness Hull to measure the quality of Hugh results and the the key idea is rather than so I talked about rather than some weeks ago and we have already sent over sales document is rather than to create if it said time the uses information need in some way but the offsetting quite could quite week about that and and assuming that you have some intuitive understanding about relevance really means and the fact that this is no or no strictest indication of residents now so exam Problemist terms and information on intelligence it's really hot to define but usually people agree about what it really means to advise the call for the cost quite clear that information really is now depends on how we ask him the same through the and and the next night we take that they could look at difference different ideas different approaches to define rather than and think you get the spirit of the idea that a chaotic due next time so 1st look at some of some aspects of rather and select what is important in the findings rather than some and and will show in approach the usually used in information to research found that the most saying you want to know when you want to build a house system that they is that we don't need a precise definition because just 1
systems really works so so so ceviche some some is ago tried to try to Omega systematic approach to to turn over relevant and he came up with 5 different definition so I'm the 1st 1 of the system was written aggressive and which basically means that you define relevant in terms of my tourism so you assuming that what they are is a return is deemed is being rather than and what it does and return is not relevant to visit the this is the at the 11th defined mathematical terms and this is rather than as the defining and found by means of a winner at this week into the previous falsettist find the use information need and so this is this rather more on the way of this is an rhythmic part of it and this time we have cognitive for maybe the human side of rather to just how to get to and from the more because he went to not follow full model of cost so different falls rather than use a based and
taking look at them now set system of a good aggressive and his team descent of jective concept so we just defined returns in terms of some algorithm found and 3 justify and that what they are gorithms with some return is deemed is being relevant and all of the documents and the road so that the amount of talks on the deal that just a smaller relevant for some very big and the School of the got given given agreed terms we would compared with the Meridian had been decree return and documents in the collections and for example that we would say it or documented the simulated of about DuPont 5 deemed to be rather than and those who his for on offer a nice division of rather than but the has of cost thousands of match human would understand the relevance because only influenced by the ideas of the
next 1 is topical subject rather than and he has ideas that Britain and the US and some is understood as them about now so if you have a query and the and that the any assuming creature points to some large topic on for example of that German politics and if you if you are looking for German politics and definitely the Wikipedia ought to kill off Joan politics would be Rolando may be film pages themselves on petitions would be related because they are about Woods German politics in general so found with ideas that could use some kind of intellectual assessment where that document chorus points to the topic of the area required and described by the critic of this is the India of some kind of semantic fit this show some people to Tuquiri and document and they say when these are easy to topics or not and if the topics on relief in some way in the document that it touched has been driven to the suspect to the tree sold some standard of cause them to S has the about this of a documentary don't you need a queries so could just take the each documents and the and a tiger arise each document talks on the time I got exemplification scheme has been discussed in 1st lecture has done and libraries their trying to to the classified documents into different subject area and the same could be done for the tree and and then you just returned those documents matching the same subject area of this system general time as 1 meant by public on the subject but the X
1 as the next notion of rather and his computer of relevant also known as Ferdinand so again this subjective type of elements some and he events now is understood that the relation between duck the documents and the cognitive state of known each and information need of the user so which means that if already know all about it on the map and it's definitely not the rather than result for me if page is shown to be at the time looking for don't politics because all know this information so this the main difference between the full my type of red among the top public
events and the Gurkhas prolamins because the coveted development also of equal weight my prior knowledge of the of the user in my particular situation so governments really means what to person eat the Dutch is being rather than and what not so and
public at such a public rather than events would be the question of what would most people regard as being rather than on a purely topical basis and he and a number of definitely is related to the cream region politics
but it already know everything about and active and it is not relevant what held would not be Ireland results to degree in my individual cases of underlying question is what is the uses judgement about the applicability of the Tree of documents to the mesh hand so it's not about topical relatedness but about the question whether this information helps me in some way so what's new to this type of documented the grid events now can be Dynamic the more kind know about the topic the more a duet about them the more documents will be deemed as the 11th by me because of the note but written in them so and changes over time changes with in the same period context with a look at the 1st results in my list of the reasons is that the remotest also in my list of queries of all what the but differently that assuming time asking you create a system the system returned list of 5 documents and that is human at the beginning of this document is rather than this and and this is the and and now I'm looking at the 1st documents that 2nd document and read and that's assuming the these documents contains a similar content and 1st 1 that after reading the 2nd document the 3rd and the 4th document are not longer rather than to me any more because of all the military contained so rather than changes and or coverage of rather than smaller and this seems to be quite how to handle it surgeon so that
instead further is this situation the events or utility again its objective of its from my perspectives perspective of personal off the creamy and again it's Dynamic that and changes during the search pro said the more wine note this documents Parliament to me and now it's not only about prior knowledge about whether the gap it helps me installing my problem at hand so that is basically what rather than meets yet so I'm having some kind information I don't know something want to solve a problem that the question is whether the documents helpmeet insulting this problem it believes those recommends on of defined as being rather than here that helped me installed in my problem so this quite nice and but usually this what you want but of costs that in the US and lost also also call serendipity so this means that some information might be useful though they did not expected but it not such Floyd of what sounded 5 some some problem of buying some some thought on the internet the wanted a modified sites where Tobias who away buy it on the Internet and the system Richards me some Habraken model thought he and and branchlike just around the corner and that it is a rather than to my crude so stuff and name some staff online shopping and brick and model stolen branchlike is definitely not relevant to my created the region a sense but it is rather than a and that is rather than the comes to solve and my problem so in-order condition of situation around events or utility as it had sleazily my problem it becomes grid and although it is not topic rather than it is rather than in topic Asencio not relevant by helpful
and the last of and most brought definition of rather than this so called effective rather than began its objectives and dynamic and now rather and is seen as the the relations between documented and intense and gold and the emotions and motivation of the user of and visited the India's seat he human drive for information so prevalent here is everything that makes me happy whatever this might be it doesn't need to be related to the creamy in any way it is information makes me happy had to be in some way it then it is an event of of cost it is impossibility at least currently for machine to know enough about meeting to to give the result unrelated to McCreery make me happy in some way so maybe this is a goal railway by under achieving the future effective relevant happiness by by using the with
9 but usually code he going for some some more restrict fall further and so he has some read it seemed 5 different but if of rather than on 1st 1 at the great regret events have just using a mathematical model for some protect based rather than small believed retriever relevant of language Mollerin events rather than others documents at every turn by the idealism period Sodexo honest public events and this seems to be accorded the most reasonable definition of relevance because it can be judged white objectively by by different people really is not difficult to through on to find out what to know whether document is 3 8 2 2 3 Rio Dietz with the same topic as a creamy does so this seems to be the most reasonable definition of relevant at the moment so coveted rather than also incorporated my state of knowledge of the situation the 11th also enculturate the problem want to solve and defective red events found is about my whole all my all over a mile the highest and what makes me happy what my my mum well used by striving for and everything that satisfies me that makes a successful that has made had been accomplishing something is relevant well Sussex to be quite as a direct for their going to these about same so the
said that 1 of of romance and the current girls the major is to be algorithm that is very similar to the top figure at events by most user so it most users would say that your page of Angela Merkel the would be a good already and result to the cream German politics than we want to have with the returns exactly both pages the relevant of being found rather than by most use so and was future growth this for research topping up off is to find ways to address the other destinations for make the of that the British machine that makes its he was happy not to have some
finally found that we are would like to would like to stress that their difference between rather than and Ferdinand depression and really refers to use of personal information need so the problem your going to solve with after new queries and rather than is just about the topic of degree read so we have a information you ask created talks album might could be I'm on to want to buy something on the internet and by which after the queried the name after a singer would like to buy and then he shot a 2nd term for example and in the my are system and the system returns this upon system I found that usually Helen House goes some way and this was the casual in the iPhone relations used so often 1 says relevant to agree read by which usually means this is red events to a typical information need feeding decree so the CRE itself is just term buddleja what you're looking at Renmin talking about rather than he is now the topic of the Tree not the end of the creatures that it could only be predetermined such would define the Red and then within the given Load would be the best model because only returns those of what exactly returns those document containing the created through which is not what we want from the small to Evans and than just time and
again found the and now we know that maybe not know but it quite good India but returns really mean and that the back door and is a question of how to evaluate the result quality delivered by some system so that tradition approach onto doing this is creating a benchmark the benchmark means creating off by taking some reference document collection sulphoxide collection off time magazine from the last ball that ought to go published in time magazine over 50 over the last 50 and on top of that we define the collection of information need information need that can be expressed in terms of queries so if it were some of the nation you could be doing the mediation although the bridge particular relationship between the West and Cuba in the 19th Seventies this could is information each and decree read and could be used a few but not 70 though and for each of these information need increase found there is manually assessment of the relevant for each created document tale of has called the Good stand to lead you take some some paperwork and you pay them full of 3 all document collecting and for each documentaries say whether it is rather than what free read was attacked free not and then you have some some goes stand not and you can come from and can compare your eye tourism with with the still stand out into and to treat job is such that they are best but by Uman judges would find rather than a one off game and it is all the nation I could be here where the prospects of the critic of the purchase achieving independence from the rest of Canada usually users is all questions information need and I realised that most of the crew from its so period and could be Heubeck independence Canada came as asset we
of physical I would like to to build a ghost are not by finding out for what each query which documented the collection or and and which are not so this is killing means that the need to go through the whole collection and but each documenting site whether it is relevant or not of cause if you have a collection of million documents by this time Kanoute can can be done in a reasonable way because you just don't have enough time to bowl of these documents so the precious how well what can I do to excess rather than for their large collection of what you do is you see the so called two week methods you around each read your 1 to assess relevant for on a on a collection of very different now systems of sound like just a smaller and then which model and published in 1 of the many other more and for each by are and you could that those documents that deemed is being rather than to the system and pulled over these results to get so the point that contained document that off held found to be rather than by Italy's drawn algorithm so and so has the collective documents and potentially relevant in some way but somebody take that topped by 100 results from each each system and year 5 systems and then you end up with the most 500 documents usually a much smaller number because there simple at between different Systems again they take the Union of these retrieve document and prevent them to human Duchess for many other events assessment so that dealers who have launched document collection to have your career ascent agreed to a different to different algorithm these algorithms which return different possibly overlapping without at and all the documents contained in these results and the and other than judged by human referees where the document rather than or not of cause of the good thing is that they said usually much much smaller than the the collection of these could be millions of documents and these are a few 100 and most looking some of coastline assumption after method is that there was documents that has not been return by a search engine out definitely element of cost in may have been that he is some highly rated undocumented which has not been determined that any of the 3 year search engines which simply gets lost in the pros happened but it resumed search engines are good enough enough and we will have to find most rather than document released we can avoid
the looking for all the documents obtained singles escape the steep to allow it is about time some document collections with which has been used in the past and not just
take a look at the 2 sites by
South load of piano its
it's quite nice to all to take a look at the fact website where they where they are meant the make some feel contests every
year the city for
yourself alright
found so as as a summary of all this has collection used he the basic ideas to evaluate the algorithmic relevance return by you all the way that you did you designed and that you want to win 2 1 to compel at which point you want to measure that is competitive Gantz topic an events assessed by human Duchess of human judges they create your list of rather documents regarding the topic events and you are grid them also returns what it's all goal rhythm and it things is relevant and then you're compelling and then I compendious to notion of to each other so well underlying assumption CEO of cost you as you then when the human judges to make their assessments of public events and that and the red events judgments all resembling what you would expect from really feel Simulations of calls the human judges on doing that the judgement in the choice simulation where they can come from the different documented after saying that it is relevant to the query not and their perception of relevance could be different in a really feel the situation but they usually usually is quite consistent found what caused the problem my beating to subject reliability so found different people might have different notion of public relevant long usually the different is not to we now each so you could Hugh could reasonably expect a different expert the different at the Duchess come to the same computer regarding the Koran and cause you assuming that threatens often in the region documents can be assessed independently of other documents show is basically a good condition and public 11 and the different from of rather than we all read already discussed time so we as you may document is these Ireland or not end this judgement of and does not change over time what does not changed the depending on documents that the Dutch has seen before this one off again how do we know
how do we now do exactly or a valuation of result for the team and 1st of all we take a look at the men duration of and systems that in the US say so results had means that they do not provide any rankings but some of the boys to the model of returns results set so that we become drinking a but later found this is but not only Kate is so we are starting with results so that didn't again is to compare at the results sent with of ground troops said so these are the human Duchess the South all documents judge is being rather than 9 while you and judges and this is what all my rhythm thinks it's rather than the latest a modified sound and then different set and loftier sold on the outside that this is the set of 4 documents contained in the collection and then you have to set off for documents being judged as relevant by human experts and you have the set of all documents return to bide your time rhythms of cause their some of that these are the documents that are and events and Twitter and and of calls you want to ask you want these over that to be as large as possible so that the idea case you are and returns exactly the set of From Hell and documents
sold but part from the centre in the middle there are some other sets and balls so he at the scent of all documents that have been returned after that are not rather than the said is cold full of positive so well be that it set of what rather than documents return by the system of cause these other documents you want me to do not want as the you're a new of results at the end of what the good and good news is that Fault that usually can be sorted out by the user quite easily so you just ski and through 3 resolve this of results said and often enough to use the at the at the 1st loaded where the document rather than the not so way doing this and we will every day 4th positive often not a problem as long as the list of results is not dominated by full of positive for negative
of those documents that Parliament that event which has not been which earned by the system of these this is a really problematic thing of it so these other documents you win now know about the use of so you just are confronted with the results said that had no idea whether they are any other documents that we had to the public on the sell off and you can do anything about it and they are adamant the worst and pulls politics
and the remaining said are the true politesse so those documents that Parliament and had been beaten by the system and they are there to negative quality area on the on side volatility this negatives which here these other then not rather than document which have not been beaten by system so you wonder that many true positive and many negative in systems and an anti don't want any faults negative and the anti for positive positive where Fault positives are accepted the of this sudden amount and faults negative usually very critical because you have no chance as I use it to to you know anything about it or
and the next thing would be the definition of a recession a by now where the end of today's lecture because we don't to continue with the decision to conditions in the next week or or so and thank you for your attention and goodbye