Add to Watchlist
Probabilistic retrieval models (20.4.2011)
232 views
Citation of segment
Embed Code
Formal Metadata
Title  Probabilistic retrieval models (20.4.2011) 
Title of Series  Information Retrieval and Web Search Engines (SS 2011) 
Part Number  3 
Number of Parts  13 
Author 
Balke, WolfTilo

Contributors 
Selke, Joachim

License 
CC Attribution  NonCommercial 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and noncommercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
DOI  10.5446/350 
Publisher  Technische Universität Braunschweig, Institut für Informationssysteme 
Release Date  2011 
Language  English 
Producer 
Technische Universität Braunschweig

Production Year  2011 
Production Place  Braunschweig 
Content Metadata
Subject Area  Computer Science 
Abstract  This lecture provides an introduction to the fields of information retrieval and web search. We will discuss how relevant information can be found in very large and mostly unstructured data collections; this is particularly interesting in cases where users cannot provide a clear formulation of their current information need. Web search engines like Google are a typical application of the techniques covered by this course. 
Series
Annotations
Transcript
00:00
the cost the UK or to and the and catching on my at the approach the trying but never less that just start and see how far we go and whenever something the and the and the just stepped right in and tell me about it and sell will do so we will be doing basically a lot of maths today because we doing probabilistic retrieval mollis and this is are number 3 and now we are about the my I'm waiting probabilistic retrieval models and has a 2 1 9 2 2 on lectures that the family would not into the of the wonderful world of slum retrieval model in the wonderful world of the probability of such high and we will talk of a which should talk eluded about the exercise that were found out last week that anybody 2 early exit figure that with a bit
01:21
of a kick out than that told them not to would be the that at it it was just a seedy a couple of questions for you found to a 2 to repeat some of the concepts to fee out about what while the nature of the kind was not so less out of the relation between the bullying retrieval Moland the fuzzy retrieval model holiday Cymbalo different exactly so what was the great each tell you about some work her exactly and how are they similar right of way and ball whom over are basically of 6 their Babai in re by work model for by the widow monks both of them from so of ought at on a bag of 40 model me for book and good where
02:57
possible Paulo of using the duck Idenix for measuring terms and the and the novel to the job kind after the war to vote for it some is a joke high Mexico with 2 but not all of the of the if it it's a base he declined the of basically you look at how works Calker cooking Andy column of documents which works Coker anti bided by the number of document where only either 1 but were talking and so if you have someone and term to and documents you see the cases where to 1 and from to occur and the cut in the sea of this city life by where someone full term to exist a case the basic these the into section of divided by the union if you want catch and the idea this duck index of 1 will tell you the state Booker of almost bellway's to yet the and a Jurkovic so 0 0 that don't show critical with each other so why does it mean for the words full of compressed yes British of her eyes and was that it looked as tell you about the terms Term because the order with a few together and especially if you consider the if you brand to document scope for they operate in the same context the if 2 words are used in the same context the meaning should be related to some his the ideal for of TomTom levity a of crosses can be difficult the goal was to provide the terms that related but and show of area hot of that show a high Dekai index but I'm not really related from exactly in all execute have document based of Computer Science papers that they will all have an introduction carry the fight introduction and they will be the about computer so these words are always to Coca glyphosate meaning know the perfect good when the basic
06:42
the underlying Obama's approach to writing for the Watambwa from which of the technical probing and look at all the sort was the idea behind the wheel of car all the well the need to basically the idea is you the Hugh which terms higher but have something in common Mitchell some summer to and your a terms normal that of nothing comes with the outlook terms of the document this it is kind of the by VVM in 4 mation that each term has all contributes to the document and the Aug are measure is this kind of 1 of possibilities of dividing the and that the degree of information and that is before the with a of executives said the deal would
08:15
accurate collection stalled on disc using inadvertedly next and was a computation of complexity of calculating the cause science American between 2 documents canadians wealth and it's a scuppered by US would have to have to do yes but you will be part of the world that live in a world of good in world of the book the world will look out of place in the world of the book and that it would be loss of key autobahns Bubba's at Yelp P Group bump buggered What do you do the in road to the next week in sought to have the best part and Ed West to look you Sept I want you to the coastline simulator T as he said is basically the scale up rather between the term Becta 1st of these 2 became a picture so what I have to do waste that was document whom and take it to victim for a care for indeed iX intuited index looks how exactly he 1 goes to other not the 1 that was so weight off on all told and the 5 with a wave of air and for the UK at peak to we points to be freed with a weight off 1 the and the for with the weight of 4 1 component T 3 the points to be won in the wake of 1 to the to the weight of 1 of the 3 from UK at what they need to do not have the document lecturers for those documents the for you know for example of the document at you want and you want contains to empty want but with that you it does not contain the the 0 it does contains He 3 and 4 and their the same for the committee to be it does not contain year they does not contains year left contains the and what you want to know is a basically the scale of part so now you're into the game is a United 1 document representation UK and and under the terms that OpenDocument document left to find out what is in this case that product BP at at of a which terms to contribute to the result of the cause on Mesh at so exact across the because I'm so mood that although scale of part just this the same means that you have the 1 Macduff to or 1 and the other when the fall of the UK and a new Kekaula will look the of Kent and you get the number of fixed but it may be that the take up to 10 0 1 of the few that take 0 0 time would help me only members of them shuttle entries there will be something in this case 3 0 care this scale up the things have to look at out only those entries where those documents have man 0 value how could find out what these it's the posting this postponed look for look at it's a post in the assault that began to look into them on to find document yuan or the to which along with further if I'm cost the 1 with what seemed like a stop scanning the list if I'm cost Pete the to we without seeing it can't stop scanning the but just need the basic beginnings of the SAS witches efficient Booka this is what happened kill you walk despondent on my little fighting bought back to a bomb richest reflect on it as a media change richest very model UK at
15:28
no basically the just a come just to be sure of the complexity of how to use the cell so what is the purpose of mobilising document victory for the taste for document like why no Molise by the time of the crash that that and way this it had look that would have thought that all the good at half back by thought of it but the thing that got rid of the BBC could be but that part of this is basically basically correct so if either get a document and its along the document that Chaand that some term and it is my child from both she Campbell if I'm looking at Calyon Simoneti from a full time except why and the document the 1 and that of the 2 of of that the document key to 0 at basically this measure year reflect on whole left PM does the Trump Exel crew in the document 3rd but the angle between the documents does not depend on the length of collectors that is the reason why you know Molise if you would do a you can use ability cover the distance of the of the the but des depend on the length of the back so you the way you long life in the big space model was soco's unfamiliarity you don't know must but just the being if you used to keep in distance to lift not because I'm fare rating the document higher because its long but there for a term at the time to book a off it doesn't make it more 11 with with that all
18:38
it was a basic add the underlying TF idea waiting West yes are the off the wall and the law lot of what it meant that fund took less a basic idea the measures the Discriminative how can even example it becomes clear some consider the number of documents of computer science which works which is a I'll rather Discriminative between documents and which which would a or of not just as well as a very Repco Sept in computer science of finding a document about this topic in your collection seems to be a very easy task not menu documents were contained this work on the other hand we talk to a couple of this Hollywood to work like computer and occurred every single the documents sort was point using it as as strains for discriminating between documents obtained and was the 2nd part yet ability to was 3rd but does a measure of the so busy gave measures the focus of the document have with respect to some terms again if the document is the term bookrest Mawston womanised by lengths in the document that documents seems to be Moorhouse about this term and more concerned with the job focused on the stock and is expressed by the triumph of high frequency if no last is good whoever if the word is your computer 0 some everyday fluid and and help with care but then what is
21:42
the difference between Friday and posterior probabilities and was based Fiore actions could last question because they would need that heavy today by of probability and it right and it but conditional on the other the and and that brings and Bay few read which set as fact of her and the pro bility off some you've and 82 given that some of the big beast for a deal could be used fine of up to be time the and that exactly so it kind of exchange saw it relates to the probability that something for cross from under the condition of something a some other independent thing has already ocurred with reverse probability that something was kind of causal Walker and given the probability that the found that the individual results for could say this top improbable that the augurs for example this is very small and the probability that week bookers on the beat we also before a small from 80 or cross where I really is also affect the public and if I have a Pubblici either be or when 80 could before also influences of precautions to public yet this
24:24
year the most basic the proposed by Thomas babies and will see a little bit more of them to date so I'm only goal was probabilistic rich Ramallah and the idea of a probabilistic between them what but it is basically whole useful is a document for some users as the Miami so to begin with so the idea is basically a duet with a large number of documents in my database Ireland you pick 1 out and say there is a mechanism by can now 0 assign number of ability to the usefulness of this document it is icon that well the documents and the collection on the night and the problem by the number and that of beads this is a good ranking of and this the idea beating behind the published a computer from combined a break because of the to him for the good of the victim and and the 1st of so the
25:52
was the idea that the idea is that if you it will be able to have a probability assigned to each document in the collection of this document is reserve was that to a certain Tree rate from so through book you then we would have a criteria to act as well the documents and we would of arranging for missing would be perfect route delivered the most global the document bold and most part of the document for something like that the law be happy and the and the good thing about that is what wide we need to do all why do we wanted to be a probability is that we do we know a lot about how all mathematically tubules probabilities we can Calculate the things we can estimated callability and this is basic help people set out in beginning of the book early Seventies late Sixties 2 would be fined but the of the of the of the usefulness of
27:07
documents and the basic part behind this is the so called Paul ballistic ranking principles so you get a good result but if you add up you documents by public and and actually 1
27:27
offer best I'm which regions of the statement was 97 of all ballistic review principle borrowed before reference retrieval Systems response to each request is a ranking of the documents a book and in the collections in order of decreasing probability of useful this is the that we want to to search for today that as all quest for 1 of the probability of the foot of the of the document for the loss of it of the request where the probabilities are estimated as accurately as possible are so we did it in a and didn't do any mistakes on the basis of what whatever data have been made available to the system for the purpose and the the and the and the fact of of the system so how good the system before were being the best that is obtainable on the basis of that so all spoke you and everybody said for found time doesn't but this 1 to catch and the probability of usefulness how do you know how for useful the documents and the evidence of the need for year as it might be David where the
28:52
me let me rephrase it's not the best use for most of the city residents have relevant is a virtual all red of under this what basically of the document reputation set of were more low they for a model where it may be the read and then icons determined have resident as if the CRE returns or I Austin and off distinguish the suspect to the the of the document collection that good if they don't in the document that bad and so that all subject of waste to 2 basic deal with what we are and what we do we have to to to to find out what is whether some things threat of all not and we and calculated that using some culpability see RÃpublique from and what we have to assuming is that if we have to documents that has the same representation so do the same the salt 1 of the them these documents should be either both for all both the route that it should be kind of strange if 1 document to stop you relevant and the and and the and and 1 of those and the other wants to evoke autocrat because icon distinguish between them they look exactly the same sort could be found to be dead and the other 1 from at and this is actually kind of the stretch think about Eidinow found time scheduled for he had trained train ride to hand over to serve the timescale killed off 2 thousand left does look very much like the timescale for 2 cells said they will also shed of the same words probably the same number as debut different seconds but what might be the extreme real red of and if you want to go to hand over now the other 1 might not be resident that will be shed on the same representation we use some the more you see the programme so sometimes not easy when you are in your representation abstract to more much from the document on the other hand some things if you look at a simple text the words might tell you what the documents about if the word computer occurs off and some document it might actually be a document about computer science you created was Computer Science News document might be relevant the case said works quite of still it's an assumption that we are doing and the base of the spine reading can be
32:12
so basically if few if you take the notion of residents say well and as a collection of documents of some representations whatever it may be that awkward said of world you tools and the and icy for a few brief skew is fixed during computer science or or sunshine the sunshine and the trees are not and that may be it's just some Cuba fixed 1 the set off rather than documents caught are you is the said although command of the but died as relevant was suspected OPEC mapping down so for simple a simple as that just notation they have a set of red of documents and a list of the relevant document of some measure of real events or Kent is to create a UK determined this measure of relevant for every document by keepers documents that oil revenue was respect to a cruise flawed over documents about match relevant to suspect that the book and used then do the toss
33:43
guests at starting with a critic you and but document the it was the probability that the document is a member of the Srolovitz at Oka and and not it while what the probability me new not timing of probabilities allways very hot who to fee all of whom would actually transport any Ed yes but the pub abilities scanned in me it named at a Churm a but with it's strange which we base all retrieval system on some very with a concept nothing that the fixed signs of probability that doesn't actually mean anything and also in statistics people see that the distance don't talk about it because of the don't like to talk about these things but it's true actually doesn't mean anything and improbability Fiore that the 2 basic factions the frequent some debate easy and and we might
35:19
be interested in a new locally and constructive not here today I'm the WTO myself from are the people the frequent is save the basic the probability means the frequency at height will have a certain all come if and sufficiently what of made the number of random experiments were the the some very very famous names NEMO person walled from that of frequent as and that is a GUI will be doing is kind of consumer prices roaring the dice of was the probability of few having 6 eyes on the bus what though of Diocese of San Dyson and his 6 5 8 1 4 6 and why is it 1 thinks about the cause the fire of the dice 5 million times what 20 million times while sufficiently and the number of the women's where the outcome most excites will be about one sixth of the number of total tries by this frequented the other half infection out the patient's they say enough probability has nothing to do with with the actual in all like ruling the guys with the experiment but his rather the degree of believes that have some of ideally I that women the bodies has 6 sites and this my reasons why 1 side should be prefered over my book then the probability that will the different types of also for 6 of for current must be the same which recently brings me dominated by 6 sites was so improbabilities and the Purple security that 1 side will have to show the form that each side the probability should be 1 fixed without ever really a singled out and so it's not like some and try it for about it's my belief it is something changes also so for example of how it's not as bad guys but the best part of 4 led dolled metal or something that makes 1 side heavy episode will turn to the bottom her then will not as it in The frequented to you after an hour media 5 rose was I'm something of the old and with the base and you might be does not change but when I've then take the base and and them look at it and if you add all this is that it's change my be because of 40 was a real Dyson only if at end it is actually not so but also believe a prominent of Beijing's of 1 of most prominent of these out of this with ideas called but also lot last affinity with the work of protagonists of of the group
39:23
and the time this is the basic ideas if you have a repeat of all random experiment that you just can't tried 100 defined 4 million times 50 million times in on and you see the probability of a bid into operating is just high walk off in the long run it will grow that's frequency that frequent tests interpretations and the event probabilities limit of a further to frequency if he would do indefinite number of extreme a penny that so 1 possibility would be the dice you just stop ruling the Dyson at some point of converged the probability that it will raise tomorrow in the well for cost how's that while you just take tables of half the writings of the of not writing music every day before every day is basically a new random expand the and look at it and said it rains although some shy of Raymond recall that and after a million days all to William days of 15 million days like and they will probably be for a most of 20 per cent of because and 20 per cent of recent cases range of and tomorrow will be a new or and the mix a Case this basically
41:06
that the these the idea was that the Beijing deputation at the degree of believe are so that can be subjectify happen to know something about the dice its stake sell I'd Woods arranged the probability slightly different that can be object of in order to identify what was anything on to what was a must be fed by the end of the Six sides will be equally Paulton's equipotent of public you 1 means 1 6 at very uptick I'm the just the thing is that you can't play with update not this as a get to know something it might change my believe in what happened the difference with the frequent disputes that of have and update function they say OK along the random experiment and a get the limited the though the Paul that it even if somebody tells me this Dyson's faked icon not say how it will influence the probe into the because the them of tried for Kent this is the basic idea behind and I'm why is it important to have also been interpretations of because they are very often in practise problems they to calm for the random experiment for example the question of the is day life on other planets follow ability can you give a probability if you would be a frequent as he would say Well basically you would have to take an in the number of land you dry want to learn that the seed of the life on the road on to the next 1 and after million of that you can probably say that the world's up but the the basin interpretation would be movie you see that so many planets we know 1 that has life on and it looks like this like good for probably if they are on the up planets looked like Road also there life must made might be for the want of the of the frequency offers like planets with suspect well it was such a reasonable was sensible and then come up with the probability if they are many like and and we know that under life developed the and it is now probable that there is life on some of if there is a smell of sections of both like or not the 1st time in volume of of and there would be a very small Paul ability on the beach but the reason made by the believe in basically it's made 1 exemptions not looking at things but just assuming foca that kind of
44:39
goes into the into into a example to make it all the more key up and offers the book lying on my desk and why do we now is that about of the following 2 public information retrieval animal health was the probability that the book is about the commissioner off we go just will say I'd we come said that because this is not a random experiment with than or object a book something of the real world it light on your best and you that it is on information retrievable that it is not for this is not a random experiment with no run in the sea probably does not apply the business would say to you where be find the and animal that 1 in 4 major for your scenes of another and there are many books on information retrieval animalhealth of the of like 50 50 times that no use a professor of their basis but might be more probable that the book will be on information retrieval and on animal health or of friendliest 80 per cent for the information and only 20 per cent on the animals because of the lower that he has a good sickbag or something like that of a cap that needs attention and might be but it's not full fault of these of different views from
46:24
end found you can order a waste that there is a correspondence see between them so it's not easy for firm but you could say OK but eye can build a random experiment from from everything but that doesn't you not so I'd say Well let's which says it not as a book lying on my table and its use of open on that led way in random experiment I'd got to the rest the library by pick up a book and the University library as books about animals and information review and put it on my desk look at its way back to the library to the new 1 or some shy put on my desk look and that bring it to life every that of random experiment on and we can describe that the was a probability that this random book is about a lot but also and the frequent as if I'd refer to a specific uptick
47:36
the best basically basically of the problem but it as to what as to try to nail it down to that book lying on my desk the frequent this would need to be that what it means immediately go no you contraflow to this book you have to refer to a random book even if his instantiated by the 1 those or New York but icon of saving a book that Falkland of kind of interesting is and that the biggest raging on for the last 200 use statistics the mathematician seem to have lost
48:22
time I'm not saying those Load Aiston if you and and and and you covered with the and so what the probability of some number again the frequent as for state of the dice well this no randomness he that it 6 or not 6 the that its 3 or not 3 it's not question of ability to pay season this would say at the need for it to be a 6 with 1 fix because of sum of all the of a at but
49:02
can also with use and change and I'd uncovered the bikes and it becomes clear whether it's a 6 0 5 4 1 of the 4 them the freedoms for state you like nothing tainted the before it was uncertainty adjusted a note and not know it but this is just 1 instance of a long chains of them experiment is of no interest what sort of of the business but not this is a paucity real of ability and I'd got knowledge by not knowing what it is and that can read Calculate the probability that it's a 5 4 0 6 because as seeds and that changes might be used for this but it so here we have the idea of updating the probability use of it experiments that can change in the until we have the ideal of the experiment fixed it has to be random and it's done over a long long long long time and the convergence points limits that is basically what of the but it well prepared your
50:34
from road relevant such dozens from the tablets really but at a think we get the feeling of the side of the feeling that the probability that some document was respectfully certain theory should be based is rather we be Luis this document match the tree in world I'd referendum document side cheque if their match and in the end though abilities but seems to be a basic think and everything Agate to know about the document which change my believe that this is rather of everything that I'd get to know about the use of which change my belief that the stock has rather than of the under by still the problem is what this residents had actually is what the what was so of what looked as look like and the Beijing approaches to express the uncertainty we don't know what it looks like we don't know what makes a document rather than Dolloff rather than in terms of Paul ability if we don't know we just have to reasonable to event probabilities that the state of subject of the in but firm that makes a thing more easy because on 1 hand we can view was that the read of these because now we have confirmed Asian this fall of information retrieval quest and on the other hand we can't just make some reasonable assumptions and they would you throw and the probability so why that she does this weekend discussed the way some probabilities by making assumptions and you could estimate the for the future of bulk local boy 1 or or point 5 for what of because we had to know that there are other the books in the library or a boat animal health of and the changes 6 with the but part of the what we have to do but finely is just estimate of ability that some document is in the violence said or not
53:02
but and basically today we will talk about 2 probabilistic retrieval schemes both relying on this notion and 1 considers that the re being a random variable the other considers the document the and probabilistic indexing said the beauty of the land and variable binaries independence retrieval as the document a the area but this is going to be a little bit mathematical of this is going to be but model for new of but don't don't get sceptics that's actually very easy and that's just using Basie re basically but what would you say the 1st
53:57
of kabbalistic indexing UMP presented by man who wins and 19 6 so that was a very early approach and the idea you have a number of the next sums and Boca below or a size Kate and the documents about term doctors in the face the sentence so waited a term or less describes a book you talk and acute he is just binary as well as in the period this was not in but the 1 we have to find is the probability that some document the is in the number of relevant of and documents filed in the set of From the document the suspect to the fixed period and a fixed for him falkirk these but it up 2 of best to
55:07
find a random very of acute Hulu is the somewhat less the said containing that possible curious and we draw 1 this smoke you apart from the big she was that of a possible Cusack could ever be and not could say about the distribution of dissent is just how best to have has acute in off the far less assuming we have only 2 terms for them that they see the Fed possible to use the determined where 1st term lost 2nd term and walk both and OK other can recalled the probe abilities all the the frequency has something happened so may be true people out for the 1st time only 7 people out for the 2nd time only 1 person last for the club most of book and end this kind of reflects with the number of probability idea of well the probability that only the 1st term 4 person acute is open to public 3 that only the 2nd term is all points 7 because it was more off not and the ability to the post on the road it is open 1 that this is the PM at the fracture of events where the cruise previously used then Niall if we take
56:57
a random queued so we fix it but we don't say what it is then let us all the number of documents relevant to suspect that it is a random set of book depending on the grid hands and the probability of some bloke is relevant with suspected is nothing for but the probability that this document is and set of documents red of and put some you the decree will skewed of the by just translated into a more 0 conditional problem 1st Eichel's randomly some period after I've found that this is what this cause for from after I've found out the to no Hilal as PM is deep and the relevance book by
58:16
the end this is a condition of ability so we can apply Bay Fiore which would the cue technicality and basic the this is the account of ability where we switched those and this is the new Mitchell things for care this is the new world saying that we have to look at and we can be a good thing as provided we do that because he weekend discussed every effect on its own individually and see how we can adults what it actually means go to him well what does
59:03
that mean that this is the probability but we dress some period that anything to do with with that the but was the same for different documents so that could be caused by the same for documents the probability for each document of not depend on wicket argue you it away and that is basically what we will be doing now so basically we just put it into some kind spent the point and cost UK with this good nothing on so far
59:54
groups let's look it was Paul ability it's a probability that a specific document is and the set of book Umar rather than document for any three week big queue for care 2 M this does not depend on the future and the smoke you any more the but its basic needs the number of times the document this real event which were spectrum and feeling whatever we may be but the dependent so we can estimated for every single document and we may give you use a mechanism to read from whether some document has been relevant with suspect to approve it no problem for them as a way of the husband rather than just 50 of the 5th and were offered a icon for the use of their would be raised at time Bookham and what sort of a way of world not from so on basically it's the resident of frequency of positive from the of the as opposed to other documents for for most this specific document filed queued for well discussed that a way to we can somewhat estimated not still have this idea what that means it meetings we 1st pick some book you that is relevant with respect to any clear Cauca and then you we can't hold off you the Kiwi that for to is actually the tree re we are interested in book you Bowkett name while if we
1:02:14
assuming that if you take you can consists of multiple terms and if we assume that bistros by independent from each other and we know from pliability see a re we can just multiplied the probability of two week ban soccer and a independent than both the pro abilities would be might applied to give the joint of from so by can do that just pulled the different of terms cases size of the ability adjust Ed up the public dressing up book now now singer and is a key not truly something happens what could happen at at but it while made a very strong assumption denied the terms in the beauty of it in the end from each other said which to but at that but at the but but this double bill of but the a might be a key terms the of for example a one to find a book about computer science but of also might think well that's include all the books about informatics while Appleby not independent of each other still just might applying the ball abilities its in the subject and ferry not on the long run we would it will be mighty not the but it's not it's it's an assumption that we have to keep in mind UK and that
1:04:27
got so we of this part that is kind of the interesting time that we have to look about what we can do now is we can split up and those questioned by the creator of is set and with a clear termist not set basically those are the 1 with the creator a specific future is set to 0 and those other ones by specific freedom is said to want remember how to look like a Chris as by want someone you buy don't want them to eye don't want 3 a 1 term for a don't want from 5 and so on looking into buying a revamped strike and that the probability for this and this they goes both go to and the probabilities for this and this Lagos to you Oka just from the part wealth and a Mad Men IAAF feel the Huai 0 also and yet but Eweida's 1 OK because the of it was he from single to see whether they'd just split up and those who terms the consider was 0 0 in the period and those who are concerned with what not what can look at the complementary events because of the probability that something is the low is 1 minus appalled ability but something as 1 of the busiest were put it and why do epidote because the and this and this is the same in the different prop up again to have the distribution he still at the foot of the pub but it contains a saint things are good so it looks
1:06:55
nice some will not more and a bill the way but still I've no idea how to calculate that Afghan out I'd you it away at the moment what it says is the if 1st they could document that is rather than this prospective any Creery somewhat include and the and I'd determined that some future is actually there was the probability of that the creature most the and the origin of the period when the document was for them to for Kent so given that the document is rather than for some during what is the probability that this period contempt Hermite but a LZT by the while yes I'm that not all the precision measure up the precision measure basically as the defined on results at so that is not was to speak to random curious but was suspected some Samperi but was suspected specific very a but this is basically you up your right in a way it's not the precision that it's basically high Walston is a document relevant for any is that they usually the popular document that was document nobody lights but looking at the tree and then you take the 2nd part was a probability that the tree contempt victim by the so given that the document was liked by many people you what is the probability that the queue read that they liked at for contained the word Computer Science for commission Tree for what of book the
1:09:33
good at the time modern and Coombs had to estimated at somehow and they said they basically by should the a bow to will of a Calculate the estimated by ability by the weight of the term as a sign to buy the human index at and so as you a document contains a term Computer Science and somebody said as well this is a very popular Computer Science document so the time computer science is very rather than for this document from the idea that the weight of all point As somebody said as well this document contains a but computer science that it is not really about computer science at just as new and will not talk about computer science will talk about the humanity and and then goes on banning the bulk of the of the value of the word Computer Science for this document volatile point 0 4 at very small but and this can be you would St 2 estimated this Publity that with looking for OK this kind of have a very reasonable exemption because of the index are the ones that that actually ability to the waiting for each took consider that something was suspected candlelit automatically of cause you can say about its term frequency times in the 1st documents weakened the order for the group to it manually by just looking for the documents were but in what the way this number is arrived it gives you and interpretation called resident the document as was back to create is 1 or 2 of the I'm basically the index terms reflect the degree of the of the patients and slide believes the term computer science is rather than the suspect to the document because this is a must be nowadays wonderful book about the basis I'd think the from computer science is not really interesting for this document because it just says the document of of book computer it and that's a some some of economics book that just mentions computer scientists and point of care the sink you get some of
1:13:00
black man we basic deemed put in the document wait for the complexity and we see it we have a constant you was not dependent on the document as such we have the probability that the document is rather than with respect to any the basically the measure of the popularity of the document end we have this term year which we can actually Calculate knowing the period and knowing the document indexed well and actually some people on you that you should cut out this altogether and just concentrate on what the user really off and not consider what the user did not ask so that in terms of positive feedback was negative many people say that I'm glad it is actually quite a random which terms are left out you could also yesorno document about computer science acroliths said it should also be a bald information retrieve it should be about database to the about but as computer science enough tool to of a distraught my my my information need and said it so that left out the word that the basis but not does not show that I'd do not want the document about that of a success but it just shows that has said enough and given that few terms and most where the but and Sami hominy Keywords to people type in in but such who for example any guesses on average it P a Augusta's more less the is a few to point 3 0 2 1 6 a think statistics very and so really most 3 resolved to 3 words not more not less the free worse usually are sufficient to transport beaten for me of the interesting for me to and so UConn kind of what the user did not write because all Pughsley he did not write the eldest son numbers of computer science inform at ICCS and the island of still was interested in documents of up be just called different so that the public Aga from his eye don't care about what this UK estimated from the crew this by can Calculate the Nycomed truculent this 1 and this is what a was interested a pet result goes
1:16:20
so the probability should in this case is unique and only 3 1 of the general elements of the document and Idei can also give you a an idea what this is really sensible so it positively influences the probability that a certain document is relevant the suspect the idea viewed to document 2 books and where was the acute about computer science UK and he would and those books in contained the were computer so that they would face the beer and but 1 book is a best seller the of the book is like by nobody in this way which 1 would you rather have a new found that the best of the 3 because of the people believe in it to be good and we're all in the to the people of the in all reading IRA part of something like that you know that anybody like every point last year 1 of the best seller make the well irate it confined to impressive start with it the best of so it should probably be on the list rather than some of you will volume that
1:18:02
nobody really likes again that it's a kind of its reasonable that will positively influence the probability of a document being rather than UK but Abe a undisturbed well read this
1:18:24
just now yes a lot
1:18:28
of but you're for if
1:18:47
actually of any good question as for making the Kiwi the random experiment seems to be a very bad idea and actually this was stunned during the Sixties for this was the 1st try and they said well we have the we but we can look it up we can see what is the distribution of accused in the period look and it is a small system so we can handle that bull but it 1 of the idea was that of the book with some of 30 more for the number of documents and that he would stand the able to just combinatorially expressed all the series basically every tempted the with every other terms of returns to give up all Fulton's together so far can't this actually where the next step lost on the walls from in the later 19th Seventies approve that actually
1:19:59
problems but really is not really the random variable the random area is the document and
1:20:09
so on but the sky some right down 1 of the founders of information retriever and and is that what looks like it will look ability of patrons agreed hat and other documents that basically Bektas in the set of work model of the fine array from nothing nothing happening here away or co or it does not occur in some document they are so that tree really is the same but or curve does not than now want to know this Publity these are queue the tree consider what we did before and would you know it's a set of work more low for the documents and the period compiled by the ball
1:21:06
of the over for the day of that
1:21:13
it
1:21:17
where so now it but but
1:21:28
if there the to put but
1:21:35
it
1:21:37
it a picture
1:21:40
you it was an index said every word had await the suspect to the document to like fuzzy Petrie and it was binary which account of for this is the same as before the change is really he folk and we don't have been next terms and we don't have a word or codes or does not Walker but we don't have a word is in Portland the suspect document or less simple suspected still want us
1:22:25
to do to have the same or
1:22:27
probability want to know
1:22:29
the probability that a
1:22:31
given some specific
1:22:32
period queue a specific
1:22:36
document is in the real events at all not so
1:22:43
this is what went what do we do we do exactly the same thing as before the cell Amin I've really love doing research in the early information retrieval state because they did the same things over and over again well what
1:23:00
do we do what we basically decide for a random variable this time we take the documents the range we can use this head of documents to spread the probability some specific document is relevant with but to some specific during to if I'd pick some specific document by the possibility of that documents and then compute how will the document all rather than the fact the 1 who is to say the book and I'd instantiate the random variable the with small the and the and the computer Hollowell probable isn't that the key is the road of a set off screws more you a care just changed the random very of not think of good has what the probabilistic indexing to just to refresh it well it worked all of acute the random variables given that a curious some periodic you and was the probability that the specific documents more the is and the relevance of different book there a symbol the quest for you deal with it is very similar because what you do with
1:24:51
you apply based Fiore but you see I was still random very of that is the duty but and you
1:25:07
find out that this probability here it is not dependent on the small would be best became basically the same for all documents Modi again put it into some constant we don't care about what it is it's the same for their single documents full individual documents obtained just added up the with left with 2 terms at
1:25:45
Hay of sister its the probability it up but a document from the collection is a specific document but does a specific document means of considering the wrecked the representation of the document L document was the same number of points that occur all the same type of loads of her canopy distinguished in the saddle for model with test the identical if where the document a new collection consists of different worlds may be overlapping but it's allies different then the probability that you pick some document is just the size of the collection 4 1 divided by the size of the correct or if it was some kind of guy if some probability chef the same if some document Shell same representation begin this type of document becomes probable Foca so basically it's not if of interest in interesting with Latin I'm actually this is not out to fact outwitting calculated we know what the the distribution of of recommendations by what it would be found what it would do is it worked in fund the wine find and book layoffs in a certain document reputational across the smaller the probability of each of those document the and the residents that would be that seems kind of strange doesn't a certain style of documentary popular in a collection of the 4 is less rather why the and more popular in the collection but strange so on what you want
1:28:11
you would actually do is to say about our want to get rid of this so I'm and not looking at this probability and well but I'm looking at the hands ratio of the ability read and looking at the probability the vided by its conduct of of the RBS that are the also that some document is in the rather than said of some period is basically the probability He violated by how and it would not be in the set and the bottom the beauty of the design is not all is not what the bulk what yes exactly in out so basically the question whether some think is in the tree said for the something is not and is not part of this fact Foca so what
1:29:31
happens now is
1:29:34
basically that by can look at these things differently and actually looking at the alteration 0 instead of at the origin a probability is exactly the same because you can buy
1:29:54
compute TV so probability rose 5 p devised by so the odds racial basically that is out of contract of the low Rodriguez just skated that to so you can see it but what you can see is that the function here is more not only the higher the probability is that the higher is also the odds ratio again 1 alternative stop the probability may be higher so he would have to rank 1 document high but the auction ratio is some less than some of worst documents ups ratio and this is not only week and directly replace the probability of the by the Office ratio of P E devised by 1 might be OK but the Vatican do that they could
1:31:06
say they were up this term under opened this term Ike do exactly the same with idea was the turned so just used a feeling discussed away the constant turned and now the
1:31:27
things that happened as exactly what he was saying have those probabilities the is not to and he is not enough you and if by their the of the ratio the of by them a witty why these area and the thing books are OK and this thing and I might not be the same queue but some queue cuprite OK so is a get is constant somehow we don't care about OK basically if you put that together we arrived at the house ratio that some document is redolent of a suspect some period it is given by the ability the West but of and some period instantiated with the provided by that it was not prevalent with suspected accused infantry with the of care but still
1:32:43
we get kept computer these beastings here and we have no idea what they would they should be and the assumption that we knew now do is called links dependent we say again why basically in can break documents about and pulled or the works but they contained Aug not content and I'd on consider and time documents but I'm only consider the terms the document of made up of Kryptos and if the term is a man event with respect to the Kerry they positively influences the residents of the document but it is contained in with the but with 3 and tree assuming that Walter off you can wounds from each other rekindles commodity life these of a was doing here had her hand so for all the Kate firms that me or may not awkward and some of them had just might apply the probability is that they make the document more relevant or less good but the arguement taste and whether this assumptions that is exactly the same as before are they really independent tell them not to synonyms on not independent still that just the human for a minute because of the West to become calculated if we don't might apply that just a you this would be a good assumption and we a kind of shows that the
1:34:46
assumption that you could probably do not we again split up by the sort of immigrants as well as the UK and this is in the document and this is not in the document just as we did before with the crew turned and based on and again on what we have here the eyes I wrote here the eyes of law was the probability that the idea 0 0 1 minus the probability that the ice is 1 of them want the same thing that stop we just 4 place at Foca buffing under 2 not all
1:35:35
these terms lookalike book and documents containing turned then not containing terms and documents containing a pent Niall we also have the terms he and acutely so we split up again for the curator and and the non she returns what happened here is this is split of into part 1 part when acute a given 1 top up where the creator dismissing same at the end up and went to the school factos tell us but Bult this is considering the terms that are not in the document and build occur in the theory that this considers the terms that are not in the document but the Walker in the queue this is bad this reduces the terms that are not in the document but tool not could queue this is all of the 11 because of called the document contains more would and between may come to terms and this is the terms that 2 Walker in the document and that 2 Walker in the beauty this is good a Case these are the full effect of that we have to look at end this happen
1:37:50
again with 2 from we have to make some changes so I'm for any terms that to does not workers in the 3 talk and we might assuming that we are all the probability of the term operating in the document that was not reflect on whether this document the threat of and will not have and or asked for this to does it occurs in the document that of the of the for example eye off to Computer Science new icon whether the document contains the term champ for table for not because of for it so the fact that it across the work treatment but will not make it Movement for less relevant with suspect like the last babies that we are with you and they must be the cause of scheme the idea was to put a little bit more technical is that resident and non relevant documents have identical term distributions for non it so I'm care what happened to their largest assuming the simulated the though the probability that something not in the serial personable commando all not out of it that well if that is true then to off before the prop rucks can thought because it is not make you on with interest it from the on Monday interest of camp I'm only interested in those post Walker and and now you see the beauty of the design the you because the links full acute the terms the probability that the document in question contains this term or does not contain the start extract from everything else that the document might contain icons suppose out I'm so not interested in money but I'd do we care if the user mentions acute whether this term is in the document that much care so goods
1:41:07
this would we left with now I'd do a wonderful trade I'm multiplied by 1 and of costs multiplying by 1 doesn't change anything and of course the club a very sensible racing and less Idowu right to the 1 in a very clever way and this year is a one off because this is the same as that this is the same as that of 1 of the but writing it in this way is a going to do me a good year for i were where because it has to compute the same structure the world over to survive for the hand and but just regroup but what happens then areas by take this year Bubba entered and put it to be with this into this part to put to sea away from it would you can see now is for those buy the tree Temple Kurds and we have this blokey of totally identical a care I've it for the eyes hero the i 1 but does not depend on the document whether the Thomas in the document and more Kop the time to eat and not I'm might apply this fellow think was the 1 that just wrote in that way and by multiplying and and mud about Fourier out so I'm going to multiply this part to this Pope and this point to the spot but the if I'm multiplied this part of this part the in what it said was is exactly the same way this difference is that the Fed has the eyes want and he was 1 of the few the its as the eyes 0 and queues want this is the same distaste but this basically the iPad may be the rural that 1 by what could be it can only be hero and so it does not depend on the terms being in the document and more a case Ikons this year but based on the IISS want she was want the ideas 1 0 Hughes want buy put to give this and that the match of the same Fed so after modified are care just as shifting and regrouping the each for can't goods
1:45:09
this why get it as a just Aug you to of this is not depending on the document and more deadly does not Meadow whether the tumble Kirsten document will not so it's the same for every document give and put it into sea to book and the Prime book and some constant but it may be that went to look at this party but this part actually say about it says have to look at the cause the terms of crew and the beauty and well the documents where this term across good and then have to calculate the probability that the eyes of 1 2 0 0 and the aim is residents of point and sort of the this thing is are seen now so this is the eyes whom and this is the same account of coupled with a pinch same here put it regrouping this but give me the sectors oka just change from around for the operation OK
1:46:57
still we need to compute the fact us and bomb what can we do this is the pro ability that if some document is 11 to some fear specific beauty that it should contains a certain terms the eyes 1 of the NME all fizzy most Buis almost document will not be resident 2 a specific you so this party at the document any document is rather than 2 specific you very small so we could save icann estimated probability by term contained Tomas not come skip the residents part of suspect to general documents being relevant was foca a is this a reasonable where
1:48:18
Yessabah says are nyse and basically I'm if we look at the document and see if that relevant was suspected of period it does not depend on the stock human to being relevant to any period of home any other documents all rather than the suspect to this year it should only depend on whether some terms Booker in this specific document folk and the basic they sorry I'm saying he is that the term tells us we have a number of documents the 1 you to the 3 and saw and that are 11 4 suspect to some theory UK and this is determined died said book you move it amend at or you book at law this time not for them but it was prospectus include or and it should Ikea looking at document the 3 whether other documents by interesting for cuticular or not the influence whole event the 3 is Paul the not free on the other hand looking at the terms contained in the 3 should that influence Harland document the 3 to is on probably yes from and this is a basically the 1st part year you sometime bookers in document the eye for for care but we just say let's get the information hominy rather than all at a resident documents they are with respect to some specific you let's focus on the information what terms awkward in each document all know this is the be idea behind this assumpsit so we do this we just escape this part out and consider only the 1st part again this is an assumption but it works quite well as we will see so the question is now we have the probability that the term is contained in a document called the believes he that it is not contained in the document saw basically this is what we or in no this is that in rose document frequency up the probability that in and document some traumas contained and it means Harloff news this turned contained in documents the violated by the collection but the probability that some random term look for in a random drawn document is just how Anders's terms generally Walker in the collection devised by Holbaek is the collection yes came but if we put it all together and
1:52:42
we just skip the pumps and put in place for the Pope's we would end up with the ability of the on clay shows for sumps specific document being resident with respect to some specific theory is giving by was proportionate to the size of the collection a lot of the collection the smaller the of ability that exactly this 1 document is rather than makes sent Stuzman but He violated by the document frequency of this to the more off the future of the Kurdish in the document of a collection the less probable it is that the document containing exactly this term is remove but example of thing about the baton computer science with that it's not a very descriptive took it will current every document self somebody who is Computer Science and I bought it as a new voters had about a tame exactly and consider the tree hyperbaric chamber it and Namche the bigger the collection Les although it is for each document to be real and on the other hand documents containing the term computer science are less probable up point containing only term computer or less probable to be resident disrespect to the tree and those containing the word hyperbaric chamber I'm because the document frequency is Loyola of of the a Case this is the idea behind the rig up that and we may do not of the sums during the way to find out that it actually quite simple depends on the collection science and on the document frequency of the terms and the tree and we have only to look at the documents that the terms of during an took command and not the terms not during the document or and not interesting suspected birds
1:55:43
so I'm this is the part O'Hea now that were still this part that we have not that was yet and actually this pop kind of a problem because nobody if he knows what it use and Hugh comes the immediate pragmatism of information retrieval poses could just of these people imagine what the said yes why don't we just say it's all point 0 0 0 1 9 if we want re on but I prefered and this is actually what they did that tried to adopt you know alike and this work quite well and I value of open 9 in all experiments for the other 2 also work but sold to no need to estimated no need to think about it just let this units all point that some other ways to to deal with that you could use your the feedback all you could see the estimated that was kind of the latest instalment in from 89 and 98 I'm that you estimated at a kind of and a by the conduct of off of last term but still use the you just ignore it completely and said the 2 0 1 9 and if this book rose unreasonable welcome to the world of information science the you just try things and if they were questions you change you just assumed things like link dependence all binary of independence and and is the works by for this that she would brings me on
1:57:59
to the end of the trial ballistic most about left off there to the probabilistic indexing and the binary independent retrieval model are the most re known and the most often used but there are loads of us off of variant of them just want to make sure you understand the basic the both basic part of them want to see the as a random by of the other seeing the document as a random and a you could extend and you could see the learning algorithms would learning about will to estimate some of the abilities and you can differentiate between different kinds Computing through that are more complex of the year senior opened as a model the dependencies between the 2 groups of youths in the same terms for pseudonyms and so on and so it has been worked on that in all but it basically all comes out of the wall and them any more and that the drawbacks though I'm basic basic strength and I'm going through a lot of them would 1 1 and borrowing and on the other hand it would fill and and ties the of lectures so we stick with the tool's want
1:59:33
you to account is 1 of the pros and cons of Paul ballistic retrieval model the best approach ever it is that they are successful they work well which is good I'm the probability also seems to be very intuitive for uses for this idea of its more probable or less probable the moral terms are contained the more probable that it get for this of this applies to you that because people can can find the reason why some object should be Marlow for less I'm having the they feel really behind it as it found that 1 mathematical calculations in all like you can you can't from a full kind of things will Ipswich's goods is a good Foundation from and you can some during the way we we we could account for the assumptions who don't assuming anything without saying it you needed and the mathematical transformations so what you do is basically you you may make explicit while on the other hand the estimation off parameters is something that we kind off sometimes this uses a point 9 part was really of a nice pieces off work in the UK must of taking them a justifying Dodwell of says of all point 9 and that's not talk about and further the end some of the assumptions like linked independence all the binary independence are the full to say the least but the works of its took less flexible than direct to space most so if you want a more both to 0 2 0 to work with pictures his recalled the be better and the public degree trio model if you're not interested in the mobile world to work with the results at the end of ballistic Trumeau would do so if you need to change algorithms in the Mall though that was a long way to go changing anything and the probabilistically trio model of very difficult operation and as seemed during the latest are complicated what to do if your account all the steps of what we didn't want somebody who made and transform things and multiplying by 1 man's or during this fact away and making this tactic constant himself that he has been a lot of work to do so which bulbs load good questions the mood of good it just looked well with a product that has humour success makes it right so it's just you are like me but they tried on different collections and open mind what all with a good choice is interesting is that the and to seem so all the problem of powerfully yes interesting the death and and in some sense as said if you if you don't have to fill with the model it probably the best the model you will get and nobody knows why it worked because you make many sometimes and he is the wrong way but it worked and why should job you to something that works now I'm not because you might have special needs and for example you have a collection by that of has no it that should definitely MP or whether some worked or across and the and the and the Korean not walk I'm not 1 to work was negative feedback for example so it is Poland whether the user does explicitly not mention some future account get into law because of all the assumption that you can't do that in the end there in the thick displays Mo we've and say OK U idle now like for over some terms that the user did not mention the period will apply penalty 0 1 of so it easier to work with the more it if you have retrieval of worms yes it does but as they can a personalised that as soon as it was to be doing now are this personalisation video of and that is when you type acuity a type the sanctuary in the 2nd time the results slightly change the all you type related to the also than the result of a change but it's because it takes some new information about you from your previous curious and add them into the result of this is not very easy to do with having a publicity from the model because how the model that the Canadian you can do it was the latest more because the new just strengths indifferent the man means that were create terms that will prefered by some of the previous week it depends speed and so on but they have both different different assumptions the 1st 1 can only be used if you have a bomb indexing well but true describes a document have to get it from somewhere where the that fuzzy weights from whether that's a library and pudding Keywords on on top of books just and are taking things whether that's tax so for example your no flicker and then you name Italy's photo sharing and what and you can't just write free tags a state of care to all this is a tree yellow this is a my mind my friend something like that then and attacked the was that this could make the photos for food for Kent more some of are so you can derived from many things the measure of residents of the term the suspect to some of it would have made the if you have that then the binary I'm indexing of with published can next is a good way to very good way if you don't have that you come to it is simply part they have to rely on binary independently a but not a question of which was that all was but what do you have in your in your basic that if you have the information held resident of Thomas for document it would be stupid not to use it if you don't have it the autumn of a way Buruk book of the year some exactly what exactly so was so basically about the 1st Melott needs is if you have a document and for every German the document to need some number of point 8 how important it is this term for the document idon't Kelly get from some tells you should arrive at from ex that other uses and social network put on the document you invented by the death of a man wants to have that for every German in the document boot the probouleutic index at if you don't have it on and off finally numbers are real numbers that kind of like a other basic to the numbers from the book from the from into my last October that of if up on but but after that kind
2:09:21
of heavy piece of mass we're want to do something nice today and
2:09:28
I'm said in the last the to of what they want to show you some statistics and show you to we're also that are really important when you come to the UK which but chose to the of new category so out 1 as the new whose collection which basically contained new of to go so what it is about a gigabyte size content for of documents if you consider the posting this 1 under 18 million the index size to see it pretty big already even if you compressed said it's still a kind of Big and if you consider that different of the colt was which is basically has been crawled at some point during the early to thousands personal for Web taking on the Web pages as documents each page is a document of the size of 12 million documents it kind of get speed up and also the Inditex's Gamache bigger so the same that we work on a flight discussing before the not quickly the rise of the the because unsymmetrical something like that from indexes actually 20 gigabytes it's really impressed and and you really have to think about how to do that efficiently because of a rise this year sites with failure and then considering that 12 million pages and in a document called is not even a fraction of the frictional but not the big 1 of for Web has today how was your good between the 2 the PP will come from during the lecture while the Chilean deflect shown we would do enough the Web search but that is really what questions are how we deal with such a big part of that and the summit here that you might have might have his work cut basically the inverted index should be somehow all well limited that should be enough about because think about English works as 2 1 as you have the option Dictionary for the terms he of every entry in the UK just a pitch the you have 1 of those posing as it should be finished on that interestingly it is and then as the fresh blood to prevent so if you if you look at the vocabulary size of the 400 thousand documents it's about 400 sold terms of the 12 million document Cabler sizes about 16 million troops much more than the 400 of before and
2:13:25
the interesting question is really high big is the term book is a way to estimate that given eye at 50 100 documents what will it do to my term cabela's and found that has said that it kind of like the number of war is the world's everything that was put in the Oaks but snowy that should be good and the and interesting
2:13:53
but the interesting thing is no it is not simply wrong found he fled or said is that the number of firms in a collection it is directly proportional to the number of tocom so that you get the order Terminator's as the term is basically what if he at the number of entries needed for you and a token is each word that you get from a document but here the pulse of the and the more documents you have the more tokens you have the maybe the same works but still the number of terms is proportional to the number of tokens and the interesting policy in what will use of propulsion and it's exponentially so it trades exponentially and 1 can say that the number of firms grows was the squalor route of the number of child everybody knows the squid route function looks like this but it's diverge it's not As topically or something moving to what some some some fixed number of its diverging effect going Beale slowly but UK and this keeps the railway and the of course he was kind of the prices by his own law but because of what he is he the conducted yet there must be a up limiting now like them must be a sense of loss of sensible world of the said of sensible words and that that will be it and if I'd had 5 documents no more sensible words with what would be in but obviously for practical collection that is not true and humiliated and measured measured and then found the
2:16:18
specific law and that if I'd look at a collection of Web pages examples 10 thousand took the 1st words that experience that are 3 thousand different terms if either get the 1st millions of tokens that already 30 cells different but I'm if either get 20 billion webpages which is a very conservative estimate of the where their comfort and assumed that each website has about 200 works that longer websites the smaller but on average the two under than the size of the vocabulary that who with has to maintain this already 16 million different and they do exist not only in German weakened but new works by just putting them together in some random fashion but also in English kind interesting business my life is like this this was kind of taken christma was off after because it it seems counted into it and then there was some guy who was called up for books and
2:17:49
I said Well actually it has something to do with with you have a number of contracts that you need to describe and you need to distinguish from each of if I'm talking about animals and you have a cat and a dog and a 1 to distinguish between them have to use different names for them because it would cause the kept the ball the nett but would not deliver talking about care the some different animals on the other hand if via will commuting to 1 of you in using a totally different meanwhile roads every 5 minutes you would be kind of stressed because the words might exist but you would have to look them up in a dictionary because you know you think I'm very often so far the high point of making things clear what kind of distinguishing between the eyes needs different for but for the communication a usually stick to she works think of it as a learning a new language so what is the amount of work that everybody uses in 2 days of every case talk to go to the post of physical stiletto you go to the banque re by some of the Red of something like that you talk to the analyst you wouldn't so that is the amount of words that he was on the day he bases and what the amount of work that you actually know what they mean it up but it it but it the point of that look that the look at the 3 thousand the m something him and number was that I know what I've from 11 development but basic the among the words that you that you know I'm 1 Austin distinguishes between the active use of words and the pass knowledge of words so you have heard you know what it means to just never in your life you is that the website had Wilbert's taking but it it had the effect that UK and that you are the only ones who uses it we have an idea of what it is but you know I'm with its not part of all active are more educated people actually have about 20 30 thousand 1st quite quite lost is not interesting so what you use and the indie speeches about 3 thousand for use order to know about the ball went of that exists is and limited with so much to learn was 1 thing I can tell you I'm what about little because people need to distinguish some concept from each other because they go into more detail and the only time to distinguish things from each other is describing the different naming them differently you can use the same worth otherwise you would also introduced entropy of confusion up and so you have to think of a talking you want this is a direct the rope from L have to describe it pulling together and number of the other works in area over the ball and the 2nd is that when not put together a before in the way was a busy and and and so about what steps actually did had is to set up yes the amount of work to try to into it is growing but it is not growing up in the stands that the active words were that you need more of them are getting no in my head that I would in a sentence that no and morale skew works out at it what he actually does it was used as well the eyes of frequent terms has the frequency in the cost of professional to 1 devoid of of the look like look flecked but this is the active words but you here is the way is the word has worked no Computer Science of and this is what they called the law and take these are the words high rich a move for threat of the role for 1 of UK and his basically from what you add is more of long table after ability to church time you would have the mass of the works of their from this to be a good example Parallel so extell told the and the laws of the land 1 and and and where he died
2:24:07
is he looked at let frequencies and term frequencies natural for language so I just took a book in English and hunted higher at every little was and a half off not so that it Eoka Ohio of the celebrity of of and found out that led the frequencies are like you so the order to provoke required off in teeth of a provide more of in the journey the anytime occurred as the and it looks like this the are aimed the end and want the good let us and he is trade and the queue and sort a case and you can't do
2:25:11
that she was the last thing for for example you have monodic everybody knows the Moby Dick amount to discount high now every word for cross and 1 of them by frequency the 1st part is the also and they became poker very also very relieved that have which like car even so this gold and you will find words like ship Wales captain maybe a Nemeth somebody locust just 1 set in the novel appearance the mass of the distribution its he and at some point desperate to this is what is called the head you should consider the interesting part of this is what comes long take usually but led off terms but they don't at mass so Justin although interesting is not with the user Luigi interesting the idea and the perfectly true well you need as a trade off these words don't tell you and I think cut them out of high with the rest of the words the and and where you are not happy with the rest of the world apply but if you cut this part out basically you have seen structure which you off somewhere probable than the 1 Selvaggia the and then the 1 over here and same structure and you need to decide for setting yet need the words and the long tail because they discriminate from but late on of mass and the getting there interesting because the it
2:27:44
said and you can serve this this low in in the money Languages that's not only apply but English and and it has been used for example codebreaking the 5th cyphers that kind of like I'm assign number rose 2 0 2 0 letters it is very easy to crack them knowing the language distributions see what were talker more not what that Oakham off their hand and with a certain probability and you can reverse the book by the and that's at this idea that I'm that when communicating you the principle of the F you need to use word that many people now and everybody comfortable with and on the other hand they need to be specific enough to describe what you want to describe it does not the if you go into the Baker read and set up our own something that has been vacant and then everybody would go like the S and that might be the brand Olivia laccaic over on the time out of the area now being more specific on the other hand if you go into the Baker the and say you know what not want this blueberries and double cream cake with such and such a amount after in it and that they would go yes the Bloomberg and understand banks in the UK said that the by efficiency and the broader economy at some point
2:29:23
and simmer relationships and can also be discovered and and in many different context so for example if you look at the excesses of page comedy people excesses webpage it's the same thing if you of them by the number of success is to look is your who has blood Lab blood and he is during which long take up and nobody wants to see if everybody wants to see who would the and it's just the polo the distribution of wealth is the Queen of looks is the Shi'ite golf thought he a you fell cricket if you up from the idea and professors in the long tail of the distribution of the population for the you the cities so if if you take
2:30:23
that he other you as the city's new York guests morning by the legal Bombay Fault while no contract somewhere you want and if you look at it on a logarithmic scale has almost line this has which basically means on Monday over a mixture the 2 would and was the 1st broken of interested to walk away from it and it was
2:31:05
accepted might enough and from make some indexing would talk about how to propel documents from special with that in mind what to say some of the work and the head of the distribution of simply not you for for a tree low so retrieving although command of contained the word for this fall the not very helpful Soderblom talk about some statistic of properties of document collection above Houghton almost document propelled him for the law and that of the that questions now than now would enough mass for today think you for attention
00:00
Boolean algebra
Theory of relativity
Ferry Corsten
Scientific modelling
Model theory
Scientific modelling
Finitary relation
Similarity (geometry)
Bit
Number
Similarity (geometry)
Information retrieval
Escape character
Mathematics
Computer animation
Natural number
Information retrieval
Fuzzy logic
Right angle
Key (cryptography)
Figurate number
Fuzzy logic
Family
World Wide Web Consortium
02:56
Context awareness
State of matter
Sheaf (mathematics)
Price index
Computer
Number
Video game
Term (mathematics)
Ideal (ethics)
Moving average
Addressing mode
Fuzzy logic
World Wide Web Consortium
Area
Information
Weight
Perturbation theory
Term (mathematics)
Measurement
Similarity (geometry)
Degree (graph theory)
Subject indexing
Word
Arithmetic mean
Process (computing)
Computer animation
Personal digital assistant
Order (biology)
Normal (geometry)
Quicksort
Film editing
08:13
Complex (psychology)
Length
Multiplication sign
Scientific modelling
Price index
Insertion loss
Weight
Mereology
Mathematics
Video game
Bit rate
Hypermedia
Vector space
Product (category theory)
Spacetime
Electronic mailing list
Term (mathematics)
Trigonometric functions
Measurement
Arithmetic mean
Wave
Computer simulation
Angle
MiniDisc
Representation (politics)
Resultant
Point (geometry)
Connectivity (graph theory)
Distance
Number
Goodness of fit
Crash (computing)
Inverse problem
Causality
Term (mathematics)
Representation (politics)
MiniDisc
Metropolitan area network
World Wide Web Consortium
Scaling (geometry)
Polygon mesh
Key (cryptography)
Computer
Cellular automaton
Length
Weight
Similarity (geometry)
Subject indexing
Stochastic differential equation
Computer animation
Personal digital assistant
Kolmogorov complexity
Game theory
18:37
Point (geometry)
Group action
Beat (acoustics)
Length
Transport Layer Security
Multiplication sign
Online help
Mereology
Computer
Number
Frequency
Thomas Bayes
Fluid
Causality
Term (mathematics)
Moving average
Subtraction
Posterior probability
Task (computing)
Condition number
World Wide Web Consortium
Focus (optics)
Numbering scheme
Physical law
Term (mathematics)
Measurement
Stochastic differential equation
Word
Process (computing)
Computer animation
Computer science
Theorem
Posterior probability
Quicksort
Permian
Reading (process)
Resultant
Thomas Bayes
24:23
Query language
Model theory
Scientific modelling
Theory
Auto mechanic
Bit
Online help
Mereology
Computer
Computer icon
Number
Similarity (geometry)
Measurement
Information retrieval
Computer animation
Bit rate
Network topology
Database
Moving average
Maize
Routing
World Wide Web Consortium
27:06
Query language
Multiplication sign
Scientific modelling
Binary code
Insertion loss
Basis (linear algebra)
Mereology
Computer
Computer icon
Number
2 (number)
Wave packet
Independence (probability theory)
Information retrieval
Programmer (hardware)
Order (biology)
Subject indexing
Dependent and independent variables
Representation (politics)
Ranking
Subtraction
Physical system
World Wide Web Consortium
Dialect
Cellular automaton
Scientific modelling
Basis (linear algebra)
Set (mathematics)
Word
Computer animation
Personal digital assistant
Information retrieval
Order (biology)
System programming
Statement (computer science)
Computer science
Dependent and independent variables
Ranking
Quicksort
Representation (politics)
Units of measurement
Routing
Identical particles
Physical system
Reading (process)
32:10
Query language
Statistics
Set (mathematics)
Real number
Multiplication sign
Distance
Event horizon
Sign (mathematics)
Positional notation
Arithmetic mean
Moving average
Representation (politics)
output
Physical system
World Wide Web Consortium
Electronic mailing list
Skewness
Set (mathematics)
Measurement
Uniform boundedness principle
Arithmetic mean
Computer animation
Network topology
Function (mathematics)
Information retrieval
Computer science
Task (computing)
Matching (graph theory)
35:18
Point (geometry)
Greatest element
Musical ensemble
Random number generation
Multiplication sign
Range (statistics)
Limit (category theory)
Bit rate
Mereology
Event horizon
Number
Frequency
Thomas Bayes
Degree (graph theory)
Mathematics
Causality
Hypermedia
Pearson productmoment correlation coefficient
Operator (mathematics)
Bus (computing)
Moving average
Software testing
Information security
Form (programming)
World Wide Web Consortium
Total S.A.
Extreme programming
Storage area network
Affine space
Limit (category theory)
Local Group
Table (information)
Degree (graph theory)
Number
Frequency
Event horizon
Computer animation
Personal digital assistant
Mixed reality
Interpreter (computing)
Website
Bayesian network
Data type
Writing
41:04
Rational number
Random number generation
Multiplication sign
Sheaf (mathematics)
Computer icon
Number
Information retrieval
Frequency
Goodness of fit
Degree (graph theory)
Video game
Spherical cap
Moving average
Subtraction
World Wide Web Consortium
Information
Key (cryptography)
Bayes, Thomas
Basis (linear algebra)
Propositional formula
Volume (thermodynamics)
Functional (mathematics)
Video game
Demoscene
Computer animation
Information retrieval
Order (biology)
Interpreter (computing)
Object (grammar)
Bayesian network
46:23
Random number
Statistics
Random number generation
Information
Open set
Computer icon
Table (information)
Video game
Computer animation
Universe (mathematics)
Maize
Bayesian network
Mathematician
Library (computing)
World Wide Web Consortium
48:21
Randomization
State of matter
Multiplication sign
Real number
Instance (computer science)
Limit (category theory)
Table (information)
Number
Number
Summation
Degree (graph theory)
Mathematics
Computer animation
Ideal (ethics)
Moving average
Bayesian network
50:31
Point (geometry)
Expression
Numbering scheme
State of matter
Set (mathematics)
Model theory
Binary code
Mereology
Event horizon
Theory
Independence (probability theory)
Information retrieval
Mathematics
Mathematics
Term (mathematics)
Subject indexing
Moving average
Ranking
World Wide Web Consortium
Window
Area
Binary code
Scientific modelling
Basis (linear algebra)
Independence (probability theory)
Term (mathematics)
Variable (mathematics)
Estimator
Tablet computer
Subject indexing
Computer animation
Network topology
Estimation
Information retrieval
Matching (graph theory)
Reading (process)
Local ring
Library (computing)
53:55
Point (geometry)
Query language
Random number
Set (mathematics)
Multiplication sign
Distribution (mathematics)
Price index
Binary code
Event horizon
Thomas Kuhn
Variable (mathematics)
Number
Sequence
Frequency
Term (mathematics)
Subject indexing
Maize
Metropolitan area network
World Wide Web Consortium
Distribution (mathematics)
Term (mathematics)
Vector graphics
Subject indexing
Summation
Computer animation
Estimation
Task (computing)
Physical system
56:53
Query language
Randomization
View (database)
Sound effect
Set (mathematics)
Number
Frequency
Thomas Bayes
Pointer (computer programming)
Event horizon
Computer animation
Causality
Subject indexing
Moving average
Theorem
Condition number
World Wide Web Consortium
58:58
Point (geometry)
Query language
Feedback
Multiplication sign
Auto mechanic
Bit rate
Event horizon
Computer icon
Number
Summation
Frequency
Subject indexing
Queue (abstract data type)
Ranking
Position operator
World Wide Web Consortium
Window
Logical constant
Set (mathematics)
Local Group
Mechanism design
Sign (mathematics)
Frequency
Computer animation
Lattice (order)
Network topology
Quicksort
Spectrum (functional analysis)
1:02:09
Query language
Ferry Corsten
Multiplication sign
Distribution (mathematics)
1 (number)
Independence (probability theory)
Term (mathematics)
Mereology
Independence (probability theory)
Frequency
Uniform boundedness principle
Latent heat
Event horizon
Computer animation
Estimation
Term (mathematics)
Personal digital assistant
Subject indexing
Computer science
Moving average
Maize
Ecoinformatics
World Wide Web Consortium
Wide area network
1:06:52
Point (geometry)
Slide rule
Query language
Multiplication sign
1 (number)
Hermite polynomials
Mereology
Weight
Information Technology Infrastructure Library
Computer
Thomas Kuhn
Number
Summation
Frequency
Sign (mathematics)
Degree (graph theory)
Causality
Term (mathematics)
Subject indexing
Queue (abstract data type)
Moving average
World Wide Web Consortium
Repetition
Moment (mathematics)
Basis (linear algebra)
Weight
Term (mathematics)
Measurement
Local Group
Maxima and minima
Degree (graph theory)
Subject indexing
Calculation
Word
Computer animation
Network topology
Order (biology)
Computer science
Interpreter (computing)
Right angle
Resultant
Directed graph
1:12:58
Point (geometry)
Complex (psychology)
Query language
Statistics
Structural load
Model theory
Network operating system
Common Intermediate Language
Ordinary differential equation
Mereology
Number
Frequency
Infinite conjugacy class property
Blog
Term (mathematics)
Average
Database
Subject indexing
Arc (geometry)
Metropolitan area network
World Wide Web Consortium
Torus
Information
Element (mathematics)
Basis (linear algebra)
Electronic mailing list
Term (mathematics)
Measurement
Word
Computer animation
Oval
Personal digital assistant
Lie group
Computer science
Data type
Resultant
1:18:00
Series (mathematics)
Random number generation
Model theory
Distribution (mathematics)
Scientific modelling
Binary code
Number
Independence (probability theory)
Information retrieval
Summation
Frequency
Stochastic differential equation
Computer animation
Term (mathematics)
Lie group
Subject indexing
Ranking
Web Ontology Language
Units of measurement
Physical system
World Wide Web Consortium
1:19:58
Query language
Model theory
Scientific modelling
Binary code
Price index
Bit rate
Bound state
Icosahedron
Thomas Kuhn
Independence (probability theory)
Information retrieval
Word
Thomas Bayes
Mathematics
Blog
Uniform space
Area
Curve
Logical constant
View (database)
Scientific modelling
Propositional formula
Term (mathematics)
Vector graphics
Maxima and minima
Mechanism design
Frequency
Network topology
Theorem
Right angle
Task (computing)
Physical system
Expression
Random number
Feedback
Set (mathematics)
Limit (category theory)
Discrete element method
Code
Emulation
Sequence
Frequency
Degree (graph theory)
Arithmetic mean
Term (mathematics)
Pearson productmoment correlation coefficient
Subject indexing
Ranking
Maize
output
World Wide Web Consortium
Distribution (mathematics)
Bayes, Thomas
Weight
Set (mathematics)
Video game
Sign (mathematics)
Subject indexing
Number
Word
Event horizon
Computer animation
Estimation
Function (mathematics)
Information retrieval
Bayesian network
1:22:24
Query language
Product (category theory)
State of matter
Model theory
Set (mathematics)
Multiplication sign
Range (statistics)
Binary code
Price index
Infinity
Disk readandwrite head
Event horizon
Computer
Thomas Kuhn
Independence (probability theory)
Information retrieval
Word
Frequency
Thomas Bayes
Latent heat
Subject indexing
Queue (abstract data type)
Moving average
Ranking
Subtraction
Random variable
World Wide Web Consortium
Window
Logical constant
Key (cryptography)
View (database)
Cellular automaton
Scientific modelling
Weight
Term (mathematics)
Vector graphics
Symbol table
Similarity (geometry)
Subject indexing
Event horizon
Computer animation
Estimation
Lie group
Information retrieval
Theorem
Task (computing)
1:24:49
Point (geometry)
Latin square
Scientific modelling
Distribution (mathematics)
Binary code
Number
Independence (probability theory)
Information retrieval
Thomas Bayes
Latent heat
Singleprecision floatingpoint format
Saddle point
Moving average
Representation (politics)
Ranking
Software testing
World Wide Web Consortium
Logical constant
Structural load
Maxima and minima
Arithmetic mean
Computer animation
Theorem
Representation (politics)
Data type
1:28:11
Abstract state machines
Greatest element
Graph (mathematics)
Ultraviolet photoelectron spectroscopy
Design by contract
Binary code
Mereology
Independence (probability theory)
Information retrieval
Frequency
Optical disc drive
Thomas Bayes
Moving average
Maize
Office suite
World Wide Web Consortium
Slide rule
Computer
Set (mathematics)
Functional (mathematics)
Computer animation
Network topology
Theorem
Odds ratio
Reading (process)
1:31:01
Area
Transport Layer Security
Multiplication sign
Binary code
Term (mathematics)
Computer
Event horizon
Independence (probability theory)
Information retrieval
Frequency
Thomas Bayes
Video game
Computer animation
Network topology
Term (mathematics)
Linker (computing)
Queue (abstract data type)
Theorem
Metropolitan area network
World Wide Web Consortium
1:34:46
Raw image format
Physical law
Division (mathematics)
Binary code
Term (mathematics)
Mereology
Theory
Independence (probability theory)
Information retrieval
Heegaard splitting
Computer animation
Term (mathematics)
Personal digital assistant
Queue (abstract data type)
Moving average
Quicksort
Optical disc drive
Product requirements document
Uniform space
World Wide Web Consortium
1:37:47
Point (geometry)
Query language
Product (category theory)
Numbering scheme
Distribution (mathematics)
Multiplication sign
Distribution (mathematics)
Binary code
Mereology
Computer icon
Independence (probability theory)
Information retrieval
Goodness of fit
Mathematics
Causality
Term (mathematics)
Operator (mathematics)
Queue (abstract data type)
Data structure
Subtraction
Identical particles
World Wide Web Consortium
Area
MUD
Bit
Term (mathematics)
Table (information)
Stochastic differential equation
Computer animation
Network topology
Personal digital assistant
IRIST
Right angle
Block (periodic table)
Identical particles
Matching (graph theory)
1:45:08
Point (geometry)
Product (category theory)
Evelyn Pinching
Binary code
Mereology
Independence (probability theory)
Prime ideal
Information retrieval
Goodness of fit
Latent heat
Computer animation
Causality
Estimation
Term (mathematics)
Operator (mathematics)
Quotient
Quicksort
Block (periodic table)
Chisquared distribution
World Wide Web Consortium
1:48:18
Point (geometry)
Randomization
Multiplication sign
Binary code
Mereology
Event horizon
Computer
Theory
Emulation
Number
Independence (probability theory)
Information retrieval
Frequency
Latent heat
Term (mathematics)
Large eddy simulation
World Wide Web Consortium
Focus (optics)
Information
Physical law
Term (mathematics)
Word
Summation
Voting
Frequency
Computer animation
Network topology
Estimation
Personal digital assistant
Computer science
1:55:39
Point (geometry)
Feedback
Model theory
Scientific modelling
Binary code
Electronic mailing list
Mereology
Independence (probability theory)
Information retrieval
Machine learning
Term (mathematics)
Linker (computing)
Computer network
Moving average
Subtraction
Poisson process
Units of measurement
World Wide Web Consortium
Logical constant
Information
Pseudonymization
Computer
Structural load
Feedback
Binary code
Scientific modelling
Independence (probability theory)
Local Group
Subject indexing
Computer configuration
Computer animation
Estimation
Information retrieval
Thermal conductivity
Extension (kinesiology)
1:59:33
Axiom of choice
Transformation (function)
State of matter
Scientific modelling
Multiplication sign
Binary code
Parameter (computer programming)
Weight
Mereology
Independence (probability theory)
Information retrieval
Mathematics
Videoconferencing
Electronic visual display
Algorithm
Product (category theory)
Spacetime
Binary code
Scientific modelling
Shared memory
Measurement
Degree (graph theory)
Digital photography
Process (computing)
Network topology
Data type
Resultant
Point (geometry)
Asynchronous Transfer Mode
Real number
Mass
Vector space model
Open set
Number
Measurement
Frequency
Goodness of fit
Term (mathematics)
Subject indexing
Ranking
Booting
Subtraction
Metropolitan area network
World Wide Web Consortium
Information
Physical law
Independence (probability theory)
Estimator
Subject indexing
Calculation
Computer animation
Estimation
Information retrieval
Fuzzy logic
Object (grammar)
Library (computing)
Computer worm
2:09:26
Web page
Point (geometry)
World Wide Web Consortium
Open source
Content (media)
Price index
Term (mathematics)
Mereology
Data dictionary
Number
Fraction (mathematics)
Word
Subject indexing
Category of being
Number
Computer animation
Term (mathematics)
Computer configuration
Website
Statistics
Software testing
Physical law
World Wide Web Consortium
2:13:52
Web page
Pulse (signal processing)
Token ring
Insertion loss
Average
Number
Word
Latent heat
Video game
Term (mathematics)
Subject indexing
Moving average
Physical law
Search engine (computing)
Subtraction
World Wide Web Consortium
Logical constant
Cellular automaton
Web page
Physical law
Token ring
Sound effect
Term (mathematics)
Functional (mathematics)
Estimator
Maxima and minima
Sign (mathematics)
Radical (chemistry)
Number
Word
Computer animation
Order (biology)
Website
Routing
2:17:47
Point (geometry)
Entropy
Multiplication sign
1 (number)
Design by contract
Electronic mailing list
Mass
Mereology
Disk readandwrite head
Total S.A.
Number
Formal language
Database normalization
Frequency
Video game
Term (mathematics)
Natural number
Queue (abstract data type)
Physical law
Computerassisted translation
World Wide Web Consortium
Area
Hidden surface determination
Logical constant
Distribution (mathematics)
Physical law
Physicalism
Sound effect
Mass
Skewness
Term (mathematics)
Power (physics)
Formal language
Table (information)
Word
Message passing
Frequency
Sample (statistics)
Computer animation
Personal digital assistant
Telecommunication
Order (biology)
Data compression
Speech synthesis
Website
Key (cryptography)
Quicksort
2:25:10
Point (geometry)
Reading (process)
Pressure
Cognition
Chemical equation
Distribution (mathematics)
Mass
Mereology
Disk readandwrite head
Discrete element method
Formal language
Number
Summation
Writing
Word
Frequency
Term (mathematics)
Physical law
Process (computing)
Data structure
World Wide Web Consortium
Area
Term (mathematics)
Formal language
Discounts and allowances
Word
Frequency
Computer animation
Empennage
2:29:21
Web page
Context awareness
Statistics
Logarithm
Distribution (mathematics)
Design by contract
Online help
Electronic design automation
Mass
Disk readandwrite head
Number
Word
Database normalization
Mixture model
Subject indexing
Physical law
Ranking
Statistics
World Wide Web Consortium
Link (knot theory)
Scaling (geometry)
Distribution (mathematics)
Web page
Ext functor
Line (geometry)
Term (mathematics)
Subject indexing
Category of being
Word
Frequency
Computer animation
Network topology
Empennage