Remembering the Probabilistic Analysis of Latent Semantic Indexing
Formal Metadata
Title 
Remembering the Probabilistic Analysis of Latent Semantic Indexing

Title of Series  
Author 

License 
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2019

Language 
English

Content Metadata
Subject Area  
Abstract 
In the late 1990s, the possibility of algorithmic extraction of insight from soulless data loomed potentially important and very intriguing. I will look back at our attempt at understanding and advancing this research program in the light of two decades of blistering progress in spectral methods, machine learning, data harvesting and deep nets.

00:00
Data management
Doubling the cube
Magnetooptical drive
Subject indexing
Mathematical analysis
Quicksort
Variable (mathematics)
00:47
Context awareness
Internetworking
Internetworking
Multiplication sign
Subject indexing
Sampling (statistics)
Mathematical analysis
Whiteboard
01:49
Web 2.0
Fraction (mathematics)
Focus (optics)
Internetworking
Multiplication sign
Right angle
Water vapor
Database
Computer
Flow separation
03:06
Group action
Server (computing)
Serial port
Code
Multiplication sign
Student's ttest
Mereology
Semantics (computer science)
Theory
Local Group
Web 2.0
Causality
Googol
Authorization
Analytic continuation
Task (computing)
Area
Domain name
Email
Block (periodic table)
Binary code
Theory
Bulletin board system
Category of being
Subject indexing
Word
Process (computing)
Voting
Internetworking
Search engine (computing)
Video game
06:36
Set (mathematics)
Singular value decomposition
Mereology
Semantics (computer science)
Dimensional analysis
Formal language
Medical imaging
Singular value decomposition
Subject indexing
Square number
Authorization
Curve
Default (computer science)
Information
Projective plane
Moment (mathematics)
Surface of revolution
Subject indexing
Word
Vector space
Information retrieval
Website
Landau theory
Matrix (mathematics)
Spacetime
Distortion (mathematics)
08:55
Addition
Group action
Information
Multiplication sign
Line (geometry)
Limit (category theory)
Computer
Tracing (software)
Power (physics)
Word
Mathematics
Subject indexing
Computer science
Website
Quicksort
Landau theory
Matrix (mathematics)
10:12
Computer virus
Principal ideal
Complex (psychology)
Context awareness
Distribution (mathematics)
View (database)
Multiplication sign
Combinational logic
Survival analysis
Singular value decomposition
Computer
Data model
Virtual reality
Endliche Modelltheorie
Electric generator
File format
Maxima and minima
Unsupervised learning
Radical (chemistry)
Wave
Mixture model
Computer science
Selforganization
Whiteboard
Quicksort
Representation (politics)
Metric system
Resultant
Service (economics)
Eigenvalues and eigenvectors
Electronic program guide
Axonometric projection
Computer
Theory
Twitter
Mixture model
Linear subspace
Representation (politics)
Theorem
Distribution (mathematics)
Focus (optics)
Eigenvalues and eigenvectors
Direction (geometry)
Surface
Electronic program guide
Principal ideal
Projective plane
Approximation
Mathematics
Word
Commitment scheme
Personal digital assistant
Mixed reality
Universe (mathematics)
Linear subspace
15:14
Randomization
Distribution (mathematics)
Correspondence (mathematics)
Multiplication sign
Source code
Set (mathematics)
Student's ttest
Parameter (computer programming)
Axonometric projection
Semantics (computer science)
Dimensional analysis
Verylargescale integration
Revision control
Data model
Mixture model
Estimator
Flow separation
Different (Kate Ryan album)
Bridging (networking)
Nichtkommutative JordanAlgebra
Authorization
Theorem
Endliche Modelltheorie
Resource allocation
Machine learning
Context awareness
Default (computer science)
Distribution (mathematics)
Electric generator
Correspondence (mathematics)
Artificial neural network
Projective plane
Moment (mathematics)
Volume (thermodynamics)
Term (mathematics)
Flow separation
Diameter
Data mining
Subject indexing
Uniform resource locator
Word
Theorem
Website
Right angle
Quicksort
Volume
20:11
Scaling (geometry)
Information
Artificial neural network
Multiplication sign
Weight
Source code
Exponentiation
Parameter (computer programming)
Virtual machine
Surface of revolution
Power (physics)
Web 2.0
Explosion
Mathematics
Word
Mathematics
Machine learning
Search engine (computing)
Information retrieval
Logic gate
22:34
Web 2.0
Mathematics
Vector space
Resultant
Virtual machine
Surface of revolution
00:02
so this would be at the local
00:06
there they include most stodgier OK so you know we don't sit on top on double
00:13
that is being that is the most being back that bullets after after many years of absence the and the and the sergeant user accounts with the with the painful realization of how think how much things have changed this be prepared for a lot of that sort and there those of you without great variability audience so you got see the race as a bunch of AI through archeology with your work here but so this about the paper
00:48
that I wrote to 20 something years ago we public I go when he saw the marking and samples and Bonner and was were appeared in boards 99 so all before anything else I will have to said that historical context
01:18
OK you know these the summer of 1998 when this happens the and it was so many of you remember had that imagined it was a very these extremely exciting times but of course leading the excitement was the realization that the Internet was on how the intervals was here and the values of the should many opportunities many challenges and there are allowed to vote intellectual
01:52
the soulsearching that has to be done OK so this was a or many of us have those realized that this is something we use and betterment and we have to see each Our focus is carried out focuses on other computer of course that way and the this was so I already went very well defined in the defense that the moral of the fence years was over of and and the more of the most important thing you know the most important the bottled water so the where brew was fine as it was the web search problem where middle and the it was an extremely challenging thing right no you really wanted to set a huge database would at this time we considered huge
02:54
OK but it wasn't work I think it was probably about several millions times smaller than what it is now the end you have to do it in a fraction of a 2nd
03:08
and the and z of a given day way in a pleasing way in a way that the voters and dies and users and so on and so forth so that was a major problem and that there were a lot of solutions out on now the Michaels of that is that beings domain of binary Brewer and combined and and his students that was actually a there so we know search engine explicitly based on the the basic concepts of the college and of course you who will overture and so on so the world of players Google was the new kid in the block and the and the then there was the Theory Group at IBM Almaden out many of you are aware of this but I don't know it's so great was you know again nostalgia and that was a time when other editions of could the world that could tasks that there are a few places in the world an idea Muhlemann was 1 was that out of the jobs for them in the gun the gun the gun the gun of a really the fury life and the and the and the published you'd because the were the the the the part of the company's of all worlds of a great company so that here to go along with them by that time led by by by was plunging until they didn't and the web search problem I'm not sure you realize how many of all of those who are already code papers that dealt to read the web endangerment came from that of from other groups are joint kind but observed hubs and authorities or your paper was so wide so so John was at the area and also can his later and in fact it was explicit continuation of this of this of this work using spectral techniques to understand how to the web but the cause of of communities in the word and the word unexpected prevalence of communities it came out that that from from that group the Bulletin of the boat where it you're familiar with that would serve something he's still a poor in the seriality use also Our from that group and so on
05:49
so there was a lot of it was a center of thinking of serious theoretical thinking about the way of and the no I remember of or public 1 send us once a center the whole group once as an email and the the you know essentially what you're saying is start thinking about the 1 2nd splitsecond of concerning suppose that you had to our lots of computing how do says that where OK and using about it that's that's the right that's that's a very insightful way over all of us sitting around the target of and that was latent semantic indexing and how many of you are familiar with
06:38
it the general on part of that Cornell was a true pioneer I mean if you have a set of the united but so basically you know last year was a nice information that the moment the following sense information retrieval had long been thought that you have a corpus of a set of books and you want to be able to search its intelligently and 2 and to work get inside to get to get the information out of collective and the Jerry Sutherland came up with this amazing proposal of the corpus is a matter it's OK in other words forget syntax forget semantics forget the what we'll forget this is that the the book is about cold war so in other words every book is a vector in on the dimension where you in the vector space where the dimensions of the words in the language and then for every corpus of his other books his ematics but so this was and the canadian wheat dairy in our default in the seventies with the and the because for example nobody knew what for the bullets automatic square distorted and felt manipulated but then in nineteen 90 the did was that the demand curve for ransom we took this idea to the next inescapable conclusion of image semantics than you blow whether you do the singular value decomposition OK and that is out of the single about the room was his was not is a picnic which take semantics and project a small amount of its so projects it's all on the police about the elections of the space defined by the matter with their so and and that they called the this technique to latent semantic indexing LSI and epidemics that they
08:43
years many people including the authors of this paper the tied the site and found it to be extremely successful in practice the in other words it was a revolution to information retrieval OK and the word was winning up
09:00
the informers' the welcome addition of the year after year know so than 1 hour or of there's something missing the site of the last line is a is a the huge hole why OK so the last really in
09:21
other words the main thing we want to do is to understand mathematical and why was it was sort of a conundrum and I'll explain why Hideo group you know for those of you who remember it's up in in the in good night in the 20th century but the traces the fifties throughout the fathers of the century but the few entering a computer science essentially had the missions the main mission is to understand mathematics the power and limitations of computers OK so as I told you this you know the the strength of this mission was waning because many of us had realized of intimate is is what is what setting it from now on and the computer was just like that of the like gadget above us them but but I mean know until that time we
10:16
were obsessed by computer terminal so it's probably cut from the youngest among you to believe that but computer science was obsessed with computers values such as that made the world a terrible mistake coevolve adopting the name of the computer focus a lot of computer science but but so the 2nd you know what what so this is of course so it goes without saying and there was a lot of success there was this is an empty but it is of the view that the that's simply problem and complexity and of what and so on so it was it was extremely successful theory but the other side we have a 2nd emissions and the civil mission is to guide computing practices by discovering from of again that I'd way to do things OK and and that in many ways boards and other old conferences over all the embodiment of the so this commitment will traditions in the rear of the 20 century through to basic solve practical problems and so and so the show the way to the world to petition and there was a fat admission of added more controversial it and that was 1 of the to prove the that what practitioners were doing was correct surveillance and practitioners is 0 like that right surface and you know what I knew this a long wire you and I have can Thompson my mind so we know that so this is something that you know he had a lot of inside the Unix contains a lot of insight from solutions to difficult problems and some of the theorems came 20 years after Twitter but so and the question is why so basically who as 3rd compelled to committed to annoy and this was as I told you this was the tail end of the use of the use of the set up of these so all these 3 permission thing but a wave basically what we wanted to answer mathematical Country prove a theorem that says that the other side is successful organ so because of singular value the board composition as you know now I teach it to undergraduates of and under the under the the chapter of unsupervised learning which saw no this was completely that this was not the of the focus of the time but they see the projects the metrics to a subspace of the principal the elections in other words the ventral awards and the these are the highest eigen values the singular values of the gold in this context and there are more all such projections is provably the minimality distorting projection Mikhail which is the best possible approximation given the dimensionality all the all this semantics and and so it's a good approximation so as a result of that the virus should not be theory but it it anybody improved case on hold that is of the representation of the corpus was works better enable Riemannian nearestneighbor served since you that the results this was the commander wide wide why does this happen on suspicion was that the VLSI projection identifies what can be called the topics of this document I suppose that topics boarded the Explore Science about comets and so on have each its own survival war distribution probabilistic vocabulary that different is that I might get between topics and then the document is a mixture of topics of and the that afford make sure the distribution of words which is a mix mix of distribution and he suggested the generative model you have about the use of the document you could see that it's a out of a combination of corporate formation of a few topics but you you compute the distribution awards degenerate and words in the service in this universe and the and the you get you get you get a document and you repeat K times if get the corpus but so that's a generative
15:00
model and the question is does it aside it is generated model identify the topics of the document and they have sold let's go back to the sort context at that time it was
15:19
so fast uprising astonishing that there you could extract insights from that location it's it was you know even that the mining was not was not a very common value of developed the learner so so to extract insights from that was something that was that was a lot of very very very and the and the need of explanation know I see learning to me was something that Soliz Valiant does the and the where I look that needs it had fewer than 200 papers and these much less and then what about neural networks very no source so it is room and I bet you that the but sponsor lives were in the 5 and 6 shall 6 hundreds of about OK so it was a different world right so let me let me read you moment over so honesty and the permissibility of honesty from from of introduction of the paper we would like to prove a theorem stating essentially that if the corpus is a reasonably focused collection of meaningfully correlated document then I performs well the problem is to divide this that so that the 1 that is it is of close correspondence with what they mean intuitively in black this and to the furor income improved OK so what you see here so what so what was the what was the of bearing inside us something or what was the the the the kind of the kind of thing that Europe was no by default but the our goal of authorities and see that the so what the approval under strong assumptions of separation of distributions and of mixtures of topics the site does identify the main topics of the document without to we also pointed out that another thing that was more than at the time but under the projection which now everybody knows about OK but at that time it was a significant editions that combines well that aside and since much work with cantilevered in other words you don't need to go out to work for side to the whole corpus you can't fares that and rejects was smaller dimension is going to work fine and the experiments we did some experiments to revision of the generative model LSI blasts random projection works much better than gumpel OK so our again there is a gap between what prove and and what happens so and and the good things happen in the experiments what
18:14
happens is that but the day but it was presented in Philadelphia that supports 1999 and was from the special volume so people were boats people sent overstated up the the interesting question is why this at the bottom OK so I guess I was the conduit of that I was with the the intermediary who were who worked there but I know I the it was a very reasonable and and this and that and then I know the appropriate decision and and the and the bridge was the is the well obviously from the Polish community it and 20 years later about but essentially what happened next is the following that the same model the same generative model was formulated and treated as a machinelearning problem in AI OK so our 1st of all by Thomas Hoffman something that is called the probabilistic Latent Semantic Indexing but was essentially it is a lot of the sort of generative model and then 4 years later by the Friends OK Jordan is the diameter of my bound Blyth's my colleague and former student the underlying reasons a former student the and proposes model where these generative model barely mentioning us and the but they are because they also go with the basic any they say this is the generative model so I wanted to it as a set this problem of parameter estimation OK so 1 and that that that define the parameters of linear including the institution the words that the verb called this the Latent Dirichlet Allocation and so this is
20:13
a runaway success in AI information retrieval of you there this paper or paper has done very well in in citations the these houses of 30 times more official so there automatic what else happened since
20:34
1999 but let's say you know there is no way of there's no reason for me but the greatest thing that happens is 1989 is explosive growth and success although machine learning and neural networks artificial neural networks and in flight web search engine companies of which is 1 of the of the fossils of these rooms in other words they have they they're basically for a the Royal leading how much to the new neural nets this has come up with a a complete lack as far as this understand the expost mathematical explanations from you know there is very little mathematical exponent mathematical argument in the way of explaining why do these things work OK and of course there is the reserve your the explain abilities so this is a is continuing a problem with with the with much and the the 2 end with a sour note of bucket but in 1999 and 1988 so we had the you know we had the inside as the fire that we're doing something go on something gate for humanity OK that's the web is going to change the way things are it's going to be a new source of global power information democracy and that of that of that so that the that the that we are working on a new world that is going to be but 1st of all a lot of scale and the of course decidedly non money OK and these so there hasn't panned
22:37
out well the war and the and the the results there a lot of disappointment apprehension with the way that vector on the Web and the web of the 1st adult so let me
22:54
finish years think very much