Introduction to Nonparametric Bayesian Models
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 160 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/33776 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Non-parametric statisticsMathematical modelSoftwareIntelEquations of motionBayesian networkThomas BayesComputer fontStatisticsCurve fittingMathematical modelParameter (computer programming)Machine learningLinear regressionLinear mapComputer networkMarkov chainDifferent (Kate Ryan album)Mathematical modelParameter (computer programming)Virtual machineNon-parametric statisticsMathematical modelGene clusterHyperplaneLinear regressionWave packetArtificial neural networkDefault (computer science)Group actionLine (geometry)PlanningDistanceMoment (mathematics)Process (computing)Parametrische ErregungObservational studyCapability Maturity ModelLimit (category theory)Computer animation
03:17
Russell, BertrandMathematical modelPairwise comparisonMathematical modelParameter (computer programming)Line (geometry)Multiplication signProcess (computing)CASE <Informatik>Degree (graph theory)AuthorizationRule of inferenceTerm (mathematics)Execution unitWave packetPower (physics)Moment (mathematics)Speech synthesisWage labourTournament (medieval)ResultantHypothesisData storage deviceTime seriesRussell, Bertrand
06:44
Gamma functionEstimationSpeciesHash functionTwitterMathematical modelMetric systemMixture modelAsynchronous Transfer ModeBayesian networkPairwise comparisonMathematical modelKolmogorov complexityNon-parametric statisticsInfinityProcess (computing)SequenceProcess (computing)Phase transitionMathematical modelAlgorithmWordMathematical modelMatching (graph theory)Likelihood functionObservational studyOrder (biology)Arithmetic progressionGoogolDialectNumerical taxonomyElectronic mailing listDemosceneAreaSelectivity (electronic)AeroelasticityClassical physicsBitMaximum likelihoodInfinityEquivalence relationSimilarity (geometry)EstimatorQuantumSpeciesData centerProof theoryInformation retrievalState observerTable (information)HypothesisTwitterGene clusterPairwise comparisonParametrische ErregungDatabaseComplex (psychology)Polarization (waves)NumberParameter (computer programming)Posterior probabilityBounded variationVolume (thermodynamics)Graph coloringInferenceExtreme programmingTerm (mathematics)Well-formed formulaMixture modelMetric systemCASE <Informatik>DiagramSurface of revolutionMaxima and minima1 (number)Virtual machine2 (number)Kernel (computing)InformationLecture/ConferenceXML
16:41
Bayesian networkParametrische ErregungNon-parametric statisticsMathematical modelInfinityCluster samplingParameter (computer programming)Distribution (mathematics)Mathematical modelMixture modelPoint (geometry)Process (computing)Machine learningLibrary (computing)Gamma functionMarkov chainProcess (computing)InfinityParametrische Erregung1 (number)Distribution (mathematics)NumberRight angleGene clusterLibrary (computing)Table (information)Software developerMathematical modelChainSimilarity (geometry)DigitizingParameter (computer programming)Multiplication signPattern recognitionPhase transitionArchaeological field surveyMathematical modelPower (physics)Incidence algebraComplete metric spaceElectronic mailing listAlgorithmWeb pageMixture modelCASE <Informatik>State of matterPerspective (visual)Human migrationMaxima and minimaNon-parametric statisticsSquare numberMereologyLikelihood functionObservational studyTheory of relativityBeta distributionOrder (biology)Electronic data processingMaximum likelihoodRule of inferenceBayes-EntscheidungstheoriePressureWorkstation <Musikinstrument>Expected valueCalculusMathematicianCategory of beingMachine learningRussell, BertrandBayesian networkBookmark (World Wide Web)Computer scienceStatisticsRepository (publishing)
Transcript: English(auto-generated)
00:06
Hi, I am Omar Gutierrez. I want to talk about some things for nonparametric bias. Most of the things I will say are based on some tutorial from this researcher Samuel
00:26
Gershman. Actually, I won't tell you too many things. We will recap somewhat the definition for a model in a nutshell.
00:45
I will show the most simple example for nonparametric bias that is clustering. I will show you how we usually do the clustering and the alternative approach that is the nonparametric
01:07
model. I will try to say some jokes, I hope you understand.
01:24
What we do in machine learning is, how can I say, not only but the main idea is to modify the values for some parameters in some training process, and this is almost
01:49
the default approach in machine learning, so it is not common to hear parametric models but almost all models that we are using right now are parametric but there are another
02:09
group of machine learning models that are nonparametric and for me they are very interesting and I think we need to discuss the idea.
02:23
So, just to remember that in linear regression our parameters are our beta values so we can move a line in the plane or a hyperplane and in neural networks it is almost the same
02:47
but it is the same, we modify the ways, et cetera, also I know hidden Markov models,
03:01
we have some kind of ways that are the hidden values and we modify them so then I think you know so much about machine learning, so I want to ask you something. For you, which one is the best model, I think on the blue line is a kind of time
03:30
series and the red line, the model, so for you, how many of you think that the model
03:40
A is the best one and how many of you think that the model B is the best one, okay, well, I did a trap because I didn't show you the complete data and in the last step
04:07
we can see that the line blue dropped down dramatically so in this case the best
04:26
one. So basically just modify parameters in a training process, data science sometimes is rather than a science, it's an art, so let's think, I don't know, there is someone from
04:43
GIS here, those guys are doing stock marketing so imagine that we are buying stocks but in some moment you need to stop buying because it could be risky so the model in the side
05:06
B is a good model in that kind of scenario. But maybe you are asking why this name like Bertrand Russell's Inductivist Turkey, do
05:20
you know the history? Okay, so it's very funny because imagine that one Turkey that is smart, this Turkey is, it has this philosophy inductivist so he make conclusions and then each morning
05:41
he receives food from his owner and then he start to think, oh, okay, I'm receiving food at night in the morning, the rainy days, sunny days, weekends, so he start to conclude that he will receive food for a lot of time but that is not the case in Christmas
06:09
day because the next day the owner will cut the throat for this Turkey so this model
06:20
is very good in that scenario and I like that history from Bertrand Russell. So I think that you know this formula, of course, the budget rule, well I will repeat what it does mean.
06:41
Well, let us start with the blue term that is our prior knowledge, imagine that as the conclusions that are the Turkey is getting every day. So, but this reality it will change, it will change because of the observations
07:11
let's think in D as the data set, I don't know, or yeah the observations given our
07:21
hypothesis or our parameter and at the end we will have a new reasoning that is called the posteriori value, that one in red, in red color. So well we define already what a model is so now we are talking about Bayesian reasoning
07:49
that is old I think, very old but this is still very useful and very popular. So the idea is that all this was before the machine learning revolution and the data
08:09
science revolution so the idea is that manipulating probabilities we can make inference. So, let's see this other formula, we want to get the maximum argument in the Bayes formula,
08:31
so this is called maximum posteriori, for example from the step two to the step three
08:40
do you know why we delete this term? You know here is present but not here, can you tell me? Because there are like 20 formulas like this, no just kidding, well we can see that
09:02
this term is not affecting the other ones so we delete it and also if we have at all not prior knowledge we can say that the probability of our hypothesis is the
09:24
same so also we can delete it. And we finish with this maximum likelihood estimation that is a very important formula and it's not so hard. So we will see maximum likelihood estimation in almost every algorithm, if you want to
09:47
prove that your algorithm is correct just try to match with this maximum likelihood estimation and I will come back later. I just want to mention that there is a very nice paper where the paper studies the history
10:12
of this maximum likelihood estimation so it's nice to see the progress, I don't remember
10:22
the title but you can google if you are interested. So now we know more about some Bayesian reasoning and we know that is present in many algorithms, machine learning algorithms. So the next thing on some problems with data, the data is always evolving so for
10:47
example imagine Wikipedia the first years, I don't know how many articles or the topic of the articles but let's say that they were just biology and chemistry and then
11:02
the next years Wikipedia start to have articles in sports or biographies from artists so the data was evolving. The same with the species in the planet, every week or every few days the biologists
11:26
are discovering new species so they need to modify somewhat the taxonomy sometimes and then it's evolving the data and let's think in the social networks they are evolving
11:47
every second for example the hashtags on Twitter and how we usually address the problem
12:02
for example clustering, there is a kind of common way to do it, one classic approach so let's say that we want to use Gaussian mixture models and let's think in the Gaussian
12:23
mixture as something that is proven by the maximum likelihood so the Gaussian mixture models is not equivalent but it fits with the maximum likelihood, this is a claim without
12:52
proof, I won't prove but yeah then in the Gaussian mixture model there is bias reasoning
13:05
so yeah we can do some clustering maybe with Gaussian mixture but then some question arise like how many clusters do we need if my data is evolving and then we usually
13:29
create too many kernels and then do a comparison I don't know maybe with bias information
13:40
criterion or silhouette there are a lot of metrics and in this case the best clustering was five kernels, I don't know if you can see so this one yeah it seems like clustering
14:03
but actually also number three so that is what we usually do but let's think in another approach so we have seen the parametric approach but let's think in non-parametric
14:24
approach and non-parametric can be confused like there is no parameters at all but actually the number of parameters is infinite but also there is a bad assumption about that
14:41
so the idea is that we have empty clusters, infinite empty clusters and we will start to fit them with our data so in this way we can solve that problem if our models can
15:02
be adapted to the complexity of our data itself and well let's see this diagram so we have the Bayesian models some of them are non-parametric and those names are funny
15:24
like you know Chinese restaurant and Indian buffet I don't know sometimes the science is funny and you know I am from Mexico I am thinking that if I study you know some of the dielectric process I can conclude another similar model I don't know the Mexican
15:48
taqueria process maybe so these models are known as dielectric process and I will explain the Chinese restaurant process because for me it is very intuitive so imagine
16:08
that we are in a Chinese restaurant and usually the Chinese restaurants in California I think they are huge so one scientist discovered that and he said well the Chinese
16:26
restaurants are huge so you can go there and if the restaurant is empty you can choose any table and then the second customer will go and will choose another table or the same
16:45
one with some probability and then at the end we realize that what is happening when a lot of customers are going to the restaurants is a kind of clustering and each table is
17:04
one cluster and well that's pretty cool so that is the idea of the model the Chinese restaurant process one model that is evolving so this is the clustering for the previous
17:24
data we saw but with the infinite Gaussian mixture model that is some dielectric process
17:44
so we can do everything you can do with a parametric model for example digit recognition or topic modeling and well actually I am in the conclusion part so let's recap in
18:06
the traditional approach sorry I mixed some letters instead of data listing in probability of age well the number of parameters are fixed we have some distribution over those
18:33
parameters and in the other case the nonparametric models we assume that we have infinite number
18:42
of clusters and our data can be adapted to or no our model can be adapted to this data so there are some libraries in Python of course Sklearn but they only have like
19:09
two or three or one one or two algorithms and the best one in my opinion because it's
19:21
more like the research stuff is this one data microscopes but I don't like it so much because they only are available on Conda and not in the official Python repository but actually yeah they are the best library they have the best library so if we want
19:51
to know more about this we can study what the beta distribution is in a nutshell the
20:01
beta distribution is just a probability of probabilities and then we have the direct distribution that is some generalization from beta distribution then that means that is distribution of distributions and finally the direct process well if you also want
20:33
to read more about it well my favourite book in machine learning is the one written
20:43
by Tom Mitchell and this tutorial is really nice and the library I mentioned they have also very nice tutorials you can check it well that's all Grazie.
21:13
So I have a few questions okay can you use the microphone and speak directly to him.
21:21
So you mentioned the Gaussian mixture model is that considered a Bayesian model then as well? Maybe the Bayesian model is not a Gaussian mixture model it's not strictly a Bayesian
21:43
model but the algorithm that we use to do the clustering is expectation maximum and this one you use the maximum likelihood so the maximum likelihood is derived from
22:03
the Bayesian rule so what we can say that is part of Bayesian reasoning, Bayesian model and actually there is some discussion like most of what we know as Bayesian models
22:24
they are not exactly Bayesian models but just statistical learning and then let's think that if we use calculus and we say for everything that we are doing things from
22:40
Newton and it's not exactly like that well Newton was important in calculus but there are other mathematicians that also put things in that over that so it's almost the same with Bayesian things sometimes it's not exactly Bayesian but it has behind this
23:10
Bayes rule. More questions?
23:23
So we have plenty of time, I have a few questions, oh it says someone again oh great. Yeah I can ask a second question maybe so I'm just trying to understand this Bayesian
23:46
models is it a category of models or is it a like a way of using different other models because I mean I know the name Gaussian mixture model and I'm trying to understand
24:00
Bayesian models is it also a category of other models or is it a way of using a model yeah maybe yeah there are some models that are strictly Bayesian for example let's
24:24
say Bayesian networks we have strictly some two probabilities that are not independent that depend on they are they depend on each other so we can say that Bayesian networks
24:44
are strictly a Bayesian model and there are other ones I think we can think on a Bayesian model or hidden Markov models because yeah we have some nodes and they have a
25:06
dependency and this dependency is a probability but exactly we can then discard a Gaussian
25:22
mixture model as a Bayesian model but still as I said it has some Bayesian things inside very deep I don't know if it was a good answer so well maybe I can finish like you
25:44
know I am more like software developer but I am not doing software development anymore because it's kind of boring then I decided to master in computer science I'm doing a lot of
26:03
statistics now but I am also not a statistician so as Bertrand Russell said I am not a philosopher I am not a mathematician either so I am a centaur so it's the same last question
26:21
I don't see hands up so thanks again to Oma