Probabilistic Programming in Python
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 51 | |
Number of Parts | 119 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/20040 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Berlin |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
EuroPython 201451 / 119
1
2
9
10
11
13
15
17
22
23
24
27
28
41
44
46
49
56
78
79
80
81
84
97
98
99
101
102
104
105
107
109
110
111
112
113
116
118
119
00:00
Computer programmingMathematical modelComputing platformAlgorithmWorld Wide Web ConsortiumWeb browserDecision theoryComputer programmingBookmark (World Wide Web)Mathematical modelBayes-EntscheidungstheorieMultiplication signWave packetWeb browserStrategy gameComputing platformDecision theoryRight angleInsertion lossProbability theoryAlgorithmStatisticsIntegrated development environmentComputer animationLecture/Conference
01:42
BitStatisticsWell-formed formulaProbability theoryLevel (video gaming)Lecture/Conference
02:20
Source codeVirtual machinePredictionComputer programmingInferenceCuboidOpen setMathematical modelSoftware testingSingle-precision floating-point formatWebsiteGroup actionRevision controlMeasurementBit rateData conversionMathematical modelInferenceContrast (vision)Mathematical modelPredictabilityCuboidBit rateTerm (mathematics)Nichtlineares GleichungssystemWebsiteComputer programmingVirtual machineData conversionSpeech synthesisMeasurementInference engineWave packetRight angleBlack boxType theoryGroup actionVariety (linguistics)CASE <Informatik>Machine learningQuicksortAlgorithmSoftware testingComputer animation
04:51
Software testingWebsiteGroup actionRevision controlData conversionBit rateMeasurementLetterpress printingBootingSample (statistics)ArmArithmetic meanMaxima and minimaEstimationAlgorithmFiber bundleResultantMaximum likelihoodBinary codeDisk read-and-write head1 (number)Proof theoryEvent horizonType theoryRight angleProcess (computing)Point (geometry)Mathematical modelCausalityParameter (computer programming)StatisticsMathematicianRevision controlReal numberMathematicsLecture/ConferenceComputer animation
06:59
Letterpress printingBeat (acoustics)Random numberSample (statistics)InfinityCASE <Informatik>Arithmetic meanDegree (graph theory)EstimatorBasis <Mathematik>Right anglePattern languageSequenceEvent horizonNumberSeries (mathematics)AlgorithmConfidence intervalLecture/ConferenceComputer animation
08:29
Software testingUniform resource nameStatisticsLetterpress printingVariable (mathematics)StatisticsStatistical hypothesis testingBranch (computer science)Different (Kate Ryan album)Procedural programmingBinary codeMultiplication signSoftware testingCASE <Informatik>State of matterAreaRight anglePoint (geometry)MereologyEvent horizonCentralizer and normalizerProcess (computing)DatabaseLecture/ConferenceComputer animation
11:03
DatabaseStatisticsStatistical hypothesis testingCore dumpSet (mathematics)Well-formed formulaGoodness of fitBayesian networkLecture/Conference
11:39
StatisticsVariable (mathematics)Well-formed formulaCore dumpThomas BayesPosterior probabilityMetropolitan area networkHypothesisRight anglePoint (geometry)Distribution (mathematics)BitWell-formed formulaCondition numberLaptopMeasurementMedical imagingVideo gameBeat (acoustics)AlgorithmMathematical modelSingle-precision floating-point formatPosterior probabilityClosed setEstimatorComputer programmingGroup actionCASE <Informatik>Confidence intervalState of matterMathematical modelSystem programmingParameter (computer programming)WebsiteLine (geometry)Probability distributionRandom variableDecision theoryClique-widthDisk read-and-write headGUI widgetPlotterThetafunktionComputer animationDiagram
17:07
CuboidPersonal digital assistantThomas BayesComputer programmingSample (statistics)Markov chainMathematical modelAlgorithmBitPosterior probabilityINTEGRALLogic programmingInfinityMultiplication signMathematical modelChainComputer programmingCASE <Informatik>Sampling (statistics)Phase transitionConstraint (mathematics)Centralizer and normalizerQuicksortMarkov chain Monte CarloSocial classWell-formed formulaAlgorithmComputer animation
18:21
Posterior probabilityMathematical modelAlgorithmAlpha (investment)Computer programmingSoftware frameworkDistribution (mathematics)Mathematical modelRight anglePosterior probabilityClosed setMarkov chain Monte CarloForm (programming)CASE <Informatik>Point (geometry)Multiplication signSocial classMathematical modelKeyboard shortcutGraph (mathematics)GradientSampling (statistics)Software frameworkAlgorithmDistribution (mathematics)Complex (psychology)CuboidComputer programmingInformationHistogramCodeSoftware bugCompilation albumRewritingConstructor (object-oriented programming)Stress (mechanics)Lecture/ConferenceComputer animation
20:42
CodeCompilerComputer programmingStandard deviationArithmetic progressionFormal languageAuthorizationSoftware frameworkMereologyMathematical modelMultiplication signLecture/Conference
21:36
Mathematical modelAlgorithmDistribution (mathematics)Alpha (investment)Computer programmingSoftware frameworkBit error rateBernoulli numberInferenceCausalityInversion (music)Thomas BayesStatisticsSpecial unitary groupMathematical modelUltraviolet photoelectron spectroscopyPerturbation theoryMereologyThetafunktionLandau theoryAlgebraSynchronizationAmsterdam Ordnance DatumUniform resource nameBeta functionComputer multitaskingInflection pointSicPersonal area networkManufacturing execution systemState of matterQuantum stateSummierbarkeitDensity of statesModal logicPoint (geometry)Inheritance (object-oriented programming)Mathematical singularityRadio-frequency identificationFermat's Last TheoremCellular automatonComputer-assisted translationSoftware testingHypothesisLetterpress printingHierarchyLocal GroupComputer programmingMultiplication signMathematical modelBitMereologyDependent and independent variablesAlpha (investment)Term (mathematics)Goodness of fitAlgorithmRandom variablePosterior probabilityRight angleDistribution (mathematics)Mathematical modelPoint (geometry)Fitness functionElectric generatorConfidence interval1 (number)Level (video gaming)2 (number)Structured programmingNumberFunctional programmingStatistical hypothesis testingBeta functionData structureCodeProbability distributionMathematical analysisSampling (statistics)Maxima and minimaShape (magazine)Bit ratePlotterCausalityStatisticsData conversionWell-formed formulaSimilarity (geometry)CASE <Informatik>Object-oriented programmingData dictionaryArithmetic meanProgram slicingContext awarenessMarkov chain Monte CarloAverageHistogramInfinityCombinational logicLink (knot theory)EstimatorGroup actionLibrary (computing)Beta distributionThetafunktionLine (geometry)Parameter (computer programming)RandomizationLipschitz-StetigkeitComputer animation
30:47
Parameter (computer programming)Right angleMathematical modelCategory of beingSimilarity (geometry)Integrated development environmentAlgorithmDifferent (Kate Ryan album)Figurate numberExterior algebraComputer animationLecture/Conference
31:33
Parameter (computer programming)Mathematical modelHierarchyPersonal area networkLetterpress printingCAN busSupremumBeta functionQuadrilateralHexagonShape (magazine)RobotArithmetic meanAlpha (investment)Point (geometry)ThetafunktionCausalityVariable (mathematics)Landau theoryScale (map)Vector spaceMenu (computing)Trigonometric functionsInferenceUniform resource nameSpecial unitary groupGamma functionSynchronizationMobile appMathematical modelGroup actionEstimatorTerm (mathematics)MassInformationSpecial unitary groupType theorySoftware testingProcess (computing)Mathematical modelObject-oriented programmingSubject indexingArithmetic meanGradientBinary codeRight angleSampling (statistics)Distribution (mathematics)Scaling (geometry)Variable (mathematics)Multiplication signRandom variableStructured programmingHierarchyAlpha (investment)Parameter (computer programming)Array data structureLatent heatTransformation (genetics)VarianceBeta functionRow (database)Beta distributionCategory of beingTask (computing)GammaverteilungShape (magazine)ExpressionVector spaceGraph (mathematics)Well-formed formulaFormal languageAxiom of choicePlotterComplex (psychology)Bit rateAlgorithmState of matterPoint (geometry)Loop (music)Thetafunktion2 (number)Auditory maskingHookingMathematicsProbability distributionComputer animation
37:47
Open setComa BerenicesPlot (narrative)InferenceAlgorithmMathematical modelEstimationComplex (psychology)Computer programmingComputer programmingConfidence intervalParameter (computer programming)Group actionBlack boxArithmetic meanScaling (geometry)AverageSign (mathematics)EstimatorGoodness of fitMathematical modelInferenceCondition numberRight angleComputer animation
38:38
Sampling (statistics)Element (mathematics)Black boxCuboidReading (process)AlgorithmInferenceTerm (mathematics)Fiber bundleBlogLecture/Conference
39:53
Reading (process)Bayesian networkAlpha (investment)LaptopAlgorithmTwitterProgrammer (hardware)Hacker (term)Data analysisStatisticsComputer programmingVariable (mathematics)Chemical equationBound stateComputer animation
40:43
Addressing modeMultiplication signProjective planeSampling (statistics)CodeSocial classWebsiteParalleler AlgorithmusCodierung <Programmierung>StatisticsTemplate (C++)Heat transferAlpha (investment)Pairwise comparisonParameter (computer programming)Disk read-and-write headAreaExtension (kinesiology)Functional programmingFitness functionLikelihood-ratio testInstance (computer science)SubsetVariable (mathematics)Parallel computingStreaming mediaMathematical modelComplete graphVirtual machineLikelihood functionVector potentialInheritance (object-oriented programming)EmailGraph (mathematics)NeuroinformatikUltraviolet photoelectron spectroscopyElectronic mailing listCache (computing)Drum memoryLecture/Conference
Transcript: English(auto-generated)
00:15
Thanks so much for coming out to hear me talk about my favorite topic probabilistic programming. I
00:22
Thought I'd introduce myself real quick. I'm recently relocated back to Germany after studying at Brown where I did my PhD on Bayesian modeling and where we studied decision-making and For a couple of years. I've also been working with quantopian, which is a Boston based startup and As a quantitative researcher and there we are building the world's first algorithmic trading platform in
00:45
the web browser The talk will be tangentially related So I'm gonna say I'm just gonna show you that screenshot of what it looks like So this is essentially what you see when you go on the website. It's a web-based IDE where it can code Python and code up your trading strategy and then we provide historical data so that you can test how
01:06
You are and would have done if it was I don't know 2012 and then on the right here you see how well it did and often and that's what I'll refer back to You were interested in whether you beat the market or lose against the market And also I should add it's completely free and everyone can use it
01:24
Okay, so I think at every talk that should be the main question is well, why should you care about probabilistic programming alright And It's not really an easy talk just because talking about probabilistic programming you need to have at least a basic understanding of
01:42
Some concepts of probability theory and statistics. So the first 20 minutes I will just give a very quick primer focusing on an intuitive level of understanding Can I get a quick show of hands like who? Sorta understands on a intuitive level how Bayes formula works
02:01
Okay, so most of you so Maybe you won't even need that primer but Might still be interesting and towards the end then we have a simple example and then a more advanced example That should be interested even if you know already quite a bit about business statistics So to motivate this further I really like this
02:23
Contrast that Olivier gave it is talk about machine learning and that is chances are you are a data scientist maybe and use scikit-learn to train your Machine learning classifiers. So what this looks like is on the left you have data then the
02:41
that You used to train algorithm and then you that I will make predictions and if those predictions are all you care about Then that might be fine, right? But One central problem that most of these algorithms have is that it they're very bad at conveying what they have learned So it's very difficult to inquire what goes on in this black box right here
03:02
So on the other hand probabilistic programming is inherently open box and I think the best way to think about this is That it is statistical toolbox for you to create these very rich models that are really tailored to the specific data that you're working with and Then you can inquire that model and really see what's going on
03:22
And what was learned so that you can learn something about your data rather than just making predictions, right? and the other big benefit I Think and we'll see that later is that these Type of models work with So called black box inference engine which are sampling algorithms that work across a huge variety of models
03:47
So you don't really have to worry about the inference step all you have to do is basically build that model and then hit the inference button and in most cases You'll just get the answers that you're looking for So there's not really much in terms of solving equations, which is always nice
04:05
So throughout this talk I want to use a very simple Example that most of you will be familiar with and that is a B testing As you know when you have two websites and you want to know which one works better and some measure that you're interested in
04:22
Maybe the conversion rate or how many users click on an ad What do you do to test that so you split your users into two groups and give group one? Website a and give group to website B, and then you want to look which Had the higher measure, right? That problem is of course much more general and since I'm coming more from a finance background
04:45
I'm gonna sort of switch back and forth between the statistically speaking identical problem where you have two trading algorithms and You want to know which one has a higher chance of beating the market on each day, so
05:03
Here I'm just gonna generate some data To basically see what the trivial answers trivial answers that you might come up with yield and how we can approve upon that and You might be surprised that I'm not using real data But I think that is actually a critical step is before you apply your model on real data
05:25
You should always use simulated data We really know what's going on and the parameters that you want to recover So that you know that your model works correctly and only then you can be sure that you'll get correct answers by Applying it to real data, right? So the day the data that we're going to work with it will be binary. So just boolean events and
05:46
That type of Statistics statistical process called a Bernoulli and that is essentially just coin flip right the probabilities of coin flips and I can use that from
06:00
Sci-fi stats and then I call it Bernoulli and here I pass in the probability of that coin of coming up heads or that algorithm of beating the market on a particular day or that website the of converting a user and Here, I'm sampling 10 trials. So this will be the result right just a bunch of binary
06:23
zeros and ones so I'm generating two algorithms One with 50% one with 60% So you want to know which one is better the easiest thing that you might want to that you might come up with is just well let's take the mean right and
06:40
Actually statistically speaking. That's not a terrible idea and it's called the maximum likelihood estimate and If you ask an applied mathematician What you should do then that might be the answer and I took a cause and applied math And the proofs always work in a very similar way. You basically have this problem and then you say well, okay, let's have
07:01
our data go to infinity and then you solve and then you get The estimator works correctly in that case and that's great but what do you do if you don't have an infinite amount of data, right and that's the much more likely case that you be in and That I think is where basis districts really work
07:21
Well, so what happens in our case now where I just take the mean of the data just generated as you can see in this case We get we estimate that the chance of this algorithm beating the market is 10% and 40% for the other one So obviously that's completely wrong I used 50 and 60 percent to generate it and the obvious answer of why this goes wrong
07:44
It's just I was unlucky and the observant members in the audience will have noticed that I used a particular random seed here So I found that I took that random seed to produce this very weird sequence of events That basically Produced this pattern but certainly that can happen with real later. Right? You can't just be unlucky and
08:05
the first 10 visitors of your website just just don't click and The central thing that I think is missing here is The uncertainty in that estimate right 10% 40% that's just a number but we're missing how confident we are in that number
08:25
So for the remainder of the talk that will be a recurring topic is really trying to quantify that uncertainty Then you might say well there's this huge branch of statistic frequent to statistics which designed these statistical tests to
08:43
Decide which one of those two is better or whether there is a significant difference Then you might run a t-test and that returns a probability value that indicates How likely are you to observe that data if it was generated by chance and You that's certainly the correct thing to do But one of the central problems with frequent of statistics is that it's incredibly easy to misuse it
09:06
for example You might run on the you might collect some data and the test doesn't turn off anything And then on the next day you get more data. So what do you do? Well, you just run another test with With all the data you have now, right you have more data so the test should be more accurate
09:25
Unfortunately, that's not the case and you can see that here just create a very simple Example where do that procedure I generate 50 random binaries with 50% probability both so there is no difference between them and then I
09:41
Start with just two events. I run a t-test if that is not significant then I do three I run the t-test right so not just that process of continuously adding data and testing whether there's a difference and if there's a difference of Smaller than 0.05 then I return true if it isn't then I return false
10:00
And then I return that a thousand times and I look at what the probability is that even though there is no statistical There is no difference at all between those two. They're both 0.5. What is the chance of this test? Yielding an answer that it's that there is a significant difference and It's thirty six point six percent in that case, which is also absurdly high, right?
10:23
so this procedure really fails if you use it in that way and Granted I misused the test right? It's not designed to work in that specific scenario but it's extremely common that people do that and For me one of the central problems is that frequent of statistics really
10:43
Are dependent on your intentions of collecting the data So if you use a different procedure of collecting the data, for example, let's say what I just did I just add data every day then you need a different statistical test and If you think about this more it's actually pretty crazy, right?
11:02
If you just I have your data scientist and you just get data from a database You have no idea what the intentions were of gathering that data, right? So and you want to be very free in exploring that data set and running all kinds of Statistical tests to see what's going on. So I think while frequent statistic is certainly not wrong. It's often
11:22
very Constricting and what it allows you to do and if you don't do things correctly, you might do shoot yourself in the foot And I think that's really a good setup for Bayesian statistics and I'm just going to introduce that very quickly. So at the core we have Bayes formula and
11:42
If you don't know what that is, essentially, it's just a formula that tells us how to update our beliefs when we observe data that implies that we have prior beliefs about the world that we have to formalize and Then here we apply Then we see data and we apply for Bayes formulas to update our beliefs in light of that new data to give us our posterior
12:06
and In general these beliefs are represented as random variables And I'm also gonna very quickly talk about what what those aren't what an intuitive ways of thinking about those so Statisticians like to call their parameters their random variables theta. So that's what I'm going to use here and
12:25
Let's define a prior for our random variable theta and theta will be the random variable About the algorithm beating the market a single algorithm beating the market or the website converting the user, right? So what is the chance that that happens?
12:41
Oops, so I didn't want to show that I just wanted to show that So the best way to think about that random variable is as opposed to a
13:01
We don't know the value right we want to reason about that value we have some ideas some rough idea about that value so rather than just having one we have we allow for multiple values and assign each possible value a probability and This is what that plot shows here. So on the X-axis, we have the possible states that the system can be in for example
13:22
The algorithm can have a chance of 50% of beating the market and I'm gonna assume that that is the most likely case That's my personal prior belief without having seen anything I'm gonna assume that on average 50% is probably a good estimate But I wouldn't be terribly surprised to see something with 60% even though it's less likely
13:42
80% Considerably less likely but still possible 100% that it like beats the market on every day that that I think would be next-gen possible, right? So I'm gonna assign a very low probability of that. So I think that's a very intuitive way of thinking about that So now let's see what happens if I observe data and
14:02
for that I created this widget here and Where I can add data When I use this slider and then it will update that probability Distribution down here and so that will be our posterior, right? Currently, there's no data available. So our posterior will just be our prior
14:23
So that is just the belief we have without having seen anything and now I'm gonna add a single data point a single success So we just ran the algorithm for a single day and it beat the market So now as you might have seen there the distribution here shifted a little bit to the right side right and that represents our updated belief that it's a little bit more likely now that the algorithm is
14:46
generating positive returns So Now let's reproduce that example from before where we had one success and nine Failures, right? There was algorithm A and there we estimated it has a 10% chance of
15:04
Beating the market. So and that was that was ridiculous right with that amount of data. No way we could say that and Also with our prior knowledge no way we would assume that 10% is actually the probability So now I created that and is updating this Probability distribution down here, which is now our updated belief that certainly with nine failures. We're going to assume that there is a
15:27
lower chance of success of that algorithm which is represented by this distribution moving to the left and But still note that it's the 10% is still extremely unlikely under this condition
15:41
Right, and that is the influence of the prior We said 10% is unlikely so that will influence our estimates away from these very low values. The other thing to note is That the distribution is still pretty wide. So Here now we have our uncertainty measure in the width of the distribution right the wider It is the less certain I am about that particular value
16:03
So now I want you to in your head Just imagine what the distribution look like if I move this up to 90 and the success up to 10 Right. So basically now we're observing data that is in line with a hypothesis that It has 90% failure
16:22
probability So as you can see the main thing that happens is it moves to the left But also it gets much narrower and that represents our increased confidence with having seen more data We have more confidence in that estimate. That's exactly what we want
16:44
By the way, how cool is is it that I can use these widgets in a live notebook? Okay, so where's the catch with all of that right this sounds a little bit too good to be true You just like create that model and you update your beliefs and you're done, right? Unfortunately, it's not always that easy. And one of the main difficulties is that this formula in the middle here
17:05
Can in the most in most cases cannot be solved the case that I just showed you it's extremely simple You just apply base formula and you can solve it and then you can compute a posterior analytically but even with like Just a tiny bit more complex models you get
17:24
Multidimensional integrals over infinity that will make your eyes bleed and no sane human would be able to solve So and I think historically that's one of the main reasons why Bayes which has been around since the 16th century Has not been used up until recently now where it's kind of having a renaissance. It's just people weren't able to
17:44
to solve for it and the central idea of probabilistic programming is that well, we can't solve something then we approximate it and luckily for us there's this class of algorithms that are most commonly used called Markov chain Monte Carlo and Instead of computing the posterior analytically that curve that we've seen it draws samples from it
18:05
That's about the next best thing we can do So just due to time constraints, I won't go into the details of MCMC So we'll just gonna assume that it's pure black magic and it works and it sort of is it's kind of it's a very simple
18:22
Algorithm, but the fact that it works in such general cases is still mind-blowing to me and The big benefit is that yeah, it can be applied very widely So often you just define your model you say go and then it'll give you your posterior So what does MCMC sampling look like as we've seen before this is the posterior that we want, right?
18:44
This neat closed-form solution, which we can't get in reality So instead we're gonna draw samples from that distribution And if we have enough samples, we can do a histogram and then it'll start resembling what it is Okay, so let's get to time c3
19:03
time c3 is a probabilistic program framework written in Python and for Python and It allows for construction of probabilistic models using intuitive syntax and And one of the reasons For doing time c3 rather than to maybe some of us you use time c2
19:24
I'm see three is actually a complete rewrite. It uses no code from time c2. There were a couple of reasons one is just Technological depth that the code base of time c2 is pretty complex It requires you to compile Fortran code which always causes huge headaches for users to to get working
19:41
So time c3 is actually very simple code and one of the reasons is that we're using Theano for all things for the whole compute engine so Basically just computed they were just creating that compute graph and then shifting everything off to Theano and the other benefit we get from Theano is that it
20:02
Compiles that it can give us the gradient information of the model and there's this new class of algorithms called Hamiltonian Monte Carlo that Work that are advanced samplers and those work even better in very complex models So they're much more powerful But they require that extra step and that's not easy to get luckily for us Theano provides that out of the box
20:24
So we don't really have to do Don't have to do anything The other point I want to stress is that a time c3 is Very extensible and also it allows you to interact with the model much more freely So maybe I've used Jags or wind bugs or or Stan which is a other very interesting recent
20:45
probabilistic programming framework and While those are really cool one problem I personally have with them is that they require you to write your probabilistic program in this specific language and Then you compile that you have some wrapper code to get the data into
21:03
Stan and then you have some wrapper code to get it out of Stan the results and For me there's always very cumbersome. So you can't really see what's going on in the model. You can't debug it So times a3 you can Write your model in Python code and then really interact with it free so you never have to leave essentially Python and that for me is
21:25
Is very very powerful and so you can think of it much more as a library and we'll see that in a second Just the authors. So John Solvett here is the main guy who came up with it and Chris Von is back Also program
21:41
Quite a bit currently we're still in alpha. It still works It works fairly well already. The main reason why it's alpha is just mainly that we're missing good documentation and We're currently writing those but if you are up for it and would like to help out with that that certainly more than appreciate it
22:01
Okay, so let's look at that model from our early example that we wanted to and see how we can solve it now in Pimc3 and for that I'm just gonna write down that model how you would write it in statistical terms So we have these two random variables, right that we want to reason about theta a and theta B And that will represent the chance of the algorithm
22:21
beating the market and Here we say this tilt means it's distributed as so we're not working with numbers but with distributions. So this is a Beginning right just from 0 to 1 if you're working with probabilities the beta distribution is the one to use and
22:46
So this is the thing that we want to reason that we want to learn about Given data and then we how do we learn about it? Well, we observe data and The data that assimilated was binary so that came from a Bernoulli distribution. So we have to assume that it's
23:05
That the data is distributed according to Bernoulli distributions or zeros and ones and the Probability of that Bernoulli distribution before I just fixed 0.5 right here now We actually want to infer that so since we don't know that value we replace it with a random variable and
23:23
That is the random variable theta a that we had above here so that is how commonly these these models look like and The other point I want to make here is that Here you really see how you're basically creating a generative model, right? So you
23:43
You might wonder like how can I construct my own model and I think a good Path for that is to just think of how the data would have been generated right here I know well, there's this probability and it generated Bernoulli data So that's the model I'm going to create but you can get arbitrarily complex and then say well
24:00
I have all these hidden causes that somehow relate in complex ways to the data and then you're gonna invert that model using Bayes formula to infer these hidden causes so here I'm just gonna Again generate data a little bit more now. So again 50 and 60 percent probability of beating the market or conversion rate and
24:24
300 values and This is what the model looks like in Pymc3 so first we just import Pymc as PM and we instantiate the model object, which will have all the random variables and whatnot and
24:42
The other Improvement over Pymc2 is that everything you specify about your model you do under this with context and basically what that does is that it Everything you do underneath here will be included in that model object so that you don't have to pass it in all the time
25:02
So underneath here now This should look pretty familiar from before where I just had these random variables right theta a distributed as a beta distribution So here I now write the same but in Python code where I say well theta a is a beta distribution we give it a name and We give it the two parameters and alpha and beta are the two parameters that this distribution takes the number of successes and failures
25:25
so this is the the prior that I showed you before that was centered around 50% and I do the same thing for theta B. And now I'm gonna relate those random variables to my data and As I said before that's a Bernoulli
25:41
which I'm gonna instantiate every other name and instead of the fixed p-value now, I give it the random variable right that we want to link together and Since this is an observed node We give it that array of 300 binary numbers that are generated a slide before right?
26:03
so this links it to the data and links it up to the random variable and the same for fee so Up until here. Nothing happened We just basically plug together a few probability distributions that make up how I think my data is structured
26:24
now It's often a good idea to Start the sampler from a good position and for that we gonna just optimize the log probability of the model using find map for find the maximum a posterior value and Then I'm gonna instantiate the sampler. I want to use there various you can choose from him using a slice sampler, which is
26:46
which works quite well for these simple models and Now I actually want to draw the samples from the posterior, right? and for that I call the sample function and I Tell it how many samples I want 10,000. I
27:02
provide it with the step method and I Give it the starting value. And when I do this call, it'll take a couple of seconds to Run the sampling algorithm and then it will return the structure to which I call trace here
27:22
And that is essentially a dictionary For each random variable that I have assigned I will get the samples that were drawn and now that I ran that I can Inquire about my posterior, right?
27:41
So here I'm using seaborne which just as an aside is an awesome plotting library on top of my plot lip You should definitely check it out creates very nice statistical plots. For example, it has that nice this plot Function that basically just creates a histogram, but one that looks much nicer and has for example this nice shaped line and
28:03
I give it the samples that I drew that my MCMC sampling algorithm drew of theta a and theta B and then it will plot the posterior now that I created and that is Again, the combination of my prior belief updated by the data that I've seen and now I can reason about that
28:22
the first thing to see is well, the theta B the probability of Or the chance of that algorithm beating the market is 60% and that's what I used to generate the data So that's good that we get that back and again, that's why we use simulated data to know actually that we're doing the right thing and
28:43
The other one is around 50% or 49% the other thing To note is that here now instead of just having that single number that Seemingly fell from the sky that we would get if we just take the mean we have our confidence estimate, right? We know how white that distribution is. We can answer many questions about that
29:04
Like how likely is it that the probable that the chance of success for that? I rhythm is 65% and then we we get a specific number out that that represents our level of certainty on And We can do other interesting things like hypothesis testing to answer our initial question, which of the two actually does better
29:22
In fact for this we can just compare the number of samples That were drawn for theta a to the samples of theta B So we're just going to ask well How many of those are larger than the other one and that will tell us well with the probability of ninety nine point eleven percent?
29:40
algorithm B is better than a and That is exactly what we want, right? So by Consistently having our confidence estimate carried through from the beginning to the end gives us that benefit of Everything we say has that confidence and probability estimate associated with it
30:04
Okay, so that was super boring up until now. Hopefully it gets a little bit more interesting now So Consider the case where instead of just two algorithms we might have 20 and that is What we have on topian many users have these algorithms and
30:23
Maybe we want to know not only each individual algorithms the chance of success, but also the algorithms overall the the group average Are they also doing? They're also consistently beating the market or not so
30:40
The easiest model you can probably build is just the one we did before but instead of to theta a and B we have 20 thetas, right and while that's fair and This is called an unpooled model. It's somehow unsatisfying right because We probably assume that
31:01
That these are not completely separate right if the items would work in the same market environment Some of them will have similar properties some similar algorithms that they're using So they will be related somehow, right? They will be they will have differences, but they will also have similarities and this model does not incorporate that right?
31:23
There's no way of what I learned about theta one. I would apply to theta two The other extreme alternative would be to have a fully pooled model where instead of assuming each one has its own random variable I just assume one random variable for all of them and that's also
31:41
unsatisfying because we know that there is that structure in our data and Which we're not exploiting and also even though we might get group estimates We could not say anything about a particular algorithm how well that one did right? so the Solution which I think is really elegant is called a partially pooled or hierarchical model and for that
32:07
We add another layer on top of the individual random variables right up until here we only have The model we had before with all these independently But what we can do is instead of placing a fixed prior on that we can actually learn that
32:23
prior for each of them and have a group distribution that will apply to all of them and Those models are really powerful and a very many nice properties One of them is well what I learn about theta one from the data Will shape my group distribution and that in turn will shape the estimate of theta two
32:44
So everything I learn about the individuals I learn about the group and what I learn about the group I can apply to constrain the individuals and Another example where this where we do this quite frequently is from my research on Say psychology where we have a
33:02
behavioral task that we will be test 20 subjects on and Often we don't have enough time to collect lots of data. So each subject by itself the estimates we would get If we fit a model to to that guy It will be will be very noisy and that is a way to build a hierarchical model to basically
33:21
Learn from the group and apply that back to the group. So we will get much more accurate estimates for each individual That's very a very nice property of these hierarchical models So here I'm just gonna generate again some data and the Essentially the data will be just an array 20 times 300
33:43
20 subjects 300 trials and will just be each row is the binaries of each individual, right and then for convenience I also create this this indexing mask that I will use in a second. That might not make sense right now and but just keep at the back of your mind that
34:03
basically, I'm indexing the first row will be just an index for the first subject and Indexing into that random variable, but this is the data that we're going to work with Okay, so how does that model look like in time c3 so
34:21
Here I'm gonna first create my group variables the group mean and group scale. So how what's the what's the average? rate to the average chance of being in the market of all algorithms and how variant variable are they that's gonna be the the scale parameter and this is a
34:41
Choice you make in modeling which price you want to use here. I used a gamma distribution and For the variants I use sorry I use a beta distribution for the group mean and I use a gamma distribution because variants can only be positive With with certain parameters, but the details of that are not that critical
35:01
then Unfortunately, the beta distribution is parameters in terms of an alpha and a beta parameter and not in terms of a mean and variance Fortunately, there's this very simple transformation we can do to these mean and variance parameters to convert them to alpha and beta values that I'm doing here and while the
35:23
Specifics of that are not important. I just wanted to show how easy it is and if you use some other languages it's not a given that you can just really very freely combine these random variables and transform them and And still have it work out. And the reason is that these are just these piano expression graphs that once I multiply them it will actually take the
35:44
Probability distributions or the formula and combine that and actually do the math in the background of of combining that So then I need to hook that up with the with the thetas with my Random variables for each algorithm and instead of having a for loop and just generating 20 of them. I
36:06
Can pass in the shape? Argument and that will generate a vector of ran of 20 random variables That will be theta. So this is not a single one, but actually 21 and Before we will note that I had just my hard-coded prior of five and five here right in the previous model
36:27
but now I'm replacing that with the the group estimates that I that are also gonna learn about and Now again, my data is going to be Bernoulli distributed and
36:40
for The probability now, I'm gonna Use that index that I showed you before and essentially that will Index into this vector in a way so that it will turn that into a two-dimensional array of the same shape as my data And then it matches it one-to-one and it just does the right thing
37:04
and then I pass in the 2d array of The rows of binary variables for each algorithm and Again, I'm running. I'm finding Google starting point and note here that I'm using now this called nuts sampler, which is
37:21
This state-of-the-art Sampler that uses the gradient information works much better in complex models specifically these hierarchical models often very difficult to estimate but the this type of sampler does a much better job and that was one of the reasons actually to to develop time c3 Okay Oops, and then with a trace plot command we can just create a plot. So
37:46
Don't mind about the right side but here now we get our estimates of the group mean and again, we have Not a single value but rather the confidence. So on average we think it's about 46 percent
38:01
We have the scale parameter and we have 20 individual I got them, right? So that would be theta 1 to theta 20 and All of them constraining each other in that model So that's pretty cool so to wrap up how I convinced you that probabilistic programming is pretty cool and
38:22
That it allows you to tell a generative story about your data, right? And if you listen to any tutorial on how to be good data sign that it is telling stories about your data, right? So how but how can you tell stories if you all you have is that black box inference algorithm? so I think that's where probabilistic programming is
38:40
Really quite improvement You don't have to worry about inference these black box algorithms Work pretty well You have to know how they what it looks like if they fail and and it can be tricky then to get it going So it's not It's not super trivial. But still they they often work out of the box and
39:03
Lastly pymc3 gives you these advanced samplers I'm gonna skip that and Go to further reading. So check out quantopian if you want to design algorithms that have hopefully higher chance than 50% of beating the market
39:22
For some content on pymc3 actually I have written a couple of blog posts on that and currently that's probably the best resource for getting getting started and Mainly that's just because there is not that much else written about pymc3 in terms of documentation and down here these
39:42
Also some really good resources that are recommend to to learning about that. So thanks a lot. Yes, please
40:03
Yeah, so so the question is
40:25
Stan provides a lot of tools for Assessing convergence and many diagnostics but also very nice feature of transforming the variables and placing balance on that and I
40:40
So pymc3 has like the most common statistics that you want to look at like the government Rubin our hat statistic and all of that and you can sample in parallel and then compare and You can and we do have support for transformed variables, it's not like as Polished as Stan just because it's still an alpha. But yeah, it's there and you can and you can bond your parameters
41:06
So yeah that that works, but it's not quite a streamlined yet more questions
41:27
I So the question was I Can't use this sampler that we provide here Hamiltonian Monte Carlo because it's too expensive to use
41:41
so how difficult would it be to use my own sampler and That I think is a big benefit of pymc3 is that you just basically inherit from the sampler base class And then you overwrite the step method and then you have you can do your own proposals and acceptance and rejections So that's very easy. And if you look at Stan, for example, I
42:04
Haven't done it, but I imagine that it's quite difficult. Just when I look at the code. It's it's really hardcore C++ and All the templates make my head hurt The other question if I understood you correct was like if you can't evaluate the likelihood or oh
42:24
Yeah So the question is how does it compare speed wise I guess or if you write your own sample in Python won't that be slow
42:51
and so I Think most of the time is actually not spent in the sampler but rather than evaluating the log likelihood of the model and also the grading computations very difficult and
43:04
It's true that Stan is fast, but One it's fast once it gets started, but it takes quite a while to compile the model actually so in in that sense I Haven't really done the speed comparison and we recently have noticed some areas where pymc3 is not fast
43:26
and We need to fix those and speed it up And certainly this Stan guys have done a lot to really make a China and that's the benefit of having C++, but on the other hand One benefit I think to Theano is that it does all these simplifications
43:43
To the compute graph and does like clever caching and you can even run it on the GPU. So I We haven't really explored that to the fullest extent yet But I think there's lots of potential speed ups that just Theano could give us And
44:01
Another answer to your question as well If for example, you really spend that much time in your sampler of just proposing drums You could also use scythe on for example and encode yourself in that. Yes
44:21
the question is about parallel sampling and that is Possible. So there is just a piece sample function instead of the sample function and that will distribute the model It doesn't quite work in every instance yet. But yeah, it uses multi-processing. So you get true parallelization and
44:43
Just in the side there's this really cool Project that someone on the mailing list just wrote about that is what times e2 but the same trick could be applied to times e3 And he's uses spark To basically do the sampling in parallel on big data Like if you have data that doesn't fit on a single machine
45:02
you can run individual samplers on subsets of the data in parallel and then aggregate them and Spark lets you do that very nicely and he basically hooked up Pi MC and and spark so that's really really exciting