Probabilistic Programming in Python
Formal Metadata
Title 
Probabilistic Programming in Python

Title of Series  
Part Number 
51

Number of Parts 
120

Author 

License 
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2014

Language 
English

Production Place 
Berlin

Content Metadata
Subject Area  
Abstract 
Thomas Wiecki  Probabilistic Programming in Python Probabilistic Programming allows flexible specification of statistical models to gain insight from data. Estimation of best fitting parameter values, as well as uncertainty in these estimations, can be automated by sampling algorithms like Markov chain Monte Carlo (MCMC). The high interpretability and flexibility of this approach has lead to a huge paradigm shift in scientific fields ranging from Cognitive Science to Data Science and Quantitative Finance. PyMC3 is a new Python module that features next generation sampling algorithms and an intuitive model specification syntax. The whole code base is written in pure Python and Justintime compiled via Theano for speed. In this talk I will provide an intuitive introduction to Bayesian statistics and how probabilistic models can be specified and estimated using PyMC3.

Keywords  EuroPython Conference EP 2014 EuroPython 2014 
00:00
BayesEntscheidungstheorie
Mathematical model
Algorithm
Decision theory
Multiplication sign
Decision theory
Web browser
Computer programming
Web browser
Bookmark (World Wide Web)
Wave packet
Strategy game
Computer programming
Computing platform
Right angle
Computing platform
World Wide Web Consortium
01:27
Statistics
Probability theory
Wellformed formula
Computer programming
Energy level
Bit
Computer programming
02:20
Group action
Variety (linguistics)
Virtual machine
Bit rate
Black box
Mathematical model
Mathematical model
Wave packet
Measurement
Inference
Machine learning
Bit rate
Term (mathematics)
Computer programming
Cuboid
Contrast (vision)
Data conversion
Nichtlineares Gleichungssystem
Website
Predictability
Source code
Mathematical model
Inference engine
Group action
Computer programming
Measurement
Open set
Virtual machine
Singleprecision floatingpoint format
Type theory
Prediction
Personal digital assistant
Inference
Revision control
Website
Speech synthesis
Software testing
Data conversion
Right angle
Cuboid
04:51
Point (geometry)
Statistics
Real number
1 (number)
Maxima and minima
Bit rate
Parameter (computer programming)
Disk readandwrite head
Event horizon
Mathematical model
Arm
Measurement
Revision control
Causality
Arithmetic mean
Website
Algorithm
Letterpress printing
Group action
Type theory
Proof theory
Process (computing)
Sample (statistics)
Maximum likelihood
Estimation
Revision control
Software testing
Data conversion
Right angle
Mathematician
Resultant
Booting
06:59
Beat (acoustics)
Random number
Confidence interval
Infinity
Letterpress printing
Event horizon
Sequence
Number
Degree (graph theory)
Estimator
Arithmetic mean
Sample (statistics)
Personal digital assistant
Right angle
Pattern language
Series (mathematics)
08:29
Statistical hypothesis testing
Centralizer and normalizer
Personal digital assistant
State of matter
Different (Kate Ryan album)
Software testing
Software testing
Mereology
09:30
Point (geometry)
Area
Statistical hypothesis testing
Multiplication sign
Binary code
Letterpress printing
Statistics
Event horizon
Variable (mathematics)
Process (computing)
Different (Kate Ryan album)
Personal digital assistant
Uniform resource name
Software testing
Right angle
Software testing
Procedural programming
11:03
Statistical hypothesis testing
Group action
Confidence interval
State of matter
INTEGRAL
Decision theory
Multiplication sign
Set (mathematics)
Parameter (computer programming)
Mathematical model
Variable (mathematics)
Medical imaging
Thomas Bayes
Estimator
Singleprecision floatingpoint format
Personal digital assistant
Core dump
Physical system
Metropolitan area network
Algorithm
Constraint (mathematics)
Closed set
Sampling (statistics)
Infinity
Bit
Statistics
Measurement
Wellformed formula
Sample (statistics)
Chain
Phase transition
Website
Right angle
Point (geometry)
Probability distribution
Laptop
Statistics
Beat (acoustics)
Cliquewidth
Algorithm
Mathematical model
Hypothesis
Centralizer and normalizer
Wellformed formula
Computer programming
Posterior probability
Condition number
Random variable
Distribution (mathematics)
Mathematical model
Core dump
Database
Line (geometry)
Computer programming
Markov chain
Logic
Personal digital assistant
Posterior probability
Video game
Cuboid
18:21
Histogram
Distribution (mathematics)
Closed set
Multiplication sign
Sampling (statistics)
Mathematical model
Markov chain Monte Carlo
Personal digital assistant
Computer programming
Posterior probability
Right angle
Software framework
Posterior probability
Form (programming)
19:11
Point (geometry)
Complex (psychology)
Algorithm
Graph (mathematics)
Information
Algorithm
Distribution (mathematics)
Multiplication sign
Keyboard shortcut
Gradient
Sampling (statistics)
Computer programming
Mathematical model
Mathematical model
Mathematical model
Software framework
Cuboid
Alpha (investment)
Social class
20:42
Point (geometry)
Algorithm
Distribution (mathematics)
Code
Multiplication sign
1 (number)
Bit error rate
Similarity (geometry)
Mereology
Inversion (music)
Mathematical model
Mathematical model
Formal language
Number
Thomas Bayes
Goodness of fit
Causality
Term (mathematics)
Computer programming
Authorization
Software framework
Random variable
Standard deviation
Algorithm
Dependent and independent variables
Distribution (mathematics)
Electric generator
Fitness function
Infinity
Bit
Computer programming
Compiler
Mathematical model
Causality
Inference
Software framework
Right angle
Bernoulli number
Arithmetic progression
Alpha (investment)
24:03
Context awareness
Beta function
Code
Multiplication sign
Bit error rate
Parameter (computer programming)
Special unitary group
Mathematical model
Thomas Bayes
Bit rate
Data conversion
Beta function
Sampling (statistics)
Amsterdam Ordnance Datum
Bit
Maxima and minima
Mereology
Statistics
Inflection point
Uniform resource name
Inference
Personal area network
Right angle
Bernoulli number
Landau theory
Alpha (investment)
Probability distribution
Functional (mathematics)
Link (knot theory)
Algorithm
Perturbation theory
Inversion (music)
Mathematical model
Number
Thetafunktion
Sic
Causality
Wellformed formula
Program slicing
Computer multitasking
Data structure
Posterior probability
Random variable
Distribution (mathematics)
Mathematical model
Ultraviolet photoelectron spectroscopy
Causality
Algebra
Synchronization
Object (grammar)
27:10
Manufacturing execution system
Inheritance (objectoriented programming)
Quantum state
Algorithm
Mathematical singularity
Data dictionary
Special unitary group
2 (number)
Thetafunktion
Structured programming
Summierbarkeit
Posterior probability
Modal logic
Beta function
Algorithm
Point (geometry)
State of matter
Sampling (statistics)
Density of states
Causality
Uniform resource name
Synchronization
Inference
Right angle
Bernoulli number
Alpha (investment)
27:45
Statistical hypothesis testing
Functional (mathematics)
Statistics
Histogram
Group action
Algorithm
Confidence interval
Plotter
Cellular automaton
1 (number)
Combinational logic
Similarity (geometry)
Perturbation theory
Shape (magazine)
Mathematical model
Number
Local Group
Estimator
Markov chain Monte Carlo
Radiofrequency identification
Average
Different (Kate Ryan album)
Energy level
Posterior probability
Algorithm
Distribution (mathematics)
Mathematical model
Mathematical analysis
Sampling (statistics)
Parameter (computer programming)
Ultraviolet photoelectron spectroscopy
Letterpress printing
Hypothesis
Mathematical model
Hierarchy
Category of being
Arithmetic mean
Fermat's Last Theorem
Exterior algebra
Integrated development environment
Personal digital assistant
Software testing
Right angle
Figurate number
Computerassisted translation
Library (computing)
31:33
Algorithm
Group action
Distribution (mathematics)
Mathematical model
Multiplication sign
Parameter (computer programming)
Mathematical model
Mathematical model
Hierarchy
Category of being
Estimator
Structured programming
Hierarchy
Right angle
Software testing
Task (computing)
Random variable
33:43
Axiom of choice
Group action
Beta function
State of matter
Confidence interval
Plotter
Shape (magazine)
Parameter (computer programming)
Special unitary group
Variable (mathematics)
Formal language
Array data structure
Estimator
Bit rate
Gammaverteilung
Vector space
Beta function
Point (geometry)
Binary code
Gradient
Sampling (statistics)
Menu (computing)
Special unitary group
Variable (mathematics)
Trigonometric functions
Open set
Type theory
Arithmetic mean
Process (computing)
Beta distribution
Vector space
Uniform resource name
Quadrilateral
Inference
Personal area network
Right angle
Landau theory
Alpha (investment)
Row (database)
Point (geometry)
Transformation (genetics)
Mass
Mathematical model
Hexagon
Latent heat
Thetafunktion
Objectoriented programming
Arithmetic mean
Wellformed formula
Term (mathematics)
Average
Hierarchy
Gamma function
Alpha (investment)
Random variable
Supremum
Robot
Scale (map)
Mobile app
Scaling (geometry)
Mathematical model
Information
Graph (mathematics)
Expression
Variance
Coma Berenices
Letterpress printing
Shape (magazine)
CAN bus
Subject indexing
Causality
Synchronization
38:05
Algorithm
Algorithm
Complex (psychology)
Sampling (statistics)
Black box
Computer programming
Mathematical model
Plot (narrative)
Element (mathematics)
Mathematical model
Inference
Estimation
Term (mathematics)
Computer programming
Inference
Cuboid
Right angle
Reading (process)
Condition number
39:53
Reading (process)
Algorithm
Bound state
Data analysis
Variable (mathematics)
Statistics
Computer programming
Twitter
Programmer (hardware)
Bayesian network
Hacker (term)
Alpha (investment)
Laptop
40:43
Complete graph
Functional (mathematics)
Statistics
Code
Multiplication sign
Virtual machine
Parameter (computer programming)
Parallel computing
Streaming media
Heat transfer
Disk readandwrite head
Likelihood function
Mathematical model
Subset
Template (C++)
Codierung <Programmierung>
Addressing mode
Extension (kinesiology)
Alpha (investment)
Social class
Area
Pairwise comparison
Projective plane
Sampling (statistics)
Fitness function
Parallel port
Likelihoodratio test
Instance (computer science)
Variable (mathematics)
Vector potential
Website
00:15
so much for coming out to me
00:17
is talk about my favorite topic probabilistic programming of the introduced myself real quick and recently wrote relocated back to
00:26
Germany after starting Brown where did my PhD on Bayesian modeling in which the decision making and therefore a couple of years have also been working with talking which is a Bostonbased startup and as a quantitative research and there we're building the world's 1st with MCE training platform in the web browser the top will be tangentially related and so on and say it was going to show you that screenshot of what it looks like so this is essentially what you see when you go on the website it's a based IDE way can call Python caught up in training strategy and then we provide historical data so that you can test holy on would have done and it was 2 thousand 12 and then on the right you see how well did didn't often and that's what I refer back to you are interested in whether you beat the market or against the market and also show that it's completely free and everyone can use it OK so I think that every time they should be the main question
01:27
is well why should you care about probabilistic programming and
01:33
and so it's not really easy charges because probabilistic programming each have at least a basic understanding of some concepts
01:44
of probability theory and statistics so the 1st 20 minutes I will just give a very quick primer focusing on 1 and to the level of understanding I can get a quick show of hands like cool so the understands on an intuitive level Bayes formula works OK so most of you so maybe you wanted need that primer but the of interesting and 2 is the and then we have a simple example minimal more advanced exam that should be interested even if you know already quite a bit about business statistics so to motivate this further I really
02:20
like this In contrast that alleviate gave it this talk about machine learning and that is
02:29
chances are you are a data scientist maybe use I could learn to train your machine learning classifiers so what this looks like this on the left you have data in the that used to train of M and then you that I will make predictions and of those predictions all you care about then that that might be finite but 1 several problem that most of these items have is that in the very bad at conveying what they have learned very difficult to inquire what goes on in this black box right here so on the other hand probabilistic programming is inherently open box and using the best way to think about this is that it is a statistical toolbox for you to create these very rich models that's a really tailored to the specific data that you're working with and then you can require that model and we see what's going on and what was learned so that you can learn something about the data rather than just making predictions right and the other big benefits I think and we'll see that there is that these the only type of models work with them so the black box inference engine which our sampling islands that work across the world huge variety of models so you don't really have to worry about the inference that all you have to do is basically build that model and then the inference spot and in most cases you just get the answers that you looking for so there's not really much in terms of solving equations which is always so throughout this talk only use a very simple example that most of you will be familiar with and that is that they be tested as you know when you have to websites and you wanna know which 1 works better and some measure that uses and maybe conversion rate or how many users click on an ad where you to test that so but the users into 2 groups and give group 1 right so that they can give group to websites the and then you wanna look which 1 had the higher measured that problem is course which would general and since some coming in from finance background and so you switch back and forth between the statistically speaking and then the problem we have 2 training
04:52
algorithms and you wanna know which 1 has a higher chance of being the market on each day so she understands that
05:05
generate some data and to really see what the triple ancestral answers that you might come up with the yield
05:13
and how we can improve upon that and then of you might be surprised about using real data but I think that is actually a critical step as before you applying model on real data you should always use simulated data we really know what's going on in the parameters that you want recover so that you know that the model works correctly and only then you can be sure that you get correct answers by applying it to really right so the date the data that I work with is it will be binary of the sporting events and that type of statistics statistical process called really and that is essentially just the coin right the probabilities of conflicts and I can use that's from sigh price that's and that's apparently and might pass and the probability of that the point of coming up heads for that algorithm of beating the market on a particular day on that website and that of converting the user and here I'm sampling 10 trials so this will be the result right just a bunch of binary 0 and what zeros and ones some generating 2 rhythms and 1 with the 2001 the 60 % so you wanna know which 1 is better the easiest thing that you might want do that you might come up with is just well let's take the mean right and actually statistically speaking there's not a terrible idea is called the maximum likelihood estimate and so if you ask an applied mathematician from which to do then that might be the answer to the cause in applied math and and the proof so we're very similar way because we have this problem and then you say well OK
07:00
let's have all data go to infinity and then you solve it and then you get the estimator works correctly in that case and that's great but what do you do if you don't have an infinite amount of data that's the much more likely case that UBM and the that I think is where degree work well so what happens in our case now where I just take the mean of the data just generated and she can see them in this case we get the we
07:31
estimated that the chance of a sovereign beating the market is 10 per cent and 40 per cent for the other 1 so obviously that's completely minus 50 and 60 % are generated and the obvious answer of why this goes wrong it's just I was unlucky and the observant members in the audience will have noticed that I used a particular random see here so I found that we did that to that random seed to produce this very we're sequence of events and that is the produce this pattern but certainly that can happen with really the right you can be unlucky and that 1st 10 you visitors of the website just just complex and the central thing that I think is missing here is the uncertainty that has the the right 10 % for that and that's just number but we're missing whole confidence behind the number so for the remainder of the talk will be recurring topic is really trying to quantify the uncertainty and then you
08:32
might say and say well there is this huge profit just equipment
08:37
frequentist at the states on which designed a statistical test to decide which 1 of those 2 is that all there is a significant difference than you might want to test and that returns a probability value that indicates how likely I to observe the data it was generated chance and that you that's certainly the practical to do but 1 of the central part of frequentist artistic is that it's incredibly easy to use it for example and you might on the you might collect some data and the test doesn't have anything and then on the next day and more data so what you want you just run another test with him with all the data we have now right you have more data so the test should be more accurate I'm putting that's not the case and you can see that you just created a very simple as
09:30
an example where do that procedure generates 50 random binaries with 50 % probably both so there is no difference between them and then I start with just 2 events are to test if that is not do 3 other entity test right and that just that process of continuously adding data and testing whether there's a difference and if there is a difference of smaller than point 0 5 10 I refer it is and then return false and then every time that a thousand times and I look at what the probability is that you know there is no statistical there's no difference at all between those 2 it was point 5 what is the chance of this test you an answer it's that it is a significant difference and it's 36 . 6 per cent in the case which also absurdly high right so this procedure really fails of use it in that way and granted I and is used to test the right it's not designed to work in that specific area but it's extremely common if people that and up for me 1 of the central problems is that frequentist test really I depend on the intentions of collecting the data
10:46
so if you use a different procedure of collecting the data for example say what I just did I just had data every day then you need a different statistical test the if you think about this more actually pretty crazy right if you just I you data sizes and
11:04
you just get data from a database have no idea what intentions were of going back to right so and you want to be very free in exploring the dataset and running all kinds of statistical tests to see what's going on so I think was proposed to this is certainly not wrong it's often very constricting and what it allows you to do any if you don't do things correctly you might you should use of the and I think that's really not a good set of a statistics and use that very quickly so the core
11:40
we have based formula and if you don't know what that is essentially it's just the formula that tells us how to update our beliefs when we observe data that implies that we have prior beliefs about the world that we have to followers and then few we apply then we see data and we apply for these funds to update of beliefs in light of the new data to give us all posterior and in general these beliefs are represented as random variables and also the very quickly talk about what what goes on with 2 ways of thinking models so decisions like to call their parameters the random variable state so this 1 menus here and let's define prior for a random variable fade and fade out will be the random variable about going algorithm beating the market as single Ivan beating the market or the websites converting user right so what's the chance that that happens groups so I didn't show that I just want to show that and so the best way to think about that random variable it is as opposed to the variable that might have from Python programming which just has a single value say i equals 5 is here we don't know the value right we want to reason about the value we have some ideas some rough idea what that value so rather than just having 1 we have we allow multiple values and assigned each possible value of probability and what that what chosen so on the that such as we have possible states that the system can be a for example that I wouldn't can have a chance of 50 % of the market and then assume that that is the most likely case just that's my personal prior belief without having seen anything and then assume that on average 50 % is probably a good estimate but I wouldn't be terribly surprised to see something with 60 % you know it's less likely 80 % considerably less likely but still possible 100 % that's like beats the market on every day that that I think would be next impossible right so then assigned very low probability that so that's very intuitive way of thinking about that so now let's see what happens if I have observed data and for that I created this and literature and where can add data when I use this letter and then it will update that probability distribution on him and so that will be of history right currently there is no data available so all posterior will just be all prior so that is just believe we have without having seen anything and now and then and a single data points from success so we just ran the item for a single day and beat the market so now as you might have seen that the distribution is shifted a little bit to the right side right and that represents all updated belief that it's a little bit more likely now that the algorithm it is generating positive returns so now that's reproduce that example from before we had 1 success and 9 and failures right so there was I was may then we estimated has a 10 percent chance of beating the market so and that was that was ridiculous right with that amount of data no way we could say that and also without prior knowledge way we would assume that 10 % is actually the probably Sinatra that his updating probably distribution here which is not updated belief that's certainly with 9 assume that there is lower chance of success of that I which is represented by this distribution moving to the left and but still knowledge that if the 10 % is still extremely unlikely under this condition right and that is the influence of the prior we set 10 % is unlikely so that would influence or estimates away from these very low values the other thing to note is that the distribution is still pretty wide so she and now we have our uncertainty measure in the width of the distribution of the wider it is the less certain I am about that particular value so now wanted to you had just imagine what the distribution look like if I move this up to 90 and the success of 210 right so basically now we observing data that is in line with the hypothesis that it has a 90 % failure probability so as you can see the main thing that happens is that most select but also gets much narrow and that represents all increase confidence with having seen data we have more confidence in the past and that's exactly what by the way close is it that I can use these images and life notebook but as well as the catch with all of that right this sounds a that's too good to be true it just like created model and you update your beliefs and and you're done possibly it's not always that easy and 1 of the main difficulties is that this formula in the middle here can In the most in most cases cannot be solved the case that I just showed you it's extremely simple you just apply the following to cancel that and then you can compute a posterior analytically but even with like just a tiny bit more complex models you get multidimensional integrals stole infinity that will make your eyes leaving no sane human would be able to solve so and I think historically that's 1 of the main reasons why phase which has been around since the 16th century has not been used up until recently now it's kind having renaissance is just people when able to to solve for it and the central idea of probabilistic programming is the logic cancel something then we approximate and luckily for us this this classifier remains the most commonly used called Markov chain Monte Carlo and instead of computing the posterior analytically that curve that we've seen it to draw samples from it and that's about the next best thing I we can do the so just due to time constraints I won't go into the details of the MCMC so we'll just gonna assume that it's a pure black magic and it works and it's the solid is intended as a very
18:22
simple but the fact that it works in such cases is still mind blowing to me and and the big benefit is that yeah it it can be applied very widely so often you just justify model we say go and then will give you a so what is MCMC sampling like as we've seen before this is the posterior that everyone right this needs a closed form solution
18:46
which we can get in reality so instead we gonna draw samples from the distributions and we have enough samples we can do histogram and then it'll start resembling but it's OK so let's get 2 times the 3 as if we had a promise program framework written in Python and for Python and allows
19:12
for contractual was models using intuitive syntax and 1 of the reasons for during times 3 rather than 2 maybe some obvious you use binds
19:23
to actively is actually complete rewrite uses no called from times to there were couple of reasons 1 is just new technological death at the coalface of attention to this pretty complexes requested compiled Fortran college always causes huge headaches for users to to get working so I C 3 is actually very simple cold and 1 of the reasons that is the reason the on off for all things and for the whole compute engine so how basically just computing the weirdest creating that compute graph and then shifting everything off to the other and the other benefit we get from the is that it compiled that it can give us the gradient information of the model and there's this new class of algorithms called Hamiltonian Monte Carlo that work that and advanced samples and those works even better in very complex models so they're much more powerful but they require that extra step and that's not easy to get luckily for us the Arnold provides that out of the box so we don't really have to do don't have to do anything the other point I was stresses that Apache 3 it is very extensible and and also it allows it to interact with model much more freely so maybe a used drags on winbugs or a stand which is a
20:43
part of the very interesting recent progress in programming framework which and why those a really cool 1 problem I personally have them is that they require you to write a probabilistic programming this specific language and then the compiler you have some wrapper code to get the data into standard and then you have some radical to get it out of standard results and for me there's always very cumbersome so you can really see what's going on in the divided sometimes if we you can right model in Python code and then we interact with it free so you never have to leave essentially Python and that for me is this is a very very powerful and so if you can think of the much was a library and we'll see that in a 2nd just the authors so john saw that use the main guys came up with a
21:36
response back also programs quite a bit of Kurdistan outside it still works so it's part which directly well already the main reason myself as and mainly that we're missing good documentation and we're currently right those but if you are operated and would like to have what was that that something more than PCA OK so let's look at that model from early example that we want to and see how we can solve it now any times 3 and for that and understand write down the model how right in statistical terms so we have these 2 random variables right that 1 reason about say a infinity and that will represent the chance of the algorithm beating the market and she we say this toll means it's distributed as something of working with numbers but with distributions so this is at the distribution and that is the distribution that we have been looking at the at the beginning right just from 0 to 1 if he is the 1 with probability is the latest edition is the 1 to use and and so far this is the thing that we want to reason they want lot about given data and then we how do we learn about well we observed data and the data that assimilated was finally so that came from the newly distribution so we have to assume that it's that the data is distributed according to the new distributions of zeros and ones and the probability of the Bernoulli distribution before I just fixed point 5 right here now we're actually wanted for that so since we don't know that value we replace it with a random variable and that is the random variable fit a that we had about the so that is how commonly these these models look like and the other point I wanna make here is that here you really see how you know about they creating a generative model right so you you might wonder like how can I construct no model and the thing goods path for that is to just think of how the data would have been generated right here I know well this this probability and it generated randomly data so that similar mimicry that you can get arbitrarily complex and then so I have all these hidden causes that somehow related
24:03
complex ways to the data and then you can invert that model using this formula to the inferred these hidden causes so he understands
24:17
again generated a little bit more now so again 50 and 60 per cent probability of being the market or conversion rate
24:24
and 300 dollars and this is what the model looks like in 23 so 1st we just import ESPN and we
24:34
instantiate the model object which will have all the random variables and then what and the other improvements over time C 2 is that everything you specify what model you do under this with context and that's what that does is that everything you do underneath here will be included in that model objects that you don't have to pass it in all the time so underneath you know this is look pretty familiar from before we just had these random variables right theta a distributed as they distribution so here and now write the same but in Python code as so often a is a better solution we given a name and we give it the 2 parameters and of beta are 2 promise of this distribution takes the number of successes and failures so this is the prior that shortly before that was centered around 50 % and I do the same thing be and and then relates those random variables to my data and as I said before that certainly which instantiated in the name and instead of the a fixed pw now I give it the random variable right that we wanna linked together and since this is an object of node we get it is that the rate of 300 binary numbers that are generated just before so this links to the data and links it to the random variables and the same for the so up until here nothing happens we just basically plot together a few probability distributions that make up the whole I think about my data structures now it's often a good idea to stop sample from a good position and for that we can just optimise the log probability of the model using 5 map for find the maximum a posteriori value and then and then instantiate the sample I want use the various you can choose from him using a slice and which is the 1 which works quite well for these simple models and now I actually want to draw all the samples from the posterior right and for that of a simple function and I to tell it how many samples of 110 thousand I provided with the 1st method that I give it the starting out and when I do
27:12
this call it'll take a couple of seconds to run the sampling algorithms and then it really
27:18
would return on the structure to which I call trace here and that is essentially dictionaries for each random variable that have assigned I will get the samples that are drawn and now that around that I can inquire about my posterior right so the amusing seaborne which just as the size is also employed in
27:45
library on top of that of and you should definitely check out creates very nice statistical plots for example the nice this plot function that is the discrete histogram but 1 that looks much nicer and has for example this once shape climb and I give it the samples that I drew that my MCMC sampling I enjoy off the main thing to be and then it will plot the posterior now that I created and that it is again the combination of prior belief updated by the data that I've seen and now I can reason about that and the 1st thing to see is an the CMB B the probabilities all of the chance of that what I was beating the market is 60 per cent and that's what we used to generate the data so that's good that we get that back and again there's where you selected data to know actually that we're doing the right thing and the other 1 is long 50 % or 49 % the other thing and to note is that she and now instead of just having that single number that simply fell from the sky that we would get if you just take the mean we have our confidence passes right we know how wide the distribution is we can answer many questions about the like how likely is it that the the probability that the chance of successful that I really missed 65 per cent and then we we get a specific number out that that represents a level of certainty and we can do other interesting things like hypothesis testing to answer initial question which of the 2 actually does better and for this we can just compare the number of samples that would for today to the samples of the data being so we just can well how many of those of larger than the other ones and that will tell us well with probability of 99 . 11 % Arabism B is better than a and that is exactly what 1 right so by consistently having confidence estimate carries through from the beginning to the end gives us the benefit of everything you said who has that confidence and probably estimate associated with it OK so that was boring up until now hopefully gets a little more interesting and so consider the case where instead of just 2 urban might have 20 analysis were but we haven't token many users have use the rhythms and maybe we wanna know and only each individual items the chance of successful also the algorithms overall the the group average but they also doing the other also consistently beating the market or not so the easiest modeling can probably build is just the 1 we did before but instead of to set a and B we have 20 right and 1 that's fair and this is called nonpooled model it's somewhat unsatisfying right because we probably assume that then these are not completely separate if there's a lot of work in the same market environment some of them will have similar properties some similar algorithms the using so they will be related somehow right there will be there were differences but they will also have similarities and this
31:21
model does not incorporate that right there is no way of what a lot of Figure 1 I would apply to the 2 the other extreme alternative would be to have a fully pooled model where instead of assuming each 1 has
31:34
its own random variable I just assume 1 Random aerial for
31:37
all of them and that also unsatisfying because we know that there is that structure in our data and which are not exploiting and also even though we might get group estimates we could not say anything about a particular algorithm how well that was the right so the solution which I think is really elegant is called a partially from a hierarchical model and for that
32:07
we add another layer on top of the individual random variables right up until here we only have the the model we had before with all these independently but we can do is instead of placing a fixed prior on that we can actually learn that prior for each of them and have a group distribution that will apply to all of us and those models are really tall from the very and many nice properties 1 of them is well what I like about theta 1 from the data and well shake my group distributions and that in turn will shape the estimate of data to so everything a lot about individuals about the group and well and but the group I can apply to constrained individuals and another example where this where we do this quite frequently from my research on say psychology we have the behavioral task that we need will be test 20 subjects on and off and we don't have enough time to collect a lot of data so the subject by itself the estimates we would get and if we fit a model to to that guy it will be we're very noisy and that is a way to the hierarchical model to basically learned from the group and apply that back to the group so we'll get much more accurate estimates for each individual that's very very nice property of these the hierarchical models so here I understand that generates again and the he then essentially the data will be just an array of 20 times
33:43
200 20 subjects 100 trials and will just be hero is the binaries of each individual rights and then for convenience i also create this this indexing mass that will use in a 2nd that might not make sense right now and but just keep in the back of your mind and base and indexing the 1st row will be just an index for the 1st subject and indexing into that random variables but this is the data that are going work with OK so for that model of latent 23 so here going to 1st create my group variables to meaning scales so how what's the with the average rate to the average chance of beating the market of tolerance and whole various variables are then distributed the scale parameter and this is the choice of making modeling with price 1 of you is here used a gamma distribution and for the variance values the story of the beta distribution for the group mean and I use a gamma distribution because variance can only be positive with the Sun promise but the details of that are not that critical then unfortunately the beta distribution is parameterized in terms of alpha and beta parameters and not in terms of the mean and variance of fortunately there is this very simple transformation we can do to these mean and variance parameters to convert them to alpha and beta values that I'm doing here and while specifics of that are not important I just wanted to show how easy it is to give you some other languages is not a given that you can just leave very freely combined these random variables and transform them and and still have to work out and the reason is that these are just the amount of expression graphs that once and multiply them it it will actually take the probabilistic users of the formula and combined that and actually the mass in the background of of combining them so then need to open that up with the with the same with my random variables for each rhythms and instead of having a full understanding 20 of them I can pass in the shape argument and that will generate a vector of a random 20 random variables and that will be the data so this is not a single 1 that actually 21 and before you will know that I had just my hardcoded prior of 5 and 5 here right the provisional but now I'm replacing that with the the group estimates that I they're also going to learn about and now again my data is going to be normally distributed and for the probability now I'm going to use that index that I showed you before and essentially that will index into vector In a way so that it will turn that into the twodimensional array of the same shape as my data and then if it matches at onetoone and it just as the right thing and then I passed in there to be the of the roads of binary variables for each and again I'm running I'm finding good starting point and note here that I'm using now this called not sampler which is this state of the art sample that uses the gradient information works much better in a complex model specifically these hierarchical models the difficult to estimate but the this type of sampler does a a much better job and there was 1 of the reasons actually to to develop into 3 OK oops and then with the trace plot command we can just create plot so don't mind what the right side but
37:49
you know we get our estimates of the group mean and again we have not a single value but rather than the confidence so on average we think it's about 46 % but we have the scale parameter and we have 20 individual everything dry so that
38:08
would be theta 1 to theta 20 and all the more constraining each other in the model so that's pretty cool so about what conditions that produce a programming is
38:20
pretty cool and that allows you to tell the generative story about a data right and you listen to me tutorial on how to be good data signed it is telling stories about your data right so whole how can you tell stories of all you have is that black box inference so I think that's where promising programming
38:39
it's it's really quite improvement you don't have to worry about inference is black box algorithms so it works pretty well you have to know how the what it looks like if they fail and and can be tricky than together going so it's not such a trivial but still some that they often work out of the box and lastly panties given you these advanced sample remember to that and go to further reading so much about photogrammetry design elements that have hopefully higher chance than 50 % of beating the market on for some 193 actually have written a couple post on that and crisis for the best resource for getting getting started and mainly that's just because there is a not that much else written about 23 in terms of documentation and don't hear these um also some really good resources that are recommended to to think about that so things like H. this this so you can I know that
40:11
the yes so so the question is stand provides a lot of tools for assessing convergence and many diagnostics but also very nice feature of transforming the variables and placing bounds on that and I so 23 has like the
40:43
most common some statistics that you want to look at the government then our had statistic and all of that and can sample in parallel and then compare them and what you can and we do have support for transformed variables it's not like as and Polish as standard just because it's still alpha but there and you can and you can bonded parameters and so that that works but it's not quite stream more questions the each of so some the real world some and you have the corrections to aggression was I can't use the sample that we provide the anatomical because it's too expensive to use so how difficult would it be to use my own samples and that I think is a big benefit of 23 is that you just is inherited from the sample class and then you overwrite the step method and then you have it you can do your own proposals and acceptance and rejection so that's very easy and you look at stand for example I haven't done it but I imagine that it's quite difficult just when I look at the code it's it's really hard cause plus plus and the ball the templates make my head and the other question eventually was like if you can't evaluate the likelihood of you but some of the main that you will so the question is how this compares the was guest or you write your own 7 Python 1 that and so on and so the I think most of the time is actually not stand in December but rather in evaluating the log likelihood of the model and also the grading competitions difficult and it's true that stands is transfer 1 of its fast once it gets started but and it takes quite want to compare the model actually and so in In that sense so I haven't really done this the comparison and the reason we have noticed some areas where pine C 3 is not fast and we need to fix those and and speeded up and and so this standardized have done a lot to really make a child that's the benefit principles clustered on the other hand and 1 benefit I think to the hours that it does all these simplifications to the complete graph and there's like caching and you can run it on the GPU so as we have really explore that to the fullest extent yet but I think that there's lots of potential speedups that just animal could give us and another answered the question is well if you for example you really spend that much time in a simpler and more of just proposing Johnson could also use site on for example and encode in this was have the reasons about parallel sampling and that it is possible so there is just a piece sample functions of the sample function and that will distribute the model it doesn't quite work in every instance at yeah you to processing so you get through parallelisation and just as an an aside this is really cool project that someone on the main is just what what that is what kinds of to the centric could be applied to 23 and he's uses spot and to basically to the sampling of pair parallel on on big data like you have data that doesn't fit on a single machine you can run individual samples on subsets of the data and parallel and then aggregate them and spot let's you do that very nicely and he basically hooked up and pine C and and spark so that's really really exciting all the