Introduction to Probability and Statistics 131B Lecture 6
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 6 | |
Number of Parts | 7 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/13606 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
00:00
Multiplication signEstimatorSquare numberDistribution (mathematics)Parameter (computer programming)GammaverteilungMoment (mathematics)Normal distributionSigma-algebraBinomial heapLecture/Conference
03:21
Moment (mathematics)Distribution (mathematics)CuboidMany-sorted logicMoment (mathematics)Right angleNumerical analysisMusical ensembleMultiplication signSet theoryParameter (computer programming)Analytic continuationFunctional (mathematics)Population densitySampling (statistics)DistanceSocial classProduct (business)Directed graphPoint (geometry)Decision theoryGrothendieck topologySummierbarkeitPower (physics)Law of large numbersLimit of a functionGenetic programmingVector spaceVertex (graph theory)Real numberLecture/Conference
13:20
DivisorSet theoryEstimatorMoment (mathematics)Figurate numberDistanceVector spaceLimit of a functionParameter (computer programming)ThetafunktionPhase transitionLecture/Conference
14:39
Distribution (mathematics)Functional (mathematics)Sampling (statistics)Moment (mathematics)ThetafunktionSummierbarkeitParameter (computer programming)EstimatorPhysical systemMusical ensembleAnalytic continuationNumerical analysisMortality rateExpressionNoise (electronics)Group actionForcing (mathematics)ConsistencySet theoryRight angleEqualiser (mathematics)Closed setSigma-algebraSquare numberPolynomialLecture/Conference
23:47
Hill differential equationDegree (graph theory)Distribution (mathematics)ResultantFree groupAreaVapor barrierFrequencySign (mathematics)Square numberOrder (biology)Multiplication signStandard deviationSocial classMereologySigma-algebraVariable (mathematics)VarianceArithmetic meanSampling (statistics)RankingEstimatorMeasurementDecision theoryTrailFigurate numberMilitary baseDegrees of freedom (physics and chemistry)Random variableSummierbarkeitNormal (geometry)Independence (probability theory)Standard errorNormal distributionGamma functionAverageChi-squared distributionRandomizationLecture/Conference
31:57
1 (number)Directed graphPhase transitionWaveSquare numberLine (geometry)Multiplication signTerm (mathematics)RoutingMereologySubstitute goodVarianceNumerical analysisPhysical lawProjective planeRule of inferenceForcing (mathematics)DiagonalFigurate numberArithmetic meanIndependence (probability theory)Sinc functionHand fanMoment (mathematics)AreaFunctional (mathematics)Right angleGoodness of fitSummierbarkeitAxiom of choiceRandom variableLimit (category theory)RootNormal (geometry)Lecture/Conference
40:07
Gamma functionTerm (mathematics)VarianceModulformNichtlineares GleichungssystemPower (physics)Degrees of freedom (physics and chemistry)Sigma-algebraMultiplication signPopulation densityDifferenz <Mathematik>Standard errorAlpha (investment)MereologyChi-squared distributionRootSquare numberStandard deviationPresentation of a groupStudent's t-testEvent horizonAreaFunctional (mathematics)Mortality rateLine (geometry)Lecture/Conference
48:17
MeasurementMoment (mathematics)Similarity (geometry)Gamma functionStatistical hypothesis testingEstimatorMaximum likelihoodDistribution (mathematics)Parameter (computer programming)Standard errorFrequencyMetreStandard deviationSquare numberRootNatural numberRandom variableCondition numberNumerical analysisDifferent (Kate Ryan album)Network topologyObservational studySampling (statistics)Independence (probability theory)Population densityVariancePoisson distributionPhysical lawEvent horizonExplosionMany-sorted logicLecture/Conference
54:36
Independence (probability theory)Product (business)Object (grammar)Normal distributionRandom variableParameter (computer programming)Population densitySigma-algebraNumerical analysisOpticsRule of inferenceLecture/Conference
58:27
Total S.A.Object (grammar)Condition numberNumerical analysisPopulation densityParameter (computer programming)Sampling (statistics)Lecture/Conference
59:42
Functional (mathematics)Doubling the cubeExpression1 (number)Point (geometry)Forcing (mathematics)Set theoryLie groupParameter (computer programming)Arithmetic meanExpected valueSummierbarkeitMultiplication signFraction (mathematics)Different (Kate Ryan album)Right angleEstimatorDistribution (mathematics)GradientDescriptive statisticsIndependence (probability theory)Many-sorted logicLikelihood functionVarianceRootLogical constantVector spaceRandomizationMaxima and minimaThetafunktionPopulation densityMaximum likelihoodLikelihood-ratio testLecture/Conference
01:09:40
ModulformAverageFunctional (mathematics)Parameter (computer programming)EstimatorIndependence (probability theory)Maximum likelihoodPoisson processSampling (statistics)Likelihood functionProduct (business)Right angleSigma-algebraSummierbarkeitDerivation (linguistics)Differential (mechanical device)Observational studySquare numberTerm (mathematics)Arithmetic meanVarianceDistribution (mathematics)MereologyDifferent (Kate Ryan album)Water vaporNormal (geometry)Interior (topology)Rule of inferenceMultiplication signAreaLecture/Conference
01:16:10
Multiplication signConnectivity (graph theory)Constraint (mathematics)Order (biology)Maxima and minimaSummierbarkeitExpressionDifferent (Kate Ryan album)Maxwell's equationsState of matterModel theoryLikelihood functionCasting (performing arts)GradientSigma-algebraTerm (mathematics)1 (number)Independence (probability theory)Axialer VektorPressureVector spaceFunctional (mathematics)Derivation (linguistics)MultiplicationMaximum likelihoodLine (geometry)Square numberRandomizationCausalityCoordinate systemDifferential (mechanical device)Lecture/Conference
01:21:39
SummierbarkeitRandom variableEmpirical distribution functionEstimatorMultiplication signMaxima and minimaDistribution (mathematics)CuboidParameter (computer programming)PressureMaximum likelihoodNumerical analysisSampling (statistics)Mortality rateWater vaporDirection (geometry)Lecture/Conference
01:25:31
CuboidNumerical analysisGrothendieck topologyStandard errorMultiplication signAreaVarianceComputabilityEstimatorPopulation densitySquare numberDistribution (mathematics)RootMaximum likelihoodStandard deviationBinomial heapSampling (statistics)PiParameter (computer programming)DivisorNear-ringFrequencyPhysical systemLecture/Conference
Transcript: English(auto-generated)
00:05
Alright, last time we talked about method of moments estimators and the
00:47
idea was underlying distribution relies on some parameters that you want to estimate.
02:09
For example, you might be sampling from a normal distribution with some unknown parameter mu and sigma squared. So theta one would be mu, theta two would be sigma squared.
02:30
In a gamma distribution you would have two parameters you want to estimate.
03:22
Remember the multinomial distribution? You distribute things in cells or boxes. You have to distribute little n items in k boxes. The probability that the proportion in box one would be p1 and box two would be p2, etc.
03:42
So maybe I'll write down the density here.
04:35
Does this look familiar? Do you know what this symbol means?
04:42
Product, yeah, product of these numbers. Instead of sum it's product. So here there are k parameters that might have to be estimated. And I guess finally maybe theta one is lambda.
05:04
If the underlying distribution is Poisson with parameter lambda. So these are examples where the parameter might be just one parameter, there might be two, there might be many.
05:27
And you want to estimate them and typically, or sometimes, I don't know what the proper word is here,
05:45
the parameters are functions of the moments.
06:32
So i.e. theta sub j would be some function f sub j of some number of the moments.
06:42
And we can approximate the moments by the sample moments by this law of large numbers.
07:47
Which says that 1 over n sum i equal 1 to n, the kth power of xi converges to the expected value of x1 to the k,
08:18
which is mu k. In what sense?
08:53
This is not my vertical dyslexia, I'm not writing an a upside down. This means what?
09:00
We're all epsilon, bigger than zero. Oops, I'm sorry.
09:23
Okay. That's the version of the law of large numbers that we have. The probability that the sample kth moment differs from the real kth moment by more than epsilon goes to zero is epsilon plus zero.
09:41
Now, you might recall, you might not, or you might not have been in this class.
10:11
So what's SS1 or SSI? Summer session one. In 138 we had an exercise that said if g is continuous, remember this exercise? I heard it groan, so somebody must remember it.
10:33
If g is continuous, and we'll call this star, star holds, then, do you remember that exercise?
11:29
There's probably a theta hat in there, or a theta n instead of mu k hat, and a theta instead of mu k, but it's the same. And it just uses the definition of continuity, more or less.
11:45
Similarly, in multidimensional case, if, for example, that fj over there, if fj is continuous, as we have here,
13:04
each of these sample moments are converging to these moments, these fixed numbers, in this sense. In fact, you could do the distance between this vector and this vector being the probability that
13:23
the distance between this vector inside here and this vector is bigger than epsilon goes to zero. And, because if fj is continuous, then fj of that vector minus fj of this vector will go to zero in this sense.
13:45
Okay, but what is that, what is this thing giving you? This is giving you the method of moments estimator for the parameter theta j.
14:25
So, did I write it down somewhere? I should write it down over here, where theta j hat.
14:58
So, you get the parameter and the distribution theta j from this function of the sum of the moments.
15:07
We want to estimate the theta j's. How do we do that? Well, we estimate the mu's. How do we estimate the mu's? We use the sample moments, mu 1 hat, mu 2 hat, mu l hat. We use the same function that delivers us theta j when we put the moments in here.
15:25
What we've just shown is that if the function that gives you the parameter from the moments is continuous, then the method of moments estimator will converge to the true value of the parameter.
15:41
This is called consistency. If this happens, theta j hat is called consistent. Didn't provide all the details, but more or less we've just shown that if the function fj is continuous,
17:12
where fj is a function that gives you the true value of the parameter from the true values of the moments,
17:23
then theta j hat equal fj of mu 1 hat through mu l hat is a consistent estimator of theta j.
17:47
And that's because the mu hats are consistent estimators of the mu's, and this is preserved under continuity of mappings.
18:05
That would go to 1 because theta j is getting close to theta. I'm looking at the probability that theta j hat is far from theta j there,
18:21
and the probability that it's far away is going to 0, so that means the probability that it's close is going to 1, which is what you want. You want it to be close to the thing you're estimating with high probability. Does that answer your question? OK, so what were the examples we had?
18:44
Maybe just review one of them. The normal, so we want to express mu as,
19:24
so if x is normal with mean mu bearing sigma squared, we call mu 1 the first moment of x, mu 2 the second moment of x,
19:50
and we want to express mu and sigma squared as functions of mu 1 and mu 2.
20:35
We know that this will be enough. We don't have to go to higher moments. And what are these functions?
20:48
What's the function in this case? It's just mu 1, right? And sigma squared?
21:06
Well, it's mu 2 minus mu 1 squared, correct? It's the second moment minus the first moment squared. And f1 and f2 are, they're both polynomials in the variables,
21:27
so they're both continuous. What would mu 1 hat be?
22:55
These are consistent estimators.
23:02
OK, and if you go through and look at the Poisson or the gamma, you'll find that the functions that give you the parameters from the moments are continuous, so we get consistent estimators always. That's reassuring. It means that as the number of samples gets large, the estimator gets close to the true value of the parameter.
23:27
But the question always comes up, how close? And the question of how close involves something called the standard error, the distribution of the estimator.
24:24
Over here I should put variance of the estimator as a measure of the accuracy of the estimate. So you need to know something about the variance of the estimators.
24:42
That means you need to know the distribution of the estimators.
25:07
So in the normal example, what's the distribution of mu hat is what?
25:44
Well, let's look at mu hat. There it is. What are the xi's? The xi's are iid normal mean mu variance sigma squared random variables.
26:16
And what's mu hat? It's the sample average of those, 1 over n sum.
26:22
It's x bar. What's the distribution of x bar if the x's are this? What's the distribution of sum of independent normal random variables? Normal. What's the mean?
26:41
What's the expected value of this thing? Well, we get 1 over n, expected value of the sum. That'd be 1 over n sum of the expected values. Each expected value is mu. We get n of them, so we get n mu. Divide by n, we get mu. So this is normal with mean mu.
27:03
And what about the variance? Sigma squared over n. Because the variance of a sum of independent random variables is the sum of the variances. First of all, we get 1 over n squared times variance of the sum. That'd be 1 over n squared sum of the variances.
27:20
Each time we'd get a sigma squared here. Sum up n of those, we get n sigma squared. Divide by n squared, you get sigma squared over n. So the distribution of this estimator is normal with mean mu and variance sigma squared over n. So the standard error here, okay?
28:04
But it's a little problematic. What is the standard error? Well, we're sampling to find out what mu and sigma are. So can we say what the standard error is here? No.
28:20
So what should we do? We estimate the sigma. What are we using? Our estimate, yeah. Since we don't know sigma squared, we resort to sigma hat over n as the standard error.
28:52
Okay. And what's the distribution of sigma hat squared?
29:09
I believe it's something like this. n times sigma hat squared over sigma squared is chi squared with n minus 1 degrees of freedom. That's the result from chapter 6 or 5.
29:28
Or is it 6? 6, okay. I think I mentioned this last time. Did I mention this last time in class?
29:48
In the first period last time? The first day of class. Oh, okay. Of the second session. All right.
30:03
So what's the standard error here? We need the variance of a chi squared.
30:24
What's the variance of a chi squared with n degrees of freedom? Is that a hard problem or is that an easy problem? Yes, gamma comes in.
30:41
So let's see how. Let's do it with n degrees of freedom and then we'll just translate back to n minus 1.
31:02
Okay, so how do chi squared with n degrees of freedom arise? They are the sums of squares of independent standard normals.
31:40
Okay?
31:41
So let's call that capital X. We want the variance of X which is this.
32:10
What's the expected value of X squared? Well, what's the expected value of X? It's the expected value of this sum. So it'd be the sum of the expected values.
32:20
What's the expected value of Z1 squared? Well, Z1 is normal mean zero variance 1. So expected value of Z1 squared is the variance of Z1 which will be 1. So expected value of this is 1. Expected value of the next one is 1. Expected value of all of these is 1. There are n of them. So expected value of this sum is n.
32:42
So the expected value here is n. When we square it, we get n squared. So we'll get n squared here. Now let's do that.
33:33
Okay, now I have a square of a sum and so I'll get diagonal terms and off diagonal terms.
33:45
The diagonal terms look like that and then the non-diagonal terms will look like that.
34:15
We've gone through this kind of computation 73 or 4 times by now.
34:26
So I won't repeat it. Since they're identically distributed, we get n terms here that are the same.
34:41
How many terms are in this sum? n times n minus 1. Because there were originally n squared terms, we split off n of them there so there must be this many here. And here we get the expected value of zi squared times zj squared.
35:02
And zi squared and zj squared are independent random variables. So the expected value of zi squared and zj squared is the expected value of zi squared times the expected value of zj squared. And expected value of zi squared and expected value of zj squared are the variances of those random variables which will be 1.
35:23
So expected value of zi squared and zj squared is 1. Is that okay for everybody? No, it's not? Okay. Let's do that over here.
35:51
This is because zi squared and zj squared are independent. Okay. Now what is zi? It's normal mean zero variance 1.
36:02
So expected value of zi squared is the variance of zi. And the variance of zi is 1. This is 1, this is 1, so this is 1. So this part here is n times n minus 1.
36:24
You look like you're still puzzling over that. No, behind you. Which part? This part? Oh, okay. Because that's how many terms there are in the sum.
36:43
Let's do this in two steps. Do you agree with this step from here to here? Okay. Now how many terms are in this sum?
37:01
How many choices are there for i? n. How many choices are there for j? n. So therefore there are n squared terms in the sum. Okay. I'm still going to do two parts. How many terms are in here? n. Right? And there are n squared altogether. So there must be this many in this sum.
37:26
Okay? Because if I add n to this, what do I get? n squared. Okay? So it's n... These are all the terms not in the diagonal. They're n diagonal terms. And since they were n squared before I took the diagonal out,
37:41
they're n squared minus n or n times n minus 1 remaining. Okay? And is this okay with everybody? Alright, good. So this becomes n times n minus 1. And that reduces us to computing the fourth moment of a normal. Which isn't so bad, I don't believe.
38:22
It would be that.
38:41
And that would be the same as this. Because the function I'm integrating is an even function.
39:01
So the integral to the right of zero would be twice the integral over the whole line. And if I multiply 2 times on a root 2, I get a root 2 in the numerator. And now I'm going to make a substitution. y equal to x squared over 2.
39:23
And then 2 dy is... No, I'm sorry. dy is x dx. Okay? And I'm going to try to substitute in these values.
39:46
The limits of integration, do they change? No.
40:03
Now we've got to figure out what goes here. For the dy, I had to borrow an x. I can take that x away from here. How many powers of x will be left?
40:20
Yeah, so I'll have an x cubed. And I have to express that in terms of y. So let's solve this equation for x in terms of y. And I think it just becomes that.
40:41
Is that right? So what's x cubed? It would be this cubed. Okay? So I'm going to put the 2 to the 3 halves out here.
41:04
And y to the 3 halves here. And can you do that integral?
41:26
Okay, what's the power of 2 out here? That's 2 to the 1 half, 2 to the 3 halves, I think that's 2 squared or 4.
42:08
And y in the world where I write 3 halves is 5 halves minus 1. Yeah, right, we recognize that's the part of the gamma density with lambda equal to 1 and alpha equal 5 halves.
42:23
So this is 4 over root pi times gamma of 5 halves.
42:47
Or it's just the gamma function, I guess, right, at 5 halves. And the gamma function at 5 halves is 3 halves times the gamma function at 1 half.
43:15
Gamma at 1 half is root pi. So this is 6.
43:24
I forgot that I'm doing, yeah, 6. 6. Did I say 6? I think that's right, isn't it, 6? Okay, where were we? Where does the 6 go?
43:45
Up there. This is 6.
44:06
Okay, that's expected value of x squared. So let's go back over here.
44:31
Looks like 5n.
44:50
Shouldn't it be 3 halves gamma 3 halves? Oh yes, I skipped, yeah, you're right, 3 halves gamma 3 halves. And then half, yeah, so this is 3, sorry.
45:03
I'm sorry, I jumped too far. You subtract 1 here, put it out and put that in here. So we get 3 halves and then 1 half gamma of 1 half. And that would be 3.
45:21
So this becomes 3. Here's an n squared and that subtracts away, it cancels it. And then we have a minus n, 3n, so it's 2n. That's the variance of a gamma with n minus 1 degrees of freedom.
45:47
What about sigma hat squared then? What's the variance of sigma hat squared? Well, the variance of this thing would be,
46:39
does everybody agree with the two steps I made so far?
46:45
Here I'm just multiplying by 1 in a peculiar form and then I pull this part out. When I pull it out it has to be squared. And now this is a chi squared with n minus 1 degrees of freedom.
47:03
So it's variance is 2n minus 1.
47:28
OK, so what would the standard error be?
47:57
Then we use sigma hat squared here.
48:24
OK, so you can do, or should look at similar estimates for Poisson or gamma to get some idea about how accurate the method moments estimators are.
48:43
You have to look at the distribution of the estimator and use an estimator for the standard error because typically the standard error involves one of the unknown parameters so you have to use an estimator of that parameter.
49:02
OK, so let's take a break and we'll come back and do another method for estimating called the maximum likelihood. OK, well I made a mistake in the first period. I forgot to put a root n here.
49:22
It says n, it's supposed to be the square root of n. I guess this should be the square root of n here too. OK, it's root n. It's sigma squared over n is the variance.
49:40
The square root of that would be sigma over root n. Square root of n. OK, now for a different type of estimator.
52:11
OK, so examples.
52:30
Suppose these are independent Poisson with parameter lambda. Then the joint density we think of as being a density given the parameter lambda
52:46
can be expressed that way where x1 through xn are non-negative integers.
53:15
These are the possible values. This would be the probability that x1 is little x1,
53:23
x2 is little x2, xn is little xn given lambda. So there may be some argument that tells us that these will be Poisson random variables.
53:40
For example, they might satisfy the conditions of what we call the law of rare events that led us to model the number of alpha particles emitted by a substance in an hour to be a Poisson random variable or the number of successful seeds from a maple tree or an elm tree.
54:09
That would be a Poisson random variable. So there are theoretical arguments that would lead us to think that the thing we're sampling is an example of a Poisson random variable
54:22
and that would mean we'd want to know what's the lambda in the underlying distribution. So we have to come up with some way of estimating lambda from observed data. Here would be the joint density evaluated at x1 through xn.
55:18
We might know that we're observing instances of normal random variables.
55:23
Here the joint density of these random variables could be expressed this way. If we're given mu and sigma squared, it's this.
56:03
Or we could express it as a product.
56:25
So these two examples had independent random variables. If I go back here, I could also express this one as a product like that.
56:46
But we don't always look at independent ones. Another example might be x1 through x sub l, or m, let's say m,
57:03
are multinomial with parameters p1 through pm and n.
57:31
In this case the joint density would depend on m plus 1 parameters, the p's, the probabilities,
57:53
and the number of objects that are placed in these things.
58:22
And in this case there are a couple of restrictions.
58:42
This is equal to that when x1 plus x2 plus dot dot dot plus xm is equal to n. The number of objects distributed in these m cells must be n.
59:04
And of course, a condition of this being multinomial,
59:22
some of these probabilities must be 1. So those are examples of these possible joint densities where you need to figure out what these parameters are from sampling.
59:41
So, we'll define...
01:00:14
L of the parameters to be log of the density. So these observations are fixed in this definition.
01:00:39
So you're given some data set of values, observed values, little x1 through little xn.
01:00:46
And somebody asks you, well, what are the thetas? And this expression F here is more or less, I mean, if you're given the thetas, it's a probability that this happens.
01:01:02
But this happened. Now you might have come up with some estimate of the thetas. Well, you take the point of view that likely things happen more often than unlikely things.
01:01:20
So this thing you observed is probably a likely thing. That means it has high probability under this set of thetas. In other words, you should pick the thetas that make this observation the most probable.
01:01:40
That means you maximize this expression in the thetas, and that'll give you the maximum likelihood estimator.
01:02:37
Does everybody know what argmax means? It's the argument that maximizes the expression.
01:02:42
So it's the vector where the function L achieves its maximum.
01:03:17
That's called the MLE.
01:03:26
So let's compute a couple examples.
01:03:48
Well, let's go to the ones we mentioned earlier. Let's try Poisson first. What are the thetas here? Well, there's only one theta, that's lambda.
01:04:03
So our function L, the log likely, it's called the log, did I say it before? Log likelihood function.
01:04:35
So you're given observed values, and for us our only parameter is lambda.
01:04:58
So we take the log of this function at the observed values.
01:05:05
So I take the log of that thing, and that gives me n little x bar log lambda. Where does that come from?
01:05:22
This is n x bar, is it not? X bar is this sum divided by n, so this is n times x bar. Then I have minus the log of the denominator there, and don't worry about that.
01:05:41
I'll write it down, but don't worry about it. You'll see in a minute. It's a constant. Yeah, we're maximizing what variable? Lambda. So what happens to this when we, how do you maximize? You differentiate, right?
01:06:01
It's gone, so it looks terrible, but causes no trouble whatsoever. Then we take the log of this, and that would be minus n lambda.
01:06:23
Is this good?
01:06:45
So we differentiate this, and what do we get?
01:07:01
With respect to lambda, that is. Set that equal to zero, and the value of lambda we get is lambda must be equal to x bar.
01:07:27
So if x1 through xn are the random variables, lambda hat equal to x bar is the MLE of the true value of lambda.
01:08:00
So what's the distribution of lambda hat?
01:08:06
We're adding up n independent random variables, and then dividing by n. What's the mean of the resulting random variable? What's the mean of the individual x's?
01:08:22
What's the expected value of a Poisson random variable? Lambda. So the expected value of this is lambda. What does that say? That means this is an unbiased estimator of lambda. What's the variance of x bar? Well, it'd be lambda or lambda squared?
01:08:44
Lambda. The mean and the variance for Poisson are both the parameter lambda. But then remember there's this 1 over n out in front, so you get lambda over n. So this is mean lambda, variance lambda over n.
01:09:04
If we do what? x bar minus lambda over lambda over root n. Root lambda over n, sorry, is approximately normal.
01:09:26
Then mean 0, variance 1, which tells us about the sampling distribution of lambda hat. So let me put this as lambda hat here.
01:09:42
So the maximum likelihood or estimator of lambda for independent Poissons is lambda hat equal to the sample average. Let's do the normal.
01:10:24
Here our function depends on two parameters. The log likelihood function depends on two parameters given the data. Let's use that form there because when we take the log over product, what does that become?
01:10:42
Sum of the logs, right?
01:11:30
First we get a minus log sigma, the log 1 over sigma, but that's minus log sigma. Then we get a sum i equal 1 to n minus a half log 2 pi.
01:11:51
Then we get a minus sum i equal 1 to n xi minus mu squared over 2 sigma squared.
01:12:04
That's the log likelihood function with observed data x sub i. So let's find derivatives.
01:12:34
When we're differentiating with respect to mu, this goes away, this goes away, and here we get what?
01:12:44
2 comes down, I get a 1 over sigma squared out in front, and then xi minus mu. Is that right?
01:13:07
When is this 0?
01:13:21
When I do this sum, I get the sum of the xi's minus n mu, right? I get this term n times, so that would be 0 if mu is x bar. Therefore, MLE for mu is mu hat equal to capital X bar.
01:13:55
The sampling distribution for mu hat is normal, mean mu, and variance sigma squared over n.
01:14:07
Let's find the MLE for sigma squared, differentiate with respect to sigma squared.
01:14:21
Now here we have what? This depends on sigma squared. Maybe I could rewrite this as sigma squared here and a half there since I've got to differentiate with respect to sigma squared. Or I could differentiate with respect to sigma. Which do you prefer? Sigma squared. Sigma squared, okay.
01:14:40
When I sum up this, I get actually minus n over 2 log sigma squared, right? Because they're n terms. So this is really minus n over 2 log sigma squared. If I differentiate log sigma squared with respect to sigma squared, I get 1 over sigma squared. So the first part would be minus n over 2 sigma squared.
01:15:03
The derivative of this is gone. Then I have to differentiate this with respect to sigma squared. That's just a constant over sigma squared. The derivative of that with respect to sigma squared is minus that constant over sigma to the fourth, right?
01:15:31
The derivative with respect to X of 1 over X is minus 1 over X squared. So here I get plus summation i equal 1 to n Xi minus mu squared over, well there's a 2 there.
01:15:53
I keep it. Sigma to the fourth, okay?
01:16:00
Now I set this equal to 0 and solving for sigma squared, we get that.
01:16:34
But mu is supposed to be replaced by mu hat so our maximum likelihood estimator for sigma squared is 1 over n
01:16:42
sum i equal 1 to n little xi minus little x bar squared. And so in terms of the random variables, given, putting back in the randomness, we get that.
01:17:16
Let's do the non-independent case of the multinomial.
01:18:15
Okay, and now we're going to maximize with respect to the values pi given the observed values little x1 through little xm.
01:18:24
And so will this or this cause us any trouble when we're maximizing in the values p sub i? No. These look like terrible expressions but differentiation with respect to a variable pi
01:18:40
makes them go away because they're constant as far as the p's are concerned. So we don't have to worry about those. So we're going to maximize this function but remember there's a constraint and I erased it. We're going to maximize in the values p but what do the p's have to do? They have to add up to 1.
01:19:02
The values p sub i have to add up to 1. So we're going to do maximization with a constraint and that involves Lagrange multipliers. So what's the constraining function?
01:19:28
This must be 1. G must be 1. So we're going to maximize L given G is 1. So the maximum occurs where the gradients line up.
01:19:53
So two vectors line up if one is a multiple of the other.
01:20:03
Now what's the gradient of G? The i-th coordinate of the gradient of G is dGdpi. dGdpi is just 1. So the gradient of G is the vector composed of all 1's. If I multiply that by lambda it's the vector composed of lambda everywhere.
01:20:24
Now what about the gradient of L? No p's here. No p's there. Here the derivative of this with respect to p sub j would be xj times the derivative of this with respect to p sub j which would be 1 over p sub j.
01:20:45
So the j-th component of the gradient would be xj over pj.
01:21:02
So this must be lambda. This must be lambda. This must be lambda. So they all must be equal to the same thing so they're all equal to each other.
01:21:49
So xi is equal to some value lambda times pi. What is lambda? How can we find lambda?
01:22:09
Well what happens if we add them all up?
01:22:20
What's the sum of the p sub i's? 1. So this is lambda. What's the sum of the xi's? Well xi is the number of things in cell i and how many things are distributed in the multinomial things?
01:22:50
So the sum of the xi's is little n. So the sum of the xi's is...
01:23:03
So what's lambda? Lambda is n. So what's our estimator of p? So our maximum occurs at...
01:23:41
In other words, what are the estimates for the p's? Just the proportion of things in the boxes. Observe the proportion of things in the boxes. That's the maximum likelihood estimator for the p's. So if we're given a random sample
01:24:01
or we treat x1 through xn as random variables with a multinomial distribution,
01:24:29
this we consider to be known. But these are unknown.
01:24:54
So what's the sampling distribution of pi hat?
01:25:00
Well, what's the distribution of xi? x1 through xn are multinomial. What's the distribution of x1?
01:25:24
xi is binomial with a parameter n and pi. Think of it this way. You only care about whether something is in box i or not. So you have xi and then you have the other ones, I guess.
01:25:48
So what's the probability that xi is number j?
01:26:20
I'll leave this computation to you. This is n choose j, pi to the j, and then 1 minus pi to the n minus j. But you can think of it this way.
01:26:40
You want to put j balls in box i, the n minus j other places, or you're performing an experiment, getting a ball in box j is success. Not getting it in there is failure.
01:27:02
What's the probability of success? It's pi. What's the probability of having j successes? That would be pi to the j. But then the others have to be failures. That would be 1 minus pi to the n minus j, and then you have to choose the j balls from among the n that you're going to succeed with, or trials that you're going to succeed with. That would be this many trials, n choose j.
01:27:23
So that's binomial. So the sampling distribution of the MLE is binomial divided by n. So that helps you decide what the standard area is. You need to know the distribution of the estimators.
01:27:46
Here's your estimator xi over n. That's binomial with parameter n and unknown value pi. So divide that by n and that's your thing. So what's the variance of a binomial?
01:28:03
So what's the standard area of this estimate? Variance of xi is pi times 1 minus pi. So the standard error of pi hat is,
01:28:21
well, what's the variance of this guy? It would be this over n squared. So the standard error would be the square root of that, pi, 1 minus pi over n. Now what's the problem of using that as the standard error?
01:28:43
Well, you don't know what it is. There's usually difficulty using something you don't know. So what do we do? Replace the pi by the pi hat.
01:29:04
Well, oh, uh, yeah, I'm sorry. Yeah, yeah, yeah, yeah. n, n, right. I was doing a Bernoulli, yes. Thank you. Okay, so now we divide by n. So we divide this by n squared.
01:29:22
That would give us this over n and I take square root. Thank you. That's great. pi, 1 minus pi over n. Yeah, the variance of a binomial is n times p times 1 minus p. But now there's over n here.
01:29:41
So when we take variance of this, what happens with 1 over n? It becomes 1 over n squared. So I can multiply this by 1 over n squared to get the variance of xi over n. But now we don't know p, so we use p hat in there.
01:30:00
For a standard error. And that's something we've computed. If you have the data, you can compute this then. So that's about how, this gives you an idea of how accurate this estimate is for the true value p sub i.
01:30:27
Okay, so let me remark that if the density function, like this, if they're sufficiently smooth, then maximum likelihood estimators are consistent. Remember what consistent meant?
01:30:41
It means they converge to the things they're estimating. As the number of samples gets larger, they'll converge to the things they're estimating.
01:31:02
And I think that's where I'll stop today. You get 7 minutes that you didn't think you'd have. And time is money, so I've just given you some money.
Recommendations
Series of 7 media
Series of 16 media