Variance of Bernoulli Estimator

tweek · 06-15-09 03:47 PM

Say I'm trying to estimate the population mean of a Bernoulli variable, say a baseball batter's "hits." Given n samples, I want to know the standard error of my sample mean.

I *think* that it's simply SE_x = s/sqrt(n), where s is the sample standard deviation.

My rationale is that, according to the CLT, my sample mean will converge in distribution to a normal distribution whose variance is sigma²/n.

So, for my estimator on hits, a player who has 3 hits in 10 PA will have a hit average of 0.300 +/- 0.145 (knowing the variance of a bernoulli variable is p(1-p)). A batter with 30 hits in 100 PA will have a hit average of 0.300 +/- 0.045. A batter with 300 hits in 1000 PA will have a hit average of 0.300 +/- 0.015.

What's driving my question here is the idea of regression towards the mean. According to "The Book," one way to regress towards the mean is to weight a player's average (of any stat) with the league mean of similar players (of the same stat, of course), by the inverse of the variance of each stat.

So, what confuses me a bit is that in the case of a Bernoulli variable, the league average is going to have a variance of p(1-p). However, a player is going to have a variance of p'(1-p')/n (where p' is an estimate of the player's true rate, p). So, what this means is that for as few as 10 at bats, I'll be weighting a player's estimator approximately 10x as much as I will be weighting the league estimator, which seems to be an absurd ratio for such a small number of at bats.

For example, in the above example of 3 hits in 10 PA's, the batter will have a hit rate of 0.300 +/- 0.145 (variance = 0.021). Say the league hit rate is 0.250, which would put the variance of the league's hit rate at 0.1875 (p*q = 0.25*0.75). So, even for 10 PA's, according the "The Book's" inverse variance regression, I'll weight the player's 10 PA's at 0.9 and the league's hit rate at 0.1. How can this make any sense? The authors of "The Book" use the inverse variance to regress a multinomial. Is there some reason it doesn't work for a binomial?

I guess this became more of a question about regression toward the mean, and my title doesn't reflect that... sorry.

wintermute · 06-15-09 06:49 PM

You can't use p*q as the variance for a population.

You need to read the section in the Appendix on Measuring Population Variations very carefully ( pages 373 to 376 ). I've read it myself about a dozen times.

I'd go into more detail but I'd probably get the explanation wrong. The explanation in The Book is right although the coin tossing example is not very good and there appear to be a few typos scattered here and there. Try checking out http://www.tangotiger.net/wiki/index.php?title=Mailbags The authors of The Book attempt to answer reader's questions there.

Good luck.

tweek · 06-15-09 07:09 PM

Originally Posted by wintermute

You can't use p*q as the variance for a population.

You need to read the section in the Appendix on Measuring Population Variations very carefully ( pages 373 to 376 ). I've read it myself about a dozen times.

I'd go into more detail but I'd probably get the explanation wrong. The explanation in The Book is right although the coin tossing example is not very good and there appear to be a few typos scattered here and there. Try checking out http://www.tangotiger.net/wiki/index.php?title=Mailbags The authors of The Book attempt to answer reader's questions there.

Good luck.

PERFECT! Thanks wintermute. After I posted, I realized that yes, p*q is not a good estimate for "variation in true skill among the league." I saw the post on their wiki here but didn't recall seeing how they treated it in the book. So, I emailed Andy (who has been really good at responding to e-mails) about where it was in the Appendix as he referenced, and you pointed me right to it. I will give it a good read tonight and maybe check back in here if I have some questions, since you're up 12 reads on it from me

tweek · 06-16-09 10:43 AM

Originally Posted by wintermute

You can't use p*q as the variance for a population.

You need to read the section in the Appendix on Measuring Population Variations very carefully ( pages 373 to 376 ). I've read it myself about a dozen times.

I'd go into more detail but I'd probably get the explanation wrong. The explanation in The Book is right although the coin tossing example is not very good and there appear to be a few typos scattered here and there. Try checking out http://www.tangotiger.net/wiki/index.php?title=Mailbags The authors of The Book attempt to answer reader's questions there.

Good luck.

Man, I find something new and great in this book every day. Thanks for pointing me in the right direction. The treatment in the book makes a lot more sense than what I have been doing... hopefully it will help

SBR Top-Rated Sportsbooks				Best Sportsbooks List
#1 FanDuel	SBR rating 4.8/5	Review	#6 BetRivers	SBR rating 4.1/5	Review
#2 Caesars	SBR rating 4.7/5	Review	#7 Fanatics	SBR rating 4.1/5	Review
#3 DraftKings	SBR rating 4.7/5	Review	#8 Betway	SBR rating 3.8/5	Review
#4 BetMGM	SBR rating 4.6/5	Review	#9 Borgata	SBR rating 3.5/5	Review
#5 bet365	SBR rating 4.6/5	Review	#10 ClutchBet	SBR rating 2.9/5	Review

Variance of Bernoulli Estimator

Thread Tools

Variance of Bernoulli Estimator