Say I'm trying to estimate the population mean of a Bernoulli variable, say a baseball batter's "hits." Given n samples, I want to know the standard error of my sample mean.
I *think* that it's simply SEx = s/sqrt(n), where s is the sample standard deviation.
My rationale is that, according to the CLT, my sample mean will converge in distribution to a normal distribution whose variance is sigma2/n.
So, for my estimator on hits, a player who has 3 hits in 10 PA will have a hit average of 0.300 +/- 0.145 (knowing the variance of a bernoulli variable is p(1-p)). A batter with 30 hits in 100 PA will have a hit average of 0.300 +/- 0.045. A batter with 300 hits in 1000 PA will have a hit average of 0.300 +/- 0.015.
What's driving my question here is the idea of regression towards the mean. According to "The Book," one way to regress towards the mean is to weight a player's average (of any stat) with the league mean of similar players (of the same stat, of course), by the inverse of the variance of each stat.
So, what confuses me a bit is that in the case of a Bernoulli variable, the league average is going to have a variance of p(1-p). However, a player is going to have a variance of p'(1-p')/n (where p' is an estimate of the player's true rate, p). So, what this means is that for as few as 10 at bats, I'll be weighting a player's estimator approximately 10x as much as I will be weighting the league estimator, which seems to be an absurd ratio for such a small number of at bats.
For example, in the above example of 3 hits in 10 PA's, the batter will have a hit rate of 0.300 +/- 0.145 (variance = 0.021). Say the league hit rate is 0.250, which would put the variance of the league's hit rate at 0.1875 (p*q = 0.25*0.75). So, even for 10 PA's, according the "The Book's" inverse variance regression, I'll weight the player's 10 PA's at 0.9 and the league's hit rate at 0.1. How can this make any sense? The authors of "The Book" use the inverse variance to regress a multinomial. Is there some reason it doesn't work for a binomial?
I guess this became more of a question about regression toward the mean, and my title doesn't reflect that... sorry.