Originally Posted by
samserif
The way it works is that you start with the standard error, which is the standard deviation of the sampled mean (which in this case is your current winning percentage), divided by the square root of the number of samples.
Then you decide on what confidence level you want to use; i.e., how certain do you want to be that your method really works and you haven't just been lucky or unlucky? Your confidence level can be expressed either in terms of standard deviations or in terms of percentage. Some people express it in terms of s.d.'s whereas others like percentages (e.g., 95%).
With the standard error and the confidence level, you can compute your margin of error. Just multiply them. For example, if you've selected a confidence level of 2 standard deviations (which is a tad greater than 95%), then your margin of error is 2 times your standard error.
Finally, you get the confidence interval, which is:
(your sampled winning percentage) plus/minus (your margin of error)
Here's how to visualize it. Imagine drawing a normal distribution (bell curve) around your current winning percentage. The "width" of the distribution is proportionate to the number of samples (more samples = better estimate = narrower distribution). it's possible that the real performance of your algorithm -- in other words, the true winning percentage over time -- is much different than your current winning percentage. This would happen if the true percentage were way out to one side of the distribution. But luckily, we can calculate the probability of that happening and say something like "I know that my winning percentage has a [blah blah blah] chance of being within [blah blah blah] percentage points of my current winning record."
It happens that a 95% confidence level corresponds to about 1.96 standard errors. So when Stanford Wong writes that 2 standard errors aren't enough, he's saying that the traditional 95% confidence level isn't good enough to quit your day job.