Calculating Confidence Interval

BuddyBear · 08-03-08, 10:09 PM

Originally posted by radbet

Does anyone have a program or method of calculating confidence intervals with small sample sizes? I am wary of putting down $$$ without confirming the signifigance of these smaller nodes.

RadBet

Yeah, that is a pretty small sample size. Because of the small sample size, key assumptions related to normality are likely violated. Therefore you'll probably have to use some sort of nonparametric statistic to figure this problem out. I am not very good with non-parametric statistics but I am pretty sure there is an equivalent in nonparametric for confidence interval testing...I think it is something along the lines of bootstrapping or something like that.

I am sure Ganch knows.....

Ganchrow · 08-04-08, 11:19 AM

Originally posted by radbet

For NCAA football, I have a classification tree algorithm using the past 10 years of team data and my own ranking system. I am using the following equation to setup confidence intervals for my measuring accuracy for specific leaf nodes in my system:

p= (2 * N * (measured accuracy) + Z^2 +/- (SQRT(Z^2 + (4 * N * (measured accuracy) - 4 * N * (m accuracy)^2))
-------------------------------------------------------
(2 * N * Z^2)

Z=1.96 for 95% confidence.

My issue is when the N (# of events) gets small. There are several situations where the accuracy is high (90-100%) but the N is between 10 and 20. I know that standard confidence interval equations are accurate when the N is larger (>30 minimum).

Does anyone have a program or method of calculating confidence intervals with small sample sizes? I am wary of putting down $$$ without confirming the signifigance of these smaller nodes.

RadBet

Why don't you give a specific example, including an explanation of what exactly you mean by "measure accuracy"?

radbet · 08-04-08, 05:38 PM

example

what I mean by measured accuracy is the percentage of correctly predicted outcomes of a node within my decision tree output.

Here is an example of a node I have:

Over past 4 years (1860 games), a situation has occurred 20 times with 18 of them resulting in a positive outcome (ie. correctly predicted victory). This gives an obvious predicted outcome % of 90%.

From my understanding (please correct me if i am wrong), if the N>30 in a classification tree node where the class variable has 2 values (ie win/loss), then the probability of the class variable outcome can be safely assumed to be a normal distribution (if the known distribution of variables is also normal). But, if N<30, then normality can not be assumed and predicting confidence intervals is more complicated.

By the way, the tree is built with 15 continous variables which are based on team rating, offensive/defensive scoring, and SOS. I have evaluated the variables extensively and they are all normally distributed.

BuddyBear · 08-04-08, 10:05 PM

Well, if your main dependent variable of interest is binary (i.e. win/loss) then you could do what is called a logistic regression. Logistic regression is similar to standard OLS regression with the exception that the DV is binary and then the continuous variables could be included in the model. This, to me, seems more rigorous than confidence intervals b/c logistic regression would allow you to control for all those 15 variables in the model.

It's really all not that clear to me what you are trying to do except it seems like you are trying to predict win/loss based on a certain set of variables you've collected data on. If that is the case, regression would be able to do that for you.