I have started using a new function to double check hypothesis suggested by data. I have run numerous tests on this using pseudo-random coin flips with different edges, including no edge, and have found that it works perfectly (as in the probability expected) in uncovering angles that may be due to randomness. To determine the p-value of an angle that I have found by data mining (i.e. looking at the data for profitable patterns) I use the following equation in Excel:
data mined p-value = (population size - angle population size + 1) * (1 - NORMSDIST((total won - total bet) / SQRT(total bet))) / 2
population size = total number of bets considered
angle population size = total number of bets in the angle
total units won = total number of units won assuming all bets were to win 1 unit
total units bet = total number of units bet assuming all bets were to win 1 unit
An example:
Let's say that I went through a database of 4968 games and saw an angle (home team, between certain odds, and one other criteria) that had an ROI of about 14%. This seems good to me and I would like to know the p-value of this angle. Here are my numbers:
population size = 4968
angle population size = 840
total units won = 990.961
total units bet = 869.222
Now, if I had not mined for this angle but actually thought of the angle logically without looking at the data or data mined it by looking at a completely different set of games, my normal p-value would be calculated as:
p-value = 1 - NORMSDIST((990.961-869.222)/SQRT(869.222)) = .0000182128
But because I looked at ALL of the data, and because there are 4968 - 840 + 1 = 4129 clusters exactly the same size in the sample, I need to multiply my normal p-value by 4129 to compensate for the fact that I looked at the data first. Therefore:
data mined p-value = .0000182128 * 4129 = .037600423
This would indicate that my data mined angle is not due to randomness (at the 96.24% confidence level).
The reason I am writing this post is that I was under the assumption that I needed to keep track of ALL the combinations I had ever looked at in a population and multiply the p-value I got for any angle by this combination number. In my test using millions of pseudo-random coin flip data, this does not seem to be the case. I only need to consider how many other groups of data the exact same size that there could be in the population I am considering. Just because I had considered other angles with different attributes does not mean that I am penalized for looking at those when I am looking at a new angle. I would appreciate comments from those in the know whether my these assumptions are true. If they are true, then data mining just got fun again!
data mined p-value = (population size - angle population size + 1) * (1 - NORMSDIST((total won - total bet) / SQRT(total bet))) / 2
population size = total number of bets considered
angle population size = total number of bets in the angle
total units won = total number of units won assuming all bets were to win 1 unit
total units bet = total number of units bet assuming all bets were to win 1 unit
An example:
Let's say that I went through a database of 4968 games and saw an angle (home team, between certain odds, and one other criteria) that had an ROI of about 14%. This seems good to me and I would like to know the p-value of this angle. Here are my numbers:
population size = 4968
angle population size = 840
total units won = 990.961
total units bet = 869.222
Now, if I had not mined for this angle but actually thought of the angle logically without looking at the data or data mined it by looking at a completely different set of games, my normal p-value would be calculated as:
p-value = 1 - NORMSDIST((990.961-869.222)/SQRT(869.222)) = .0000182128
But because I looked at ALL of the data, and because there are 4968 - 840 + 1 = 4129 clusters exactly the same size in the sample, I need to multiply my normal p-value by 4129 to compensate for the fact that I looked at the data first. Therefore:
data mined p-value = .0000182128 * 4129 = .037600423
This would indicate that my data mined angle is not due to randomness (at the 96.24% confidence level).
The reason I am writing this post is that I was under the assumption that I needed to keep track of ALL the combinations I had ever looked at in a population and multiply the p-value I got for any angle by this combination number. In my test using millions of pseudo-random coin flip data, this does not seem to be the case. I only need to consider how many other groups of data the exact same size that there could be in the population I am considering. Just because I had considered other angles with different attributes does not mean that I am penalized for looking at those when I am looking at a new angle. I would appreciate comments from those in the know whether my these assumptions are true. If they are true, then data mining just got fun again!