Here is a real life example. For the purpose of this post, I consider 5% to be the minimum level of significance that I would want (95% confidence).
The following represents 3171 independent games and the outcome of a single sport for a single bet type. The first column represents 4 different variables (0000 to 1113), each with its own digit (i.e. the 1st digit with a value of either 0 or 1 represents one variable, the 2nd digit is a different variable and same with the third. A value of 0 is assigned if the variable is one way and 1 if it is the other. For example, a 0 in one digit may represent a team that is home and a 1 may represent a team that is away. The 4th digit is unique in that the variable has been divided up into quartiles (or as near to a quartile as possible) using the same decision process to decide which way to round the group and does not look at the results when deciding which way to round. There is a total of 2x2x2x4=32 combinations of these 4 variables. The second column is the number of events for that combination. The third column is the ROI assuming betting an amount to win the same 1 unit regardless of odds. The fourth column is the total amount won for each variable combination assuming betting an amount to win the same 1 unit. The fifth column is the total amount bet for each variable combination assuming betting an amount to win the same 1 unit.
Vars # ROI Win Bet
0000 275 +2.92% 340.51 330.84
0001 260 +5.85% 356.16 327.04
0002 271 -6.00% 327.26 348.14
0003 286 -8.19% 349.86 381.05
1000 346 -7.30% 609.57 657.57
1001 359 +4.04% 761.47 731.87
1002 321 -6.57% 636.65 681.44
1003 347 -4.14% 787.85 821.88
0010 110 -7.31% 87.78 94.70
0011 122 -0.21% 108.73 108.95
0012 127 -17.27% 95.15 115.01
0013 118 -22.09% 85.49 109.73
1010 57 +13.19% 38.73 34.21
1011 58 -22.94% 27.73 35.99
1012 54 -6.63% 27.58 29.18
1013 60 -16.84% 8.28 12.99
0100 118 +16.01% 162.64 140.20
0101 127 +10.38% 170.64 154.60
0102 122 -2.82% 146.33 150.58
0103 110 +2.42% 143.67 140.27
1100 60 +7.57% 113.09 105.13
1101 54 +2.54% 97.60 95.18
1102 58 +11.50% 113.42 101.72
1103 57 -9.06% 94.89 104.34
0110 286 +5.60% 247.28 234.16
0111 271 +4.19% 231.35 231.64
0112 260 -13.09% 197.46 227.20
0113 275 -7.57% 232.76 251.82
1110 347 +5.29% 178.60 169.64
1111 321 +6.66% 184.25 172.74
1112 359 -13.16% 174.52 200.97
1113 346 +6.01% 217.25 204.93
I have run 1,000,000 Monte Carlo trials (because they are Moneyline odds and are not straight 50/50 bets) for each of the 15 variable combinations that have positive ROI. From this I determined that only 1, the "0100" combination, is significant below the 5% level (95% confident that the actual results were not random). However, perhaps I was over zealous in deciding to break the data up into such small groups and this is the point of this post.
Using the above results, I can tell you that if I combine the combinations and make the following combined group:
x1x0 AND x1x1 (where x denotes the variable can either be 0 or 1), I have a positive ROI of 1385.45 / 1303.29 = +6.30% ROI on a 1584 event sample size and that this is significant at the 5% level according to a 10,000,000 trial Monte Carlo run. However, it was only because I could look at the original ROI results in the first place that I would have even know to combine them this way (actually, I had an idea but could not prove it).
And this brings me to the real problem I am having now. How do I, and can I even correctly, combine the combinations to create new combinations without making a Type I error? On the one hand, I am now biased because I have seen the ROI results for each combination. On the other hand, just because I have seen them does not automatically make them not significant. I mean, if I had a sample of 10,000 independent trials that were a 50/50 proposition but had a guaranteed built in ROI of +5% for me and I broke down the sample using 10 different variables (that unknown to me didn't matter anyway), I could get a sample size so small for each combination that a significance test would say the positive ROI was not statistically significant. I have read papers on sports betting in scientific journals that do say things similar to "the highest 3 combinations on the list are jointly significant". Well, how do they justify combining just the top 3 without creating a Type I error? I mean, the only way they knew to combine these 3 combinations in the first place was that they looked at the results.
After reading about a dozen of these papers, I have come up with 3 different hypothesis as for why they (the authors) can do this and still not create a Type I error:
1) You can combine combinations linearly starting from the top or bottom of the variable list. For example, if I was looking at NBA totals, I could start from the highest total on record (about 270 or something like that) and even though there are not enough events to declare an ROI as statistically significant for an individual total of 270, I could then add the results for 269, then 268, the 267 and so on until I got to a point where I had a desirable ROI that had a large enough sample size that it could be statistically significant. Even if this first occurred at 230, I could keep going down the list to 229 etc. to see if I got a higher statistically significant ROI. I could also start from the lowest total (I am guessing it is like 150 or something) and work my way up. What I could not do though, is start somewhere other than each end and select a cluster that had a positive ROI and a large enough sample size and declare it as statistically significant. For example, I could not notice that between 195 and 215 was the most profitable and declare that as my unbiased group and test it for significance.
OR
2) You can combine ANY variable combination from anywhere in the list with another variable combination so long as it shares a linear relationship with all of the other variable combinations that they are being combined with. Here is a completely made up example for a fictitious sport. Suppose we have the following 4 variables:
Home and Away
Favourite or Dog
Line Moved With or Against Team
Won or Lost Their Previous Game
Let's say that each of the 16 possible variable combinations (2x2x2x2=16) resulted in the following 5 combinations having positive ROI but that each did not have a sample size that was big enough to declare the result as significant by itself.
a) Line Moved Against Home Dog That Won Previous Game
b) Line Moved With Home Dog That Won Previous Game
c) Line Moved Against Home Fav That Won Previous Game
d) Line Moved With Away Dog That Won Previous Game
e) Line Moved Against Away Dog That Lost Previous Game
I now notice that "Won Previous Game" connects 4 of the combinations, Home connects 3 of them, Dog connects 4 of them, and Moved against connects 3 of them, etc. I am now free to reduce the variables from 4 down to either 3, 2, or 1 in order to get the highest positive ROI that I can and that is statistically significant. I can make new groups like:
a) Dogs That Won Previous Game
b) Home and Won Previous Game
c) Line Moved Against and Won Previous Game
etc.
So long as I am only removing variables to create a new group, I can do this without creating a Type I error. However, I can not say things like:
a) (Dogs That Won Previous Game) or (Home That Won Previous Game)
b) (Line Moved Against) and (Dogs That Won Previous Game)
c) (Line Moved Against and Won Previous Game) but not (Dogs That Won Previous Game)
Aside: If this #2 method is acceptable, what do I do with combinations that might be already part of another group? I am puzzled by this as well.
OR
3) They can not do what they are doing and are creating a Type I error by doing so.
The thing that worries me about hypothesis #1 or #2 is that if I am going to check all these different combinations of variables that are linearly connected, do I have to adopt some penalty scheme like the Bonferroni Method and penalize myself at a rate of .05/n where n is the number of combinations I have looked at in order to be sure I still have significance at the 5% level? This is the crux of my problem since I do not see the mathematicians even eluding to this in their sports betting papers.
The following represents 3171 independent games and the outcome of a single sport for a single bet type. The first column represents 4 different variables (0000 to 1113), each with its own digit (i.e. the 1st digit with a value of either 0 or 1 represents one variable, the 2nd digit is a different variable and same with the third. A value of 0 is assigned if the variable is one way and 1 if it is the other. For example, a 0 in one digit may represent a team that is home and a 1 may represent a team that is away. The 4th digit is unique in that the variable has been divided up into quartiles (or as near to a quartile as possible) using the same decision process to decide which way to round the group and does not look at the results when deciding which way to round. There is a total of 2x2x2x4=32 combinations of these 4 variables. The second column is the number of events for that combination. The third column is the ROI assuming betting an amount to win the same 1 unit regardless of odds. The fourth column is the total amount won for each variable combination assuming betting an amount to win the same 1 unit. The fifth column is the total amount bet for each variable combination assuming betting an amount to win the same 1 unit.
Vars # ROI Win Bet
0000 275 +2.92% 340.51 330.84
0001 260 +5.85% 356.16 327.04
0002 271 -6.00% 327.26 348.14
0003 286 -8.19% 349.86 381.05
1000 346 -7.30% 609.57 657.57
1001 359 +4.04% 761.47 731.87
1002 321 -6.57% 636.65 681.44
1003 347 -4.14% 787.85 821.88
0010 110 -7.31% 87.78 94.70
0011 122 -0.21% 108.73 108.95
0012 127 -17.27% 95.15 115.01
0013 118 -22.09% 85.49 109.73
1010 57 +13.19% 38.73 34.21
1011 58 -22.94% 27.73 35.99
1012 54 -6.63% 27.58 29.18
1013 60 -16.84% 8.28 12.99
0100 118 +16.01% 162.64 140.20
0101 127 +10.38% 170.64 154.60
0102 122 -2.82% 146.33 150.58
0103 110 +2.42% 143.67 140.27
1100 60 +7.57% 113.09 105.13
1101 54 +2.54% 97.60 95.18
1102 58 +11.50% 113.42 101.72
1103 57 -9.06% 94.89 104.34
0110 286 +5.60% 247.28 234.16
0111 271 +4.19% 231.35 231.64
0112 260 -13.09% 197.46 227.20
0113 275 -7.57% 232.76 251.82
1110 347 +5.29% 178.60 169.64
1111 321 +6.66% 184.25 172.74
1112 359 -13.16% 174.52 200.97
1113 346 +6.01% 217.25 204.93
I have run 1,000,000 Monte Carlo trials (because they are Moneyline odds and are not straight 50/50 bets) for each of the 15 variable combinations that have positive ROI. From this I determined that only 1, the "0100" combination, is significant below the 5% level (95% confident that the actual results were not random). However, perhaps I was over zealous in deciding to break the data up into such small groups and this is the point of this post.
Using the above results, I can tell you that if I combine the combinations and make the following combined group:
x1x0 AND x1x1 (where x denotes the variable can either be 0 or 1), I have a positive ROI of 1385.45 / 1303.29 = +6.30% ROI on a 1584 event sample size and that this is significant at the 5% level according to a 10,000,000 trial Monte Carlo run. However, it was only because I could look at the original ROI results in the first place that I would have even know to combine them this way (actually, I had an idea but could not prove it).
And this brings me to the real problem I am having now. How do I, and can I even correctly, combine the combinations to create new combinations without making a Type I error? On the one hand, I am now biased because I have seen the ROI results for each combination. On the other hand, just because I have seen them does not automatically make them not significant. I mean, if I had a sample of 10,000 independent trials that were a 50/50 proposition but had a guaranteed built in ROI of +5% for me and I broke down the sample using 10 different variables (that unknown to me didn't matter anyway), I could get a sample size so small for each combination that a significance test would say the positive ROI was not statistically significant. I have read papers on sports betting in scientific journals that do say things similar to "the highest 3 combinations on the list are jointly significant". Well, how do they justify combining just the top 3 without creating a Type I error? I mean, the only way they knew to combine these 3 combinations in the first place was that they looked at the results.
After reading about a dozen of these papers, I have come up with 3 different hypothesis as for why they (the authors) can do this and still not create a Type I error:
1) You can combine combinations linearly starting from the top or bottom of the variable list. For example, if I was looking at NBA totals, I could start from the highest total on record (about 270 or something like that) and even though there are not enough events to declare an ROI as statistically significant for an individual total of 270, I could then add the results for 269, then 268, the 267 and so on until I got to a point where I had a desirable ROI that had a large enough sample size that it could be statistically significant. Even if this first occurred at 230, I could keep going down the list to 229 etc. to see if I got a higher statistically significant ROI. I could also start from the lowest total (I am guessing it is like 150 or something) and work my way up. What I could not do though, is start somewhere other than each end and select a cluster that had a positive ROI and a large enough sample size and declare it as statistically significant. For example, I could not notice that between 195 and 215 was the most profitable and declare that as my unbiased group and test it for significance.
OR
2) You can combine ANY variable combination from anywhere in the list with another variable combination so long as it shares a linear relationship with all of the other variable combinations that they are being combined with. Here is a completely made up example for a fictitious sport. Suppose we have the following 4 variables:
Home and Away
Favourite or Dog
Line Moved With or Against Team
Won or Lost Their Previous Game
Let's say that each of the 16 possible variable combinations (2x2x2x2=16) resulted in the following 5 combinations having positive ROI but that each did not have a sample size that was big enough to declare the result as significant by itself.
a) Line Moved Against Home Dog That Won Previous Game
b) Line Moved With Home Dog That Won Previous Game
c) Line Moved Against Home Fav That Won Previous Game
d) Line Moved With Away Dog That Won Previous Game
e) Line Moved Against Away Dog That Lost Previous Game
I now notice that "Won Previous Game" connects 4 of the combinations, Home connects 3 of them, Dog connects 4 of them, and Moved against connects 3 of them, etc. I am now free to reduce the variables from 4 down to either 3, 2, or 1 in order to get the highest positive ROI that I can and that is statistically significant. I can make new groups like:
a) Dogs That Won Previous Game
b) Home and Won Previous Game
c) Line Moved Against and Won Previous Game
etc.
So long as I am only removing variables to create a new group, I can do this without creating a Type I error. However, I can not say things like:
a) (Dogs That Won Previous Game) or (Home That Won Previous Game)
b) (Line Moved Against) and (Dogs That Won Previous Game)
c) (Line Moved Against and Won Previous Game) but not (Dogs That Won Previous Game)
Aside: If this #2 method is acceptable, what do I do with combinations that might be already part of another group? I am puzzled by this as well.
OR
3) They can not do what they are doing and are creating a Type I error by doing so.
The thing that worries me about hypothesis #1 or #2 is that if I am going to check all these different combinations of variables that are linearly connected, do I have to adopt some penalty scheme like the Bonferroni Method and penalize myself at a rate of .05/n where n is the number of combinations I have looked at in order to be sure I still have significance at the 5% level? This is the crux of my problem since I do not see the mathematicians even eluding to this in their sports betting papers.