I have recently come to the conclusion that anything less than strong protection against back fitting would result in me coming up with models that were useless (i.e. I could not try and employ methods that required me to be judgmental in anyway or assume I took certain precautions simply because I do not really understand when I am crossing the line).
For example, recently I gathered up 27 variables which encompassed the whole universe of these types of variables that are available to me (without unrealistic and considerable extra effort anyway). I believe (my hypothesis) that these types of variables would be a good predictor of positive EV. Also, these variables are all raw numbers in the sense that I have not somehow convoluted them to represent obscure things like winning on a Sunday when the temperature is exactly 10 degrees at the start of the game and they lost their last game by exactly 12 points etc.). These are raw hard untampered with numbers. In fact, I had only ever looked at 7 of these variables before specifically so I would not taint the final process when I looked at the rest. So, I ran a regression analysis on them (the 27) and came up with:
Observations 608.000
Sum of weights 608.000
DF 580.000
R² 0.081
Adjusted R² 0.039
MSE 1.158
RMSE 1.076
MAPE 94.240
DW 1.934
Cp 28.000
AIC 116.315
SBC 239.800
PC 1.007
Analysis of variance:
Source DF Sum of squares Mean squares F Pr > F
Model 27 59.481 2.203 1.903 0.004
Error 580 671.408 1.158
Corrected Total 607 730.889
Computed against model Y=Mean(Y)
With a p of 0.004 and R^2 of 0.081 which explained 28.44% of the variability of the actual ROI, I felt that I had a profitable model. Keep in mind that the dependant variable was actually the return of betting to win 1 unit (so either 1 or the negative amount lost to try and win 1). Because the adjusted R^2 was so much lower than R^2, this indicated to me that there were a significant amount of variables that had little to do with the 0.081. So, rather than try (what would be impossible anyway) of checking all of the billions of combinations of 27 or less of the variables, I thought what I would do is check the adjusted R^2 value of of 26 different models, each without one of the variables to see what the highest adjusted R^2 would be. If I came across one that would be better than the original 0.039, I would keep it and then try taking each of the remaining 25 variables away from my new 26 variable model to find the best adjusted R^2. I would continue this until no improvement would be made. The best I came up with was 17 variables. No combination of the 16 remaining variables produced a better result so I accepted the 17 variable results of:
Observations 716.000
Sum of weights 716.000
DF 698.000
R² 0.082
Adjusted R² 0.060
MSE 1.121
RMSE 1.059
MAPE 94.404
DW 1.910
Cp 18.000
AIC 99.752
SBC 182.079
PC 0.965
Analysis of variance:
Source DF Sum of squares Mean squares F Pr > F
Model 17 70.032 4.120 3.674 < 0.0001
Error 698 782.677 1.121
Corrected Total 715 852.709
Now, before everyone reading this starts to respond with "back fitting" etc., I enrolled (at least to my understanding) one of the statistically strongest control against back fitting that there is, the Bonferonni Method, which states according to Wikipedia:
"In order to retain the same overall rate of false positives (rather than a higher rate) in a test involving more than one comparison, the standards for each comparison must be more stringent. Intuitively, reducing the size of the allowable error (alpha) for each comparison by the number of comparisons will result in an overall alpha which does not exceed the desired limit, and this can be mathematically proved to be true using Bonferroni's inequality, regardless of independence or dependence among test statistics.
However, it can be demonstrated that this technique (called the Bonferroni method) is overly conservative, i.e., it will actually result in a true alpha that is substantially smaller than 0.05 when the test statistics are highly dependent and/or when many of the nulls are false; thereby failing to identify an unnecessarily high percentage of the true differences. For example, in fMRI analysis, tests are done over 100000 voxels in the brain. The Bonferroni method would require p-values to be smaller than .05/100000 to declare significance; this threshold might be considered too stringent for practical use."
So, in total I actually only looked at the p-value and adjusted R^2 value of 27+26+25+24+23+22+21+20+19+18+17=242 different combinations. That is 242 out of billions (Maybe you know the factorial equation) of combinations. Furthermore, according to the Bonferroni Method, as long as my p-value of whatever I find and want to call statistically significant (I'll use p=.05 as a default), is less than .05/242=.0002066, I met the requirements of one of the most conservative tests there is against back fitting. Well, my p is so low it comes up as <.0001 and for what it is worth, most of the ones that I rejected were also <.0001. Since <.0001 is less than .0002066, I have not crossed the line to where my process could be biased (at least to my understanding).
Now, this is where I am an interpreting a contradiction.
When I look at the Holm-Bonferroni Method (as opposed to the Bonferonni Method), it states according to Wikipedia:
"Suppose there are k hypotheses to be tested and the overall type 1 error rate is α. Start by ordering the p-values and comparing the smallest p-value to α/k. If that p-value is less than α/k, then reject that hypothesis and start all over with the same α and test the remaining k - 1 hypothesis, i.e. order the k - 1 remaining p-values and compare the smallest one to α/(k - 1). Continue doing this until the hypothesis with the smallest p-value cannot be rejected. At that point, stop and accept all hypotheses that have not been rejected at previous steps.
Here is an example. Four hypotheses are tested with α = 0.05. The four unadjusted p-values are 0.01, 0.03, 0.04, and 0.005. The smallest of these is 0.005. Since this is less than 0.05/4, hypothesis four is rejected. The next smallest p-value is 0.01, which is smaller than 0.05/3. So, hypothesis one is also rejected. The next smallest p-value is 0.03. This is not smaller than 0.05/2. Therefore, hypotheses one and four are rejected while hypotheses two and three are not rejected."
This seems to say that I am supposed to throw out the combinations with the lowest p-values. I am completely lost. I thought I knew something for a moment but I am now completely baffled. I thought I was supposed to be searching for low p-value models, not high p-value models.
Please try and explain this to me.
Thanks Ganchrow. I will be refreshing constantly on this one waiting.
For example, recently I gathered up 27 variables which encompassed the whole universe of these types of variables that are available to me (without unrealistic and considerable extra effort anyway). I believe (my hypothesis) that these types of variables would be a good predictor of positive EV. Also, these variables are all raw numbers in the sense that I have not somehow convoluted them to represent obscure things like winning on a Sunday when the temperature is exactly 10 degrees at the start of the game and they lost their last game by exactly 12 points etc.). These are raw hard untampered with numbers. In fact, I had only ever looked at 7 of these variables before specifically so I would not taint the final process when I looked at the rest. So, I ran a regression analysis on them (the 27) and came up with:
Observations 608.000
Sum of weights 608.000
DF 580.000
R² 0.081
Adjusted R² 0.039
MSE 1.158
RMSE 1.076
MAPE 94.240
DW 1.934
Cp 28.000
AIC 116.315
SBC 239.800
PC 1.007
Analysis of variance:
Source DF Sum of squares Mean squares F Pr > F
Model 27 59.481 2.203 1.903 0.004
Error 580 671.408 1.158
Corrected Total 607 730.889
Computed against model Y=Mean(Y)
With a p of 0.004 and R^2 of 0.081 which explained 28.44% of the variability of the actual ROI, I felt that I had a profitable model. Keep in mind that the dependant variable was actually the return of betting to win 1 unit (so either 1 or the negative amount lost to try and win 1). Because the adjusted R^2 was so much lower than R^2, this indicated to me that there were a significant amount of variables that had little to do with the 0.081. So, rather than try (what would be impossible anyway) of checking all of the billions of combinations of 27 or less of the variables, I thought what I would do is check the adjusted R^2 value of of 26 different models, each without one of the variables to see what the highest adjusted R^2 would be. If I came across one that would be better than the original 0.039, I would keep it and then try taking each of the remaining 25 variables away from my new 26 variable model to find the best adjusted R^2. I would continue this until no improvement would be made. The best I came up with was 17 variables. No combination of the 16 remaining variables produced a better result so I accepted the 17 variable results of:
Observations 716.000
Sum of weights 716.000
DF 698.000
R² 0.082
Adjusted R² 0.060
MSE 1.121
RMSE 1.059
MAPE 94.404
DW 1.910
Cp 18.000
AIC 99.752
SBC 182.079
PC 0.965
Analysis of variance:
Source DF Sum of squares Mean squares F Pr > F
Model 17 70.032 4.120 3.674 < 0.0001
Error 698 782.677 1.121
Corrected Total 715 852.709
Now, before everyone reading this starts to respond with "back fitting" etc., I enrolled (at least to my understanding) one of the statistically strongest control against back fitting that there is, the Bonferonni Method, which states according to Wikipedia:
"In order to retain the same overall rate of false positives (rather than a higher rate) in a test involving more than one comparison, the standards for each comparison must be more stringent. Intuitively, reducing the size of the allowable error (alpha) for each comparison by the number of comparisons will result in an overall alpha which does not exceed the desired limit, and this can be mathematically proved to be true using Bonferroni's inequality, regardless of independence or dependence among test statistics.
However, it can be demonstrated that this technique (called the Bonferroni method) is overly conservative, i.e., it will actually result in a true alpha that is substantially smaller than 0.05 when the test statistics are highly dependent and/or when many of the nulls are false; thereby failing to identify an unnecessarily high percentage of the true differences. For example, in fMRI analysis, tests are done over 100000 voxels in the brain. The Bonferroni method would require p-values to be smaller than .05/100000 to declare significance; this threshold might be considered too stringent for practical use."
So, in total I actually only looked at the p-value and adjusted R^2 value of 27+26+25+24+23+22+21+20+19+18+17=242 different combinations. That is 242 out of billions (Maybe you know the factorial equation) of combinations. Furthermore, according to the Bonferroni Method, as long as my p-value of whatever I find and want to call statistically significant (I'll use p=.05 as a default), is less than .05/242=.0002066, I met the requirements of one of the most conservative tests there is against back fitting. Well, my p is so low it comes up as <.0001 and for what it is worth, most of the ones that I rejected were also <.0001. Since <.0001 is less than .0002066, I have not crossed the line to where my process could be biased (at least to my understanding).
Now, this is where I am an interpreting a contradiction.
When I look at the Holm-Bonferroni Method (as opposed to the Bonferonni Method), it states according to Wikipedia:
"Suppose there are k hypotheses to be tested and the overall type 1 error rate is α. Start by ordering the p-values and comparing the smallest p-value to α/k. If that p-value is less than α/k, then reject that hypothesis and start all over with the same α and test the remaining k - 1 hypothesis, i.e. order the k - 1 remaining p-values and compare the smallest one to α/(k - 1). Continue doing this until the hypothesis with the smallest p-value cannot be rejected. At that point, stop and accept all hypotheses that have not been rejected at previous steps.
Here is an example. Four hypotheses are tested with α = 0.05. The four unadjusted p-values are 0.01, 0.03, 0.04, and 0.005. The smallest of these is 0.005. Since this is less than 0.05/4, hypothesis four is rejected. The next smallest p-value is 0.01, which is smaller than 0.05/3. So, hypothesis one is also rejected. The next smallest p-value is 0.03. This is not smaller than 0.05/2. Therefore, hypotheses one and four are rejected while hypotheses two and three are not rejected."
This seems to say that I am supposed to throw out the combinations with the lowest p-values. I am completely lost. I thought I knew something for a moment but I am now completely baffled. I thought I was supposed to be searching for low p-value models, not high p-value models.
Please try and explain this to me.
Thanks Ganchrow. I will be refreshing constantly on this one waiting.