Bonferonni Math Question For Ganchrow

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • VideoReview
    SBR High Roller
    • 12-14-07
    • 107

    #1
    Bonferonni Math Question For Ganchrow
    I have recently come to the conclusion that anything less than strong protection against back fitting would result in me coming up with models that were useless (i.e. I could not try and employ methods that required me to be judgmental in anyway or assume I took certain precautions simply because I do not really understand when I am crossing the line).

    For example, recently I gathered up 27 variables which encompassed the whole universe of these types of variables that are available to me (without unrealistic and considerable extra effort anyway). I believe (my hypothesis) that these types of variables would be a good predictor of positive EV. Also, these variables are all raw numbers in the sense that I have not somehow convoluted them to represent obscure things like winning on a Sunday when the temperature is exactly 10 degrees at the start of the game and they lost their last game by exactly 12 points etc.). These are raw hard untampered with numbers. In fact, I had only ever looked at 7 of these variables before specifically so I would not taint the final process when I looked at the rest. So, I ran a regression analysis on them (the 27) and came up with:

    Observations 608.000
    Sum of weights 608.000
    DF 580.000
    R² 0.081
    Adjusted R² 0.039
    MSE 1.158
    RMSE 1.076
    MAPE 94.240
    DW 1.934
    Cp 28.000
    AIC 116.315
    SBC 239.800
    PC 1.007


    Analysis of variance:

    Source DF Sum of squares Mean squares F Pr > F
    Model 27 59.481 2.203 1.903 0.004
    Error 580 671.408 1.158
    Corrected Total 607 730.889
    Computed against model Y=Mean(Y)

    With a p of 0.004 and R^2 of 0.081 which explained 28.44% of the variability of the actual ROI, I felt that I had a profitable model. Keep in mind that the dependant variable was actually the return of betting to win 1 unit (so either 1 or the negative amount lost to try and win 1). Because the adjusted R^2 was so much lower than R^2, this indicated to me that there were a significant amount of variables that had little to do with the 0.081. So, rather than try (what would be impossible anyway) of checking all of the billions of combinations of 27 or less of the variables, I thought what I would do is check the adjusted R^2 value of of 26 different models, each without one of the variables to see what the highest adjusted R^2 would be. If I came across one that would be better than the original 0.039, I would keep it and then try taking each of the remaining 25 variables away from my new 26 variable model to find the best adjusted R^2. I would continue this until no improvement would be made. The best I came up with was 17 variables. No combination of the 16 remaining variables produced a better result so I accepted the 17 variable results of:

    Observations 716.000
    Sum of weights 716.000
    DF 698.000
    R² 0.082
    Adjusted R² 0.060
    MSE 1.121
    RMSE 1.059
    MAPE 94.404
    DW 1.910
    Cp 18.000
    AIC 99.752
    SBC 182.079
    PC 0.965


    Analysis of variance:

    Source DF Sum of squares Mean squares F Pr > F
    Model 17 70.032 4.120 3.674 < 0.0001
    Error 698 782.677 1.121
    Corrected Total 715 852.709


    Now, before everyone reading this starts to respond with "back fitting" etc., I enrolled (at least to my understanding) one of the statistically strongest control against back fitting that there is, the Bonferonni Method, which states according to Wikipedia:

    "In order to retain the same overall rate of false positives (rather than a higher rate) in a test involving more than one comparison, the standards for each comparison must be more stringent. Intuitively, reducing the size of the allowable error (alpha) for each comparison by the number of comparisons will result in an overall alpha which does not exceed the desired limit, and this can be mathematically proved to be true using Bonferroni's inequality, regardless of independence or dependence among test statistics.

    However, it can be demonstrated that this technique (called the Bonferroni method) is overly conservative, i.e., it will actually result in a true alpha that is substantially smaller than 0.05 when the test statistics are highly dependent and/or when many of the nulls are false; thereby failing to identify an unnecessarily high percentage of the true differences. For example, in fMRI analysis, tests are done over 100000 voxels in the brain. The Bonferroni method would require p-values to be smaller than .05/100000 to declare significance; this threshold might be considered too stringent for practical use."

    So, in total I actually only looked at the p-value and adjusted R^2 value of 27+26+25+24+23+22+21+20+19+18+17=242 different combinations. That is 242 out of billions (Maybe you know the factorial equation) of combinations. Furthermore, according to the Bonferroni Method, as long as my p-value of whatever I find and want to call statistically significant (I'll use p=.05 as a default), is less than .05/242=.0002066, I met the requirements of one of the most conservative tests there is against back fitting. Well, my p is so low it comes up as <.0001 and for what it is worth, most of the ones that I rejected were also <.0001. Since <.0001 is less than .0002066, I have not crossed the line to where my process could be biased (at least to my understanding).

    Now, this is where I am an interpreting a contradiction.

    When I look at the Holm-Bonferroni Method (as opposed to the Bonferonni Method), it states according to Wikipedia:

    "Suppose there are k hypotheses to be tested and the overall type 1 error rate is α. Start by ordering the p-values and comparing the smallest p-value to α/k. If that p-value is less than α/k, then reject that hypothesis and start all over with the same α and test the remaining k - 1 hypothesis, i.e. order the k - 1 remaining p-values and compare the smallest one to α/(k - 1). Continue doing this until the hypothesis with the smallest p-value cannot be rejected. At that point, stop and accept all hypotheses that have not been rejected at previous steps.

    Here is an example. Four hypotheses are tested with α = 0.05. The four unadjusted p-values are 0.01, 0.03, 0.04, and 0.005. The smallest of these is 0.005. Since this is less than 0.05/4, hypothesis four is rejected. The next smallest p-value is 0.01, which is smaller than 0.05/3. So, hypothesis one is also rejected. The next smallest p-value is 0.03. This is not smaller than 0.05/2. Therefore, hypotheses one and four are rejected while hypotheses two and three are not rejected."

    This seems to say that I am supposed to throw out the combinations with the lowest p-values. I am completely lost. I thought I knew something for a moment but I am now completely baffled. I thought I was supposed to be searching for low p-value models, not high p-value models.

    Please try and explain this to me.

    Thanks Ganchrow. I will be refreshing constantly on this one waiting.
    Last edited by VideoReview; 03-05-08, 04:09 PM. Reason: grammar
  • Ganchrow
    SBR Hall of Famer
    • 08-28-05
    • 5011

    #2
    Originally posted by VideoReview
    This seems to say that I am supposed to throw out the combinations with the lowest p-values. I am completely lost. I thought I knew something for a moment but I am now completely baffled. I thought I was supposed to be searching for low p-value models, not high p-value models.
    You are indeed looking for low p-value models. I think you're just getting stuck on the terminology.

    A typical hypothesis test might look something like this:
    H0: The tested model is not profitable
    Ha: The tested model is profitable

    You'd then reject the null in favor of the alternative given a p-value ≤ α.

    As a side note, technically speaking the Bonferroni correction is only an approximation. FWIW, the actual corrective factor should be 1-(1-α)1/k which will always be > α/k (for α > 0, k > 1). In practice, however, for traditional values of alpha (≤ ~5%) the difference is negligible.
    Comment
    • VideoReview
      SBR High Roller
      • 12-14-07
      • 107

      #3
      Originally posted by Ganchrow
      You are indeed looking for low p-value models. I think you're just getting stuck on the terminology.

      A typical hypothesis test might look something like this:
      H0: The tested model is not profitable
      Ha: The tested model is profitable

      You'd then reject the null in favor of the alternative given a p-value ≤ α.

      As a side note, technically speaking the Bonferroni correction is only an approximation. FWIW, the actual corrective factor should be 1-(1-α)1/k which will always be > α/k (for α > 0, k > 1). In practice, however, for traditional values of alpha (≤ ~5%) the difference is negligible.
      I see. So in the Holm-Boneronni example I gave, I would be accepting the null hypothesis for the p-values of .03 and .04 and rejecting it for .005 and .01. Is this correct?

      Also, please confirm or reject my understanding that either of these 2 Bonferroni methods do, in fact, strongly prevent back fitting errors if I honestly account for every single p-value that I look at regardless of how I came to investigate the variable combination in the first place. I am sure you have already drawn your conclusions about my very rudimentary statistics knowledge so what I am really asking is your subjective opinion if you think the Bonferonni and/or Holms-Bonferonni methods are enough to keep "my" models HSD from the null hypothesis given my limited knowledge?

      Also, assuming you think either or both of these methods would be suitable for me, what is your opinion on the adjusted r^2 value for the model I posted above, the way in which I developed the model, and my ultimate conclusion that the model is significant at .05 or below based on the number of p-values I looked at in order to create my final model? The dependant variable was the actual ROI on a win one unit bet.

      Finally, I see what you mean about the corrective factor being pretty close for small alphas but it gets way off for relatively large alphas. What is the name of the corrective factor you quoted?
      Comment
      • Ganchrow
        SBR Hall of Famer
        • 08-28-05
        • 5011

        #4
        Originally posted by VideoReview
        I see. So in the Holm-Boneronni example I gave, I would be accepting the null hypothesis for the p-values of .03 and .04 and rejecting it for .005 and .01. Is this correct?
        Correct.

        Originally posted by VideoReview
        Also, please confirm or reject my understanding that either of these 2 Bonferroni methods do, in fact, strongly prevent back fitting errors if I honestly account for every single p-value that I look at regardless of how I came to investigate the variable combination in the first place.
        Actually, no. For the Bonferroni correction to be theoretically valid the hypotheses you're testing need to be independent of one another. Also see below.

        Originally posted by VideoReview
        I am sure you have already drawn your conclusions about my very rudimentary statistics knowledge so what I am really asking is your subjective opinion if you think the Bonferonni and/or Holms-Bonferonni methods are enough to keep "my" models HSD from the null hypothesis given my limited knowledge?

        Also, assuming you think either or both of these methods would be suitable for me, what is your opinion on the adjusted r^2 value for the model I posted above, the way in which I developed the model, and my ultimate conclusion that the model is significant at .05 or below based on the number of p-values I looked at in order to create my final model? The dependant variable was the actual ROI on a win one unit bet.

        Finally, I see what you mean about the corrective factor being pretty close for small alphas but it gets way off for relatively large alphas. What is the name of the corrective factor you quoted?
        I think you might be getting a bit ahead of yourself here. I certainly respect the care with which you're investigating your hypotheses and the obviously interest you have in the underlying statistics. That said, I think your first step really should be to partition your data set (I generally try to create 3 segments based on date divisibility by 3 -- this helps insure that season-specific effects won't bias a particular partition) and play around with this within a single segment. Beat up the data as much as possible and use these methods you've outline to try to come up with the model you expect to be most profitable out of sample.

        Then test the same model on another partition and see what happens.
        Comment
        • VideoReview
          SBR High Roller
          • 12-14-07
          • 107

          #5
          Originally posted by Ganchrow
          Correct.

          Actually, no. For the Bonferroni correction to be theoretically valid the hypotheses you're testing need to be independent of one another. Also see below.

          I think you might be getting a bit ahead of yourself here. I certainly respect the care with which you're investigating your hypotheses and the obviously interest you have in the underlying statistics. That said, I think your first step really should be to partition your data set (I generally try to create 3 segments based on date divisibility by 3 -- this helps insure that season-specific effects won't bias a particular partition) and play around with this within a single segment. Beat up the data as much as possible and use these methods you've outline to try to come up with the model you expect to be most profitable out of sample.

          Then test the same model on another partition and see what happens.
          Regarding your notes on the Bonferonni correction, I understood that the test statistics could be dependant or independent:
          "Intuitively, reducing the size of the allowable error (alpha) for each comparison by the number of comparisons will result in an overall alpha which does not exceed the desired limit, and this can be mathematically proved to be true using Bonferroni's inequality, regardless of independence or dependence among test statistics."

          I am sure I am missing something here because I can't see what use this correction would have at reducing p-values of different hypothesis since I understood that this was to prevent Type I errors for similar hypothesis when extensive testing was being performed. Please explain a bit more.

          Regarding out of sample testing, I think I understand what you mean and I COMPLETELY forgot to tell you an important part of my model development. The 728 data points represent the home and away team for 364 NHL games. These 364 games were part of a total population of 1372 possible bets (686 games) from April 2007 until January 2008. The ONLY reason the 728 data points were selected was because they represented all of the data points where all of the variables were present. In other words, I actually did the regression analysis on all 1372 possible bets and the program was instructed to remove any observation where all of the variables that were being tested were not present. By present I do not mean it was positive or negative and do mean that the variable was actually missing and I had no way of knowing what the value was. What I did was once I settled on my model of 17 variables, in order to get a statistically significant result, I actually ran 364 linear regression analysis' using the Leave-One-Out cross validation method. From what I had read, this is one of the strongest ways of developing test results. So in essence, I was using results from over half a season to predict the results of a single game which, and please correct me if I am wrong, would remove as much as possible and seasonal bias in my data notwithstanding that I didn't yet add February and March. I had 728 teams (possible bets) in 364 games and one game at a time, I removed both the home and away teams data from the population. So, the values of the variables (the training data) were made up from the remaining 726. This training data obviously had no way of knowing the values of the 2 teams (1 game) that I had pulled out or the results of the game. I copied the 2 predicted results and the actual result from the 2 test teams. Because the training data figured out that by betting both sides of a game, I would lose about 3.7% on any game, the predicted amounts would always be different by the "exact" amount of the average loss of each bet of the total population. My results are based on betting at Pinnacle and although a NHL nickel book, the negative 3.7% simple shows that the dogs did better than average during this sample. A typical result of using 726 data points to predict 2 data points would look like this:

          Observation Weight Result To Win 1 Unit Pred(Result To Win 1 Unit)
          Obs1 1 1.000 0.069
          Obs687 1 -1.410 -0.143

          Observation 1 and 687 represent the away and home team from a single game. In this case, it estimated the away team to have a predicted positive EV of 6.9% and the home team to have a predicted negative EV of -14.3%. Therefore, the total average EV is predicted to be -3.7% = ((-14.3+6.9)/2). I did not copy the individual p-values from each of the 364 models in the Leave-One-Out process but did anecdotally notice that all of them were <.0001. Also, the adjusted r^2 value for all that I looked at was in the .05 to .06 range which is what was expected based on the entire population.

          I gathered the following from Wikipedia:

          "K-fold cross-validation
          In K-fold cross-validation, the original sample is partitioned into K subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K − 1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the validation data. The K results from the folds then can be averaged (or otherwise combined) to produce a single estimation."

          I think what you were suggesting was a sort of K-fold cross validation where I would take 1/3 of the data as training data and that this data would be a random or pseudo-random cross section of the entire set (say every third game) so that I would have an equal amount of games from various dates and would not end up with a seasonally biased (season, day of week, week of month etc.). My assumption (again please correct or confirm for me) is that I have done this with the method I have employed. However, I am very keen to hear if you still feel that your proposal is better than what I am doing and why.

          Since the Leave-one-out method is simply a k-fold cross-validation taken to its furthest limit, I concluded that I was able to take any averages or do regressions on the new test population that was completely generated out of sample. Please confirm or correct this assumption as well.

          I did some basic average tests on the new test data since I was confident that it was out of sample. It showed the following assuming bet to win 1 unit:

          1) If I were to make 728 bets (the entire test population) which would mean betting both sides of the game, my ROI would be -3.7627528%. I though I would check this number against the entire population of 1372 bets (which include games where all variables were not present and hence were not included in the regression) and the results were -3.623192%. This showed me that my subset was a fair reflection of the whole population.
          2) If I were to make an equally weighted bet on every single positive predicted EV (Pred EV > 0), my ROI for 324 bets would be +12.433314%. Because I have not figured out a way to do a regression with the total amount bet and total amount won numbers combined, I had decided I would simply say the result was 1 if the team won (I win 1 unit) and -n for the number of units lost if the team lost (i.e. -1.5 for a -150 team loss or -.5 for a +200 team loss). Just to double check my average ROI, I calculated what my ROI would have been using total won/total bet for the entire test sample. It came back as 12.4025404% which was very close to my number so I accepted it. Finally, I took all of the odds that were bet (324 of them in total) in the test sample, and did 2 Monte Carlo runs with a bogey of .124025404 (I used the more accurate number). The 1 million trial run came back with p=.012323 and the 3 million trial run came back with p=.0123716. The means were both very close to zero so I accepted the results.
          3) I wanted to see what difference betting with different weights would have and so weighted each of the predicted positive EV bets by exactly the amount the were predicted to be (e.g. if the bet on a +100 team was predicted to be a positive EV of .10, I would put the weight at .1 which efectively means I was betting .1 units to win .1 units in this case). The results were that my Total Won/Total Bet = 20.548618%. This was substantially better than the equal weight bets and reassured me I was on the right track. I then requested and received your code for weighted Monte Carlo runs. I sent the odds to the program along with the predicted EV weights and the results for a 1 million trial was p=.002784 and 3 million trial was p=.002805. The means were, again, very close to zero so I accepted the results.
          4) Finally, I wanted to test how good these EV's (both positive and negative) were at predicting the ACTUAL outcome of betting to win one unit (the actual ROI). I ran a regression analysis on the Predicted EV column and made the dependent variable the actual win 1 unit outcome. I also forced the intercept to zero because I simply wanted a percentage of predicted EV that I could multiply by to come up with a statistically significant predicted EV. I ran it at 95% confidence and the results were:
          Value=.596
          Lower Bound at 95%=.324
          Upper Bound at 95%=.869
          p <.0001

          I also did this for 99% and 99.99% confidence level (the maximum my program lets me). The 99.99% results were:
          Value=.596
          Lower Bound at 99.99%=.054
          Upper Bound at 99.99%=1.139
          p <.0001

          I had pre-decided that I was going to take the 95% Lower Bound and bet to win a quarter of this amount (approximately(thanks for the previous equation) quarter Kelly).

          Therefore, my new bet amount would be to win the following number of units per bet:

          Predicted EV(if > 0) * .324 / 4 = .081073875

          I did a check on the 8.1% number and found that it was the lower bound at a little over the 99.97% confidence level. Therefore, I accepted this number as very reliable.

          I have tried to do every step of the process as thoroughly as I am capable of with the knowledge I have and have tried to employ the most rigid requirements at every step. If you can see any error, either small ones or ones that require large flashing green corrections, please let me know or point me in the correct direction if is proprietary.

          I would like you to know that although I won't make my variables known, even though everyone is aware of them, I have only seen them discussed very infrequently, never as an entire group of variables, and that the discussions seem to come from posters who are on the whole quiet on the subject and that I sincerely believe are making money.
          Last edited by VideoReview; 03-05-08, 10:18 PM. Reason: Clarification
          Comment
          • Ganchrow
            SBR Hall of Famer
            • 08-28-05
            • 5011

            #6
            Originally posted by VideoReview
            Regarding your notes on the Bonferonni correction, I understood that the test statistics could be dependant or independent:
            "Intuitively, reducing the size of the allowable error (alpha) for each comparison by the number of comparisons will result in an overall alpha which does not exceed the desired limit, and this can be mathematically proved to be true using Bonferroni's inequality, regardless of independence or dependence among test statistics."
            This refers to the Bonferroni inequality. The inequality will be an equality only if the events are independent. If the events are not independent than Bonferroni will be too conservative, resulting in lower statistical power for the test. I apologize if I implied otherwise in my original response.

            Originally posted by VideoReview
            I am sure I am missing something here because I can't see what use this correction would have at reducing p-values of different hypothesis since I understood that this was to prevent Type I errors for similar hypothesis when extensive testing was being performed. Please explain a bit more.
            I don't understand your question.

            Regarding out of sample testing, I think I understand what you mean and I COMPLETELY forgot to tell you an important part of my model development. The 728 data points represent the home and away team for 364 NHL games. These 364 games were part of a total population of 1372 possible bets (686 games) from April 2007 until January 2008. The ONLY reason the 728 data points were selected was because they represented all of the data points where all of the variables were present. In other words, I actually did the regression analysis on all 1372 possible bets and the program was instructed to remove any observation where all of the variables that were being tested were not present. By present I do not mean it was positive or negative and do mean that the variable was actually missing and I had no way of knowing what the value was. What I did was once I settled on my model of 17 variables, in order to get a statistically significant result, I actually ran 364 linear regression analysis' using the Leave-One-Out cross validation method. From what I had read, this is one of the strongest ways of developing test results. So in essence, I was using results from over half a season to predict the results of a single game which, and please correct me if I am wrong, would remove as much as possible and seasonal bias in my data notwithstanding that I didn't yet add February and March. I had 728 teams (possible bets) in 364 games and one game at a time, I removed both the home and away teams data from the population. So, the values of the variables (the training data) were made up from the remaining 726. This training data obviously had no way of knowing the values of the 2 teams (1 game) that I had pulled out or the results of the game. I copied the 2 predicted results and the actual result from the 2 test teams. Because the training data figured out that by betting both sides of a game, I would lose about 3.7% on any game, the predicted amounts would always be different by the "exact" amount of the average loss of each bet of the total population. My results are based on betting at Pinnacle and although a NHL nickel book, the negative 3.7% simple shows that the dogs did better than average during this sample. A typical result of using 726 data points to predict 2 data points would look like this:

            Observation Weight Result To Win 1 Unit Pred(Result To Win 1 Unit)
            Obs1 1 1.000 0.069
            Obs687 1 -1.410 -0.143

            Observation 1 and 687 represent the away and home team from a single game. In this case, it estimated the away team to have a predicted positive EV of 6.9% and the home team to have a predicted negative EV of -14.3%. Therefore, the total average EV is predicted to be -3.7% = ((-14.3+6.9)/2). I did not copy the individual p-values from each of the 364 models in the Leave-One-Out process but did anecdotally notice that all of them were <.0001. Also, the adjusted r^2 value for all that I looked at was in the .05 to .06 range which is what was expected based on the entire population.

            I gathered the following from Wikipedia:

            "K-fold cross-validation
            In K-fold cross-validation, the original sample is partitioned into K subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K − 1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the validation data. The K results from the folds then can be averaged (or otherwise combined) to produce a single estimation."

            I think what you were suggesting was a sort of K-fold cross validation where I would take 1/3 of the data as training data and that this data would be a random or pseudo-random cross section of the entire set (say every third game) so that I would have an equal amount of games from various dates and would not end up with a seasonally biased (season, day of week, week of month etc.). My assumption (again please correct or confirm for me) is that I have done this with the method I have employed. However, I am very keen to hear if you still feel that your proposal is better than what I am doing and why.

            Since the Leave-one-out method is simply a k-fold cross-validation taken to its furthest limit, I concluded that I was able to take any averages or do regressions on the new test population that was completely generated out of sample. Please confirm or correct this assumption as well.

            I did some basic average tests on the new test data since I was confident that it was out of sample. It showed the following assuming bet to win 1 unit:

            1) If I were to make 728 bets (the entire test population) which would mean betting both sides of the game, my ROI would be -3.7627528%. I though I would check this number against the entire population of 1372 bets (which include games where all variables were not present and hence were not included in the regression) and the results were -3.623192%. This showed me that my subset was a fair reflection of the whole population.
            2) If I were to make an equally weighted bet on every single positive predicted EV (Pred EV > 0), my ROI for 324 bets would be +12.433314%. Because I have not figured out a way to do a regression with the total amount bet and total amount won numbers combined, I had decided I would simply say the result was 1 if the team won (I win 1 unit) and -n for the number of units lost if the team lost (i.e. -1.5 for a -150 team loss or -.5 for a +200 team loss). Just to double check my average ROI, I calculated what my ROI would have been using total won/total bet for the entire test sample. It came back as 12.4025404% which was very close to my number so I accepted it. Finally, I took all of the odds that were bet (324 of them in total) in the test sample, and did 2 Monte Carlo runs with a bogey of .124025404 (I used the more accurate number). The 1 million trial run came back with p=.012323 and the 3 million trial run came back with p=.0123716. The means were both very close to zero so I accepted the results.
            3) I wanted to see what difference betting with different weights would have and so weighted each of the predicted positive EV bets by exactly the amount the were predicted to be (e.g. if the bet on a +100 team was predicted to be a positive EV of .10, I would put the weight at .1 which efectively means I was betting .1 units to win .1 units in this case). The results were that my Total Won/Total Bet = 20.548618%. This was substantially better than the equal weight bets and reassured me I was on the right track. I then requested and received your code for weighted Monte Carlo runs. I sent the odds to the program along with the predicted EV weights and the results for a 1 million trial was p=.002784 and 3 million trial was p=.002805. The means were, again, very close to zero so I accepted the results.
            4) Finally, I wanted to test how good these EV's (both positive and negative) were at predicting the ACTUAL outcome of betting to win one unit (the actual ROI). I ran a regression analysis on the Predicted EV column and made the dependent variable the actual win 1 unit outcome. I also forced the intercept to zero because I simply wanted a percentage of predicted EV that I could multiply by to come up with a statistically significant predicted EV. I ran it at 95% confidence and the results were:
            Value=.596
            Lower Bound at 95%=.324
            Upper Bound at 95%=.869
            p <.0001

            I also did this for 99% and 99.99% confidence level (the maximum my program lets me). The 99.99% results were:
            Value=.596
            Lower Bound at 99.99%=.054
            Upper Bound at 99.99%=1.139
            p <.0001

            I had pre-decided that I was going to take the 95% Lower Bound and bet to win a quarter of this amount (approximately(thanks for the previous equation) quarter Kelly).

            Therefore, my new bet amount would be to win the following number of units per bet:

            Predicted EV(if > 0) * .324 / 4 = .081073875

            I did a check on the 8.1% number and found that it was the lower bound at a little over the 99.97% confidence level. Therefore, I accepted this number as very reliable.

            I have tried to do every step of the process as thoroughly as I am capable of with the knowledge I have and have tried to employ the most rigid requirements at every step. If you can see any error, either small ones or ones that require large flashing green corrections, please let me know or point me in the correct direction if is proprietary.

            I would like you to know that although I won't make my variables known, even though everyone is aware of them, I have only seen them discussed very infrequently, never as an entire group of variables, and that the discussions seem to come from posters who are on the whole quiet on the subject and that I sincerely believe are making money.
            This is quite a lot of writing and would take me a long time to get through. Could you possibly condense this?
            Comment
            • 20Four7
              SBR Hall of Famer
              • 04-08-07
              • 6703

              #7
              OMG someone has learned Ganch. I'm having difficulty wading through this but think once I do it could be valuable.
              Comment
              • VideoReview
                SBR High Roller
                • 12-14-07
                • 107

                #8
                Originally posted by Ganchrow
                I think you might be getting a bit ahead of yourself here. I certainly respect the care with which you're investigating your hypotheses and the obviously interest you have in the underlying statistics. That said, I think your first step really should be to partition your data set (I generally try to create 3 segments based on date divisibility by 3 -- this helps insure that season-specific effects won't bias a particular partition) and play around with this within a single segment. Beat up the data as much as possible and use these methods you've outline to try to come up with the model you expect to be most profitable out of sample.

                Then test the same model on another partition and see what happens.
                I have done what you have suggested and divided the games that were in chronological order up into thirds by numbering them 1, 2, 3, 1, 2, 3, 1 etc. I even had a random number selected to start the sequence.

                To eliminate bias, I now utilize a model creation algorithm. Basically, every single time I run a regression analysis on any group of data, the data is evaluated in order to create the best "new" model of different variables as well as different values. I never consider what models previously worked well and l start from scratch every time. This eliminates bias from me determining the best model for the entire population and then having only the variable values change when I perform a regression on subsets of the data.

                With 200+ in my new NHL sample set, and 27 variables to start with, I was able to get an adjusted r^2 value of .316 and p <.00001. When I tested to see what my results would have been if had bet to win 1 unit every time the predicted EV was positive, my ROI was around +24%. Now, I tested the variables on the other 2 sets. The results were completely unexpected and I would like some help to interpret them. I expected to see one of two possibilities:

                a) The ROI on the other 2 sets at around -3.6% which would be the ROI if I bet both sides of the game for all teams trying to win 1 unit. Basically, the model has no predictive power.
                OR
                b) An ROI in the range of +2% to +5% indicating that the model was profitable.

                Instead, I got -9% and -13% if the bets were weighted based on predicted EV. Statistically significantly far worse than random. Does anyone have any insight as to what these results mean?

                I then tried testing each of the other 2 sets against the remaining sets and had similar results. I then divided the data up into thirds in chronological order this time and had them predict the other remaining sets and had even more startling results (i.e. +40% for my sample set, and -15% for each of the other 2 sets). In fact, the higher the adj r^2 value, the higher the ROI in my sample set and the more negative the ROI of the test sets.

                All 3 sets have a combined unweighted ROI of about +3% when the variables are used.

                The only explanation I can come up with for this almost perfectly symmetrical divergence of results between sample set and test sets is that I do have a small edge for my NHL games population and the ROI's of the games are not independent (i.e. market or book pressure continuously corrects this edge from getting too large.).

                Any other ideas anyone?
                Comment
                • Ganchrow
                  SBR Hall of Famer
                  • 08-28-05
                  • 5011

                  #9
                  If you're randomly segmenting your sample I don't really see how we could expect that this is the result of a book continually readjusting its odds.

                  When you say "far worse than random" do you mean they were to a statistically significant extent much worse (two-tailed) than your a) case? If not, it seems most likely that your results were simply the product of data mining.

                  If so, and especially if you're finding this across partitions, well I hate to be dismissive, but have you considered the possibility of a programming error?
                  Comment
                  • VideoReview
                    SBR High Roller
                    • 12-14-07
                    • 107

                    #10
                    Originally posted by Ganchrow
                    If you're randomly segmenting your sample I don;t really see how we could expect that this is the result of a book continually readjusting its odds.
                    After spending 50%+ of my waking hours since you replied trying to come up with a plausible explanation for the results I was seeing, I have FINALLY succumb to the overwhelming conclusion that it is very risky, regardless of extremely low p values, to assume a model is valid when it can not be verified by data that is "totally" out of sample. Although this conclusion may seem academically trivial for those who already abide by this principle, it is a major leap for me to truly take data away before doing the serious number crunching. I always feared that I would not see the trends by limiting my data. Now I know that any trend worth spotting will be in a smaller data sample and if not verifiable due to a small amount of available data, then I must obtain or wait for more out of sample data before attempting to draw any conclusions.

                    In retrospect I can see now that I was grasping at straws trying to explain why my model wasn't useless at predicting something.

                    Although this is the case, I am still at a loss to explain why the larger my Adjusted R value, and consequently my ROI, became in my 1/3 sample, the proportionally worse my ROI became in the remaining 2/3's of the data. I mean, if I was flipping a coin 300 times and cut the data by taking every third flip as my original sample and discovered some way to explain 80% of the flips within that 100 flips, I would expect that my algorithm would produce at least a 50% result (50% if formula is useless and higher if I found a bias) on the remaining 2/3's of the data and not a 35% result which would equal .80*100 + .35*200 = 150 or 50%. In my NHL data, the higher my ROI got in my 1/3 (every third game in chronological order) sample, the lower the ROI got in the 2/3 test sample. This is what I was trying to explain by the bookmaker readjusting its odds idea. Is there a simple explanation for the phenomenon I was observing?

                    Originally posted by Ganchrow
                    When you say "far worse than random" do you mean they were to a statistically significant extent much worse (two-tailed) than your a) case? If not, it seems most likely that your results were simply the product of data mining.
                    I overstated the situation and will be more careful in the future not to make unsubstantiated general claims like this without doing the hard math which is what I had always done before.

                    Originally posted by Ganchrow
                    If so, and especially if you're finding this across partitions, well I hate to be dismissive, but have you considered the possibility of a programming error?
                    I use XLStat, an Excel statistics package add-on so unless they have an error in their programming, I am usually pretty thorough making sure my data in intact (i.e. the use of check-sums to make sure I haven't inadvertently changed the data from the original etc.)




                    I am currently at the stage of coming up with a logical model that is able to be developed, tested, and verified on numerous combinations of in-sample date (i.e. every 2nd game or every 3rd game or every 4th game etc.) and develop from scratch new variable combinations and values each time that still explain, beyond what is random and enough to generate positive EV beyond vig, in the remaining out of sample data. This will be done without data mining and looking for combinations that can be explained that way. Although this is obviously a big task, I am at least grateful that I am now asking the correct question. Stay tuned...
                    Comment
                    SBR Contests
                    Collapse
                    Top-Rated US Sportsbooks
                    Collapse
                    Working...