Math/Statistics question

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • maxpower79
    SBR Rookie
    • 02-01-07
    • 9

    #1
    Math/Statistics question
    This is partly intended for Ganchrow, but anyone can respond:

    Let's say I want to use statistics to evaluate a certain angle, or capper, or tout, or whatever.

    I know if I have some observations, I can use the binomial distribution, or normal approx. for large samples, to calculate a p-value, given a null hypothesis. So if I observe an ATS record of 9-2, and I make the null hypothesis that each of this system/capper/etc.'s picks will hit with probability q,

    p = (11 choose 2) * q^9 * (1-q)^2
    + (11 choose 1) * q^10 * (1-q)
    + (11 choose 0) * q^11

    And I can choose some significance level p*, and reject the null if p < p*. I hope that's right, anyway.

    What I'd like to know is how things change if I have lots of sets of observations - say I'm data-mining, and I look at 100 different angles at once. Or I'm evaluating lots of handicappers - maybe the BTP contest - and I want to know, quantitatively, "Can any of these guys really cap?" Obviously, I can calculate p-values for each individual set of picks. But intuitively, they don't appear to have the some meaning. If the smallest p-value out of 700 is less than some p*, well, maybe you would probably expect somebody to do that well just from sheer luck and the high sample size.

    I did get that if you take the null to be "all of these guys' picks hit with the same probability q" then you can add all the results together and get one p (for each q) for the entire sample. But if you rejected that null all you would be able to say is "at least one of these cappers wins with probability greater than q". But that's not that useful. What I'm wondering is if there is a way to come up with an "adjusted" p for each observation, that takes into account that it was one of many.

    I hope that's clear. Thanks in advance for any input.

    ~ Max
  • SBR Lou
    BARRELED IN @ SBR!
    • 08-02-07
    • 37863

    #2
    Paging Ganchrow...
    Comment
    • calm
      SBR Hustler
      • 01-04-08
      • 82

      #3
      I'm pretty sure your intuition is wrong. I'm too lazy/tired to go through the math, but whether you're looking at one handicapper or a thousand, I don't think it would change anything.
      Comment
      • pokernut9999
        SBR Posting Legend
        • 07-25-07
        • 12757

        #4
        lost me after the 3rd paragraph
        Comment
        • pico
          BARRELED IN @ SBR!
          • 04-05-07
          • 27321

          #5
          Originally posted by pokernut9999
          lost me after the 3rd paragraph
          you're better than me. i stop reading after the first sentence.
          Comment
          • mofome
            SBR Posting Legend
            • 12-19-07
            • 13003

            #6
            ganch, don, remp? whos here?
            Comment
            • pokernut9999
              SBR Posting Legend
              • 07-25-07
              • 12757

              #7
              excuse me, I got lost in the 3rd paragraph
              Comment
              • CHALKbreaker
                SBR High Roller
                • 12-27-07
                • 115

                #8
                Seems like paralysis by analysis to me.
                Comment
                • DrunkenLullaby
                  SBR MVP
                  • 03-30-07
                  • 1631

                  #9
                  I can't give an answer, but my gut tells me that when Ganch arrives that there may be a Chi-square distribution in our future.
                  Comment
                  • Ganchrow
                    SBR Hall of Famer
                    • 08-28-05
                    • 5011

                    #10
                    Originally posted by maxpower79
                    What I'd like to know is how things change if I have lots of sets of observations - say I'm data-mining, and I look at 100 different angles at once. Or I'm evaluating lots of handicappers - maybe the BTP contest - and I want to know, quantitatively, "Can any of these guys really cap?" Obviously, I can calculate p-values for each individual set of picks. But intuitively, they don't appear to have the some meaning. If the smallest p-value out of 700 is less than some p*, well, maybe you would probably expect somebody to do that well just from sheer luck and the high sample size.

                    I did get that if you take the null to be "all of these guys' picks hit with the same probability q" then you can add all the results together and get one p (for each q) for the entire sample. But if you rejected that null all you would be able to say is "at least one of these cappers wins with probability greater than q". But that's not that useful. What I'm wondering is if there is a way to come up with an "adjusted" p for each observation, that takes into account that it was one of many.
                    Dating mining is very dangerous business. Unless you really know what you're doing do yourself a favor and stay far, far away.

                    With that caveat firmly in place, here are a couple of quick and dirty mathematical approaches. In my opinion, a clear understanding of the following would be a necessary, although by no means sufficient, precondition for embarking on any form of profit-oriented data mining sports betting project. The first approach considers the likelihood of the single best observed outcome, while the second considers the likelihood of a complete data set as extreme or better as that observed. Both tests are inherently one-tailed.
                    Let's say you're looking at a single handicapper making straight up picks at unbiased lines. If he were a 50% handicapper his probability of picking N or more games correctly would be given by =1-BINOMDIST(N-1,100,50%,1), where BINOMDIST() is the Excel binomial distribution function. Hence, a 50% picker would only have a 4.4313% probability of picking 59 or more games correctly. (This is referred to as either of the "p-value" or the "significance level".)

                    If you were looking at M handicappers, each making sets of 100 independent picks, then assuming they were all 50% pickers:
                    1. The probability of at least one handicapper picking n or more correctly would be =1-BINOMDIST(N-1,100,50%,1)^M.
                    2. The joint probability of results as extreme as the entire outcome or better where the ith handicapper has made Ni correct picks could be approximated by =CHIDIST(-2×Σi≤Mln(1-BINOMDIST(Ni-1,100,50%,1)), 2*M), where CHIDIST() is the Excel chi-squared distribution function, which in this case is called with 2*M degrees of freedom. This is known as Fisher's method. A more accurate way of explaining it would be that the result of chi-square is the probability that the product of the individual significance levels would have the observed value or lower, assuming all pickers were 50/50.


                    So let's say you're looking at 5 pickers, each making 100 picks on unrelated games, with 60, 57, 52, 48, and 47 correct picks respectively. If all pickers were 50% pickers:
                    1. We'd expect to see at least one of the five picking 60 or more correctly with probability =1-BINOMDIST(60-1,100,50%,1)^5 ≈ 13.44%.
                    2. The joint probability of the entire outcome or better would be =CHIDIST(-2*ln(2.84%*9.67%*38.22%*69.14%*75.79%),1 0) ≈ 13.17%. (It might pay to give an example of what would be considered "the same outcome or better". This phrase refers to the product of the significance levels over independent trials, which in this case works out to be about 0.05507%, the fifth root of which is 22.29%. This happens to be is the approximate significance of picking a bit less that 512 out of 1,000 correctly (1-BINOMDIST(512-1,1000,50%,1) ≈ 23.352%). Hence, the above outcome is, by the standards of Fisher's method, would be about as "extreme" as 5 people all picking 512 out of 1,000 correctly.)

                    I'll note that when used to analyze contemporaneous contest results data the above methods would need to be adjusted to take into account underlying contest structure possibly including correlation between contestants' picks and the impact of stale lines on winning percentages.

                    Another issue with the first method is that holding the desired significance level constant, as you increase the number of sample sets considered (i.e., the number of contest participants or the number of alternativ betting strategies considered), the incidence of Type-II errors ("false negatives") would also increase, decreasing the statistical power of these tests and rendering this form of analysis effectively useless.

                    The can also be an issue with the second method, especially to the extent that only a relatively small number of truly talented pickers (or successful strategies) exist within a large population. If this becomes an issue there are certainly other (more complicated) testing methodologies featured in commercial statistical packages that you might consider.
                    Comment
                    • maxpower79
                      SBR Rookie
                      • 02-01-07
                      • 9

                      #11
                      Thanks Ganch.

                      So the first method, you are essentially creating an adjusted cdf for the max of N observations, and testing using that. The second method, you may have gone a bit deeper than I can follow, but I got the jist of it.

                      FWIW, I'm not planning on doing any data-mining, as you are right, I would be in over my head. It was really more just curiosity.

                      ~ Max
                      Comment
                      • DrunkenLullaby
                        SBR MVP
                        • 03-30-07
                        • 1631

                        #12
                        Originally posted by Ganchrow
                        where CHIDIST() is the Excel chi-squared distribution function,
                        Comment
                        • Data
                          SBR MVP
                          • 11-27-07
                          • 2236

                          #13
                          Ganchrow, excellent article, as always.

                          Originally posted by Ganchrow
                          The can also be an issue with the second method, especially to the extent that only a relatively small number of truly talented pickers (or successful strategies) exist within a large population.
                          Could you explain why and how that small number becomes an issue?

                          If this becomes an issue there are certainly other (more complicated) testing methodologies featured in commercial statistical packages that you might consider.
                          I am thinking about getting SPSS or may be even SAS. Can you elaborate how those packages can help here and their overall value for an analytical sportsbettor?
                          Comment
                          • Ganchrow
                            SBR Hall of Famer
                            • 08-28-05
                            • 5011

                            #14
                            Originally posted by Data
                            Ganchrow, excellent article, as always.
                            Thanks.

                            Originally posted by Data
                            Could you explain why and how that small number becomes an issue?
                            It's apparent just by examining the test: Χ2[-2×Σi≤Mln(αi); 2*M] (where αi refers to the significance level of the ith capper, and integer M is the number of cappers being tested, corresponding to half the degrees of freedom). If we have a large number of talented cappers within the population, then their α's will be low and the Χ2 will in turn show significance. As we increase the number of "average" cappers within the population we'd start seeing many more α's of around 50%. Even if these cappers were better than average, with α's of e-1 ≈ 36.79%, then the value of the tested would approach the degrees of freedom (twice the number of cappers). As d.o.f. apprioach infinity the Χ2 approaches normality with a mean equal to the d.o.f., and so the significance of the test would approach 50%. Mind you, this occurs when filling in with cappers better than average.

                            Originally posted by Data
                            I am thinking about getting SPSS or may be even SAS. Can you elaborate how those packages can help here and their overall value for an analytical sportsbettor?
                            In the past I've used both Mathematica and S+ professionally. I haven't used either SAS or SPSS since grad school. Nowadays, for no particularly good reason I primarily used self-programmed purpose-written libraries.

                            Really what needs to eb done here is some form of categorical analysis. We aren't looking to determine how good the single best capper is, or how good the population is as a whole, but rather how good a particular unspecified category of capper is. To this end I believe a test known as Mantel-Haenszel may be applicable. To be perfectly honest I don't quite remember the details other than it's a chi-squared test.
                            Comment
                            • Data
                              SBR MVP
                              • 11-27-07
                              • 2236

                              #15
                              Originally posted by Ganchrow
                              As we increase the number of "average" cappers within the population we'd start seeing many more α's of around 50%.
                              Why would we include the average cappers in our test? The Fisher method allows us to test "a basket" of cappers/strategies in a manner we test a single capper using the first (null-hypothesis) method and we want only the best cappers/strategies in that basket. What I do not immediately see is how to account for the low likehood (Bayesian-wise) of successful strategy in general population when translated into our cherry-picked sample.
                              Comment
                              • Ganchrow
                                SBR Hall of Famer
                                • 08-28-05
                                • 5011

                                #16
                                Originally posted by Data
                                Why would we include the average cappers in our test? The Fisher method allows us to test "a basket" of cappers/strategies in a manner we test a single capper using the first (null-hypothesis) method and we want only the best cappers/strategies in that basket. What I do not immediately see is how to account for the low likehood (Bayesian-wise) of successful strategy in general population when translated into our cherry-picked sample.
                                Were we just to select a subset of tested cappers (let's say just the best X%) then the Fisher method wouldn't be directly applicable. It needs to be applied it to the entire test population.

                                If you were to limit the test population to just a superior subset, then of course the chi-square would be significant because you'd only be including the most significant results. Fisher tests the joint significance of all the results -- testing only the most significant for significance makes no sense. Applying Fisher in this manner would be akin to applying the first described method to only a portion of the results. You can't pick out the best results within a contest and pretend the other results never happened.

                                Now of course if you had prior knowledge that a certain group was likely to outperform then you could certainly use Fisher on just that group -- but you can't use the data set itself to come to that conclusion (in other words you'd need separate in-sample and out-of-sample data sets). But as I understood the OP's initial question the whole point is to identify the talented cappers in the first place.

                                Now I'm not saying Fisher is useless in this regard, but rather that it couldn't be used in its raw form on a portion of the data set that's determined by the data set itself -- that would be very bad practice. And if the portion of the data set which is selected has "only a relatively small number of truly talented pickers existing within it" ... then the problem I identified in my initial first post might crop up. In other words, if you select you properly select you data set in-sample, then there's no way to guarantee that this won't be an issue with Fisher out-of-sample.
                                Comment
                                • Data
                                  SBR MVP
                                  • 11-27-07
                                  • 2236

                                  #17
                                  Originally posted by Ganchrow
                                  Were we just to select a subset of tested cappers (let's say just the best X%) then the Fisher method wouldn't be directly applicable. It needs to be applied it to the entire test population.
                                  Why is that? Not only this method allows us cherry-picking cappers from one contest but we can pick any capper from any contest. Then, we can calculate a p-value for that "basket".

                                  I do not know what practical result can be achieved by applying this test to all the cappers in a given contest but I would guess none.

                                  If you were to limit the test population to just a superior subset, then of course the chi-square would be significant because you'd only be including the most significant results.
                                  What is wrong with that? After all the significant results is what we are looking for, who needs insignificant ones? All we want is to calculate how significant the combined results are. So, how we do that? Can we just add all the records and treat the sum as one record? No, of course not, but we can use the Fisher's method instead.
                                  Comment
                                  • Ganchrow
                                    SBR Hall of Famer
                                    • 08-28-05
                                    • 5011

                                    #18
                                    Originally posted by Data
                                    Why is that? Not only this method allows us cherry-picking cappers from one contest but we can pick any capper from any contest. Then, we can calculate a p-value for that "basket".

                                    I do not know what practical result can be achieved by applying this test to all the cappers in a given contest but I would guess none.
                                    It's rather unfortunate indeed that the gods of statistics have yet to decree that statistical validity need portend practicality. If you know of any virgins to sacrifice, however, I think I can get us a knife and some long flowing robes.

                                    You can choose to use Fisher in any manner you like. However, if your selection of cappers is determined by their individual significance levels, and then you use those same individual significance levels as inputs for Fisher, you're engaging in data dredging at its most blatant.

                                    The point of testing the Fisher statistics against the chi-square is to determine the likelihood of attaining that product of significance levels or lower. If, however, you've not properly determined your significance levels because you've not properly conditioned them on their of likelihood of appearing within your chosen subset in the first place -- then the Fisher Method will routinely deliver spurious results.

                                    Try this experiment in Excel. Generate the results in column A from 500 samples of 100 randomized binomial trials assuming a 50% success rate. So cells A1:A500 would each look something like: =CRITBINOM(100,50%,RAND()). (These would represent the results of 500 talentless handicapper each picking 100 games.)

                                    In column B, display each capper's significance levels(so cell B1 would read =1-BINOMDIST(A1-1,100,50%,1), B2 would read =1-BINOMDIST(A2-1,100,50%,1), etc.)

                                    In column C fill in the natural logarithms of the values in column B (so cell C1 =ln(B1)).

                                    Then determine the test-wide Fisher statistic by setting cell D1 to =-2*SUM(C1:C500).

                                    To determine the significance, run a chi-square with 1,000 degrees of freedom by setting cell D2 =CHIDIST(D1,1000).

                                    Press F9 a few times to recalculate, checking the p-value in cell D2 each time. You should be seeing numbers fairly close to 100%, implying a lack of statistical significance. (The fact that it's generally SO close to 1 is implicative of the issue with Type II errors I had earlier mentioned).

                                    But now ... let's look at Fisher results if we cherry-pick a sample. Let's say we only look at the top 50% or better of pickers. What kind of Fisher method results will we see?

                                    Set cell D3 to the Fisher statistic of the cherry picked subset =-2*SUMIF(A1:A500,">="&PERCENTILE(A1:A500, 50%),C1:C500). The number of handicappers in the top half would naturally be given by =COUNTIF(A1:A500,">="&PERCENTILE(A1:A500 ,50%)). Set cell C4 to the chi-squared p-value for the cherry picked sample =CHIDIST(D3,2*COUNTIF(A1:A500,">="&PERCE NTILE(A1:A500,50%))).

                                    See a difference? Hit F9 a few times to make sure you aren't looking at some crazy aberration. You should be seeing results almost indistinguishable from zero, implying extreme significance. And we're not looking at some absurd subset either -- we're just looking at the top 50% of pickers drawn from a population that flips coins to determine picks.

                                    Now please don't get me wrong -- the problem here isn't inherent with Fisher itself, but rather with using incorrect p-values within the natural logarithm. Garbage-in, garbage-out, after all.

                                    The way in which one would need to properly apply Fisher in this particular instance would be by using p-values in the logs in column C that were conditioned on having been found in the top 50% of results. Without that conditioning you're going to get spurious results every time.

                                    The easiest way to handle the conditioning would be by appealing to the Central Limit Theorem as much as possible, while the correct way, probably involving Clopper-Pearson binomial intervals, would certainly be much tricker. That said as long as you're not overly proud, aren't trying to earn some sort of degree in statistics, and aren't serving time in prison, you're probably best off just appealing to that great equalizer among statisticians -- the Monte Carlo.
                                    Attached Files
                                    Comment
                                    • Data
                                      SBR MVP
                                      • 11-27-07
                                      • 2236

                                      #19
                                      Originally posted by Ganchrow
                                      It's rather unfortunate indeed that the gods of statistics have yet to decree that statistical validity need portend practicality.
                                      I was just implying that my approach is purely practical. If I can use a method then I am all ears but if there is no way applying it to make a profit I could not care less. I very much appreciate your theoretical insight, not for the theory itself but for what it can do for me.

                                      The point of testing the Fisher statistics against the chi-square is to determine the likelihood of attaining that product of significance levels or lower. If, however, you've not properly determined your significance levels because you've not properly conditioned them on their of likelihood of appearing within your chosen subset in the first place -- then the Fisher Method will routinely deliver spurious results.
                                      Precisely, I am thinking the same, determining significance levels is the key.

                                      You should be seeing results almost indistinguishable from zero, implying extreme significance.
                                      This is where we think differently, or, more likely, where I do not see something. If we are looking at 100 50% cappers we should expect one of them to show the results with significance of 1% but if we are looking at 1000 of them then we expect to see the results with a lower significance level of 0.1%. None of these should make us excited. So, when we observe extremely low numbers in your example no matter how extreme is significance it has no value for us for obvious reasons. However, why cannot we use all these expected numbers (1%, 0.1%, close-to-zero) as baselines and if we get the numbers lower then those that would imply that we may be seeing something none-random?
                                      Comment
                                      SBR Contests
                                      Collapse
                                      Top-Rated US Sportsbooks
                                      Collapse
                                      Working...