data mined p-value

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • VideoReview
    SBR High Roller
    • 12-14-07
    • 107

    #1
    data mined p-value
    I have started using a new function to double check hypothesis suggested by data. I have run numerous tests on this using pseudo-random coin flips with different edges, including no edge, and have found that it works perfectly (as in the probability expected) in uncovering angles that may be due to randomness. To determine the p-value of an angle that I have found by data mining (i.e. looking at the data for profitable patterns) I use the following equation in Excel:

    data mined p-value = (population size - angle population size + 1) * (1 - NORMSDIST((total won - total bet) / SQRT(total bet))) / 2

    population size = total number of bets considered
    angle population size = total number of bets in the angle
    total units won = total number of units won assuming all bets were to win 1 unit
    total units bet = total number of units bet assuming all bets were to win 1 unit


    An example:

    Let's say that I went through a database of 4968 games and saw an angle (home team, between certain odds, and one other criteria) that had an ROI of about 14%. This seems good to me and I would like to know the p-value of this angle. Here are my numbers:

    population size = 4968
    angle population size = 840
    total units won = 990.961
    total units bet = 869.222

    Now, if I had not mined for this angle but actually thought of the angle logically without looking at the data or data mined it by looking at a completely different set of games, my normal p-value would be calculated as:

    p-value = 1 - NORMSDIST((990.961-869.222)/SQRT(869.222)) = .0000182128

    But because I looked at ALL of the data, and because there are 4968 - 840 + 1 = 4129 clusters exactly the same size in the sample, I need to multiply my normal p-value by 4129 to compensate for the fact that I looked at the data first. Therefore:

    data mined p-value = .0000182128 * 4129 = .037600423

    This would indicate that my data mined angle is not due to randomness (at the 96.24% confidence level).

    The reason I am writing this post is that I was under the assumption that I needed to keep track of ALL the combinations I had ever looked at in a population and multiply the p-value I got for any angle by this combination number. In my test using millions of pseudo-random coin flip data, this does not seem to be the case. I only need to consider how many other groups of data the exact same size that there could be in the population I am considering. Just because I had considered other angles with different attributes does not mean that I am penalized for looking at those when I am looking at a new angle. I would appreciate comments from those in the know whether my these assumptions are true. If they are true, then data mining just got fun again!
  • Ganchrow
    SBR Hall of Famer
    • 08-28-05
    • 5011

    #2
    It kind of seems like you're somehow preapplying the Bonferroni correction in anticipation of n=4,129. My initial reaction would be that there are a lot more than just 4,129 clusters of size 840 within your sample. Specifically, there'd be =COMBIN(4968, 840) clusters of size 840 which is officially a Very Large Number™.

    Obviously, there's a lot more at play that just this, and I'd have think more about it, but perhaps there's an article or paper on this topic you can site?
    Comment
    • pico
      BARRELED IN @ SBR!
      • 04-05-07
      • 27321

      #3
      try to make money using economagic, eh? good luck
      Comment
      • VideoReview
        SBR High Roller
        • 12-14-07
        • 107

        #4
        Originally posted by Ganchrow
        It kind of seems like you're somehow preapplying the Bonferroni correction in anticipation of n=4,129. My initial reaction would be that there are a lot more than just 4,129 clusters of size 840 within your sample. Specifically, there'd be =COMBIN(4968, 840) clusters of size 840 which is officially a Very Large Number™.
        This is exactly what I am doing and FWIW it seems to be working in practice which is why I am finding it hard to accept a larger correction number than what I am coming up with in my "data mined p-value" equation.

        Originally posted by Ganchrow
        Obviously, there's a lot more at play that just this, and I'd have think more about it, but perhaps there's an article or paper on this topic you can site?
        I'll see if I can find something "official" but I doubt I would recognize it even if I came across it. In the meantime, I find my own empirical results pretty conclusive. Maybe the best way to confirm/reject what I am doing is to look at a couple of examples assuming a fair coin.

        I flip a coin once and heads comes up. I assume I have a profitable angle that the coin is biased towards heads. I now will calculate both the "p-value" and the "data mined p-value" according to my initial equation.

        population size=1
        angle population size=1
        total won=2
        total bet=1

        p-value = 1 - NORMSDIST((total won - total bet) / SQRT(total bet))

        p-value = 1 - NORMSDIST((2-1)/SQRT(1)) = .158655

        data mined p-value = (population size - angle population size + 1) * (1 - NORMSDIST((total won - total bet) / SQRT(total bet)))

        data mined p-value = (1-1+1) * (1 - NORMSDIST((2-1)/SQRT(1))) = .158655

        So the 2 p-values are the same and I would assert a priori that they are the same because no data mining advantage could possibly come from looking at only 1 flip.

        Here is a slightly larger example. Assume that I flip a fair coin 1,033 times and mark down each flip consecutively. After looking at the entire sample, I notice that the first 10 flips were heads. I make the conclusion that I have found that 10 heads in a row are more profitable then it should be since the fair 10 flip parlay would have paid 2^10=1024 to win.

        If I was not concerned at all with the fact that I data mined this conclusion, I would surmise (incorrectly of course) the probability of this occurring randomly is:

        p-value = 1-NORMSDIST((1024-1)/SQRT(1) = (near zero)

        Considering I did data mine it, if starting from before the first flip and after every single flip I bet that the next 10 flips were going to be 10 heads at a payout of 1023 to 1 (1024 units back for a 1 unit bet), using my initial equation I get:

        data mined p-value = (1033-10+1) * (1-NORMSDIST((1024-1024)/SQRT(1024)) = 512

        Of course 512 is exactly 1024 times greater than p=.5 (i.e. random) which is not coincidentally the same probability of flipping a coin and having heads come up 10 times in a row (2^10) in the first place.

        512 is a very large p-value and tells me that this data mined angle is useless unless I can find something about that pattern of 10 flips that occurs less than 1 in 1024 times because 512 * (1 / 1024) = .5. Even if I say something like the first 10 flips of a sequence of 1033 are more likely to start with 10 heads, if I run a simulation of 1033 flips an infinite amount of times, I will see that the first 10 flips being heads occurs in exactly 1 in 1024 sets which would now indicate that the data mined p-value is exactly .5 or random (512 * 1 / 1024). This data mined p-value of .5 is now, after out of sample simulation, exactly where it should be because the initial results I found were in fact random.

        My point is that if I come up with a data mined p-value of 512 and after simulation (i.e. out of sample testing) find that the pattern of 10 heads in a row occurs in the first 10 flips of the 1033 set occur exactly 1 time out of 1024 samples (as it should), why should my data mined p-value of 512 be invalid when clearly:

        512 * (1 / 1024) = .5

        Why should it matter that I also looked at groups of 2,3,4,5,6,7,8,9,11,12,13.....1024 to see if there was groups of heads in a row there because the .5, is in all reality, true. To say that the .5 is too low because I looked at different sequences (2 heads in a row, 3 heads, etc.), implies to me that I can somehow affect the true probability of the coin simply by looking at more combinations than I should have. The fact is that regardless of the millions of sequence combinations I may look at:

        1) The coin coming up heads on any flip is 1/2
        2) The coin coming up heads for the first 10 consecutive flips is 1/1024

        Conclusion: (1/2) / (1/1024) = 512 which is my initial data mined p-value

        Does this coin flip argument hold water?
        Comment
        • Ganchrow
          SBR Hall of Famer
          • 08-28-05
          • 5011

          #5
          Originally posted by VideoReview
          Here is a slightly larger example. Assume that I flip a fair coin 1,033 times and mark down each flip consecutively. After looking at the entire sample, I notice that the first 10 flips were heads. I make the conclusion that I have found that 10 heads in a row are more profitable then it should be since the fair 10 flip parlay would have paid 2^10=1024 to win.

          If I was not concerned at all with the fact that I data mined this conclusion, I would surmise (incorrectly of course) the probability of this occurring randomly is:

          p-value = 1-NORMSDIST((1024-1)/SQRT(1,023) = (near zero)
          (Typo corrected in red -- you're betting 1 unit to win 1,023 at decimal odds of 1024. Conversely were you betting 1 1,023 of a unit to win 1, you'd have a z-score of ( 1,024 1,023 - 1 1,023 )/SQRT( 1 1,023 ), which would obviously yield the same results.)

          Ignoring the notion of data mining for a moment, the central limit theorem is woefully inadequate in this regard given your sample size of 1. The proper p-value would come from the binomial distribution and would equal =BINOMDIST(10,10,0.5,0) = 2-10 ≈ 0.09766%.

          I'll also point that this is only a 1-tailed test, when in reality we should probably be looking at a 2-tailed test. Whatever ... we can just ignore that for the duration ... that's the least of our worries.

          Originally posted by VideoReview
          Considering I did data mine it, if starting from before the first flip and after every single flip I bet that the next 10 flips were going to be 10 heads at a payout of 1023 to 1 (1024 units back for a 1 unit bet), using my initial equation I get:

          data mined p-value = (1033-10+1) * (1-NORMSDIST((1024-1024)/SQRT(1024)) = 512
          Ok. Remember that a p-value refers to a probability-value and a probability of 512 makes no more sense than a probability of "Barney".

          If you recall, in a previous post I referred to Bonferroni as only an approximation (and what you're doing isn't really even Bonferroni) and that given a single test p-value of p, determined by considering n independent samples, the corrected p-value would in fact be 1 - (1-p)n.

          When we apply this to the problem of looking at 1,024 independent samples of 10 flips each (although in your example the 1,024 flips are not independent) we get a p-value of 1-(1-2-10)1,024 ≈ 63.23%.

          Now because of the lack of independence between the 1,024 10-flp samples, this p-value is actually a bit to high. If you flip a coin 1,033 times, what's the probability that at least 10 flips in a row will land heads (again, we're ignoring the 2-tailed component) at at least one point during the sample?

          Well the answer to that is 39.517% (which can be obtained from Streak Calculator -- enter values of 1033, 10, and 50%). I explained the algorithm for calculating this in this post.

          This figure of 39.517% is the correct p-value for the experiment that involves looking at 1,033 flips and declaring a "success" when 10 flips in a row land heads.

          But there's a huge problem with this, too. What's this fixation with 10 flips in a row? What if you had looked at the dataset and noticed that of the first 10 even numbered flips every coin landed heads? Or what if you had looked at the first 10 odd numbered flips and noted that every coin landed heads? Or what if you had looked at flips# 17, 28, 95, 433, 434, 436, 928, 982, 1,024, and 1,030 and noted that every coin landed heads? The point is that there are many, many more than just 1,024 different possible combinations of 10-flip sequences.

          In fact there are =COMBIN(1033, 10) ≈ 3.64975×1023 possible combinations of 10-flip sequences, and of these nearly one septillion combinations 2-10 ≈ 0.09766% of them would contain exactly 10 heads.

          So I'm just not seeing the probabilistic logic to this angle.
          Comment
          • VideoReview
            SBR High Roller
            • 12-14-07
            • 107

            #6
            I have lost my mind and apparently I was seeing things on this one. I don't know what I was thinking as I had already been through this before and had it resolved. I know I can not do what I was stating.

            Please accept my appologies for starting this thread and wasting everyones valuable time. If there is any way to delete this thread, I would like to. If not, it will serve as a reminder to me.

            Thanks.
            Comment
            • Ganchrow
              SBR Hall of Famer
              • 08-28-05
              • 5011

              #7
              Originally posted by VideoReview
              I have lost my mind and apparently I was seeing things on this one. I don't know what I was thinking as I had already been through this before and had it resolved. I know I can not do what I was stating.

              Please accept my appologies for starting this thread and wasting everyones valuable time. If there is any way to delete this thread, I would like to. If not, it will serve as a reminder to me.

              Thanks.
              I wouldn't worry about it.

              Sometimes the best way to convince yourself an idea's wrong is to attempt to take it to its logical conclusion.
              Comment
              • Justin7
                SBR Hall of Famer
                • 07-31-06
                • 8577

                #8
                Originally posted by VideoReview
                I have lost my mind and apparently I was seeing things on this one. I don't know what I was thinking as I had already been through this before and had it resolved. I know I can not do what I was stating.

                Please accept my appologies for starting this thread and wasting everyones valuable time. If there is any way to delete this thread, I would like to. If not, it will serve as a reminder to me.

                Thanks.
                The time is not wasted. Your approach will teach others, regardless of whether it works.
                Comment
                Search
                Collapse
                SBR Contests
                Collapse
                Top-Rated US Sportsbooks
                Collapse
                Working...