data mined p-value

Ganchrow · 05-30-08, 10:43 PM

It kind of seems like you're somehow preapplying the Bonferroni correction in anticipation of n=4,129. My initial reaction would be that there are a lot more than just 4,129 clusters of size 840 within your sample. Specifically, there'd be =COMBIN(4968, 840) clusters of size 840 which is officially a Very Large Number™.

Obviously, there's a lot more at play that just this, and I'd have think more about it, but perhaps there's an article or paper on this topic you can site?

pico · 05-30-08, 10:45 PM

try to make money using economagic, eh? good luck

VideoReview · 05-31-08, 08:24 AM

Originally posted by Ganchrow

It kind of seems like you're somehow preapplying the Bonferroni correction in anticipation of n=4,129. My initial reaction would be that there are a lot more than just 4,129 clusters of size 840 within your sample. Specifically, there'd be =COMBIN(4968, 840) clusters of size 840 which is officially a Very Large Number™.

This is exactly what I am doing and FWIW it seems to be working in practice which is why I am finding it hard to accept a larger correction number than what I am coming up with in my "data mined p-value" equation.

Originally posted by Ganchrow

Obviously, there's a lot more at play that just this, and I'd have think more about it, but perhaps there's an article or paper on this topic you can site?

I'll see if I can find something "official" but I doubt I would recognize it even if I came across it. In the meantime, I find my own empirical results pretty conclusive. Maybe the best way to confirm/reject what I am doing is to look at a couple of examples assuming a fair coin.

I flip a coin once and heads comes up. I assume I have a profitable angle that the coin is biased towards heads. I now will calculate both the "p-value" and the "data mined p-value" according to my initial equation.

population size=1
angle population size=1
total won=2
total bet=1

p-value = 1 - NORMSDIST((total won - total bet) / SQRT(total bet))

p-value = 1 - NORMSDIST((2-1)/SQRT(1)) = .158655

data mined p-value = (population size - angle population size + 1) * (1 - NORMSDIST((total won - total bet) / SQRT(total bet)))

data mined p-value = (1-1+1) * (1 - NORMSDIST((2-1)/SQRT(1))) = .158655

So the 2 p-values are the same and I would assert a priori that they are the same because no data mining advantage could possibly come from looking at only 1 flip.

Here is a slightly larger example. Assume that I flip a fair coin 1,033 times and mark down each flip consecutively. After looking at the entire sample, I notice that the first 10 flips were heads. I make the conclusion that I have found that 10 heads in a row are more profitable then it should be since the fair 10 flip parlay would have paid 2^10=1024 to win.

If I was not concerned at all with the fact that I data mined this conclusion, I would surmise (incorrectly of course) the probability of this occurring randomly is:

p-value = 1-NORMSDIST((1024-1)/SQRT(1) = (near zero)

Considering I did data mine it, if starting from before the first flip and after every single flip I bet that the next 10 flips were going to be 10 heads at a payout of 1023 to 1 (1024 units back for a 1 unit bet), using my initial equation I get:

data mined p-value = (1033-10+1) * (1-NORMSDIST((1024-1024)/SQRT(1024)) = 512

Of course 512 is exactly 1024 times greater than p=.5 (i.e. random) which is not coincidentally the same probability of flipping a coin and having heads come up 10 times in a row (2^10) in the first place.

512 is a very large p-value and tells me that this data mined angle is useless unless I can find something about that pattern of 10 flips that occurs less than 1 in 1024 times because 512 * (1 / 1024) = .5. Even if I say something like the first 10 flips of a sequence of 1033 are more likely to start with 10 heads, if I run a simulation of 1033 flips an infinite amount of times, I will see that the first 10 flips being heads occurs in exactly 1 in 1024 sets which would now indicate that the data mined p-value is exactly .5 or random (512 * 1 / 1024). This data mined p-value of .5 is now, after out of sample simulation, exactly where it should be because the initial results I found were in fact random.

My point is that if I come up with a data mined p-value of 512 and after simulation (i.e. out of sample testing) find that the pattern of 10 heads in a row occurs in the first 10 flips of the 1033 set occur exactly 1 time out of 1024 samples (as it should), why should my data mined p-value of 512 be invalid when clearly:

512 * (1 / 1024) = .5

Why should it matter that I also looked at groups of 2,3,4,5,6,7,8,9,11,12,13.....1024 to see if there was groups of heads in a row there because the .5, is in all reality, true. To say that the .5 is too low because I looked at different sequences (2 heads in a row, 3 heads, etc.), implies to me that I can somehow affect the true probability of the coin simply by looking at more combinations than I should have. The fact is that regardless of the millions of sequence combinations I may look at:

1) The coin coming up heads on any flip is 1/2
2) The coin coming up heads for the first 10 consecutive flips is 1/1024

Conclusion: (1/2) / (1/1024) = 512 which is my initial data mined p-value

Does this coin flip argument hold water?

Ganchrow · 05-31-08, 12:01 PM

Originally posted by VideoReview

Here is a slightly larger example. Assume that I flip a fair coin 1,033 times and mark down each flip consecutively. After looking at the entire sample, I notice that the first 10 flips were heads. I make the conclusion that I have found that 10 heads in a row are more profitable then it should be since the fair 10 flip parlay would have paid 2^10=1024 to win.

If I was not concerned at all with the fact that I data mined this conclusion, I would surmise (incorrectly of course) the probability of this occurring randomly is:

p-value = 1-NORMSDIST((1024-1)/SQRT(1,023) = (near zero)

(Typo corrected in red -- you're betting 1 unit to win 1,023 at decimal odds of 1024. Conversely were you betting 1 1,023 of a unit to win 1, you'd have a z-score of ( 1,024 1,023 - 1 1,023 )/SQRT( 1 1,023 ), which would obviously yield the same results.)

Ignoring the notion of data mining for a moment, the central limit theorem is woefully inadequate in this regard given your sample size of 1. The proper p-value would come from the binomial distribution and would equal =BINOMDIST(10,10,0.5,0) = 2^-10 ≈ 0.09766%.

I'll also point that this is only a 1-tailed test, when in reality we should probably be looking at a 2-tailed test. Whatever ... we can just ignore that for the duration ... that's the least of our worries.

Originally posted by VideoReview

Considering I did data mine it, if starting from before the first flip and after every single flip I bet that the next 10 flips were going to be 10 heads at a payout of 1023 to 1 (1024 units back for a 1 unit bet), using my initial equation I get:

data mined p-value = (1033-10+1) * (1-NORMSDIST((1024-1024)/SQRT(1024)) = 512

Ok. Remember that a p-value refers to a probability-value and a probability of 512 makes no more sense than a probability of "Barney".

If you recall, in a previous post I referred to Bonferroni as only an approximation (and what you're doing isn't really even Bonferroni) and that given a single test p-value of p, determined by considering n independent samples, the corrected p-value would in fact be 1 - (1-p)ⁿ.

When we apply this to the problem of looking at 1,024 independent samples of 10 flips each (although in your example the 1,024 flips are not independent) we get a p-value of 1-(1-2^-10)^1,024 ≈ 63.23%.

Now because of the lack of independence between the 1,024 10-flp samples, this p-value is actually a bit to high. If you flip a coin 1,033 times, what's the probability that at least 10 flips in a row will land heads (again, we're ignoring the 2-tailed component) at at least one point during the sample?

Well the answer to that is 39.517% (which can be obtained from Streak Calculator -- enter values of 1033, 10, and 50%). I explained the algorithm for calculating this in this post.

This figure of 39.517% is the correct p-value for the experiment that involves looking at 1,033 flips and declaring a "success" when 10 flips in a row land heads.

But there's a huge problem with this, too. What's this fixation with 10 flips in a row? What if you had looked at the dataset and noticed that of the first 10 even numbered flips every coin landed heads? Or what if you had looked at the first 10 odd numbered flips and noted that every coin landed heads? Or what if you had looked at flips# 17, 28, 95, 433, 434, 436, 928, 982, 1,024, and 1,030 and noted that every coin landed heads? The point is that there are many, many more than just 1,024 different possible combinations of 10-flip sequences.

In fact there are =COMBIN(1033, 10) ≈ 3.64975×10²³ possible combinations of 10-flip sequences, and of these nearly one septillion combinations 2^-10 ≈ 0.09766% of them would contain exactly 10 heads.

So I'm just not seeing the probabilistic logic to this angle.

VideoReview · 05-31-08, 06:35 PM

I have lost my mind and apparently I was seeing things on this one. I don't know what I was thinking as I had already been through this before and had it resolved. I know I can not do what I was stating.

Please accept my appologies for starting this thread and wasting everyones valuable time. If there is any way to delete this thread, I would like to. If not, it will serve as a reminder to me.

Thanks.

Ganchrow · 06-01-08, 10:36 AM

Originally posted by VideoReview

I have lost my mind and apparently I was seeing things on this one. I don't know what I was thinking as I had already been through this before and had it resolved. I know I can not do what I was stating.

Please accept my appologies for starting this thread and wasting everyones valuable time. If there is any way to delete this thread, I would like to. If not, it will serve as a reminder to me.

Thanks.

I wouldn't worry about it.

Sometimes the best way to convince yourself an idea's wrong is to attempt to take it to its logical conclusion.

Justin7 · 06-01-08, 11:15 AM

Originally posted by VideoReview

I have lost my mind and apparently I was seeing things on this one. I don't know what I was thinking as I had already been through this before and had it resolved. I know I can not do what I was stating.

Please accept my appologies for starting this thread and wasting everyones valuable time. If there is any way to delete this thread, I would like to. If not, it will serve as a reminder to me.

Thanks.

The time is not wasted. Your approach will teach others, regardless of whether it works.