Data Mining And P-Values

VideoReview · 03-29-08 12:45 PM

The following example was taken from Wikipedia:

Testing hypotheses suggested by the data
How to do it wrong
For example, suppose fifty different researchers, unaware of each other's work, run clinical trials to test whether Vitamin X is efficacious in preventing cancer. Forty-nine of them find no significant differences between measurements done on patients who have taken Vitamin X and those who have taken a placebo. The fiftieth study finds a difference so extreme that if Vitamin X has no effect then such an extreme difference would be observed in only one study out of fifty. When all fifty studies are pooled, one would say no effect of Vitamin X was found. But it would be reasonable for the investigators running the fiftieth study to consider it likely that they have found an effect, until they learn of the other forty-nine studies. Now suppose that the one anomalous study was in Denmark. The data suggest a hypothesis that Vitamin X is more efficacious in Denmark than elsewhere. But Denmark was fortuitously the one-in-fifty in which an extreme value of a test statistic happened; one expects such extreme cases one time in fifty on average if no effect is present. It would therefore be fallacious to cite the data as serious evidence for this particular hypothesis suggested by the data.
--------End Of Article--------

Through trial and error in my sports betting data, I have come up with a way to detect the "Denmark's" in my data in a statistically significant way without having any original hypothesis. The problem is I believe my logic to be flawed but can not figure out how. Please shed some light on the situation if you see my mistake.

Let's say that I was aware that the previously mentioned study was done in 65,536 (2^16) different locations in the world. I have not looked at the results of individual locations of which and unknown to me there exists only 1 location with a positive study. According to the previous example, the significance level is a function of the number of studies I look at. If I was looking randomly (or with a flawed hypothesis that was invalid and therefore random), on average I would look at 32,768 locations before coming to the one with the positive study. This would mean there would be a 50/50 chance that the results were random (significant at the 50% level). However, by measuring various subsets of the data instead, I could quickly arrive at the single positive location (assuming there is at least one - if there isn't, I haven't lost anything anyway) in no more than 16 tests of the data. This would imply to me a p value of 16/65,536 or 1 in 4,096 or .0002441 which would normally indicate strong significance. Basically, if you pick a number in your head between 1 and 65,536 it is possible for me to guess the number correctly 100% of the time within 16 attempts asking only if the number is within a certain range.

This would be the same as a single profitable angle being present out of 65,536 combinations and me finding its exact location in only 16 attempts. Related to sports betting, many of us a familiar with hearing about obscure angles like "home team that finished a 5 day road trip and has had 2 days off and is playing on a Tuesday in March etc. etc. is 68% ATS". Often these angles are a result of data mining and because the investigator looked at too many combinations the results become insignificant. However, if I only look at 16 combinations and the ROI of the angle has a significance of p=.003125 then according to the Holms-Bonferroni method, my results would now be considered significant at .003125 * 16 = .05 or the 5% level.

Is this correct?

BuddyBear · 03-29-08 02:19 PM

Originally Posted by VideoReview

The following example was taken from Wikipedia:

Testing hypotheses suggested by the data
How to do it wrong
For example, suppose fifty different researchers, unaware of each other's work, run clinical trials to test whether Vitamin X is efficacious in preventing cancer. Forty-nine of them find no significant differences between measurements done on patients who have taken Vitamin X and those who have taken a placebo. The fiftieth study finds a difference so extreme that if Vitamin X has no effect then such an extreme difference would be observed in only one study out of fifty. When all fifty studies are pooled, one would say no effect of Vitamin X was found. But it would be reasonable for the investigators running the fiftieth study to consider it likely that they have found an effect, until they learn of the other forty-nine studies. Now suppose that the one anomalous study was in Denmark. The data suggest a hypothesis that Vitamin X is more efficacious in Denmark than elsewhere. But Denmark was fortuitously the one-in-fifty in which an extreme value of a test statistic happened; one expects such extreme cases one time in fifty on average if no effect is present. It would therefore be fallacious to cite the data as serious evidence for this particular hypothesis suggested by the data.
--------End Of Article--------

No it would not be reasonable to conclude that we have a significant effect. A couple things here to note:

(1) If you obtain a statistically significant effect 1 out of 50 times under the same conditions, I don't think I would be doing cartwheels. In fact, we would expect by chance alone to obtain a statistically significant relationship 5 out of 100 times (or 2.5 times out 50 in this case) when no such relationship exists under the conventional .05 p-value.

(2) In social science research, there is a process that is referred to as replication. Replication just means that different researchers (working independent of one another) perfrom the same studies multiple times under the same conditions (or sometimes slightly different conditions). If researchers routinely arrive at the same results, we can build our confidence that the effect we are observing is a real effect. However, we can never be 100% fully sure that what we have found is a "true" effect. There is always a possibility that we are committing Type 1 error, but again, confidence grows through replications and triangulation (combining different methodologies to test the same thing). This is why we can never really say that a study "proves" something. Instead, a significant effect in a study only increases our confidence in the effect.

(3) In this particular case, there may be something unique about Denmark that would require further investigation or Denmark could possibly be an anomaly or an outlier in this data set. More research would be needed to explore why the effect was exclusive to Denmark and not other countries. This is why it is always a good idea to have theory guide you. Remember, statistics are largely irrelevant without theory. Theory is what should ALWAYS guide you in your research. There is never an instance where statistics is superior to theory. It may be interesting to data mine and find some significant results, but without theory there is not much to go with. In this example, we would need some type of explanation to understand why Denmark was significant (i.e. air quality, lifestyle habits, etc...)

Originally Posted by VideoReview

Through trial and error in my sports betting data, I have come up with a way to detect the "Denmark's" in my data in a statistically significant way without having any original hypothesis. The problem is I believe my logic to be flawed but can not figure out how. Please shed some light on the situation if you see my mistake.

Let's say that I was aware that the previously mentioned study was done in 65,536 (2^16) different locations in the world. I have not looked at the results of individual locations of which and unknown to me there exists only 1 location with a positive study. According to the previous example, the significance level is a function of the number of studies I look at. If I was looking randomly (or with a flawed hypothesis that was invalid and therefore random), on average I would look at 32,768 locations before coming to the one with the positive study. This would mean there would be a 50/50 chance that the results were random (significant at the 50% level).

I am not sure what you are getting at here. I am not sure your understanding of the term "random" is accurate though. I know for a fact that a lot more people would get things published at the p < .50 level though

Originally Posted by VideoReview

However, by measuring various subsets of the data instead, I could quickly arrive at the single positive location (assuming there is at least one - if there isn't, I haven't lost anything anyway) in no more than 16 tests of the data. This would imply to me a p value of 16/65,536 or 1 in 4,096 or .0002441 which would normally indicate strong significance. Basically, if you pick a number in your head between 1 and 65,536 it is possible for me to guess the number correctly 100% of the time within 16 attempts asking only if the number is within a certain range.

This would be the same as a single profitable angle being present out of 65,536 combinations and me finding its exact location in only 16 attempts. Related to sports betting, many of us a familiar with hearing about obscure angles like "home team that finished a 5 day road trip and has had 2 days off and is playing on a Tuesday in March etc. etc. is 68% ATS". Often these angles are a result of data mining and because the investigator looked at too many combinations the results become insignificant. However, if I only look at 16 combinations and the ROI of the angle has a significance of p=.003125 then according to the Holms-Bonferroni method, my results would now be considered significant at .003125 * 16 = .05 or the 5% level.

Is this correct?

I think Ganch is going to have to answer this. There is a lot of things going on here, but I am not quite sure what you are getting at. I think you have the general right idea about data mining and trying to find significant results, but again, without theory you don't have much going for you. I know different people may disagree with that and point to the fact that sports betting does not have a well developed theoetical paradigm, but to me if you take a set of data and you can have thousands of observations to go from, but unless there is something guiding you in your analysis of that data, I am not sure how reliable the findings you obtain are.

Good luck.....

SBR Top-Rated Sportsbooks				Best Sportsbooks List
#1 FanDuel	SBR rating 4.8/5	Review	#6 BetRivers	SBR rating 4.1/5	Review
#2 Caesars	SBR rating 4.7/5	Review	#7 Fanatics	SBR rating 4.1/5	Review
#3 DraftKings	SBR rating 4.7/5	Review	#8 Betway	SBR rating 3.8/5	Review
#4 BetMGM	SBR rating 4.6/5	Review	#9 Borgata	SBR rating 3.5/5	Review
#5 bet365	SBR rating 4.6/5	Review	#10 ClutchBet	SBR rating 2.9/5	Review

Data Mining And P-Values

Thread Tools

Data Mining And P-Values