The following example was taken from Wikipedia:
Testing hypotheses suggested by the data
How to do it wrong
For example, suppose fifty different researchers, unaware of each other's work, run clinical trials to test whether Vitamin X is efficacious in preventing cancer. Forty-nine of them find no significant differences between measurements done on patients who have taken Vitamin X and those who have taken a placebo. The fiftieth study finds a difference so extreme that if Vitamin X has no effect then such an extreme difference would be observed in only one study out of fifty. When all fifty studies are pooled, one would say no effect of Vitamin X was found. But it would be reasonable for the investigators running the fiftieth study to consider it likely that they have found an effect, until they learn of the other forty-nine studies. Now suppose that the one anomalous study was in Denmark. The data suggest a hypothesis that Vitamin X is more efficacious in Denmark than elsewhere. But Denmark was fortuitously the one-in-fifty in which an extreme value of a test statistic happened; one expects such extreme cases one time in fifty on average if no effect is present. It would therefore be fallacious to cite the data as serious evidence for this particular hypothesis suggested by the data.
--------End Of Article--------
Through trial and error in my sports betting data, I have come up with a way to detect the "Denmark's" in my data in a statistically significant way without having any original hypothesis. The problem is I believe my logic to be flawed but can not figure out how. Please shed some light on the situation if you see my mistake.
Let's say that I was aware that the previously mentioned study was done in 65,536 (2^16) different locations in the world. I have not looked at the results of individual locations of which and unknown to me there exists only 1 location with a positive study. According to the previous example, the significance level is a function of the number of studies I look at. If I was looking randomly (or with a flawed hypothesis that was invalid and therefore random), on average I would look at 32,768 locations before coming to the one with the positive study. This would mean there would be a 50/50 chance that the results were random (significant at the 50% level). However, by measuring various subsets of the data instead, I could quickly arrive at the single positive location (assuming there is at least one - if there isn't, I haven't lost anything anyway) in no more than 16 tests of the data. This would imply to me a p value of 16/65,536 or 1 in 4,096 or .0002441 which would normally indicate strong significance. Basically, if you pick a number in your head between 1 and 65,536 it is possible for me to guess the number correctly 100% of the time within 16 attempts asking only if the number is within a certain range.
This would be the same as a single profitable angle being present out of 65,536 combinations and me finding its exact location in only 16 attempts. Related to sports betting, many of us a familiar with hearing about obscure angles like "home team that finished a 5 day road trip and has had 2 days off and is playing on a Tuesday in March etc. etc. is 68% ATS". Often these angles are a result of data mining and because the investigator looked at too many combinations the results become insignificant. However, if I only look at 16 combinations and the ROI of the angle has a significance of p=.003125 then according to the Holms-Bonferroni method, my results would now be considered significant at .003125 * 16 = .05 or the 5% level.
Is this correct?