How To Combine Positive ROI Angles To Create Statistically Significant Results

**Ganchrow** · 02-25-08, 04:39 PM

I'll try to respond to this when I get back from vacation, but my initial reaction is that you need to be considerably more cautious regarding data mining.

For example, you said you're looking at 15 different variable combinations. If we assume all these combinations are independent (which of course we know not to be true) then the probability of at least one of these 15 combinations being significant at the α=5% level would be 1 - 95%¹⁵ = 53.67%.

**Quebb Diesel** · 02-25-08, 04:48 PM

data snooping is a no no in statistics!

**rake922** · 02-25-08, 04:50 PM

looks like coincidences

**VideoReview** · 02-25-08, 05:46 PM

Originally posted by Ganchrow

I'll try to respond to this when I get back from vacation, but my initial reaction is that you need to be considerably more cautious regarding data mining.

For example, you said you're looking at 15 different variable combinations. If we assume all these combinations are independent (which of course we know not to be true) then the probability of at least one of these 15 combinations being significant at the α=5% level would be 1 - 95%¹⁵ = 53.67%.

I am not sure what you mean by independent but there is no overlap with each combination. In other words, the way I have divided up the variables, one event (game) could never belong to 2 or more of the combinations. The entire population is divided up into these 32 different combinations with 15 of them show positive ROI. (Actually, I have excluded -105/-105 ML games to have a clear line between favourite and dog).

**VideoReview** · 02-25-08, 06:30 PM

Originally posted by Ganchrow

I'll try to respond to this when I get back from vacation, but my initial reaction is that you need to be considerably more cautious regarding data mining.

Originally posted by Ganchrow

For example, you said you're looking at 15 different variable combinations. If we assume all these combinations are independent (which of course we know not to be true) then the probability of at least one of these 15 combinations being significant at the α=5% level would be 1 - 95%¹⁵ = 53.67%.

Actually, I have looked at 32 combinations in total so that would be about 80.63%.

But this is my question exactly because I would have had an idea that I should be looking at the quartiles first. In other words, I could have looked at only 4 combinations and I would have had the following each with about a 1600 sample size.

+2.90%
+4.27%
-9.22%
-3.63%

Both the second number is significant and the first one just barely missed. Furthermore, if I had to have chosen just one combination out of every possible combination of 1 to variables, I would have chosen a combination that would have been the first 2 numbers combined (+2.90% and +4.27%) which when combined, are statistically significant with 3200 events.

Is this how it works with data mining? Am I supposed to pick combinations that I will look at so carefully that they are to be done on an individual basis and that if I casually check another combination, I must pay the confidence level penalty of .05/n where n is the number of combinations considered?

One strategy I considered was calculating the ROI of each variable I thought might have an ROI that was statistically significant assuming a no vig line. So, let's say I come up with 8 variables that I think might fit the bill. I then calculate that 6 of them have a statistically significant ROI. Here are some hypothetical numbers and their no vig statistically significant ROI. Let's assume I have 10,000 events.
A=.05
B=.04
C=.03
D=.02
E=.01
F=.01
G=0
H=0

So, I have looked at 8 variables and choose to throw out 2 of them (G & H) right away. I am not sure at this point if pay any penalty for this and if this data snooping but if so, how much and why?

Now, I am going to select my first combination that I hypothesize will be the most profitable. I am going to add the results to a combination called System X. Whatever the results are for this combination, and however poor they may be, I will keep it. I will now look at the actual ROI and not the no vig ROI for the combinations.

The combination I choose is ABCDEF. I find that this combination as a ROI of +15.5% but only 62 events. I do not bother testing for significance because I am not done adding combinations to System X. I would like to reiterate that other than checking the first 8 variables as isolate variables and seeing what there results are, no combination is being checked before it is added to System X.

The next combination I choose is ABCDE which excludes the combination of ABCDEF from the sample. I find the ROI is +9.7% and has 112 events. Again, I add this to System X because I decided that any combination I check must be added if I am going to look at it.

Now, with each step, I run a regression on the variables I have chosen in the combination and their ROI's so I can see how the next combination I might choose might fair. I might even look at the number of events in that combination but NEVER look at the results. Between the number of events and making my best guess as to what the ROI might be, I can continue to take chances adding new combinations to System X until I feel that any more selections might reduce the overall statistically significance of the current ROI that System X has. Eventually I have 3300 events with an ROI of +5.6% that is statistically significant and I decide to stop adding to System X. I now start System Y and continue the selection process until I feel I should stop that. Maybe I end up with 2400 events and an ROI of 2.1% that is barely statistically significant. I then do System Z which as I am adding combinations, never has a significant ROI.

I know I have said it twice already but using this strategy, other than the initial check of 8 variables, I have never looked at the results of any particular combination before adding it to System X, Y, or Z.

Is this a valid process that avoids the pitfalls of data snooping?

**VideoReview** · 02-25-08, 07:50 PM

Originally posted by rake922

looks like coincidences

Thanks for the input Rake.

I thought I would check this out after I read your post. I have been using the exact same method since February 14th. Since then, I have actually placed 51 Money Line bets at average odds of +100 and have won 58.82% of those bets for an average ROI of 16.0332% when betting to win the same 1 unit. Because I have recently learned how to calculate the level of significance from Ganchrow when different odds are used, I ran a 3,000,000 trial Monte Carlo simulation on the following 51 bets:

-132 -106 178 -130 112
117 132 -118 131 -138
-122 -108 -103 137 -113
197 164 105 -132 126
-130 -115 -125 247 -122
-110 -134 121 135 192
105 -114 101 -137 -137
-107 -148 121 132
-119 121 -119 -131
-133 -145 -101 133
-136 -140 120 -119

The above bets produced the ROI results I had mentioned. The 3,000,000 trial results are that QUAL or p is .1218. Therefore, I am 87.82% sure that the chances that positive results that I am getting are not a coincidence. Although this does not meet my normal .05 level that I consider as a cut off, I am still hopeful I am on the right track. Finally, my ROI would have been a tad better (about 16.2% ROI as a final number) if I was able to catch the final closing line rather than the line 1-3 minutes before the close. I base the projections on the closing numbers from previous results.

UPDATE
I don't intend to continue to update my results but as this post is still in play and as I continue to approach the magic 95% confidence level I am seeking I thought I would add that I caught both Philadelphia +162 and Toronto +186 last night. Incidentally, although the totals numbers are not yet as refined as my Moneyline numbers and I am not including these in my overall win % or amounts, I also caught the under 6 at -101 on the Toronto game. These were my only 3 NHL plays last night.

It is not that I do not think that data snooping is unimportant or that there is a chance my results are coincidences. It is more that I believe I am close (if not there) to being able to uncover positive EV plays and am only trying to make sure that I develop and test these models from scratch the correct way. In other words, I think I may continue to get positive long term results with what I am doing but for the wrong reasons and am certain that if I am going to continue to get positive results, I am no where near full Kelly on this. The proof of this seems to be the continuing positive results I am getting which are now:

NHL Moneyline Plays: 53
Average Odds: +102 (based on an average of the log odds)
Average Win %: 60.38%
QUAL or P or Confidence Level: .0766

The confidence level was based on a 3,000,000 trial Monte Carlo simulation and shows that the likelihood that my results are a mere coincidence is 7.66% and that it is 92.33% likely that it is not. BTW, the p value has been on almost a straight line descent from the time I started these Moneyline plays. When I get below 5%, I would no longer (as I am now) willing to accept the null hypothesis that my results are by chance.

Right now I am only playing with a very small bankroll and will continue to do so until I can win for the "right reasons". In that regard, I am on your side with still maintaining a certain degree of skepticism.

**RickySteve** · 03-03-08, 12:27 AM

"A little learning is a dangerous thing; drink deep, or taste not the Pierian spring: there shallow draughts intoxicate the brain, and drinking largely sobers us again." -Alex Pope

**Justin7** · 03-03-08, 09:29 AM

I'd be very careful with this line of analysis. It's good to think and keep an open mind, but there are two serious potential problems with this approach.

First, call it "data mining bias". Consider a sequence of 1000 coin flips. If you conduct 100 different tests on a purely random sequence, you'll find 5 that "look" 95% confident that it is good. You have to apply logic to explain why your sequence is special.

Second, you have a problem with the market correcting itself. Almost anything you can describe, the market knows. The sports betting market has been getting tighter and tighter in the last 5 years. In the 1990s, you could blindly net NFL home dogs of 7.5+ and win. There were lots of "tricks" like this, but as more people figured it out, and more people bet it, the value disappeared.

A better approach to trend-based betting is to use power models, and figure out how much a trend should adjust power ratings. For example, a team that gets blown out at home (define blowout = lose by 20+ points) might play 2.5 points above its power rating on average in the next game. In the past, that might have been a 55% hit rate. You'll do better adjusting for that 2.5 points, and considering other factors as well... What if after the adjustment, the line is right?

**curious** · 03-03-08, 09:56 AM

The problem with trend based betting, as I see it, is this. How sure can you be that the variables you are tracking for the trend are the determining factors in the win-loss ratio and that the true determining factors were actually something else that was not being tracked in this trend? I will use an example. Say you are following the trend "win loss record on Tuesday". I know this is a goofy one but I have seen people in here who follow it. Was it really the fact that the games were being played on Tuesday that was the determining factor? Or, could it have been the fact that Tuesday just happened to coincide with the last game of a road trip or the last game of a series or the start of a road trip or the start of a series. Or, could it be that the pitching rotation being used makes it so that one of the team's pitching aces usually starts on Tuesday?

Having said this I do think that there are some situational trends like a starting pitcher's effectiveness at home vs on the road, for example, which absolutely have merit. However, the lines makers know about these situational trends also and have adjusted the line appropriately.

PS
I want to thank everyone who reached out to me last night when I made the post about being ill. I feel better today and am going to go see an internal medicine specialist this afternoon. THANK YOU!!

**VideoReview** · 03-05-08, 04:52 PM

Originally posted by RickySteve

"A little learning is a dangerous thing; drink deep, or taste not the Pierian spring: there shallow draughts intoxicate the brain, and drinking largely sobers us again." -Alex Pope

I'm working on it RickeySteve, I'm working on it. If I thought I knew as much as you infer, I would be betting nickels and dimes (alot for me but still manageable) chasing non-existent positive EV. For the last year (which is when I started looking at sports) in total I have bet absolute peanuts and spent most of my time analyzing. Until I have found and verified a positive EV process, you won't have to worry about winning my money.