Data Sampling Methods

**Ganchrow** · 04-27-08, 06:37 AM

Originally posted by VideoReview

Although this seemed logical to me, I had been getting some strange results from the verification sets. For example, if I had a "system" that had an ROI of +20% in my initial set, I might get an ROI of -6% and -8% on each of the other 2 sets. Although, the ROI of the entire sample above was +2.66%, I would get negative results in my test sample.

This does not sound to me like a "strange" result but rather evidence that the conclusions drawn based on the first data segment weren't predictive of the population (i.e., were "data mined"). This is a common result.

Another possibility is that there exists autocorrelation within your time series (so results on days t-1 ... t-n need be used in formulating day t forecasts). If this were the case then segmenting your data set so it cut across relevant time horizons would be a bad idea.

Originally posted by VideoReview

If I do the same thing but divide the 3 sets chronologically (i.e. Set 1=2004 Season, Set 2=2005 Season, Set 3=2006 Season), when my initial set was positive, I would more times than not get positive results in the 2 out of sample test sets.

So this is either a "good" result or (when taken in conjunction with the above) is suggestive of a programming error. Question: If you split up your successful seasonal data using the other method do you find one of you thirds drastically outperforms the other two?

Originally posted by VideoReview

Let's say you gave me 10,000 game betting results in date order from any major sport and I selected every second game for a total of 5,000. Purely by chance the favoured team in this test sample won 100% of the time for an ROI of +50% betting to win 1 unit. Now, if sports betting was like dice which have no memory, I would expect that the ROI of favourites in the 5,000 games that were out of sample would be close to -vig or -2% at a 5 cent book. However, I would bet the farm that the ROI of underdogs would be VERY high in the out of sample set.

If you make this claim with prior knowledge that the entire 10,000 game population had "average" results, then this would follow directly from conditional probability.

Without this precondition, then you're either falling prey to the Gambler's Fallacy or have uncovered what promises to be a very lucrative sports-based autoregressive moving average model.

Originally posted by VideoReview

a) If data is divided across dates in an attempt to remove seasonal bias, the results of this original sample and the out of sample data need to be totaled together to produce a final result.

No. Don't do this. This defeats the whole purpose of separately maintained in and out-of-sample data segments. At least if you do do it, make sure to properly condition your results on the likelihood of finding such results in-sample.

Originally posted by VideoReview

b) Data should not be divided across dates and should be divided by logical dates (i.e. by whole seasons and not something like using pre-season game data to predict playoff game results etc.)

If this works for you then that's certainly great news. If you've done everything correctly that I'd suggest you start moving ahead.

That said, the fact that it only works when you divide your data in such a manner does not inspire a whole lot of confidence within me. I'd strongly suggest double-, triple-, quadruple-, quintuple-, and sextuple-checking your work (you can stop there septuple-checking is just plain silly) to make absolutely certain that no out-of-sample data at all somehow crept in to your in-sample modeling.

But let me make this very clear ... there's nothing necessarilly "wrong" with segmenting data by season, and if you get good results then by all means go for it. But based upon what you;ve written above I'm just going to warn you again to make sure your programming and modeling are sound.

Originally posted by VideoReview

I highly respect the person that gave me the original suggestion but I am simply not able to produce consistent results by this sampling method. I would appreciate any thoughts, especially those that involve the math of what I am saying.

Finding meaningful results particularly difficult to come by when using proper sampling methodology is generally to be expected.

**VideoReview** · 04-27-08, 09:16 AM

Originally posted by Ganchrow

So this is either a "good" result or (when taken in conjunction with the above) is suggestive of a programming error. Question: If you split up your successful seasonal data using the other method do you find one of you thirds drastically outperforms the other two?

When I spilt up the data I actually put them on different spreadsheets to avoid any possible contamination. All of my assumptions, including initial hypothesis', are then drawn from the initial set only. To answer your question, very often the case is that one of the segments drastically outperforms the other two and it is not always the initial sample.

Originally posted by Ganchrow

Another possibility is that there exists autocorrelation within your time series (so results on days t-1 ... t-n need be used in formulating day t forecasts). If this were the case then segmenting your data set so it cut across relevant time horizons would be a bad idea.

That is the first time I have ever heard mention of this possibility. Is there a quick and easy test to determine if this is the case? When I first started seeing the results, I started fooling around with moving averages trying to develop some stochastic triggers. I really haven't got to deep with that though yet.

Originally posted by Ganchrow

If you make this claim with prior knowledge that the entire 10,000 game population had "average" results, then this would follow directly from conditional probability.

Without this precondition, then you're either falling prey to the Gambler's Fallacy or have uncovered what promises to be a very lucrative sports-based autoregressive moving average model.

In a way I do make this with prior knowledge but that knowledge is more a priori since I know that the books would not leave something as blatant as favourites winning at an ROI of +50% for any amount of time and would correct the situation even if the sharps and public don't. So, even though I do not know what sport it is and where the sample is from, I strongly believe that if you gave me any 10,000 consecutive games from any sport and divided them up as I mentioned, the other half would have to have significant positive ROI for the dogs. I know what you mean about Gambler's Fallacy and I am not suggesting that things just have to even out. They don't. However, from what I have seen in sports betting, the distribution of results are often skewed or have such a high level of kurtosis such that the probability of it being random is highly unlikely (p<=.0001 etc.). So if the distribution of results is being "controlled" then I do not think it is a big leap to suggest that the entire population results would be average. What do you think of my assumptions above?

Originally posted by Ganchrow

If this works for you then that's certainly great news. If you've done everything correctly that I'd suggest you start moving ahead.

That said, the fact that it only works when you divide your data in such a manner does not inspire a whole lot of confidence within me. I'd strongly suggest double-, triple-, quadruple-, quintuple-, and sextuple-checking your work (you can stop there septuple-checking is just plain silly) to make absolutely certain that no out-of-sample data at all somehow crept in to your in-sample modeling.

But let me make this very clear ... there's nothing necessarilly "wrong" with segmenting data by season, and if you get good results then by all means go for it. But based upon what you;ve written above I'm just going to warn you again to make sure your programming and modeling are sound.

Finding meaningful results particularly difficult to come by when using proper sampling methodology is generally to be expected.

Thanks again for the fair warning. I have done everything I can to avoid programming error. I am working in Excel so I can see the results line by line and it all appears good and the sample and test data are separated.

**Ganchrow** · 04-27-08, 11:11 AM

Forgive me, but I'm not sure I see where you're heading with this.

You have a model that you believe performs well out-of-sample. Why not just run with it? What's really your concern?

Originally posted by VideoReview

When I spilt up the data I actually put them on different spreadsheets to avoid any possible contamination. All of my assumptions, including initial hypothesis', are then drawn from the initial set only. To answer your question, very often the case is that one of the segments drastically outperforms the other two and it is not always the initial sample.

But you're saying that you've tried segmenting your data set in two different ways.

If you segment your data set based on seasons do you still find that three-day pattern?

Originally posted by VideoReview

In a way I do make this with prior knowledge but that knowledge is more a priori since I know that the books would not leave something as blatant as favourites winning at an ROI of +50% for any amount of time and would correct the situation even if the sharps and public don't. So, even though I do not know what sport it is and where the sample is from, I strongly believe that if you gave me any 10,000 consecutive games from any sport and divided them up as I mentioned, the other half would have to have significant positive ROI for the dogs. I know what you mean about Gambler's Fallacy and I am not suggesting that things just have to even out. They don't. However, from what I have seen in sports betting, the distribution of results are often skewed or have such a high level of kurtosis such that the probability of it being random is highly unlikely (p<=.0001 etc.). So if the distribution of results is being "controlled" then I do not think it is a big leap to suggest that the entire population results would be average. What do you think of my assumptions above?

I'm not sure I see a testable hypothesis here.

...

If you want to test for autocorrelation your very first step would be looking at the Durbin-Watson statistic. Try regressing your residuals on your lagged residuals (so if it's a 3-day cycle you're seeing regress e_t on e_t-1 and e_t-2.) nI suspect this will prove a waste of time.

It's very difficult for me to guess what's going on here but to perfectly blunt, my estimation is that you're either on to something really, really big or there's some sort of systematic error in your model.

I know you're doing this for NHL. Why not give it a try with MLB, a sport for which considerably more data exists.

**VideoReview** · 05-13-08, 10:05 AM

Originally posted by Ganchrow

Forgive me, but I'm not sure I see where you're heading with this.

I apologize for the delay in getting back. This thread is important and I would like to clear up some misconceptions.

Originally posted by Ganchrow

You have a model that you believe performs well out-of-sample. Why not just run with it?

I am actually running with it now and have "officially" started on 02/14/2008. That date marks the first time I have ever deposited more than my "always the same nominal system test amount over the last year (which I always lose, and know I will lose, because I am trying so many different things at the same time that cumulatively I would certainly be over 2x Kelly even if by chance I have +EV on some systems). The nominal amount worked well for me because Pinnacle's minimum is only $1 US. I am no longer in test mode. Since I started running with it, I have placed 594 bets for a total of 41.32956 times my initial deposit and my net profit is currently 3.30138 times larger than my initial deposit. The end of today will mark my first quarter (3 months) and I will be running a weighted Monte Carlo run (thanks again Ganch for this invaluable tool) for all of my past 594+ bets to determine my overall p-value.

Originally posted by Ganchrow

What's really your concern?

My real concern is determining Kelly at a true unbiased 90% confidence level. To do this though, I need to be able to test my hypothesis' properly. I have lowered my confidence level to 90% from 95% after considering your comment to me about the fact that we're not dealing with SETI data here.

Originally posted by Ganchrow

But you're saying that you've tried segmenting your data set in two different ways.

If you segment your data set based on seasons do you still find that three-day pattern?

It's not really a 3 day pattern at all. All I am doing is sorting the games by date and then numbering them 1, 2, 3, 1, 2, 3 etc. So, if there are 8 games on my first day in the data they will be numbered 1, 2, 3, 1, 2, 3, 1, 2, and the first game of the next day will start with 3 and then carry on.

I could go on for pages describing what is happening but I will try and sum it up. When I divide the data into thirds across date lines as shown above, I find a reoccurring pattern of clusters with the highest p-values (i.e. negative ROI) in my test third resulting in very high positive ROI's in the other two thirds. I thought I was seeing things so I ran a non-parametric regression on it, which totally removed me from the model process, and I was truly able to predict the other 2 thirds. My dependent data was a single column of p-values of various clusters (i.e. systems) and the dependent data was a single column of p-values for those same systems in the other 2 thirds. The systems themselves were all completely random using valid data though. I would create 50,000 random systems and test them on one third of the data and have the NP-regression find a model that predicts the p-values in the other 2 thirds. I kept taking a random sample of 10% of the data out in order to test the model on. Although it is only a positive ROI of about 2-3%, the p-value was essentially 0. The very strange thing was, the model found the same pattern I did. If my system performed poorly in the test sample, it was more likely to perform well in the remaining 2 thirds. When I did the EXACT same process but divided the sample up by seasons, it could not make accurate predictions based only on p-values alone. This leads me to believe that there is strong autocorrelation going on but I can not figure out how to take advantage of it - yet. I am not asking for help with figuring it out as I am satisfied I am on the right track using more traditional ideas anyway. This is something I will be thinking over for quite some time I am sure.

Originally posted by Ganchrow

I'm not sure I see a testable hypothesis here.

Why not? What I am saying is that if you give me any large amount of sports results divided into thirds across date lines as I demonstrated earlier, with one third of the data I can tell you, with p<.05, what is going to happen in the other 2 thirds. If I, by chance, see the ROI of home teams in the third I get is -30% (which would obviously be due to chance alone as books would never allow this long term), you can be sure I will be saying that home teams in the other 2 thirds will be profitable. It could be Favs or Dogs etc. as long as I find a common variable that has performed very badly in my third; I am saying it will likely perform positively in the remaining data. I can see why someone would think this is a typical Gambler's Fallacy case.

My very best guess as to why this works, and in fact why it may never be practical to make a system out of it (which is why it is not my focus at the moment), is that I am using games from before an event and AFTER an event to predict games that occurred in the middle. I think THIS is the crux of the problem and is a form of data snooping. In other words, there seems to be an ebb and flow to what bettors are pounding and the books seem to be moving with this. Having the ebb and flow calculated on one third of a cross section seems to have some predictive value on determining the ebb and flow of the other remaining data.

Originally posted by Ganchrow

I know you're doing this for NHL. Why not give it a try with MLB, a sport for which considerably more data exists.

I am doing it with MLB now. I was trying to develop my ideas and overall model construction methods on one sport so that I could apply them in an unbiased way to other sports. Actually, MLB has been doing very well for me which is good as I have been getting killed on NHL since the second round started.

My final point/question/idea for you still is along the following lines:

1) If you flip a coin 300,000 times and give me the results of every third flip (100,000 results), and I find that incredibly, heads came up 70,000 times in my 100,000 samples, I would still bet that the remaining 200,000 would have approximately 50% heads.

However,

2) If you assemble the results of 300,000 games of a sport and sort them by date and give me every third games results, and I find that incredibly in my third, the home team won so many times that my ROI was +50%, I would bet heavily on the visiting team in the remaining 2 thirds of the games.

But if you gave me the FIRST 100,000 games in date order and the results were similarly incredible, I would NOT necessarily bet on the away team for that reason alone. I think this is what separates what I am seeing compared to Gambler's Fallacy.

Do you think there is any logic to this at all?

PS I am prepared to end the discussion on this topic if you do not see any merit to this hypothesis.