Proper way to backtest

flsaders85 · 10-22-15, 11:33 AM

Originally posted by mikmak

Hi all. I have developed a model for college football that works fairly well when back tested using 2014 data and results. I base my game choices on contests that my capped line/total is off by x amount of the line I'm getting from the book. When back testing using the 2014-2015 season, it hits at about a 63% rate for around 300 games (out of 869) for football. I'm going to collect a couple more years data to do more back testing but I want to make sure I'm doing this properly.

My concern is that I'm using the end stats for the season to cap all games. Is that, essentially, "cheating" because I'm using data from games to cap those actual games? I'm assuming this is one of the inherent hurdles with back testing. Back testing on a weekly basis is going to be much more involved as I'll need to scrape box scores for individual game stats instead of using season stats.

So, ultimately, my question is ... is there a standard percentage I should knock off my results due to the fact that I'm using the entire season's stats to cap a game who's stats are included in the calculations or is the advantage minuscule enough to simply disregard?

Unfortunately, that is most likely swaying the backtest. It's known as back-fitting. The proper way to backtest would be the have a data sample for training purposes and a seperate data sample to test your model.

HeeeHAWWWW · 10-22-15, 01:52 PM

Originally posted by mikmak

My concern is that I'm using the end stats for the season to cap all games. Is that, essentially, "cheating" because I'm using data from games to cap those actual games?

Yep. Pretty much anytime you do this you'll come up with spectacularly good results :-) You can't use any future knowledge at all, whether data or even structural assumptions.

If you don't have enough older data to make a proper test/train split, try using cross-validation, or something like random forest that has an internal estimator for out of bag samples.

mikmak · 10-22-15, 03:21 PM

Thanks for the replies. I had a feeling this was going to be the case. I didn't mention that when I jacked up the differential I'm using to select games, my totals were hitting at over 80%. Not a ton of games but enough to make some serious coin.

OK ... so it looks like I'll have to break my stats down by week and run my model with only stats that have been accumulated prior to the games being capped. Or use an estimator or cross-validation like heehaw mentioned. Looks like I've got some reading to do because I, honestly, don't know what cross-validation is or how to implement it.

flsaders85 · 10-22-15, 03:56 PM

www.repole.com has what you're looking for in terms of weekly data

Waterstpub87 · 10-22-15, 08:31 PM

ESPN has monthly stats. So in that case, you could use only January data for February games, and then average the two for March games ect.

mikmak · 10-26-15, 12:09 PM

I found that teamrankings.com has what I need on a weekly basis. I use excel for running my model so I'll simply create separate stat sheets for each week and just use the previous week's stat sheet for capping the current week's games. If I repeat that from week5 or so on through the end of 2014, that should be an accurate snapshot of how my model is doing. Thanks for all the help guys.

SquareBetNoMore · 10-27-15, 03:04 PM

Mikmak, I do this kind of thing and have had good results but I agree with everyone above, gotta use stats from before the game is played. I run a refresh on Mondays before games to get any important stats. I then don't update it until I copy that week's results into a separate excel spreadsheet that I use for back testing. Once I've copied that week of stats, I refresh the data so I have updated stats from the previous week. It's not too bad once you know how to use the macros and import functions for excel. I do it for NCAAF, NCAAB, NBA, and MLB. NFL doesn't have enough games to get a sample for me to trust the system results.