Regression queries

brettd · 11-07-10 10:25 AM

How important is it in ensuring that assumptions for regression are not violated when sports modelling? It seems to me that all the variables that I'm investigating have assumption violations of one or more criteria.

Should I be transforming these variables to ensure the assumptions for regression are met? I've toyed with this already, but the predictive value (r squared) of these equations are lower than they were when I didn't worry about assumption violations. I'm not too sure of what to think of this.

Maybe I should be identifying and removing outliers to meet the regression assumptions?

For the record, I'm using a population of 447 cases.

Anyone have any thoughts?

Justin7 · 11-07-10 10:35 AM

Nearly every method I use violates some assumption. This is fine as long as you have a clean testing phase.

what exactly are you trying to analyze?

brettd · 11-07-10 10:46 AM

Clean testing phase? Does that mean out of sample testing?

I'm trying to analyze whether a bunch of variables have a predictive capacity on my nominated dependent variable: the "MarginDifferential". In other words, the winning margin.

Cheers for replying

Justin7 · 11-07-10 11:06 AM

I would use 3 steps.
1. Pick a few years. Regress "margin differential" as a function of whatever you are testing, on a game by game basis. If you have 2 seasons of NFL, you have over 500 data points.
2. Once you have a relationship "of interest", do a forward-test in those same years. For example, pretend your regression says "margin = (yards per offensive play - league average ) * 6 + (yards allowed per defensive play - league average) * (-6). I would start around week 5. In week 5, look at a team's average off and def ypp, and test the prediction. How valid is it? There are two problems with this test. It's in sample, and each successive week is dependent on earlier weeks, so you don't really have 250 data points in a year. It's still a good starting point. Step two is where you tweak it to make your best guess of what the relationship should be.
3. Test on an out-of-sample year (or five). you stll have the week to week dependency issue, but I haven't found this to be a real problem.

Justin7 · 11-07-10 11:07 AM

I've already been long-winded, and I have greatly abbreviated it and left out a lot. One other obvious point though -- your test should consider both teams, not just one. And less obvious -- accounting for strength of schedule will give you a lot more precision, especially in weeks 5-8.

Miz · 11-07-10 11:56 AM

accounting for strength of schedule

one of the best topics in your book ... your approach and the concise description of it is what I liked best. Nice read overall. nice job.

xbalto · 11-07-10 01:00 PM

In ordinary least squares (linear) regression, a nice fact is that your parameter estimates will converge to the right answer if if your data are non-normal (don't follow a Gaussian/bell curve). Whether you have enough data for this effect to kick in is a separate question.

brettd · 11-07-10 08:14 PM

Hi guys thanks for the advice. I've got your book Justin and have read it over a couple of times. It's a great read. I'm not trying to model NFL though, it's actually quite a niche area: second-tier Australian Rules football. I think Australian sportsbooks are plump, lazy and not as receptive to market dynamics as compared to their offshore counterparts. I don't think it would take much to beat the line here in my nominated area.

There's this one particular second-tier league that I've managed to get 5 years worth of data for. However, only nine teams play a total of 92 games in the regular season. By round 5, only 16 games have been played in the first 4 rounds. Will this satisfy Justin's/your contention that four weeks is enough to start forward testing?

Justin7 · 11-07-10 08:50 PM

Originally Posted by brettd

Hi guys thanks for the advice. I've got your book Justin and have read it over a couple of times. It's a great read. I'm not trying to model NFL though, it's actually quite a niche area: second-tier Australian Rules football. I think Australian sportsbooks are plump, lazy and not as receptive to market dynamics as compared to their offshore counterparts. I don't think it would take much to beat the line here in my nominated area.

There's this one particular second-tier league that I've managed to get 5 years worth of data for. However, only nine teams play a total of 92 games in the regular season. By round 5, only 16 games have been played in the first 4 rounds. Will this satisfy Justin's/your contention that four weeks is enough to start forward testing?

The weaker the league, the less data you need. Even with just 16 games, you probably have enough. Maybe even with 12, since it's such a small market. It's more important what you think though. How predictive is the data with just 3 weeks? or 4?

yak merchant · 11-07-10 11:11 PM

Originally Posted by Justin7

I would use 3 steps.
1. Pick a few years. Regress "margin differential" as a function of whatever you are testing, on a game by game basis. If you have 2 seasons of NFL, you have over 500 data points.
2. Once you have a relationship "of interest", do a forward-test in those same years. For example, pretend your regression says "margin = (yards per offensive play - league average ) * 6 + (yards allowed per defensive play - league average) * (-6). I would start around week 5. In week 5, look at a team's average off and def ypp, and test the prediction. How valid is it? There are two problems with this test. It's in sample, and each successive week is dependent on earlier weeks, so you don't really have 250 data points in a year. It's still a good starting point. Step two is where you tweak it to make your best guess of what the relationship should be.
3. Test on an out-of-sample year (or five). you stll have the week to week dependency issue, but I haven't found this to be a real problem.

Threadjack warning! But the week 5 to 8 comment got me thinking. I'm new around here, but been doing this whole computer handicapping longer than most. Far from a "modeling" expert (actually based on recent results I've probably just built a overly complex suck machine). but I've recently realized that if I would wait until week 6 to start using my "model", I would have alot more money than I have now. So lately I've been toying with a new Genetic Algorithm toolbox and I built a model using the "wrong data" and the results were surprising.

Historically I'd take week 6 teams roll up the data for the first 5 weeks and use whatever the function of the week I'm on to try to predict the score. I'd take that data and the results and archive for use in the model. Rinse Repeat. So in the end I always had a data set with the input variables containing the information only up to the previous week. i.e. input variables are averages from weeks 1 to 7 and the output would be the score in week 8. So, I was trying this thing out last weekend, let's say to predict week 8. But instead of my normal process, I screwed up and built a complete data set using the matchups from week to week but the averages from the most recent week. I seem to remember trying it years ago and thinking I was having over "smoothing" issues and poor results. But maybe I just never gave it enough of a chance. So I've been mulling the pros and cons of this all week, and can't come to a real good argument for or against. Just wondering if you or anyone have any thoughts before I spend the already endangered free time building models.

Thanks in advance. On another note, I haven't read a book on gambling in years and years but look forward to reading yours over the holidays.

pedro803 · 11-09-10 05:29 AM

yak merchant, I like to think of it in terms of what they call "moving averages" in stock trading terminolgy -- and I don't think there is any question that you can add some power by weighing more recent data more heavily, because the point is to predict what will happen next not to predict what the cumulative end of the year average will be.

But of course how heavily to weight the recent data, and where to draw the line of what is recent, and how to draw the line -- is it black and white (recent and not recent) or do you gradate the older data out completely little by litte. These are all questions that must be answered over and over again with each new piece of data that you use in your model. If you really want to get into it then look into the term "data transformation" which encompasses much more than just weighting more recent data, but it is a very important part of modeling for eggheads.

Sounds to me like what you really need to do is slow down and keep really good records of your results and then use the results to hone your model towards being more accurate. fwiw, you are ahead of me because I am still trying to learn to build a database/scrape the internet!

Good Luck!!

yak merchant · 11-09-10 08:54 AM

Originally Posted by pedro803

yak merchant, I like to think of it in terms of what they call "moving averages" in stock trading terminolgy -- and I don't think there is any question that you can add some power by weighing more recent data more heavily, because the point is to predict what will happen next not to predict what the cumulative end of the year average will be.

But of course how heavily to weight the recent data, and where to draw the line of what is recent, and how to draw the line -- is it black and white (recent and not recent) or do you gradate the older data out completely little by litte. These are all questions that must be answered over and over again with each new piece of data that you use in your model. If you really want to get into it then look into the term "data transformation" which encompasses much more than just weighting more recent data, but it is a very important part of modeling for eggheads.

Sounds to me like what you really need to do is slow down and keep really good records of your results and then use the results to hone your model towards being more accurate. fwiw, you are ahead of me because I am still trying to learn to build a database/scrape the internet!

Good Luck!!

Thanks for the reply. You are definitely right that I need to slow down and get a better handle on what the system can and can't do. I think my next project really needs to be a rewrite of parts of the code, to provide a better (more efficient/parallel) backtesting system and move all my data from multiple databases (1 per year) into one gigantic database. Right now changing a model and getting the results is like going to the dentist.

Yes I've played with building the model with more recent data over the years. I've also used moving averages and a version of a Stochastic Oscillator in baseball in the past. In football I've played with using the last 3, 4 or 5 data points. I still struggle with it because the limited sample size may help with "improving/declining" aspect of a team, but from previous experience is usually out weighed by what I lose in the lack of data for my crazy strength of schedule calculations. What I haven't really tried to do is take the most recent averages/medians and apply those to matchups in the past and rebuild the model.

After thinking about it a while, I need to really revisit how the whole thing is structured and what components can be parallelized so that I can really get to the bottom of what I have and don't have in an efficient manner. It also probably wouldn't hurt for me to dust off a few of the statistics books I have and relearn what was long lost in the cloud of malted hops and barley.

SBR Top-Rated Sportsbooks				Best Sportsbooks List
#1 FanDuel	SBR rating 4.8/5	Review	#6 BetRivers	SBR rating 4.1/5	Review
#2 Caesars	SBR rating 4.7/5	Review	#7 Fanatics	SBR rating 4.1/5	Review
#3 DraftKings	SBR rating 4.7/5	Review	#8 Betway	SBR rating 3.8/5	Review
#4 BetMGM	SBR rating 4.6/5	Review	#9 Borgata	SBR rating 3.5/5	Review
#5 bet365	SBR rating 4.6/5	Review	#10 ClutchBet	SBR rating 2.9/5	Review

Regression queries

Thread Tools

Regression queries