Question relating to single variable regression vs multi-variable regression output

brettd · 03-04-11 06:42 AM

Hey everyone,
I'm going along with trying to follow the logic by Justin's NFL points based model (although applying it to Australian football). I have a particular statistic; called the kick differential (difference in the amount of kicks between the home and away side).

If I run a single regression of the HomeKickDifferential (as the IV) with HomeMarginOfVictory (as the DV), I see that the equation is:

HomeMarginOfVictory = 1.711 + 1.048 * HomeKickDifferential

A HomeKickDifferential of 1 causes a 1.048 increase in the HomeMarginOfVictory.

Now if I add more IV's to test against the HomeMarginOfVictory, this 1.048 figure gets lower for each new variable I add. For example, if I add the IV HomeHandballDifferential (difference in the amount of handballs between the home and away side) to this equation, I now get:

HomeMarginOfVictory = 1.104 + (0.892 * HomeKickDifferential) + (0.345 * HomeHandballDifferential)

The impact of the HomeKickDifferential has now dropped. It keeps changing (and getting lower) every time I add more variables.

What is going on? Which figure of HomeKickDifferential do I actually incorporate into my final formula? Am I going to have to change my figures each time I add a new variable?

Any help would be really appreciated. Cheers.

brettd · 03-04-11 06:49 AM

Then I need to figure out how to factor the 'luck' inherent within these differentials. Which I have no idea how to do. Justin (or anyone else)? Is there any quick and dirty way to get a ballpark figure of the 'luck' in any given type of differential?

bztips · 03-04-11 07:27 AM

Sounds like you need to read up on the basics of linear regression.

Yes, every time you add a new variable the other coefficients are very likely to change. How much they change depends partly on the correlation among the explanatory variables themselves. You should use the coefficients from whatever final equation you end up with. But you need to look not only at the coefficients, but whether they are statistically significant (look at their t-statistics).

As an aside, your method of adding one variable at a time and seeing whether it's significant is basically a data mining approach that brings into question whether the overall reported statistical significance of your results are reliable.

brettd · 03-04-11 07:59 AM

Hey thanks for the advice bztips.

Originally Posted by bztips

Sounds like you need to read up on the basics of linear regression.

As an aside, your method of adding one variable at a time and seeing whether it's significant is basically a data mining approach that brings into question whether the overall reported statistical significance of your results are reliable.

I'm of only doing this variable by variable as I'm following the logic behind the points based model in Justin's "conquering Risk" book. He explores concepts individually (turnovers, strength of schedule, third down conversion attempts) and then tacks their individual coefficients onto the P(off) and P(def) equation (after adjusting for luck). I gathered I had to do the same thing with my chosen statistics.

I've just now had a go at combining my four seperate IV's and examining their effect on the HomeMarginOfVictory. The t-stats are all significant down to the p<0.01 level. However, the coefficient of one particular statistic (let's call it x) is completely counter-intuitive. A one unit increase in x causes a decrease in the HomeMarginOfVictory by -0.407 points (I expected the opposite). I've corroborated this by checking the 'Away' multiple regression formula which yields the same results.

However, x on its own has a positive coefficient when compared against the HomeMarginOfVictory. This completely confuses me.

I understand that the variables are interacting with each other, but to the extent that variable x becomes a negative figure (and still a legitimate addition to the formula)?

Justin7 · 03-04-11 08:45 PM

The more noise you add, the less everything else will seem to matter. Figuring out the optimal coefficients is very complex. If you regress long enough, you'll eventually remove all information from your equations.

brettd · 03-04-11 09:10 PM

Originally Posted by Justin7

The more noise you add, the less everything else will seem to matter.

Are you saying 'noise' as in garbage variables making legitimate variables 'seem' to matter less?

Originally Posted by Justin7

If you regress long enough, you'll eventually remove all information from your equations. Figuring out the optimal coefficients is very complex.

Well in saying that, how does this bode for your turnover differential coefficient in the NFL points based model explained in 'Conquering Risk'? This is confusing, as I understood your explanation of the turnover coefficient as being created by solely comparing the average margin of victory against the turnover differential. You said the reader could verify this coefficient by conducting the same analysis.

If this is not the case, is it a matter of just picking one of a number of coefficient alternatives that is generated by my regressions, and then examining its effect on the average results error? Although I think in that case, it may become a data mining error.

I'm still at a loss as to how to move forward.

bztips · 03-04-11 09:43 PM

I don't want to sound rude, but you're misktaken if you think you're going to be able really understand what is going on with your regression modeling by asking these types of questions in this forum -- because I guarantee that every time someone here gives you a reasonably legit answer, it's only going to open up a whole new set of questions as you go along.

Seriously your next best step would probably be to get an intro to regression modeling book (or do a google search on linear regression), and spend lots of time trying to learn the basics.

brettd · 03-04-11 09:56 PM

I finished first semester of an Applied Stats Post-Grad Diploma in December. I actually did pretty well in my "Introduction to Correlation/Regression" subject. It's just that we did not touch on multiple regression at all.

I'm doing multiple & logistic regression this semester starting next week though.

I just didn't think finding the right coefficient figure was this complicated. 'Conquering Risk' made it seem a lot easier than it actually is.

Justin7 · 03-04-11 10:03 PM

brettd,

the more things you look at, the more it dilutes everything. Garbage or not, every variable you add will minimize good stuff. A good regression model is a bit like cooking; you're trying to find the ideal ratios of different components. I have never found a perfect way to do this, but I don't have a math Ph.D. either.

I have found that running linear regressions gives me a good baseline. Using those values in a multi-variable model works pretty well, even though a multi-variable regression gives you very different results.

bztips · 03-04-11 10:12 PM

Originally Posted by brettd

I finished first semester of an Applied Stats Post-Grad Diploma in December. I actually did pretty well in my "Introduction to Correlation/Regression" subject. It's just that we did not touch on multiple regression at all.

I'm doing multiple & logistic regression this semester starting next week though.

I just didn't think finding the right coefficient figure was this complicated. 'Conquering Risk' made it seem a lot easier than it actually is.

That's even better -- you will definitely have a much greater understanding of what's going on after finishing that course. Good luck.

brettd · 03-04-11 10:12 PM

Originally Posted by Justin7

brettd,

the more things you look at, the more it dilutes everything. Garbage or not, every variable you add will minimize good stuff. A good regression model is a bit like cooking; you're trying to find the ideal ratios of different components. I have never found a perfect way to do this, but I don't have a math Ph.D. either.

Ah, that makes a lot more sense to me now.

Originally Posted by Justin7

I have found that running linear regressions gives me a good baseline.

Just to be crystal clear in my mind; you initially conduct linear regressions as in linear single variable regression on each explanatory variable (independent variable)?

brettd · 03-04-11 10:21 PM

Originally Posted by bztips

That's even better -- you will definitely have a much greater understanding of what's going on after finishing that course. Good luck.

Thanks bztips. Last year I chose to spend one year (at least:- I may go for the Master's) full time back at university, just so I can make my own sport models.

Am I taking sport gambling degeneration to the next level? Or was it an astute decision? Anyway, time will tell.

1 semester down, and at least 1 to go.

Justin7 · 03-04-11 11:05 PM

Originally Posted by brettd

Just to be crystal clear in my mind; you initially conduct linear regressions as in linear single variable regression on each explanatory variable (independent variable)?

Yes.

brettd · 03-04-11 11:21 PM

Cool thanks Justin.

bztips · 03-05-11 08:47 AM

Given what I just said, I'm contradicting my own advice now, but I can't let this pass. The idea of testing separate regressions on single variables one at a time is NOT a good statistical practice -- it's akin to what's known as "step-wise regression", and it's an extreme form of data-mining. To be avoided if at all possible.

brettd · 03-05-11 08:56 AM

Hhhmm.... Well bztips, I might leave that one for Justin to answer.

In saying this, there's one thing i'm confident of (I think). Some my coefficients (a couple of the main ones), seem to be resistant to change no matter how many variables I include or what combination they are in. Sure there is minor variation in the value of these coefficients from regression to regression, but on the whole they have fairly consistent output values.

Could I consider these consistent coefficients as 'locks' that can be added to my model?

jscol · 03-05-11 10:35 AM

I actually have been taught quite the opposite in university classes. That the more variables the better. Even if you are adding insignificant variables, this will help the overall regression and make it more meaningful.

bztips · 03-05-11 12:49 PM

Originally Posted by brettd

Hhhmm.... Well bztips, I might leave that one for Justin to answer.

In saying this, there's one thing i'm confident of (I think). Some my coefficients (a couple of the main ones), seem to be resistant to change no matter how many variables I include or what combination they are in. Sure there is minor variation in the value of these coefficients from regression to regression, but on the whole they have fairly consistent output values.

Could I consider these consistent coefficients as 'locks' that can be added to my model?

Yes, those variables are "locks". The ones that jump around are likely the result of collinearity with the other variables you're testing.

specialronnie29 · 03-05-11 05:05 PM

Originally Posted by Justin7

The more noise you add, the less everything else will seem to matter. Figuring out the optimal coefficients is very complex. If you regress long enough, you'll eventually remove all information from your equations.

so wrong. so so wrong.

as for thread starter - bztips is the smartest in thread so far. listen to him. you need to understand how linear regression works to understand why these things are happening and what they mean and how you should choose a final model. justin obviously does not understand this.

to answer your question there is no definitive right model. i have always had a few competing models and then bet when i have a consensus among them. if the predictions are wildly different, there is a problem. if they are off by only a point or two that is a good sign.

now here is an example to help you understand what is going on with your model as you add more explanatory variables.

suppose you have game results for nba for the past few years and you think that the MOV of a team tonight depends upon its MOV in its last game. so you regress MOV on MOV last game.

you will get a statistically significant coefficient because in general teams that win their last game by a lot are good teams and are more likely to win the next game by a lot too. the true variable is team strength but MOV yesterday is just correlated with that variable. we know that in reality yesterday's mov is almost irrelevant today.

so lets say i now give you data on every team's players and how many points they score per game etc and you include all these variables in your regression. then the coefficient on MOV last game will become insignificant and close to 0. this is not because i added noise. it's because when measures of team strength are considered, mov yesterday is found to be an irrelevant factor.

in principle including more variables can increase the magnitude of an existing coefficient, in contrast to what has been said in this thread. it depends on whether the correlation between certain explanatory variables is positive or negative and in general most explanatory variables will be correlated in one way or another.

so what do you take from this. what if you dont realize that MOV yesterday should be irrelevant? it is possible you can end up using a wrong model. the answer is you need a theory.it is dangerous to just throw a bunch of stuff in a regression and go with it. ultimately you will always have the problem of whether to use the model with x variables and the one with x+1 variables. the main way to answer that is to see whether the x+1 variable is statistically significant or not. if the t-stat is close to 0 just forget it. the one thing justin said that was ok is that you need to play with your data and run a lot of regressions just to see what is driving what. but then you need to include everything you think could be possibly relevant and that you have data for, and then trim it down from there (if at all).

now go find a book on linear regression and look up omitted variable bias, this is the key part you need to know

specialronnie29 · 03-05-11 05:06 PM

Originally Posted by bztips

The ones that jump around are likely the result of collinearity with the other variables you're testing.

this is a fact

specialronnie29 · 03-05-11 05:08 PM

Originally Posted by jscol

I actually have been taught quite the opposite in university classes. That the more variables the better. Even if you are adding insignificant variables, this will help the overall regression and make it more meaningful.

adding irrelevant variables will not bias the coefficients on the relevant variables but it will increase the coefficients' standard errors. this is a problem if you care about whether a coefficient is significant but for prediction it doesnt really matter so you are right. always err on side of too many. what you will often find is that these extra variables dont change the predictions much so its no big deal

bztips · 03-05-11 06:46 PM

Originally Posted by specialronnie29

adding irrelevant variables will not bias the coefficients on the relevant variables but it will increase the coefficients' standard errors. this is a problem if you care about whether a coefficient is significant but for prediction it doesnt really matter so you are right. always err on side of too many. what you will often find is that these extra variables dont change the predictions much so its no big deal

Be careful here. A potentially important caveat is that using the "kitchen sink" approach to adding variables is that you will find a "significant" relationship just by chance -- again, data mining.

subs · 03-05-11 09:11 PM

great thread gentleman - thanks.

the tank IS still alive.

(I only post the answers to simple stuff cos think it be good if the more advanced in the community had more time to address the more complex issues instead of having to answer the same questions again and again)

THANK U

specialronnie29 · 03-06-11 10:01 AM

Originally Posted by bztips

Be careful here. A potentially important caveat is that using the "kitchen sink" approach to adding variables is that you will find a "significant" relationship just by chance -- again, data mining.

if your theory is reasonable then you have to have reason for adding each variable

the term data mining is used lots on this forum because of all the guys who propose systems w/ no theory and ask if they have found the holy grail

ignoring these fools --- the only real data mining concern for people who have half an idea of what theyre doing is backtesting the model based on the data from which it is derived.

brettd · 03-07-11 05:05 AM

Hey guys, awesome posts! I think there's a few gems for everyone to read there.

As where i'm at with my regressions currently:- I think I've found the '500 pound gorilla' variable that has a massive impact on my regressions. My seeming coefficient 'locks' were blown out of the water. My 'HomeScoringShotsDifferential' (difference in scoring shots between the Home and Away sides) has an r-squared value of 0.866 when compared to the margin of victory. And when combined with a second variable 'HomeAccuracy', the r-squared value goes up to 0.988.

As Justin mentioned in his Yards-Per-Play Chapter in 'Conquering Risk', scoring is merely a shadow of a team's offensive potential, and that there are better indicators of future scorelines.

Given the magnitude of the r-squared value for this single variable, should I be modelling the ability for a team to generate (and prevent) scoring shots, rather than points scored? Have I found my 'Yards-Per-Play' equivalent for Australian Rules Football?

mjespoz · 03-07-11 03:44 PM

Hey brettd,

It's easy to get excited about results when first building models, but you need to step back and think logically about what you're doing. From your original post, your DV is HomeMarginofVictory and your IVs are HomeScoringShotsDifferential & HomeAccuracy? So if the home team has more shots at goal, and their ratio of goals to behinds is better than the away team then their winning margin is greater.... This is correct, however I'm not sure how useful this is. Always be cautious when you get model fit statistics (such as an R-Squared) so high. And I hope that's Adjusted R-squared you're looking at. Always use Adjusted.

Good luck mate.

Cheers,
mjespoz

SBR Top-Rated Sportsbooks				Best Sportsbooks List
#1 FanDuel	SBR rating 4.8/5	Review	#6 BetRivers	SBR rating 4.1/5	Review
#2 Caesars	SBR rating 4.7/5	Review	#7 Fanatics	SBR rating 4.1/5	Review
#3 DraftKings	SBR rating 4.7/5	Review	#8 Betway	SBR rating 3.8/5	Review
#4 BetMGM	SBR rating 4.6/5	Review	#9 Borgata	SBR rating 3.5/5	Review
#5 bet365	SBR rating 4.6/5	Review	#10 ClutchBet	SBR rating 2.9/5	Review

Question relating to single variable regression vs multi-variable regression output

Thread Tools

Question relating to single variable regression vs multi-variable regression output