Building a predictive model - what to look for?

**Red_Sux** · 03-25-08, 01:20 AM

i don't trust the stats other than MLB. could you work on that model instead of NBA?

**Arnold** · 03-25-08, 03:01 AM

Originally posted by Red_Sux

i don't trust the stats other than MLB. could you work on that model instead of NBA?

Lol. I don't have any betting experience with MLB yet, so I thought NBA would be more appropriate. If I can find anything in the NBA, I'm sure I could use same techniques in other sports as well. I just need to get the fundamentals straight.

**BuddyBear** · 03-25-08, 11:23 AM

well first things first, you need to identify all variables you believe will help you predict your dependent variable. There could be literally 100s of varaibles here to sort through. You'll need these not just to figure out which predictors are strongest, but also for statistical control in the model.

Typically, when you construct a mulitvariate model, you should have some sort of theory that is guiding you in how you choose your predictor variables. If there is no theory, then there is a really good chance that the model you are constructing won't work unless you get lucky and find a perfect combination of variables. Otherwise, you might want to try something called stepwise regression since that is a form of regression that does not really rely on theoretical considerations.

Good luck, I've tried this in the past and it is really hard to do....

**Arnold** · 03-25-08, 11:54 AM

Originally posted by BuddyBear

well first things first, you need to identify all variables you believe will help you predict your dependent variable. There could be literally 100s of varaibles here to sort through. You'll need these not just to figure out which predictors are strongest, but also for statistical control in the model.

For now I'm playing with the variables I have and see where it gets me. Then it is just a matter of collecting/organizing more data for the model.

Typically, when you construct a mulitvariate model, you should have some sort of theory that is guiding you in how you choose your predictor variables. If there is no theory, then there is a really good chance that the model you are constructing won't work unless you get lucky and find a perfect combination of variables. Otherwise, you might want to try something called stepwise regression since that is a form of regression that does not really rely on theoretical considerations.

Yeah, I was gonna look into the stepwise regression as well. There is so much to learn yet.

Good luck, I've tried this in the past and it is really hard to do....

Did you succeed? How much time did it take you?

**BuddyBear** · 03-25-08, 12:19 PM

Originally posted by Arnold

Did you succeed? How much time did it take you?

I tried collecting data but unfortunately i fell very far behind. Literally to construct a model like this you need at least 60 hours a week and you'll probably need someone to help you out. I think the best bet is to see if you can find some existing data out there and use that. Collecting your own data is very very time consuming.....

**Arnold** · 03-25-08, 12:32 PM

Well, I wouldn't type it in myself. It is all automated. All I need is to write the code and the rest is done by computer.

Assuming you can get all the possible variables and plug them into the regression analysis, you would certainly find a predictive model?

**butters** · 03-25-08, 09:16 PM

Arnold,

I have a couple of thoughts that might help:

1) There are others on here that know much, much more about statistics than I do, but from what I remember, the 'R-Sq' tells you how much of the variation in the dependent variable (total points scored) is explained by the variation in your independent variables. So it's not really saying that you have 23.8% of the factors, but rather that the factors you have account for 23.8% of the variation in total points.

2) Having a model with a high R-Sq is certainly nice, but it's more important to have independent variables that are significant predictors. You can determine whether a predictor is significant by looking at the p-value, which is given by the last value in the rows for pvscore and phscore, with lower values corresponding to greater significance. Fortunately, the two factors you have right now are both significant for any reasonable confidence level, so that's good. Try to make sure that most of the variables you use are significant.

3) Going forward, though, I would caution against trying to get 'all possible variables', dump them into a regression model, and have Minitab sort it out. First, it's probably not possible to get 'all possible variables', since there are tons of ways to construct variables from raw data, and you could look at any of those metrics for the entire season, the last month, the last week, previous games against opponent, etc. Which statistics and splits are best? I don't know. But dumping them all into a model would result in severe multicollinearity, which is bad. Like BuddyBear said, it would be better to try to develop and test specific theories as opposed to trying to find tons of different variables and throwing them all together.

Hopefully this helps a little bit. Again, there are others here who know a lot more than I do about stats, so follow their advice over mine. I do think this is a good approach to take, so good luck and keep us posted.

**Arnold** · 03-25-08, 10:18 PM

About the p-values. I know this is supposed to tell how significant a variable is. The thing is the value for the same variable changes depending on your other independent variables in the equation. Sometimes the value becomes too high to be significant. That's why I don't know how much I can trust these values, although they do serve me as a guide.

**BuddyBear** · 03-25-08, 10:36 PM

Originally posted by Arnold

Well, I wouldn't type it in myself. It is all automated. All I need is to write the code and the rest is done by computer.

Assuming you can get all the possible variables and plug them into the regression analysis, you would certainly find a predictive model?

Not necessarily and even if you were able to find strong predictors, without theory there is not much value to it.

Remember, theory helps to explain, describe, and predict. The lack of theory makes it difficult to construct a strong multivariate model.

**BuddyBear** · 03-25-08, 10:38 PM

Also, if you can get a copy of SPSS or Stata, it is much better than Minitab. Minitab is certainly servicable, but SPSS will enable to do more things especially in terms of graphing. But everyone has different opinoins on statistical software packages....

**Arnold** · 03-25-08, 10:57 PM

Originally posted by BuddyBear

Also, if you can get a copy of SPSS or Stata, it is much better than Minitab. Minitab is certainly servicable, but SPSS will enable to do more things especially in terms of graphing. But everyone has different opinoins on statistical software packages....

Maybe, I don't know. For now Minitab is just fine. I'm a very basic user

**Cyclone** · 03-25-08, 11:43 PM

I can tell you a little about my experience. I have learned (again, in MY experience) the hard way that:

- NBA games are very difficult to predict. Maybe with such high scores, the outcomes are more random?
- Over/unders in any sport seem to be unpredictable.
- For any model to work, it must be as simple as possible. I have found that only two variables at the most works the best.

**Arnold** · 03-26-08, 01:35 AM

Originally posted by Cyclone

I can tell you a little about my experience. I have learned (again, in MY experience) the hard way that:

- NBA games are very difficult to predict. Maybe with such high scores, the outcomes are more random?
- Over/unders in any sport seem to be unpredictable.

But if you look at the outcomes in terms of over/under, would you say they are more random? I know it is hard to predict the exact final score, but predicting merely the over or under should be easier the way I see it.

- For any model to work, it must be as simple as possible. I have found that only two variables at the most works the best.

I wish it was that simple. But something tells me that real life scores have more than 2 factors.

What I think a logical approach would be is a bit more complex than just a few variables. I think you need to break it down into smallest bits and pieces you can. For example, I first need to figure out what makes up the score? That's easy, you shoot and you score. Just to verify it, I ran a regression analysis on fgm, ftm, tpm (3-pt fgm) variables:

Regression Analysis: pts versus fgm, ftm, tpm, rebdef, reboff, ast, fga

The regression equation is
pts = - 0.000000 + 2.00 fgm + 1.00 ftm + 1.00 tpm + 0.000000 rebdef
- 0.000000 reboff - 0.000000 ast + 0.000000 fga

Predictor Coef SE Coef T P
Constant -0.00000000 0.00000000 * *
fgm 2.00000 0.00000 * *
ftm 1.00000 0.00000 * *
tpm 1.00000 0.00000 * *
rebdef 0.00000000 0.00000000 * *
reboff -0.00000000 0.00000000 * *
ast -0.00000000 0.00000000 * *
fga 0.00000000 0.00000000 * *

S = 0 R-Sq = 100.0% R-Sq(adj) = 100.0%

I threw in some more variables just to make sure I understand the analysis correctly. So, fgm, ftm, and tpm make up the final score 100%.

Now I need to break it down further. What makes up a fgm? A fga. Then I need to figure out what makes up a fga:

Regression Analysis: fga versus rebdef, reboff, st, to, bs

The regression equation is
fga = 66.2 + 0.199 rebdef + 1.01 reboff + 0.446 st - 0.558 to + 0.0850 bs

Predictor Coef SE Coef T P
Constant 66.159 1.033 64.07 0.000
rebdef 0.19869 0.02295 8.66 0.000
reboff 1.01089 0.03295 30.68 0.000
st 0.44561 0.04282 10.41 0.000
to -0.55769 0.03280 -17.00 0.000
bs 0.08499 0.05107 1.66 0.096

S = 5.62090 R-Sq = 40.5% R-Sq(adj) = 40.3%

Only 40.5%, so I need to do more work on it. But that's just to demonstrate my logic. The solution would have multiple steps and multiple variables. If all the significant variables can be generated in real world, then I think it is possible to build a predictive model like this. But if fga or anything relevant to us largely depends on what type of shoes a player wears, then I think my whole project is doomed

**Justin7** · 03-26-08, 10:13 AM

Before you go crazy with numbers, make sure you are comparing apples to apples.

Did the lineup change? It's no longer the same apple.

Is it a competitive game, or a blowout? Competitive games tend to be higher scoring.

What is the pace rating for each team? does the pace rating change, based on if they are a favorite or underdog?

Before doing regressions on the total, look at smaller pieces to put your puzzle together.

**Arnold** · 03-26-08, 11:43 AM

Originally posted by Justin7

Before you go crazy with numbers, make sure you are comparing apples to apples.

Did the lineup change? It's no longer the same apple..

In the end the starting point of the model will be the line up. Everything will be adjusted based on it.

Is it a competitive game, or a blowout? Competitive games tend to be higher scoring.

This is true in some cases. It will depend on if both teams are ready to play D.

What is the pace rating for each team? does the pace rating change, based on if they are a favorite or underdog?

Pace is definitely something very important. It will be accounted for too.

The hardest thing will be of course as usual, to predict future numbers.

**The HG** · 03-26-08, 12:37 PM

Originally posted by Cyclone

I can tell you a little about my experience. I have learned (again, in MY experience) the hard way that:

- NBA games are very difficult to predict. Maybe with such high scores, the outcomes are more random?
- Over/unders in any sport seem to be unpredictable.

NBA totals have been by far the best type of bet for me lifetime.

**Cyclone** · 03-26-08, 09:31 PM

If anyone can make money betting totals, especially in the NBA, go ahead. I couldn't figure out a way. My theory is that teams want to win, and maybe they would like to beat the point spread if they could, but they don't seem to care how many points are scored in the game.

I'm very suspicious of complicated models, whether it is sports betting or the stock market or anything else. It's relatively easy to find a model that worked in the past, but they seem to unravel when applied to future events. Simpler is better, I have found, but again this is all my experience.

I would agree there are probably more than two variables involved in anything, but I would recommend trying to find the two most important variables.

**Rufus** · 03-28-08, 07:07 PM

Originally posted by Arnold

Standard deviation of 17.8. What is my goal here? Is it to bring down the standard deviation to a minimum? Would the model be more predictive if the standard deviation was, lets say, 4.6?

The goal is to build a model that systematically outperforms the line. A higher standard deviation for a coefficient estimate in a regression means that you are not able to estimate the effect confidently.

Originally posted by Arnold

Lines are way off too. Besides standard deviation, is there anything else I should look at? I have a very basic understanding of statistics, so I'm not sure what some of the numbers represent or what is of interest to me.

A low p-value (which indicates that the standard error--an estimate of the standard deviation--is small) means you can more specifically and accurately pinpoint the effect of an independent variable on a dependent variable. In any good model, you only want to use variables that are statistically significant predictors (i.e. they have low p-values).

Originally posted by Arnold

I have a question about the variable "R-Sq". The 23.8%, does that mean I only have 23.8% of all factors that make up an accurate predictive model? Do I need to search for the other 76.2%?

R-squared represents the fraction of the variation in your dependent variable that is explained by changes in your independent variable. This does NOT mean that you only have 23.8% of all factors needed for a good model. I have a very profitable baseball betting model (which uses more complex statistical analysis) and it only explains 3% of changes in who wins and loses. R-squared measures the "fit" of the model, but be careful--it you try to include too many variables to get the R-squared value higher, you will overfit the model and sacrifice predictive value.

**Arnold** · 03-28-08, 11:14 PM

Originally posted by modelman

R-squared represents the fraction of the variation in your dependent variable that is explained by changes in your independent variable. This does NOT mean that you only have 23.8% of all factors needed for a good model. I have a very profitable baseball betting model (which uses more complex statistical analysis) and it only explains 3% of changes in who wins and loses. R-squared measures the "fit" of the model, but be careful--it you try to include too many variables to get the R-squared value higher, you will overfit the model and sacrifice predictive value.

Are you saying your R-squared is only 3%, yet you have a very profitable model? That doesn't make any sense to me.

**Rufus** · 03-29-08, 03:10 AM

That is exactly what I'm saying. Although it's not exactly R-squared I'm using, since I'm not doing a simple linear regression (there is a stat called pseudo R-squared which essentially estimates R-squared in more complex models).

Know what you can and can't predict. And in baseball, the worst teams win nearly 40% of the time. I don't need a model that predicts everything. I just need to do better than the betting line.

**BuddyBear** · 03-29-08, 04:06 PM

R-squared is a very tricky variable. The conventional wisdom is that the higeher the R-square the better. But that isn't always the case. In reality, the more predictors you put into a model, your R-squared is going to increase regardless of how stupid they are.

For example, I could have a model trying to predict income and my variables could be income and seniority and have an R-squared of let's say .312.

I could have another model with the same DV and have income, seniority, and dick size as my 3 predictor variable and my r-squared will go up regardless to say .324. You have to be very careful in your interpretation.

Just b/c you have a low R-squared, a model shouldn't be dismissed in favor of a model with a higher R-squared.....

**tacomax** · 03-29-08, 04:47 PM

Originally posted by BuddyBear

R-squared is a very tricky variable. The conventional wisdom is that the higeher the R-square the better. But that isn't always the case. In reality, the more predictors you put into a model, your R-squared is going to increase regardless of how stupid they are.

And that's people tend to use the adjusted R-sq in multi-variable regressions which adjusts the R-sq to penalise for the addition of regressors. And, as an aside, an addition of a regressor totally uncorrelated with your regressand will not increase the R-sq.

**BuddyBear** · 03-29-08, 04:53 PM

Originally posted by tacomax

And, as an aside, an addition of a regressor totally uncorrelated with your regressand will not increase the R-sq.

Yes it will....time for you to go back to stats class

**Arnold** · 03-29-08, 05:33 PM

So R-squared is a useless variable? Must be if your model can be good even with the crappiest R-squared. Although, to tell the truth, I still don't see any logic here.

**tacomax** · 04-01-08, 04:17 PM

Originally posted by BuddyBear

Yes it will....time for you to go back to stats class

The R-sq does not automatically go up when an additional explanatory variable is added. It probably will (due to some correlation) but it isn't guaranteed. (Hint - note the definition of R-sq. If an additional variable is added which does not increase the % of the variation in Y explained by the model then what happened to the R-sq?)

**Arnold** · 04-01-08, 04:27 PM

This is out of Minitab help:

Adjusted R2
Percentage of response variable variation that is explained by its relationship with one or more predictor variables, adjusted for the number of predictors in the model. This adjustment is important because the R2 for any model will always increase when a new term is added. A model with more terms may appear to have a better fit simply because it has more terms. However, some increases in R2 may be due to chance alone.

The adjusted R2 is a useful tool for comparing the explanatory power of models with different numbers of predictors. The adjusted R2 will increase only if the new term improves the model more than would be expected by chance. It will decrease when a predictor improves the model less than expected by chance.

**tacomax** · 04-01-08, 05:36 PM

I'll hit you with a quote from p223 from "Basic Econometrics" by Gujarati (4th Edition). Note that almost invariably doesn't mean always.

"An important property of R2 is that it is a nondecreasing function of the number of explanatory variables or regressors present in the model; as the number of regressors increases, R2 almost invariably increases and never decreases"

And, since I'm a nice guy, I'll give you a link to the book in it's entirety. A good read, if you're into that kind of thing; that is specifically the maths behind the regression output.

http://polisophy.files.wordpress.com/2008/01/basic-econometrics.pdf

**BuddyBear** · 04-01-08, 07:09 PM

That link does not even open up. Nice try though....

Trust me....ANY variable you include will increase the R2. Go try it and tell me if it does not increase. That is a fact....

**durito** · 04-01-08, 07:16 PM

It opens up just fine for me. I'd been looking for a good online basic econometrics book. Thanks.

**Arnold** · 04-01-08, 07:18 PM

Opens for me too. 1000 pages though...that's hardcore.

**turnip** · 04-02-08, 10:09 AM

thanks tacomax

**curious** · 04-02-08, 10:48 AM

Originally posted by Arnold

Lol. I don't have any betting experience with MLB yet, so I thought NBA would be more appropriate. If I can find anything in the NBA, I'm sure I could use same techniques in other sports as well. I just need to get the fundamentals straight.

Yes, but MLB has stats which are much more predictive in nature. One key thing for MLB, you have to always always always be watching trendlines in addition to "raw" stats. Underrated teams often get their act together but aren't taken seriously by the better teams so they have winning streaks, then the better teams adjust. Or, "better" teams get overconfident while everyone is gunning for them and they go on losing streaks. I watch runs per plate appearance trend lines very carefully to judge how a team is performing LATELY, I don't care what they did last month.

Curious

**curious** · 04-02-08, 10:49 AM

Originally posted by tacomax

I'll hit you with a quote from p223 from "Basic Econometrics" by Gujarati (4th Edition). Note that almost invariably doesn't mean always.

"An important property of R2 is that it is a nondecreasing function of the number of explanatory variables or regressors present in the model; as the number of regressors increases, R2 almost invariably increases and never decreases"

And, since I'm a nice guy, I'll give you a link to the book in it's entirety. A good read, if you're into that kind of thing; that is specifically the maths behind the regression output.

http://polisophy.files.wordpress.com...onometrics.pdf

Nice guy? When did you have a lobotomy?

**Arnold** · 04-02-08, 12:27 PM

Originally posted by curious

Yes, but MLB has stats which are much more predictive in nature.

I want to stick to one sport right now. I won't get anywhere if I do everything at once. Baseball already started so there is no point to do the research. I'd rather get ready for the next NBA season and hopefully have something ready for the NFL too. After that I can move to MLB in its off-season.

One key thing for MLB, you have to always always always be watching trendlines in addition to "raw" stats. Underrated teams often get their act together but aren't taken seriously by the better teams so they have winning streaks, then the better teams adjust. Or, "better" teams get overconfident while everyone is gunning for them and they go on losing streaks. I watch runs per plate appearance trend lines very carefully to judge how a team is performing LATELY, I don't care what they did last month.

I agree with you. This is true in NBA as well. New York is the team to play right now. They were +8 underdogs against the crappy Bucks yesterday. They covered 4 in a row now. I think they go unnoticed. The Spurs have been on fire too. So I pay attention to these things.