NBA modeling

bolekblues · 01-19-11 01:44 PM

Hi everyone.

For quite some time I have been working on NBA data and trying to model the NBA point spread market. Firstly, I have tried regression analysis, but while it has shown satisfactory results over the previous seasons (2006-07 to 2008-09), it has been disappointing in the last season - and that regardless of the parameter estimates I used, even if i applied the same parameter estimates for predicting the margin of victory as I have obtained from modeling 2009-10 data (which one should not do because of overfitting and that kind of stuff), it still was mediocre.

Then I have tried starting from projected pace and points per possession and the come up with projected points scored by a team in a game. I have written about it in the forum but no one confirmed it was the right way to go, so I basically gave up (honestly, I used a lot from justin's book, so hoped he and others were using this - of course as a base / starting point - and would at least say whther it was right or wrong. did not happen)

Afterwards, I started different approaches. Special thanks to Data, who reffered me to log5 and corr.gaussian method. My first model is based on these measures / formulas. The second model I tested is based on Power Rankings (which take into account margin of victory instead of win%, recent form and the strength of schedule). Those are bases for my two models.

It seems that given this very basic assumptions (as above) both have shown some nice results.
But I felt there was some room for improvement and another versions of models tried to incorporate:

v2: home court advantage (HCA) - different for different teams (up to 5pts for great home teams, and down to 1 for weak home teams - relative to overall record, with 3 being average)

v3: rest - subtracting 1 (when at home) to 2 (when away) points from teams playing on a B2B against those with 1+ days rest. Those estimates might be somewhat wrong but are derived from a very basic calculations and personal feeling. I did not account for teams on a 4(games)-in-5(days) or 5-in-7 because the sample was too small to draw any conclusions.

v4: injuries - this is far from perfect as well, because i do not possess exact injuries data, but rather starting 5 for each game. So injuries for key bench players like Crawford, Ginobili (previous years), Terry are not accounted for in my models. Plus, I have my own estimates as to how many points which player is worth and those might be even more far from optimal estimates (because you have to predicet how replacable he is and will his absence affect the team against a particular opponent), but this is better than not incorporating injuries at all i guess.

That's all about methodology. Below are the results I obtained for 2009-10 season (let's just say previous years were better or at least not worse, as i have mentioned, so please don't tell me this is not enough sample size to draw any conclusions. Also, these models do not face the problem of overfitting).

In the tables at the top you can see average absolute errors between model's predicted lines and opening line, closing line and final result, respectively. It seems that every 'later' version of both models improved (diminished) all three errors. Except for v2, which i apllied to the first model but decided not to apply to the second model while it seems that the results are very poor. I heard somewhere that a team having a particularly good home record relative to its overall record (let's say 10-5 at home and 4-11 on the road) is just more of a random effect and it will regress to mean at some point, rather than something worth noting. The results cannot convince me this is not the case so I just stuck to a standard 3point HCA. Besides this, all the errors seem to get smaller with B2B and injuries being accounted for.
Model 2 seems to get better results as far as predicting the lines (on average).
Also, it is worth noting that market avg abs error is 9,32 and 9,29 for opening and closing lines, respectively.
So while models get closer to the market as far as lines prediction, they are still inferior to the market.

The bottom two tables show wins and losses (along with win %) against the closing lines produced by both models. The results are quite satisfactory (at least for me) and this time model 1 achieves better win% (generally). It seems that while model 2 better predicting lines (overlap numbers producing any resonable number of plays are around 4,5-5, while for model 1 they are around 7-8), the model 1 may be able to find some more mispriced lines.

I would like to ask all modelers (especially those who deal with NBA sides) about suggestions, advice and results your models produce - if this is not any secret. Mainly: model's line average absolute error (from opening, closing and final result), win% on a particular overlap. Also, what do you think about approaches i described above (regarding differentiating HCA for teams and incorporating injuries and rest in this manner). Would you add any other variables?

Thank You
Bartek

Justin7 · 01-19-11 02:13 PM

NBA HFA is an odd duck. If a team has been on the road for a week, and just arrived home against a team that was home for a week and just went on the road, the home team might get a negative adjustment. A team at the end of a long home stay playing against a team at the end of a long road trip might have a much larger HFA.

The error introduced by this variable HFA is enough to badly skew your results.

I haven't attacked NBA yet, but I think using pace and player-specific OE and DE has a good chance of working. Player changes make some pretty big differences, and knowing that a guy might only play 15 instead of 30 minutes can impact your numbers heavily.

Data · 01-20-11 12:35 AM

Originally Posted by bolekblues

Also, it is worth noting that market avg abs error is 9,32 and 9,29 for opening and closing lines, respectively.

What if you run a regression with Zero Constant against MOV using your lines and the closers. That k's they'd get? That is the avg abs error for this combined line?

bolekblues · 01-20-11 03:46 PM

Originally Posted by Justin7

NBA HFA is an odd duck. If a team has been on the road for a week, and just arrived home against a team that was home for a week and just went on the road, the home team might get a negative adjustment. A team at the end of a long home stay playing against a team at the end of a long road trip might have a much larger HFA. The error introduced by this variable HFA is enough to badly skew your results. I haven't attacked NBA yet, but I think using pace and player-specific OE and DE has a good chance of working. Player changes make some pretty big differences, and knowing that a guy might only play 15 instead of 30 minutes can impact your numbers heavily.

You talk more of a situational HCA (like coming from a road trip) rather than differentiating across teams, but thanks, I might be able to research this as well.

Originally Posted by Data

What if you run a regression with Zero Constant against MOV using your lines and the closers. That k's they'd get? That is the avg abs error for this combined line?

NOt sure what you're asking about. Please specify if possible. As for those 9,32 and 9,29 those are averages from absolute errors of market openers and closers (my models errors are in the tables). I believe anyone who has database with 2009-10 spreads and results can check this and come up with similar results.

What you guys think about models in general and about approaches i have taken? Really no one else model NBA? that seems strange since time after time I see some posters chiming in about their models and approaches.

Data · 01-20-11 04:52 PM

Originally Posted by bolekblues

NOt sure what you're asking about. Please specify if possible.

Combine the closers and you lines into the third line and see how good that line is. To combine the lines use regression. If you use Excel, checkmark Constant is Zero.

Originally Posted by bolekblues

As for those 9,32 and 9,29 those are averages from absolute errors of market openers and closers (my models errors are in the tables).

Why are you looking at the average absolute errors? The means make more sense to me.

Originally Posted by bolekblues

What you guys think about models in general and about approaches i have taken?

I approve your approaches but that is given.

In general, once you mastered the basics (and you almost did) that is the time to get creative. For instance, the injuries. Everyone knows the injuries are important but nobody really knows how to properly account for them. One more thing, if you feel you need some data then get up and get that data. Data is important!

bolekblues · 01-21-11 05:01 PM

Originally Posted by Data

Combine the closers and you lines into the third line and see how good that line is. To combine the lines use regression. If you use Excel, checkmark Constant is Zero. Why are you looking at the average absolute errors? The means make more sense to me. I approve your approaches but that is given.

In general, once you mastered the basics (and you almost did) that is the time to get creative. For instance, the injuries. Everyone knows the injuries are important but nobody really knows how to properly account for them. One more thing, if you feel you need some data then get up and get that data. Data is important!

Well isn't mean the same as average? you sum up all absolute values and devide by the number of occurences, right?

As for the regression, i am not sure if i understood correctly, but estimating linear regr. with no constant, using OLS i got sth like this:
MOV = 0,524*closer + 0,361*model1LINE

if I derive MOV^ (prediction) for particular games in the 2009-10 season, my avg abs error from actual MOV dropped to 9,25 and w/l records ag.closers were 219-175(55,6%) on 1pt overlap, 121-85(58,7%) on 1,5pt overlap, 62-48(56,4%) on a 2pt overlap.
Those are impressive numbers, but i am pretty sure this has a lot to do with sample error and overfitting, as one uses parameters estimates obained from one season for prediction of the same games.

Data · 01-21-11 09:25 PM

Originally Posted by bolekblues

Well isn't mean the same as average? you sum up all absolute values and devide by the number of occurences, right?

Of course it is. My bad. I meant median. This may sound strange and I do NOT want to get deep into this but, to say the least, if a team that you favored over the line did not show up then it is not that much important how much it lost by and vice versa.

Originally Posted by bolekblues

As for the regression, i am not sure if i understood correctly, but estimating linear regr. with no constant, using OLS i got sth like this:
MOV = 0,524*closer + 0,361*model1LINE

This numbers look good. They confirm that the lines you make will produce ATS winners.

specialronnie29 · 01-21-11 09:51 PM

may i ask what the std errors were on MOV = closer + model1line?

data, why take the constant out of that regression?

Data · 01-21-11 10:17 PM

Because we run that regression not for the sake of making a better number out of given two numbers but to assess the importance of the info our line contributes to produce the better line which is the result of the regression.

specialronnie29 · 01-21-11 10:38 PM

hmm

suppose in two dimensions you have a true relationship which is y = 10 and your data reflects this with some noise and you run regression y on x without a constant, you're going to get some sort of 45 degree line when there is actually no relationship

doesnt this problem carry forward to 3 dimensions

just like if you have y on x1 x2 you might get a significant relationship on both x1 and x2, but with the introduction of x3 this can all change, even though you are just interested in how x1 and x2 are contributing to y

i am definitely not an expert on this but find it intuitively strange to explicitly ask to remove the constant

Data · 01-21-11 11:39 PM

The simple to understand reason why the regression should be used with the constant (aka intercept) set to zero (aka go through origin) is this. If your model and the market line produce the line of 0 then the line should be 0. If the intercept is not zero then the line in this case will equal that intercept which does not make sense while analyzing the relationship between the model and the market lines.

roasthawg · 01-21-11 11:46 PM

Originally Posted by Data

Because we run that regression not for the sake of making a better number out of given two numbers but to assess the importance of the info our line contributes to produce the better line which is the result of the regression.

Why not just look at t value?

specialronnie29 · 01-22-11 12:01 AM

ok but if the model and the market are sharp then the constant should be 0 or close to 0 anyways. you're placing a restriction on the slope that gives it a feature youd like it to have rather than letting it produce it itself

im also concerned about colinearity. if the closer is an unbiased predictor and the model is too, then the two should be very similar and this regression should have a colinearity problem, which can lead to 'wild' coefficients because b1=1 and b2=0 works as well as b1=0 and b2=1 and any linear combination.

yes in general if you run a regression with the closer in it then you should be happy if another variable is significant, but typically that variable isn't another line

just my opinion

Data · 01-22-11 12:02 AM

Originally Posted by roasthawg

Why not just look at t value?

Because the null hypothesis that is being tested by the T-test is of no interest to us.

Data · 01-22-11 12:06 AM

Originally Posted by specialronnie29

ok but if the model and the market are sharp then the constant should be 0 or close to 0 anyways.

Right, and that is precisely why we enforce this view and avoid the noise amplified by using a non-zero intercept.

MadTiger · 01-22-11 12:56 AM

Originally Posted by Justin7

NBA HFA is an odd duck. If a team has been on the road for a week, and just arrived home against a team that was home for a week and just went on the road, the home team might get a negative adjustment. A team at the end of a long home stay playing against a team at the end of a long road trip might have a much larger HFA. The error introduced by this variable HFA is enough to badly skew your results.

Agreed. Home field advantage has to actually be an ADVANTAGE.

roasthawg · 01-22-11 03:28 PM

Originally Posted by Data

Because the null hypothesis that is being tested by the T-test is of no interest to us.

Isn't the null hypothesis simply that the slope is zero and that the two lines are not predictive of the absolute error? Don't the t values give us an indication of whether or not the independent variables are predictive or not? If so aren't those values what we're interested in here? I ask not to argue but to better understand what's being discussed... I get a lot out of a little when it comes to my knowledge of statistics!

bolekblues · 01-22-11 04:33 PM

Originally Posted by Data

Of course it is. My bad. I meant median. This may sound strange and I do NOT want to get deep into this but, to say the least, if a team that you favored over the line did not show up then it is not that much important how much it lost by and vice versa. This numbers look good. They confirm that the lines you make will produce ATS winners.

Yes, I know what is up with median. When evaluating market efficiency, I came across some academic papers in which they showed (and proved) why it is more important to investigate medians, thus looking at 3rd (skewness) and 4th (kurthosis) moments was also important, not only mean. As far as i remember they also proposed an LR test to propoerly account for this and not worry about the distribution of errors (in case of a regression).

For market openers and closers, the median absolute error (from MOV) is 8,0 and 7,5, respectively.

My models generate: model1 - 8,0; model2 - 7,9 (since it has any real number as a line, not only integer and integer+0,5), so they are right around openers.

As for the numbers, why do you see these estimates possibly indicative of a model strength? are the very estimates enough to draw conclusions (in this case) or you see some similarity between these estimates and perhaps other models you came across.

Originally Posted by specialronnie29

may i ask what the std errors were on MOV = closer + model1line? data, why take the constant out of that regression?

The estimates with st.errors in parentheses: model1: 0,361 (0,108), closers: 0,524 (0,136), so they are both statistically significantly different from zero (two tailed T-test).

specialronnie29 · 01-22-11 04:49 PM

hate to go head to head with data here but the restriction of noconstant does not make any sense to me. theres a reason virtually every regression has one, even if it comes with an interpretation problem. this is true for guys just interested in coefficients and not in making predictions like any social scientist

if your trying to predict income you might run a regression like
income = constant + age + education + parents income + gender + whatever

it is very possible such a model gives you a negative constant - how can you have negative income? but you would still put in despite this interpretation problem. In fact when you force the line of best fit through the origin you may not even be able to set the mean square error to 0.

also like i said the two variables of this gentlemans model and the closer are nearly identical. its like running a regression where two of your predictors are annual income and monthly income. Well for many people they are related by annual = 12*monthly. if this is true for all your observations your regression wont even run because the matrix of covariates is not invertible. if its not perfectly true OLS may still give you estimates but think about the problem... what coefficient should you put on the two variables of annual and monthly. any linear combo will work nearly as well. thats what im concerned about here.

humor me - are both coefficients significant without forcing a zero constant? i see no reason why a constant amplifies rather than reduces noise.

specialronnie29 · 01-22-11 04:50 PM

Originally Posted by Data

Because the null hypothesis that is being tested by the T-test is of no interest to us.

what were you looking at when the modeler posted his coefficient estimates then? that the coefficient on model1line wasnt zero? of course its a necessary requirement for the model to be of any use but if it isnt also significantly different from zero then you cant confident.

he has since posted it was significant

Data · 01-22-11 05:12 PM

Originally Posted by roasthawg

Isn't the null hypothesis simply that the slope is zero and that the two lines are not predictive of the absolute error? Don't the t values give us an indication of whether or not the independent variables are predictive or not? If so aren't those values what we're interested in here? I ask not to argue but to better understand what's being discussed... I get a lot out of a little when it comes to my knowledge of statistics!

Can you clarify what exactly T value of you are looking at? What are you testing with the T test?

specialronnie29 · 01-22-11 05:16 PM

the null hypothesis that the coefficient on model1line is 0...

bztips · 01-22-11 05:20 PM

Have to agree here with ronnie.

The two variables are likely highly collinear -- what is the correlation coefficient between the two variables?

Normally one of the results of multicollinearity is large std errors for the affected variables -- I suspect in this case you're not seeing that precisely because you've excluded the constant term which, as ronnie said, works in the opposite direction and erroneously deflates standard errors.

IMO, the correct procedure to follow would be:

1) Initially estimate the model with a constant term and the closer variable ONLY; keep the constant whether it appears to be statistically significant or not. The closer should be significant.

2) Now add in the model's prediction of the line (again, keeping the constant).
Check the std errors on the 2 independent variables -- are they now each insignificant? If so, that's likely due to collinearity.
Did the coefficient on the closer change a lot from the initial estimate? If so, that's another indication of collinearity.
Check the model's F-statistic -- if it's real significant even though the two individual variables are insignificant, that's yet another indication of collinearity.

If none of the above in 2) apply, then congratulations -- ronnie and I are wrong, and you can have some confidence that your model's line is providing some independent explanatory power to predict MOV.

Data · 01-22-11 05:30 PM

Originally Posted by specialronnie29

the restriction of noconstant does not make any sense to me. theres a reason virtually every regression has one

We run this regression not to find the best fit for the set of variables we have. If we did, yes, this restriction would not make sense. This way we would be explaining the past data in a best way possible. However, we do not care about the past, we want to have something what makes sense and use it to predict the future.

Thus, if the market line L, L=0 and model line M, M=0, then we must restrict the constant C to equal zero, because to have the resulting combined "final line" equaling anything but zero in this case does not make sense.

"final line"=C+x*L+y*M

specialronnie29 · 01-22-11 05:39 PM

theres a lot of stuff that doesnt make intuitive sense in regression results
you have to deal with it. just like the model can predict a margin of victory of 4.5393 which doesnt make sense, but the modeler has to use common sense to interpret this.

you want to include the constant to best explain past data as best as possible to see if that explanation tells you if your model (variable model1line) provides any valuable input. i thought thats why you liked the result of the modeler's regression output. it may make sense for some reason to use the noconstant option to make predictions, but if youre testing to see if your model is useful i think the constant should be in there.

anyhow it seems its time to use the model in practice!!

bztips · 01-22-11 06:06 PM

Originally Posted by Data

We run this regression not to find the best fit for the set of variables we have. If we did, yes, this restriction would not make sense. This way we would be explaining the past data in a best way possible. However, we do not care about the past, we want to have something what makes sense and use it to predict the future.

Thus, if the market line L, L=0 and model line M, M=0, then we must restrict the constant C to equal zero, because to have the resulting combined "final line" equaling anything but zero in this case does not make sense.

"final line"=C+x*L+y*M

I thought the purpose of this regression was to determine if the OP's model has some explanatory power to explain MOV that's not already captured in the closing line. If that's the case, there's no reason not to include the constant term. By excluding it a priori, you have no way of determining whether the reasonable concerns that ronnie and I expressed have any validity.

Data · 01-22-11 06:28 PM

Originally Posted by bolekblues

As for the numbers, why do you see these estimates possibly indicative of a model strength?

Those numbers being close tell me that they have a similar level of "importance". The key to understand here is that the model actually does NOT need to be "strong" (accurate).

In fact, it can be very inaccurate and still produce a winning record. (Create a test model using this formula: new_line=2*your_model_line, see how it fares, I predict it will be a wining model despite being awfully inaccurate).

What we are looking for for the model to do is to "know" something that the market does not account for. Realize, if your model is perfect and the market is perfect too, your model is useless. So, the model is only good if it can pinpoint the market inefficiencies. Say, the true line -4, the market line is -2 and the model line is -7. The game lands on 4. The market line was more accurate but the model gave you a winner because it pointed out that -2 was too low. Of course, the more accurate your model the better but the accuracy is not a prerequisite for winning. As long as the model knows something that the market misprices, that something can be (significantly) mispriced by the model too, as long as that corrects the market price in the right direction.

specialronnie29 · 01-22-11 06:42 PM

Originally Posted by Data

Those numbers being close tell me that they have a similar level of "importance". The key to understand here is that the model actually does NOT need to be "strong" (accurate).

In fact, it can be very inaccurate and still produce a winning record. (Create a test model using this formula: new_line=2*your_model_line, see how it fares, I predict it will be a wining model despite being awfully inaccurate).

What we are looking for for the model to do is to "know" something that the market does not account for. Realize, if your model is perfect and the market is perfect too, your model is useless. So, the model is only good if it can pinpoint the market inefficiencies. Say, the true line -4, the market line is -2 and the model line is -7. The game lands on 4. The market line was more accurate but the model gave you a winner because it pointed out that -2 was too low. Of course, the more accurate your model the better but the accuracy is not a prerequisite for winning. As long as the model knows something that the market misprices, that something can be (significantly) mispriced by the model too, as long as that corrects the market price in the right direction.

translation: a non-zero coefficient on model1line that is statistically significant and that does not suffer from strong colinearity.

thats why we are questioning whether that regression tells you what youre saying in bold letters

Data · 01-22-11 07:13 PM

Originally Posted by specialronnie29

translation: a non-zero coefficient on model1line that is statistically significant and that does not suffer from strong colinearity.

thats why we are questioning whether that regression tells you what youre saying in bold letters

Because I already know that one of the variables is a good predictor by itself (the line). I also know that it is appropriate for the regression to go through the origin. You "counter" example with MOV of 4.5393 does not cut it. 4.5393 makes sense while "my line is 0 which means my line -1.567" does not. So, the results tell me that model's significance is going to be close enough.

roasthawg · 01-22-11 07:45 PM

Originally Posted by Data

"final line"=C+x*L+y*M

This is actually a very good explanation... after mulling it over I see your point here. I tend to agree with ronnie's post which stated that much in regression doesn't make intuitive sense as the point is to find the best all-encompassing equation, not simply the equation that best fits when the line is at or near zero.

specialronnie29 · 01-22-11 08:29 PM

Originally Posted by Data

Because I already know that one of the variables is a good predictor by itself (the line). I also know that it is appropriate for the regression to go through the origin. You "counter" example with MOV of 4.5393 does not cut it. 4.5393 makes sense while "my line is 0 which means my line -1.567" does not. So, the results tell me that model's significance is going to be close enough.

if this is appropriate the regression will yield this by itself. it wont be 0 exactly but it will be very close to 0. the point is probably not a big deal then but this sort of intuitive restriction is bad. it seems youre saying it must go through the origin because the closer plus the model line explain everything so if the closer and the model line are 0 it absolutely must be that the constant is 0. but if closer was so great to begin with the model line wouldnt be significant. now youre assuming the model accounts for everything the closer misses so it must go through the origin so we might as well impose the restriction noconstant. i dont buy it.

Justin7 · 01-22-11 08:44 PM

Originally Posted by Data

Because I already know that one of the variables is a good predictor by itself (the line). I also know that it is appropriate for the regression to go through the origin. You "counter" example with MOV of 4.5393 does not cut it. 4.5393 makes sense while "my line is 0 which means my line -1.567" does not. So, the results tell me that model's significance is going to be close enough.

While the market line is usually a good indicator of the right price, I think it's a terrible idea to use it as part of your regression. When looking forward, you want to identify the weak lines before the market does. If you use the regression going forward, and incorporate the market line, you are diluting the results of your own model.

I think a better approach completely ignores the market line during regression analysis, but uses market lines as a way to "grade" your approach based on market moves.

Data · 01-22-11 09:08 PM

Originally Posted by specialronnie29

i see no reason why a constant amplifies rather than reduces noise.

A bad and somewhat misleading wording on my part. By "noise" I meant the inaccuracy of coefficients estimations. If you take a few subset samples you will see the coefficients "noisy" jumping all over the place.

Data · 01-22-11 09:41 PM

Originally Posted by Justin7

While the market line is usually a good indicator of the right price, I think it's a terrible idea to use it as part of your regression. When looking forward, you want to identify the weak lines before the market does. If you use the regression going forward, and incorporate the market line, you are diluting the results of your own model.

I think a better approach completely ignores the market line during regression analysis, but uses market lines as a way to "grade" your approach based on market moves.

You are totally arguing with something you made up. Yes, I agree with you critics of the process if we took the following steps:
1) take the lines that the model produced
2) take the market lines
3) run the OLS regression
4) use the coefficients and the intercept to calculate our final line

However, here is what we actually did:
1) take the lines that the model produced
2) take the market lines
3) run the RTO regression
4) look how useful our model judging by the coefficients (which are between 0 and 1)
We stopped at this point, as you suggested, our model can "identify the weak lines before the market does"

Having said that, there is a next meaningful step:
5) use the coefficients (this time no intercept!) to calculate our final line
What we do at step 5 is just paying respect to the market and assuming that the true line is not our model's line but a number in between. And we do that intentionally, despite your objections of "diluting the results of your own model". Now, if the market line changes, our final line will change too and if this was not happening that would be just silly.

specialronnie29 · 01-22-11 09:53 PM

Originally Posted by Data

A bad and somewhat misleading wording on my part. By "noise" I meant the inaccuracy of coefficients estimations. If you take a few subset samples you will see the coefficients "noisy" jumping all over the place.

ah i see what you meant, ok

SBR Top-Rated Sportsbooks				Best Sportsbooks List
#1 FanDuel	SBR rating 4.8/5	Review	#6 BetRivers	SBR rating 4.1/5	Review
#2 Caesars	SBR rating 4.7/5	Review	#7 Fanatics	SBR rating 4.1/5	Review
#3 DraftKings	SBR rating 4.7/5	Review	#8 Betway	SBR rating 3.8/5	Review
#4 BetMGM	SBR rating 4.6/5	Review	#9 Borgata	SBR rating 3.5/5	Review
#5 bet365	SBR rating 4.6/5	Review	#10 ClutchBet	SBR rating 2.9/5	Review

NBA modeling

Thread Tools

NBA modeling