Hi everyone.
For quite some time I have been working on NBA data and trying to model the NBA point spread market. Firstly, I have tried regression analysis, but while it has shown satisfactory results over the previous seasons (2006-07 to 2008-09), it has been disappointing in the last season - and that regardless of the parameter estimates I used, even if i applied the same parameter estimates for predicting the margin of victory as I have obtained from modeling 2009-10 data (which one should not do because of overfitting and that kind of stuff), it still was mediocre.
Then I have tried starting from projected pace and points per possession and the come up with projected points scored by a team in a game. I have written about it in the forum but no one confirmed it was the right way to go, so I basically gave up (honestly, I used a lot from justin's book, so hoped he and others were using this - of course as a base / starting point - and would at least say whther it was right or wrong. did not happen)
Afterwards, I started different approaches. Special thanks to Data, who reffered me to log5 and corr.gaussian method. My first model is based on these measures / formulas. The second model I tested is based on Power Rankings (which take into account margin of victory instead of win%, recent form and the strength of schedule). Those are bases for my two models.
It seems that given this very basic assumptions (as above) both have shown some nice results.
But I felt there was some room for improvement and another versions of models tried to incorporate:
v2: home court advantage (HCA) - different for different teams (up to 5pts for great home teams, and down to 1 for weak home teams - relative to overall record, with 3 being average)
v3: rest - subtracting 1 (when at home) to 2 (when away) points from teams playing on a B2B against those with 1+ days rest. Those estimates might be somewhat wrong but are derived from a very basic calculations and personal feeling. I did not account for teams on a 4(games)-in-5(days) or 5-in-7 because the sample was too small to draw any conclusions.
v4: injuries - this is far from perfect as well, because i do not possess exact injuries data, but rather starting 5 for each game. So injuries for key bench players like Crawford, Ginobili (previous years), Terry are not accounted for in my models. Plus, I have my own estimates as to how many points which player is worth and those might be even more far from optimal estimates (because you have to predicet how replacable he is and will his absence affect the team against a particular opponent), but this is better than not incorporating injuries at all i guess.
That's all about methodology. Below are the results I obtained for 2009-10 season (let's just say previous years were better or at least not worse, as i have mentioned, so please don't tell me this is not enough sample size to draw any conclusions. Also, these models do not face the problem of overfitting).
In the tables at the top you can see average absolute errors between model's predicted lines and opening line, closing line and final result, respectively. It seems that every 'later' version of both models improved (diminished) all three errors. Except for v2, which i apllied to the first model but decided not to apply to the second model while it seems that the results are very poor. I heard somewhere that a team having a particularly good home record relative to its overall record (let's say 10-5 at home and 4-11 on the road) is just more of a random effect and it will regress to mean at some point, rather than something worth noting. The results cannot convince me this is not the case so I just stuck to a standard 3point HCA. Besides this, all the errors seem to get smaller with B2B and injuries being accounted for.
Model 2 seems to get better results as far as predicting the lines (on average).
Also, it is worth noting that market avg abs error is 9,32 and 9,29 for opening and closing lines, respectively.
So while models get closer to the market as far as lines prediction, they are still inferior to the market.
The bottom two tables show wins and losses (along with win %) against the closing lines produced by both models. The results are quite satisfactory (at least for me) and this time model 1 achieves better win% (generally). It seems that while model 2 better predicting lines (overlap numbers producing any resonable number of plays are around 4,5-5, while for model 1 they are around 7-8), the model 1 may be able to find some more mispriced lines.
I would like to ask all modelers (especially those who deal with NBA sides) about suggestions, advice and results your models produce - if this is not any secret. Mainly: model's line average absolute error (from opening, closing and final result), win% on a particular overlap. Also, what do you think about approaches i described above (regarding differentiating HCA for teams and incorporating injuries and rest in this manner). Would you add any other variables?
Thank You
Bartek