I wanted to gauge other people's experience with this. I found that although my MSE or Abs error goes down marginally when i include more data (older seasons) in the training set, the performance actually improves when i only include the 2 most recent seasons in the training set. I hypothesize that it helps adjust for league changes etc, but as always limitations on dataset size are always somewhat limiting with resp to establishing statistical confidence, etc. I am looking at the NFL right now, but have seen this in other sports too. Does anyone have any thoughts/experience they'd care to share on this topic? Thanks.
Maybe use exponential smoothing or some other time series smooth technique. Optimize the weighting/smoothing parameters on a training set (I have used evolutionary algos quite a lot doing this very same thing) against MSE/MAE, and then test on out of sample set.
I found that exponential smoothing worked well for me. I spent a few hours on my test data determining the optimal rate to "smooth" and then committed to it.
Is smoothing always a good idea? What happens with your predictions when teams regress to their means after being extra lucky/unlucky? Another issue could be player injuries and or suspensions. With a key player out temporarily, that team will look worse than they actually are as the games without the key player are weighted more heavily.
This past off season I was running binary logistic regression. Test data was 2012 NFL over/unders. The learning set varied from 4 year 2008-2011 to two year 2010-2011. I found that the predictive power was stronger using the model derived from the 4 year data.