For anyone with Sabermetrics knowledge...

metaldome · 03-13-09 02:34 AM

I can't wait for baseball to start. This is the first season I am trying to come up with my own system of projecting baseball scores so I can bet games where I see value in the lines. I want to explain to you some of what I am doing so you can tell me where I may be going wrong.

1) I tested a bunch of pitching and batting statistics to see how well they correlated to wins and runs scored and found OPS and OPSA to be the best. I know there are a few stats that are supposed to be a little better (wOBA, 1.8OPS, etc) but OPS and OPSA were easier to find and use than any of the other stats, and the difference between them was fairly small (less than one percent).

2) So next, I compared all major league team's season OPS to their average RPG (and OPSA to average opponents RPG) for the last three years on one graph and found the best fit line. R squared came out to about .87 (an 87% correlation) and the formula was off by an average of 2.5%, or about 0.12 RPG, according to my calculations.

I did not make any league adjustments because when I compared each league separately to the MLB as a whole, I found that the difference was pretty small (again, less than one percent), and I wanted to keep this as simple as possible. I also figured that any difference between having a designated hitter (or not) would already be reflected in each team's OPS (and therefore the projected score).

Although I did not find much of a correlation between OPSA and unearned runs (only about 17%, obviously it depends more on fielding than pitching), it seems that including them does not lower the accuracy of the formula. For this reason, I decided to include both earned and unearned runs, as this should make the predicted score and total closer to the actual game results. Would you agree?

I also wondered whether I should put the formula in the context of runs per game (R/G) or runs per nine innings (or runs per inning). I decided on runs per game, thinking it will be easier (it is hard to get total offensive innings, although close to innings pitched for the team, it would not be exact) and closer to actual scores (you can't know whether a game will go into extra innings beforehand) than the others. Do you think this is best?

I have not yet added park adjustments because so far I am unsure whether they can be calculated with any degree of accuracy, how I should do it, or whether it will make much of a difference (except for a few teams like Colorado and San Diego). It would also be almost impossible to come up with anything for teams with new stadiums. As far as home field advantage, again I don’t know how accurate you could really get, and was thinking of just going with the major league average of about four percent. Any ideas?

I think that last year my decisions were too heavily influenced by the last three games for a pitcher and last ten games for a team. This year I would like to use at least a years worth of data in my calculations. For pitchers this is a piece of cake, but I am not sure how I could do this easily for team OPS (lineups change and players get injured or traded throughout the season, and from season to season). Any ideas? (Remember, I want to keep this somewhat simple and don't need to be totally exact. I can't spend eight hours a day collecting information and doing calculations.)

Lastly, it seems from looking at other predictive models, that I should adjust the numbers to see how teams would do against a league average pitching staff (or for pitchers, against a league average offense) before calculating scores (based on my numbers for offense, starters, bullpens, and the amount of innings I think they will pitch). I think I know how I could do this, but am not exactly sure why this is important. Can anyone explain it to me?

Sorry if this was long. I hope it wasn't too confusing and that some people got something out of it. Any help with the questions above will be greatly appreciated.

Data · 03-13-09 10:29 AM

The first question you should ask yourself is what is your edge. If you plan to have an edge via creating a model based on stats then you should dig deeper and not to take any shortcuts. A past or future success of any system that is based on stats and methods that are common knowledge is due to pure chance.

MrX · 03-13-09 12:46 PM

Data is spot on with his advice, but your post is a well thought-out request for advice and I'll try to give some.