Suppose we are trying to create a model to predict winners of baseball games; we select a binary logit specification. The probability of a team winning is specified as a function of, say, the published odds (most important!), plus any other variables that we think may help to explain the outcomes.
Of course there are a variety of measures that could be employed to estimate goodness-of-fit from such a model. However, the main thing we’re interested in is not how well it fits per se, but rather whether the probability estimates generated from the model provide some measured edge against the available odds. If so, we bet. And of course we do out-of-sample testing to see whether the edges hold up.
A problem arises, however, if the model really doesn’t fit very well – namely, a poor model without much explanatory power will tend to generate probability estimates near 50% for each of the two alternatives; in fact, 50/50 is the exact prediction if there is no explanatory value at all. But if we still insist on using such a poor model that doesn’t explain much, then on average it will tend to give us a supposed “edge” primarily on underdogs (since their predicted probabilities will always tend toward 50%), and much less often on favorites (since their probabilities will also tend toward 50% due simply to the weak explanatory power of the model).
In practical terms, I use this to screen my logit models (instead of trying to directly assess the various goodness-of-fit measures): if my model is projecting large edges primarily on underdogs at the expense of favorites, then I know it probably is not a very good model.
Thoughts?
Of course there are a variety of measures that could be employed to estimate goodness-of-fit from such a model. However, the main thing we’re interested in is not how well it fits per se, but rather whether the probability estimates generated from the model provide some measured edge against the available odds. If so, we bet. And of course we do out-of-sample testing to see whether the edges hold up.
A problem arises, however, if the model really doesn’t fit very well – namely, a poor model without much explanatory power will tend to generate probability estimates near 50% for each of the two alternatives; in fact, 50/50 is the exact prediction if there is no explanatory value at all. But if we still insist on using such a poor model that doesn’t explain much, then on average it will tend to give us a supposed “edge” primarily on underdogs (since their predicted probabilities will always tend toward 50%), and much less often on favorites (since their probabilities will also tend toward 50% due simply to the weak explanatory power of the model).
In practical terms, I use this to screen my logit models (instead of trying to directly assess the various goodness-of-fit measures): if my model is projecting large edges primarily on underdogs at the expense of favorites, then I know it probably is not a very good model.
Thoughts?