What to do with 'explainable' outliers in a model

**HeeeHAWWWW** · 01-22-19, 03:40 AM

Rather depends on how robust your algorithm is to outliers, and whether you're talking about regression or classification.

Also on your internal model construction - you say you can explain, but is that explanation external to the model? Or are the explanatory factors features already?

**gui_m_p** · 01-23-19, 01:31 PM

Usually removing outliers serves to extract some noise of the model, so it can make better predictions in the future if you train your model without them.

However, like HeeHAWW said, you should consider if the variables that explains the outliers are really exogenous from the model.

E.g. if you are predicting the total points of a game and wheather is a variable you use, you cannot remove an outlier due to bad wheather.

**QuantumLeap** · 01-23-19, 03:43 PM

If you can quantify the explainable outlier you can adjust your model to that amount. Not all outliers are equal. Some will downright wreck your model.

For example, if a "key" player is out for an NBA team that may be worth 'x' amount of points. If an even more important NBA player is out then that might be worth 10 points or even more.

I've found that the books rarely move the line enough for those player being out which allows you to fade that team.

**gojetsgomoxies** · 03-13-19, 06:07 PM

there is no easy answer to it...... you need to adjust your models, but how much and in what cases, i don't think there'd be that much agreement and it's pretty tiresome and labour-intensive

for all the ubiquitous power ratings out there, someone should track this i.e. a central power rating source that would estimate how much you should adjust your basic model ...... sort of like some sort of "completion portfolio" in equity portfolio management i.e. you focus on the the top 100 stocks of the S&P 500 and buy/sell one ETF or customized basket for the other 400 stocks (i.e. stocks 101 through 500)

**gojetsgomoxies** · 03-13-19, 06:09 PM

it's like when someone comes up with a model that picks off crazy lines each season but only plays a few games. it's like that crazy line is just as likely to be based on some major event (injury) or even a bad line in a database.

**gojetsgomoxies** · 03-13-19, 06:12 PM

Originally posted by QuantumLeap

I've found that the books rarely move the line enough for those player being out which allows you to fade that team.

that would be my thought too....... but there was a good poster on here that had the theory that teams play well in 1 game without their star player (everyone is energized by more minutes). it could be one-off or first game of star being out an extended period. maybe more the latter.......

same with the opposite. "oh great, durant's back from injury". but this poster thought it takes some time to re-gel with the star coming back.

i like all the ideas, incl. yours, and i don't necessarily see them as mutually exclusive and this poster was more focussed on first or second game of star absence/return.

**peacebyinches** · 03-13-19, 10:55 PM

Originally posted by gui_m_p

Usually removing outliers serves to extract some noise of the model, so it can make better predictions in the future if you train your model without them.

However, like HeeHAWW said, you should consider if the variables that explains the outliers are really exogenous from the model.

E.g. if you are predicting the total points of a game and wheather is a variable you use, you cannot remove an outlier due to bad wheather.

Yes, this!

Often times there is some very useful information in the residuals (aka noise) of your data. If you can extract the residuals from what your model is outputting you can have all sorts of fun, such as using that as a metric in determining the difference of your fitted model (aka what your model is ultimately trying to predict) and outcome. With enough noise estimates you can generalize some nifty parameters to include in the model a priori (basically quantify how influential a certain outlier circumstance is when it comes to upping your residuals). Other cool stuff like principal components analysis (PCA) come to mind but it really matters how your model is set up to start with.

**eaglesfan371** · 03-13-19, 11:45 PM

I find this forum section quite intriguing. Its like a whole new forum with completely different members. No one in the think tank comments in players talk or other sections.

All the quants, PhD stat and math guys must live here. I must continue observing and listening.

**gojetsgomoxies** · 03-14-19, 08:13 PM

if you start making subjective adjustments (obvious reasonable ones) to your objective power ratings systems then is it still an objective system? how can you backtest something with subjective adjustments? maybe you have no interest in that, or just back-test the objective part of it.

**tsty** · 03-15-19, 02:19 AM

Originally posted by gojetsgomoxies

it's like when someone comes up with a model that picks off crazy lines each season but only plays a few games. it's like that crazy line is just as likely to be based on some major event (injury) or even a bad line in a database.

What would be the point?

putting in so much effort for a few bets a year lol

**semibluff** · 03-15-19, 11:17 AM

Throwing out results that don't fit is 1 possibility. My preferred solution is to place differing emphasis on differing situations. It's not easy to formulate how much emphasis should be put on any given scenario. For example I run a moneyline Pick'em competition on NFL games. If I used the same stake unit for every game the competition would likely be won by whoever did best in picking successful longshots. Thus the stakes are weighted by how close the moneylines are together. Close lines are full stakes and +/-825 lines are at 38% with many differing lines and %s in between. NFL games with over 10 point handicaps were 22-4 for favourites in non-Thursday games. The point is to avoid the danger of ignoring unlikely events whilst avoiding being overwhelmed by them. Kim Si‑woo won the 2017 Players Championship at +25000 with Louis Oosthuizen 2nd at +8000. That probably wrecks a lot of golf betting models. There was also a +50000 player who finished in the top 4 of 1 of the majors last year. Key players being injured and very bad weather will happen. If the model excludes those results the model can't be used whenever there's a possibility of either occurring. That's also ok if you're only betting close to the scheduled start. If you bet several days in advance then it might not be very helpful.