Optimum Sample Size

Justin7 · 05-14-10, 12:15 AM

You need to be *very* careful making any assumptions using data from a different season. This is especially difficult analyzing players, because the sample for one season is rarely enough to make decent projections.

On team stats, I typically assume each team starts from scratch each season, and base projections going forward after 4 games. If you have a lot of seasons, you can increase the number of preliminary games before making projections.

You raise a good issue though. There is always this tough decision in deciding which data is relevant, whether you are handicapping or just studying conversions.

Wrecktangle · 05-14-10, 08:02 AM

IMO, there is no 'optimal' answer to this question. It will vary by sport, part of season, and year.

Dark Horse · 05-14-10, 08:13 AM

Originally posted by Leverage

Too small not enough data, too big doesn't take into account player deterioration. What's the optimum sample size in seasons or parts of seasons for regression?

Discuss.

I've adjusted my view on the sample size issue. I divide bets into two basic categories. The approaches that are good for a season, and the approaches (far more valuable; and typically far more work to distill) that are good season after season.

For the first category the 'old' way of using sample size is useful (Z-score). But for the second category that threshold is much too high. So instead of relying on quantity, it's become a matter of recognizing quality for me. The ability to do so is based on experience.

Wrecktangle · 05-15-10, 10:44 AM

In the mid 80s, I had graded over 10,000 NFL angles by z-score; some had scores above 5 (5 standard deviations), after two years all had degraded and only had proved marginally useful (home dogs are good). After this I moved on to statistical modeling.

mathdotcom · 05-15-10, 02:56 PM

Originally posted by Justin7

You need to be *very* careful making any assumptions using data from a different season. This is especially difficult analyzing players, because the sample for one season is rarely enough to make decent projections.

On team stats, I typically assume each team starts from scratch each season, and base projections going forward after 4 games. If you have a lot of seasons, you can increase the number of preliminary games before making projections.

You raise a good issue though. There is always this tough decision in deciding which data is relevant, whether you are handicapping or just studying conversions.

really? I realize sometimes this is the best one can do, but sometimes the best still isn't good enough.

I at least hope this is for football and not baseball.

Leverage · 05-18-10, 12:40 AM

Originally posted by mathdotcom

I at least hope this is for football and not baseball.

What about 4 starts for a pitcher?

Justin7 · 05-18-10, 10:04 AM

4 games is for football and baskets, not MLB.

mathdotcom · 05-18-10, 10:14 AM

The tradeoff of course is between making inferences based on a small sample (and thus inviting a lot more randomness than you'd like), and waiting so long that you only make a bet on the last game of the season.

So waiting only 4 MLB games is ridiculous given you have 162 in the season. But waiting for 4 pitcher starts is a lot more admissible. Likewise for football.

4 games for baskets is ludicrous, especially at the start of the season.

Justin7 · 05-18-10, 01:51 PM

Originally posted by mathdotcom

4 games for baskets is ludicrous, especially at the start of the season.

I think it depends on the approach. NCAAB is a different beast, but 4 games seems to work in WNBA.

Grind-It-Out · 05-18-10, 10:53 PM

Originally posted by Leverage

Too small not enough data, too big doesn't take into account player deterioration. What's the optimum sample size in seasons or parts of seasons for regression?

Discuss.

What kind of data are you looking at?

I have an MLB system that I kick off around July 1st every year (because I'm not so hot on using player stats from a previous year) that looks at very basic stats like ERA, WHIP, R, H, RBI, etc. I just weight the stats such that the most recent performance has the most weight, the first performance of the year has the least weight, etc. The best algorithm to use for weight distribution is debatable. Of course you still need to account for things like coming off of injuries, or being moved up from AAA, but you'd have to account for that anyway.

Leverage · 05-20-10, 10:09 PM

What about a dynamic approach that kicks in once a 1/3rd of the season under way that evolves for the other 2/3rds? That's what I'm working on now, and its showing some promise. The only drawback to that I see is things like march madness and the World Cup which might as well be their own seasons.

suicidekings · 05-21-10, 01:08 AM

Originally posted by Leverage

What about a dynamic approach that kicks in once a 1/3rd of the season under way that evolves for the other 2/3rds? That's what I'm working on now, and its showing some promise. The only drawback to that I see is things like march madness and the World Cup which might as well be their own seasons.

There's a big difference between capping playoffs/tournaments and regular season games.

For both the NBA and NFL I typically generate two sets of numbers for each matchup based on
full season data and current form, and then use both to establish value of available lines from books.

NBA: past season data until about 2-3 weeks into the season, then using the previous ~4 weeks of data (15-18 games) as the season progresses combined with season-long shooting data.
NFL: past season data through Week 3, then the past 3-4 weeks of rolling data throughout the season.

Whatever you decide on, having numbers that reflect the current form of the teams is critical.