1. #1
    Leverage
    HURR DERP DEEP DURP
    Leverage's Avatar Become A Pro!
    Join Date: 07-30-09
    Posts: 253

    Optimum Sample Size

    Too small not enough data, too big doesn't take into account player deterioration. What's the optimum sample size in seasons or parts of seasons for regression?

    Discuss.

  2. #2
    Justin7
    Justin7's Avatar Become A Pro!
    Join Date: 07-31-06
    Posts: 8,577
    Betpoints: 1506

    You need to be *very* careful making any assumptions using data from a different season. This is especially difficult analyzing players, because the sample for one season is rarely enough to make decent projections.

    On team stats, I typically assume each team starts from scratch each season, and base projections going forward after 4 games. If you have a lot of seasons, you can increase the number of preliminary games before making projections.

    You raise a good issue though. There is always this tough decision in deciding which data is relevant, whether you are handicapping or just studying conversions.

  3. #3
    Wrecktangle
    Wrecktangle's Avatar Become A Pro!
    Join Date: 03-01-09
    Posts: 1,524
    Betpoints: 3209

    IMO, there is no 'optimal' answer to this question. It will vary by sport, part of season, and year.

  4. #4
    Dark Horse
    Deus Ex Machina
    Dark Horse's Avatar Become A Pro!
    Join Date: 12-14-05
    Posts: 13,764

    Quote Originally Posted by Leverage View Post
    Too small not enough data, too big doesn't take into account player deterioration. What's the optimum sample size in seasons or parts of seasons for regression?

    Discuss.

    I've adjusted my view on the sample size issue. I divide bets into two basic categories. The approaches that are good for a season, and the approaches (far more valuable; and typically far more work to distill) that are good season after season.

    For the first category the 'old' way of using sample size is useful (Z-score). But for the second category that threshold is much too high. So instead of relying on quantity, it's become a matter of recognizing quality for me. The ability to do so is based on experience.

  5. #5
    Wrecktangle
    Wrecktangle's Avatar Become A Pro!
    Join Date: 03-01-09
    Posts: 1,524
    Betpoints: 3209

    In the mid 80s, I had graded over 10,000 NFL angles by z-score; some had scores above 5 (5 standard deviations), after two years all had degraded and only had proved marginally useful (home dogs are good). After this I moved on to statistical modeling.

  6. #6
    mathdotcom
    mathdotcom's Avatar Become A Pro!
    Join Date: 03-24-08
    Posts: 11,689
    Betpoints: 1943

    Quote Originally Posted by Justin7 View Post
    You need to be *very* careful making any assumptions using data from a different season. This is especially difficult analyzing players, because the sample for one season is rarely enough to make decent projections.

    On team stats, I typically assume each team starts from scratch each season, and base projections going forward after 4 games. If you have a lot of seasons, you can increase the number of preliminary games before making projections.

    You raise a good issue though. There is always this tough decision in deciding which data is relevant, whether you are handicapping or just studying conversions.
    really? I realize sometimes this is the best one can do, but sometimes the best still isn't good enough.

    I at least hope this is for football and not baseball.

  7. #7
    Leverage
    HURR DERP DEEP DURP
    Leverage's Avatar Become A Pro!
    Join Date: 07-30-09
    Posts: 253

    Quote Originally Posted by mathdotcom View Post
    I at least hope this is for football and not baseball.
    What about 4 starts for a pitcher?

  8. #8
    Justin7
    Justin7's Avatar Become A Pro!
    Join Date: 07-31-06
    Posts: 8,577
    Betpoints: 1506

    4 games is for football and baskets, not MLB.

  9. #9
    mathdotcom
    mathdotcom's Avatar Become A Pro!
    Join Date: 03-24-08
    Posts: 11,689
    Betpoints: 1943

    The tradeoff of course is between making inferences based on a small sample (and thus inviting a lot more randomness than you'd like), and waiting so long that you only make a bet on the last game of the season.

    So waiting only 4 MLB games is ridiculous given you have 162 in the season. But waiting for 4 pitcher starts is a lot more admissible. Likewise for football.

    4 games for baskets is ludicrous, especially at the start of the season.

  10. #10
    Justin7
    Justin7's Avatar Become A Pro!
    Join Date: 07-31-06
    Posts: 8,577
    Betpoints: 1506

    Quote Originally Posted by mathdotcom View Post
    4 games for baskets is ludicrous, especially at the start of the season.
    I think it depends on the approach. NCAAB is a different beast, but 4 games seems to work in WNBA.

  11. #11
    Grind-It-Out
    Grind-It-Out's Avatar Become A Pro!
    Join Date: 05-04-10
    Posts: 537
    Betpoints: 942

    Quote Originally Posted by Leverage View Post
    Too small not enough data, too big doesn't take into account player deterioration. What's the optimum sample size in seasons or parts of seasons for regression?

    Discuss.
    What kind of data are you looking at?

    I have an MLB system that I kick off around July 1st every year (because I'm not so hot on using player stats from a previous year) that looks at very basic stats like ERA, WHIP, R, H, RBI, etc. I just weight the stats such that the most recent performance has the most weight, the first performance of the year has the least weight, etc. The best algorithm to use for weight distribution is debatable. Of course you still need to account for things like coming off of injuries, or being moved up from AAA, but you'd have to account for that anyway.

  12. #12
    Leverage
    HURR DERP DEEP DURP
    Leverage's Avatar Become A Pro!
    Join Date: 07-30-09
    Posts: 253

    What about a dynamic approach that kicks in once a 1/3rd of the season under way that evolves for the other 2/3rds? That's what I'm working on now, and its showing some promise. The only drawback to that I see is things like march madness and the World Cup which might as well be their own seasons.

  13. #13
    suicidekings
    Update your status
    suicidekings's Avatar Become A Pro!
    Join Date: 03-23-09
    Posts: 9,962

    Quote Originally Posted by Leverage View Post
    What about a dynamic approach that kicks in once a 1/3rd of the season under way that evolves for the other 2/3rds? That's what I'm working on now, and its showing some promise. The only drawback to that I see is things like march madness and the World Cup which might as well be their own seasons.
    There's a big difference between capping playoffs/tournaments and regular season games.

    For both the NBA and NFL I typically generate two sets of numbers for each matchup based on
    full season data and current form, and then use both to establish value of available lines from books.

    NBA: past season data until about 2-3 weeks into the season, then using the previous ~4 weeks of data (15-18 games) as the season progresses combined with season-long shooting data.
    NFL: past season data through Week 3, then the past 3-4 weeks of rolling data throughout the season.

    Whatever you decide on, having numbers that reflect the current form of the teams is critical.

Top