1. #1
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 807
    Betpoints: 58

    How To Know If I Am Data Mining?

    I am working on a model for MLB (Yea I know its a bit late)...

    And I am worried about data mining...but i dont know if that is what i am doing...

    Are there some tests or just some common knowledge to know if my results are over fitted or data mined and not just "good"?

    Thanks,
    Mav

  2. #2
    Data
    Data's Avatar Become A Pro!
    Join Date: 11-27-07
    Posts: 2,236

    Here's an easy way to find out if you're data mining. If you find yourself asking the question, "am I data mining?" the answer is yes.
    Points Awarded:

    Pokerjoe gave Data 2 SBR Point(s) for this post.

    Wheell gave Data 2 SBR Point(s) for this post.


  3. #3
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 807
    Betpoints: 58

    That would possibly work for people much more naive than me... I like to pride myself on weighing all possibilities...

    I created the model against thousands of games. Tested it against many more thousands of games...

    Just want to know if there is a "checklist" that says "you data mined" or "over fit the data [model]"..

    For example...if my model uses only a handful of variables... or uses too many... or the model isnt mathemically complex...or the model is mathematically too simple...

    If any of this makes sense...

  4. #4
    Justin7
    Justin7's Avatar Become A Pro!
    Join Date: 07-31-06
    Posts: 8,577
    Betpoints: 1506

    How many approaches did you test?

    In MLB, no model has a prayer of winning unless it is player specific. Does your model account for who is actually in the batting lineup in a game? And who the starting pitcher is? And (not quite as important, but still relevant) which relievers are available?

  5. #5
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 807
    Betpoints: 58

    Starting pitchers...yes...
    Batting Lineup and relief pitchers? Indirectly... but not enough to say 'YES'...

    And as far as "how many approaches did i test"...You will have to clarify/explain what you mean.

  6. #6
    Justin7
    Justin7's Avatar Become A Pro!
    Join Date: 07-31-06
    Posts: 8,577
    Betpoints: 1506

    Quote Originally Posted by Maverick22 View Post
    Starting pitchers...yes...
    Batting Lineup and relief pitchers? Indirectly... but not enough to say 'YES'...

    And as far as "how many approaches did i test"...You will have to clarify/explain what you mean.
    If your analysis does not consider starting batting lineups, any analysis you do suggesting a profitable approach is probably due to luck or data mining. I'd lay odds that this approach will lose money over the next 5k plays.

  7. #7
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 807
    Betpoints: 58

    Surely a model tested against thousands of games cant be purely luck, no?

    Possibly unconventional... but luck? Means I could put two teams in a hat and draw...that's luck...

    and the law of large numbers says my model would approach 50% accuracy if that was the case. right?

    granted thousands of games isnt exactly "large numbers"

  8. #8
    Justin7
    Justin7's Avatar Become A Pro!
    Join Date: 07-31-06
    Posts: 8,577
    Betpoints: 1506

    If you test enough ideas, you will find something that looks good that will lose. If you are ignoring key components (such as who is batting), you are introducing a ton of error into your model. Some games *might* work well (i.e. involving teams with very few injuries). But the more factors you ignore, the more noise you introduce.

    There's a delicate balance in deciding what to ignore, and what to model. The lowest threshold for a winning model includes starting lineups. Are you also ignoring park factors? This has an impact on moneylines as well as totals.

    Thousands of games? It may or may not be relevant. I've thrown away (hundreds of?) thousands of dollars away from my mistakes. Assuming that a sample size of 10k games makes your conclusions valid without a reasonable cause/effect analysis could be a gigantic mistake.

  9. #9
    Dark Horse
    Deus Ex Machina
    Dark Horse's Avatar Become A Pro!
    Join Date: 12-14-05
    Posts: 13,764

    If you start with a logical idea, and then test it against the numbers, you are not data mining.
    If you start with the numbers, and let them shape the idea, you are most likely data mining.
    However, you can still test this idea, going forward, but you have to throw out the positive results you uncovered in the past. If it produces at the same level as before, you correctly interpreted the data. Otherwise, the data took you for a spin.

    So not all data mining is necessarily negative. The use of data without the correct underlying logic is the problem.
    Last edited by Dark Horse; 08-10-10 at 04:31 AM.

  10. #10
    Indecent
    Indecent's Avatar Become A Pro!
    Join Date: 09-08-09
    Posts: 758
    Betpoints: 1156

    It won't apply to a lot of modeling approaches, but if you use a machine-learning based approach try graphing the accuracy of training and validation set to see when they diverge.

  11. #11
    Wrecktangle
    Wrecktangle's Avatar Become A Pro!
    Join Date: 03-01-09
    Posts: 1,524
    Betpoints: 3209

    Unfortunately, the term "data mining" (like the word hacker) is becoming a misused term, at least in the sports forecasting biz. I believe the term is data snooping when you veer off into (self) abuse.

    The accepted method of building a model is the divide your data set into two parts, one used for training, and one for validation. The training set should be large enough to avoid seasonality which means you need at least two seasons of data in the training set. Yes, this means at least two MLB seasons, but you probably want more. I'd have at least 5 seasons or cycles if you really want to put the issue to rest. Your biggest problem is over-fitting with a small training set.

    Perhaps the best book on the market for a medium level of math is: Data Mining by Witten & Frank. They also have a nice set of PowerPoint briefs covering a lot of their topics on the net somewhere. A book exists for simpler math and one for the higher math levels. Let me know if you want those titles and I'll "data mine" those out.

  12. #12
    Dark Horse
    Deus Ex Machina
    Dark Horse's Avatar Become A Pro!
    Join Date: 12-14-05
    Posts: 13,764

    Good points as usual, Wreck.

    In preparing for the NFL season, just now, I came across a forgotten file with some old numbers. They are from nine completely numerical angles that were undoubtedly data mined. The combined record was 219-92 or 70.4% ATS.

    100% data mined and utterly unrealistic (as forward expectation). Is that a problem? Well... I have continued to update these angles until the present time. There is no doubt that their winning percentage came down by a spectacular margin. However, the 61-38 ATS or 61.6% record, going forward, is still rather profitable.
    Last edited by Dark Horse; 08-10-10 at 06:10 PM.

  13. #13
    CrimsonQueen
    CrimsonQueen's Avatar Become A Pro!
    Join Date: 08-12-09
    Posts: 1,068
    Betpoints: 1660

    Wreck: What are the books for lower levels of math? Let's say my problem isn't math though, I'm great with that...but my problem is I've never made a model before and don't know exactly what I'm supposed to do... Is there a good book for that? I've read everything in the intro to research thread here, but I definitely need some guidance.
    Thanks!

  14. #14
    CrimsonQueen
    CrimsonQueen's Avatar Become A Pro!
    Join Date: 08-12-09
    Posts: 1,068
    Betpoints: 1660

    I did order Justin7's book...but I have a feeling that is going to be WAY over my head

  15. #15
    Wrecktangle
    Wrecktangle's Avatar Become A Pro!
    Join Date: 03-01-09
    Posts: 1,524
    Betpoints: 3209

    A decent introductory Data Mining book is: "Data Mining Techniques" 2ed by Berry & Linoff. They also have a companion book on SQL and Excel applications.

    The book used in many college courses and is considered the "Gold Standard" in higher math is: "The Elements of Statistical Learning" 2ed by Hastie, Tibshirani, Friedman. This book is expensive (around $75) and you should be a serious mathematician to get any utility out of it for that kind of money.

    For what it's worth, Data Mining is a rapidly growing field with a lot of job opportunities even in this crappy market. There are literally thousands of unfilled jobs across the nation in this field today.

  16. #16
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 807
    Betpoints: 58

    Quote Originally Posted by Indecent View Post
    It won't apply to a lot of modeling approaches, but if you use a machine-learning based approach try graphing the accuracy of training and validation set to see when they diverge.
    What do you mean by this? the machine learning approach i loosely understand.

    But by accuracy and divergence of the training and validation set, not so much.

  17. #17
    Indecent
    Indecent's Avatar Become A Pro!
    Join Date: 09-08-09
    Posts: 758
    Betpoints: 1156

    Quote Originally Posted by Maverick22 View Post
    What do you mean by this? the machine learning approach i loosely understand.

    But by accuracy and divergence of the training and validation set, not so much.
    Just graph the accuracy over the duration of training. When the the accuracy of the training set starts to go up as the validation set goes down (they diverge), the model is over-trained.

  18. #18
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 807
    Betpoints: 58

    Ahhh... i already do that then. Sounds like i am good to go

Top