1. #1
    HUY
    HUY's Avatar Become A Pro!
    Join Date: 04-29-09
    Posts: 253
    Betpoints: 3257

    Avoiding the datamining error

    About two months ago I developed a rating system for a sport which derives win probabilities from the ratings. The system has certain parameters which were determined by looking at past data and picking the values which minimized error and maximized betting ROI.

    Now, while I was developing the system I wasn't aware of the theoritical work on the datamining error and of overfitted models. But I did think of the problem and decided to first bet according to the system but using small stakes. After this period the system seems to be doing well and according to my expectations.

    Later, I found this forum and read up on the datamining error, and also read the relevant articles in wikipedia. I know that the bets I'm placing right now are a way to test the prediction strength of the system, but it will take some time before I reach an indicative number of bets. Thus, I want to re-determine the parameters of my system in a way that will minimize overfitting. This will also help with my next project, which is to adapt my model to another sport.

    My question is, how exactly should I go about doing so? The system works by looking at past results to calculate a rating.

    I thought of using the data up to a point (training data) to determine the parameters of the system (which will be the ones that will give the least error for the matches within the training data) and then use the rest of the data (validation data) to check if the error remains acceptable.*

    Is that a viable way or should my analysis be less time-linear? Should I just sample some matches randomly from the data which will be used as validation data and use the rest as training data to determine the parameters? Then repeat with different samples? Remember that the system uses past results to calculate the rating so if I do that the training data will lose its continuity.

    *I have a question on this technique. Suppose we determine the parameters that minimize the error for the training data. Then the model will be overfitted on the training data and will typically perform worse on the validation data. Is the goal to maintain acceptable (albeit inferior) performance on the validation data and then partition the data in a different way and continue the same way finishing with an averaging of the parameters determined during each partitioning?

  2. #2
    Dark Horse
    Deus Ex Machina
    Dark Horse's Avatar Become A Pro!
    Join Date: 12-14-05
    Posts: 13,764

    When going over past results you want to leave one or two seasons untouched, so you can later use them to test your hypthesis on. Seasons that helped develop and finetune your approach are useless in determining the value of what you came up with.

  3. #3
    reno cool
    the meaning of harm
    reno cool's Avatar Become A Pro!
    Join Date: 07-02-08
    Posts: 3,567

    I like your idea of picking random parts of seasons and then testing against other random parts. That would seemingly avoid the problem of some changes that can occur over the years.

Top