For all those that are interested in a simulator.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ljump12
    SBR High Roller
    • 12-08-09
    • 113

    #1
    For all those that are interested in a simulator.
    I'll start by prefacing this that running this simulator WILL not yield profitable results. That being said, I don't think it's impossible to get close with some hard work.

    I've been working on this for a while, and have run out of steam, and time -- so I'm opening it up to the community here. It's not the best documented code, but if you spend some time you should be able to figure it out.

    This started as a project for a class, and I've attached a presentation that I gave on the simulator -- including some pretty awful results it had.

    In a nutshell, players.py models the players, and gets all the information from the database -- sim.py runs the simulation once, and game_backtester.py will run the simulations on every game in the season.

    Code can be found here: http://github.com/ljump12/Baseball-Simulator/
  • arwar
    SBR High Roller
    • 07-09-09
    • 208

    #2
    well i downloaded it - i can't find an attached presentation, either here or on the link. Just out of curiosity, why would you build a simulator that only runs one simulation? isn't the whole point to run 10000 or so? Does it have to access the retrosheet database on each simulation? or does it build a team lineup file and then use that file each time? i saw a file called phillies09 is that like a lineup file based on the 09 season. I don't use retrosheet since they don't have line data, but do they have current (2010) data?
    Comment
    • ljump12
      SBR High Roller
      • 12-08-09
      • 113

      #3
      Forgot to attach the presentation, I'll do it tomorrow --

      1) the simulator runs each game 1000 times. (you can make it more if you want)
      2) you need to have a local copy of the retrosheet db. Refer to the README
      3) I pickle the player object for each player, so once you ask for them once, you don't have to ask the database again...ever.
      4) the phillies09 was just an example of the output for the phillies 09 season.
      5) no line data, and no current season data -- though with some work you could probably get he current seasons data into the database. I think the best approach is to backtest with the games you have now, and if you can make it profitable, spend the time to get current season data.
      Comment
      • Wrecktangle
        SBR MVP
        • 03-01-09
        • 1524

        #4
        Ijump, any estimates on how long it takes to make 10000 runs with the code as it now stands?

        I have to point out that every extra line you insert into an interpreted language loop can really extend the run time, or maybe I just need a faster machine.
        Comment
        • ljump12
          SBR High Roller
          • 12-08-09
          • 113

          #5
          Yea 10000 games takes about a minute and a half -- long, but reasonable. When I ran the whole season, I only did 1000 sims/game for speeds sake.
          Comment
          • Flight
            Restricted User
            • 01-28-09
            • 1979

            #6
            Thanks ljump, looks interesting! I may even start my own branch, would you mind if I contributed?
            Comment
            • ljump12
              SBR High Roller
              • 12-08-09
              • 113

              #7
              Originally posted by Flight
              Thanks ljump, looks interesting! I may even start my own branch, would you mind if I contributed?
              This is why I contributed, if you have questions about the code, post here or pm me... Some of the code has been updated since -- i need to update it. Will do it in a minute.

              EDIT:: Just updated it.
              Last edited by ljump12; 05-02-10, 07:27 PM.
              Comment
              • MonkeyF0cker
                SBR Posting Legend
                • 06-12-07
                • 12144

                #8
                Originally posted by Wrecktangle
                Ijump, any estimates on how long it takes to make 10000 runs with the code as it now stands?

                I have to point out that every extra line you insert into an interpreted language loop can really extend the run time, or maybe I just need a faster machine.
                I wouldn't run less than 100,000 and I'd recommend 1,000,000+. Unless you don't care to factor in standard deviation.
                Comment
                • ljump12
                  SBR High Roller
                  • 12-08-09
                  • 113

                  #9
                  Seems like overkill to ne, I agree 1000 may be too small, but 1,000,000?
                  Comment
                  • MonkeyF0cker
                    SBR Posting Legend
                    • 06-12-07
                    • 12144

                    #10
                    Originally posted by ljump12
                    Seems like overkill to ne, I agree 1000 may be too small, but 1,000,000?
                    What you're creating is a binomial distribution. It's a sample of simulated games. Just like flipping a fair coin, there will be variance in your results. However, with more games, that variance (in percentage) will decrease and your results will closer resemble the true probability of success. 3 standard deviations from the mean (+/-) will include the results that you would expect to see in 99.7% of trials. For example, if you had a 1,000 simulated game sample with Team A winning 500 games (50% p-value), your standard deviation would be ±11.18034 games (3.3541%). So to be 99.7% confident in the results of your simulation, pWin(Team A) would actually be between the values of 47.646% and 53.354%. You can see how that can quickly be an issue. Even at 10,000 simulations (assuming everthing else remains unchanged), your error is ±1.06066%. At 1,000,000 simulations, you can have 99.7% confidence within 0.10606%.
                    Comment
                    • MonkeyF0cker
                      SBR Posting Legend
                      • 06-12-07
                      • 12144

                      #11
                      This is also why I stress efficiency in programming by the way...
                      Comment
                      • Wrecktangle
                        SBR MVP
                        • 03-01-09
                        • 1524

                        #12
                        Unfortunately, this does get to be an issue, if 10 games in a day, even at 1.5 minutes a day you are at 15 minutes, and now if you uncover a few mistakes along the way (I usually run into 2-3 in error checking) you can get into the better part of an hour. Possible batting orders and other variables will contribute also. Taking Monkey's point you then can get into hours for a run set.
                        Comment
                        • ljump12
                          SBR High Roller
                          • 12-08-09
                          • 113

                          #13
                          Point taken -- I agree python wasn't the best language for this... But it's there if anyone would like to work on it
                          Comment
                          • uva3021
                            SBR Wise Guy
                            • 03-01-07
                            • 537

                            #14
                            what is your one game standard deviation factor

                            I think it's different if you are simulating a game score vs just simulating win or lose
                            Comment
                            • MonkeyF0cker
                              SBR Posting Legend
                              • 06-12-07
                              • 12144

                              #15
                              Originally posted by uva3021
                              what is your one game standard deviation factor

                              I think it's different if you are simulating a game score vs just simulating win or lose
                              SQRT( np(1-p) )

                              Google "binomial standard deviation" for more information.
                              Comment
                              • arwar
                                SBR High Roller
                                • 07-09-09
                                • 208

                                #16
                                Originally posted by ljump12
                                Point taken -- I agree python wasn't the best language for this... But it's there if anyone would like to work on it
                                i haven't had time to wade through the code - i am not python anyway - do you have pseudo code or outline of the logic - i still didn't find the presentation. i do RAD stuff so it may be possible to create a binary executable without too much trouble.
                                Comment
                                • Wrecktangle
                                  SBR MVP
                                  • 03-01-09
                                  • 1524

                                  #17
                                  Ijump, I build hybrid expected value - montecarlo (**) models so Python can be very useful for me as the run time can be much shorter. So, I see your code as being very useful. While ** is thought to be the most reliable statistically, you typically must have distributions to draw from and this is where you can go very wrong if you build them incorrectly. Most folks only really know a few: normal, binomial, poisson, and not even realize that some sports have special distributions not even in the books.
                                  Comment
                                  • MonkeyF0cker
                                    SBR Posting Legend
                                    • 06-12-07
                                    • 12144

                                    #18
                                    Originally posted by Wrecktangle
                                    Ijump, I build hybrid expected value - montecarlo (**) models so Python can be very useful for me as the run time can be much shorter. So, I see your code as being very useful. While ** is thought to be the most reliable statistically, you typically must have distributions to draw from and this is where you can go very wrong if you build them incorrectly. Most folks only really know a few: normal, binomial, poisson, and not even realize that some sports have special distributions not even in the books.
                                    Umm. Normal = binomial(CLT) = poisson.

                                    LOL.
                                    Last edited by MonkeyF0cker; 05-04-10, 12:25 PM.
                                    Comment
                                    • arwar
                                      SBR High Roller
                                      • 07-09-09
                                      • 208

                                      #19
                                      obfuscate - i still don't see the presentation and i still would appreciate the logic of the simulator without having to wade through the python code. i have coded enough PHP stuff to make me want to go to COBOL (maybe even Visual COBOL [kidding of course]) and i suspect python is a red headed stepchild of PHP. I thought this was about baseball simulations not some sport that has a special distribution that has yet be quantified or beyesian predicates. I run simulations every day and maybe if we can get past the spelling book contest we might be able to build a better model.
                                      Comment
                                      • jbrent95
                                        SBR MVP
                                        • 12-07-09
                                        • 1221

                                        #20
                                        In an effort to broaden my horizons, I've spent some time learning some of the basics for Python and run all of ljump12's modules. However, I've hit a snag that I haven't been able to work through. When I try to run the retrosheet.py, the following error occurs:

                                        Traceback (most recent call last):
                                        File "C:\Python27\retrosheet.py", line 75, in <module>
                                        for match in re.finditer(pattern, urllib.urlopen(RETROSHEET_URL).read(), re.S):
                                        File "C:\Python27\lib\re.py", line 186, in finditer
                                        return _compile(pattern, flags).finditer(string)
                                        File "C:\Python27\lib\re.py", line 245, in _compile
                                        raise error, v # invalid expression
                                        error: unknown specifier: ?Ph

                                        I haven't figured out if there is a bug, if I need to rename the inputs, or take another course of action.
                                        Comment
                                        • ScoreProphet
                                          SBR Rookie
                                          • 09-01-10
                                          • 11

                                          #21
                                          retrosheet.py fails because the layout of the page it scrapes the game files from has changed:

                                          There was a major reorganization of downloadable event files on November 21, 2009. As of that date, event files
                                          are packaged into ZIP archives and available on a year by year basis directly from this page. Previously it was
                                          necessary to go through an intermediate page to get to the downloadable archives
                                          .

                                          Maybe change this:

                                          Code:
                                          pattern = r'href="(?Phttp://www.retrosheet.org/(?P\d{4})/\d{4}(?P\w{2}).htm)"'
                                          for match in re.finditer(pattern, urllib.urlopen(RETROSHEET_URL).read(), re.S):
                                                  url = "http://www.retrosheet.org/%s/%s%s.zip" % (match.group("year"), match.group("year"), match.group("league"))
                                                  queue.put(url)
                                          to this:

                                          Code:
                                          pattern = '\d+eve\.zip'
                                          for match in re.findall(pattern, urllib.urlopen(RETROSHEET_URL).read()):
                                                  url = "http://www.retrosheet.org/events/%s" % match.group()
                                                  queue.put(url)
                                          Does anybody better at this than I am have any suggestions?
                                          Comment
                                          SBR Contests
                                          Collapse
                                          Top-Rated US Sportsbooks
                                          Collapse
                                          Working...