Clean Data Downloads

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Bsims
    SBR Wise Guy
    • 02-03-09
    • 827

    #1
    Clean Data Downloads
    As I finally begin my annual baseball analysis in preparation, I'm faced with the issue of where to get good, clean, downloadable data. I've started this season by downloading the 2017 MLB data from the SBR archives. Unfortunately, the format has again changed a little (not a big deal). But I've also encountered some garbage data.

    The next step will be to pick up the data from covers. Past experience tells me it will include some garbage data also. Furthermore, the data it does include will likely differ from the SBR data (primarily in the odds).

    Does anyone know of other sources with cleaner data?
  • ClippersSux
    SBR Hustler
    • 12-10-10
    • 95

    #2
    Check out sportsoptions. They seem reliable.
    Comment
    • vampire assassin
      SBR Sharp
      • 03-09-18
      • 296

      #3
      Cleaning data is always a huge headache. If you have two sources, you might accept as "good" any line where both sources are within 10 cents of each other. Other options could include paying someone else that has already done the work, or getting it directly from a Sportsbook (and even that might have some problems).
      Comment
      • vampire assassin
        SBR Sharp
        • 03-09-18
        • 296

        #4
        Originally posted by ClippersSux
        Check out sportsoptions. They seem reliable.
        I like Sportsoptions. Their line data is pretty clean.
        Comment
        • Bsims
          SBR Wise Guy
          • 02-03-09
          • 827

          #5
          Originally posted by ClippersSux
          Check out sportsoptions. They seem reliable.
          I looked at their site but it is not free. With their acquisition by Don Best they indicated their prices would go up. I'm only interested in free data. I supply my own programming and analysis.
          Comment
          • On the come
            SBR High Roller
            • 09-03-11
            • 125

            #6
            Originally posted by vampire assassin
            I like Sportsoptions. Their line data is pretty clean.
            Is there a simple way of getting historical data from SO? Right now the only way I can find is to go Scores>Archive>>select date>>>repeat.
            Comment
            • Miz
              SBR Wise Guy
              • 08-30-09
              • 695

              #7
              i get mlb data from fangraphs. odds data needs to be scraped, or just pay someone to scrape it.
              Comment
              • Miz
                SBR Wise Guy
                • 08-30-09
                • 695

                #8
                i can probably send you open/close lines for a few years... or I can scrape up to the current day in exchange for something you have. just message me if you're interested
                Comment
                • vegasreaper
                  SBR Hall of Famer
                  • 04-27-11
                  • 5656

                  #9
                  I use baseball-reference.com my friend to get all and any data you may be searching for my friend.. GL
                  Comment
                  • hubie69
                    SBR Hall of Famer
                    • 09-16-10
                    • 7329

                    #10
                    I currently scrape baseball-reference for baseball data, I also use their sister sites for ncaaf/ncaab data. Things REALLY get messy if you need to scrape multiple sites for NCAA stuff, as linking the teams together is a pain. Different names, spellings, teamID's, etc to correlate data is a Pain. If you're not proficient in python (for scraping) there's a handful of git pages out there where people have already built scrapers for <sport>-reference.com. Some modification may be needed though.
                    Comment
                    • vegasreaper
                      SBR Hall of Famer
                      • 04-27-11
                      • 5656

                      #11
                      Originally posted by hubie69
                      I currently scrape baseball-reference for baseball data, I also use their sister sites for ncaaf/ncaab data. Things REALLY get messy if you need to scrape multiple sites for NCAA stuff, as linking the teams together is a pain. Different names, spellings, teamID's, etc to correlate data is a Pain. If you're not proficient in python (for scraping) there's a handful of git pages out there where people have already built scrapers for <sport>-reference.com. Some modification may be needed though.
                      Very true Hubie69 I couldn't agree with you more
                      Comment
                      • Bsims
                        SBR Wise Guy
                        • 02-03-09
                        • 827

                        #12
                        Originally posted by hubie69
                        Things REALLY get messy if you need to scrape multiple sites for NCAA stuff, as linking the teams together is a pain. Different names, spellings, teamID's, etc to correlate data is a Pain.
                        I faced this problem many years ago when I first started writing programs to process data from multiple web sites. Following is how I solved the problem.

                        • Created a real team name file for each sport. This contained the standardized name that I would use for each team. It also contained an abbreviation for that team. (i.e. CLE INDIANS,cle).
                        • Created an alias team name file that contained any variations of the team name I encountered and the real name. (i.e. Cleveland,CLE INDIANS).
                        • Wrote a find team subroutine. When I encountered a team name, passed it to this routine and it would search the team name files and return the standardized name and abbreviation.
                        • Created a GameID that has the game date and two abbreviations. (i.e. 20180728detcle0). (The 0 becomes 1 or 2 for double headers.)


                        I can then sort, merge, and compare on the GameIDs. Obviously there are some utility functions needed to maintain these files. Yes, it was a lot of work initially, but it very effective now.
                        Comment
                        • ChuckyTheGoat
                          BARRELED IN @ SBR!
                          • 04-04-11
                          • 37198

                          #13
                          Originally posted by Bsims
                          I faced this problem many years ago when I first started writing programs to process data from multiple web sites. Following is how I solved the problem.

                          • Created a real team name file for each sport. This contained the standardized name that I would use for each team. It also contained an abbreviation for that team. (i.e. CLE INDIANS,cle).
                          • Created an alias team name file that contained any variations of the team name I encountered and the real name. (i.e. Cleveland,CLE INDIANS).
                          • Wrote a find team subroutine. When I encountered a team name, passed it to this routine and it would search the team name files and return the standardized name and abbreviation.
                          • Created a GameID that has the game date and two abbreviations. (i.e. 20180728detcle0). (The 0 becomes 1 or 2 for double headers.)


                          I can then sort, merge, and compare on the GameIDs. Obviously there are some utility functions needed to maintain these files. Yes, it was a lot of work initially, but it very effective now.
                          I like this. Team&Date should be standard in lookup table. Like u say, a final number to indicate 1st/2nd half of double-header.

                          There are some other sports where "double-header" can show up. Sometimes in tennis, u see a player w/ two games on one date if they are playing make-up game after rainout (for example).
                          Where's the fuckin power box, Carol?
                          Comment
                          SBR Contests
                          Collapse
                          Top-Rated US Sportsbooks
                          Collapse
                          Working...