1. #1
    Bsims
    Bsims's Avatar Become A Pro!
    Join Date: 02-03-09
    Posts: 827
    Betpoints: 13

    Clean Data Downloads

    As I finally begin my annual baseball analysis in preparation, I'm faced with the issue of where to get good, clean, downloadable data. I've started this season by downloading the 2017 MLB data from the SBR archives. Unfortunately, the format has again changed a little (not a big deal). But I've also encountered some garbage data.

    The next step will be to pick up the data from covers. Past experience tells me it will include some garbage data also. Furthermore, the data it does include will likely differ from the SBR data (primarily in the odds).

    Does anyone know of other sources with cleaner data?

  2. #2
    ClippersSux
    ClippersSux's Avatar Become A Pro!
    Join Date: 12-10-10
    Posts: 95
    Betpoints: 1047

    Check out sportsoptions. They seem reliable.

  3. #3
    vampire assassin
    vampire assassin's Avatar Become A Pro!
    Join Date: 03-09-18
    Posts: 279
    Betpoints: 9908

    Cleaning data is always a huge headache. If you have two sources, you might accept as "good" any line where both sources are within 10 cents of each other. Other options could include paying someone else that has already done the work, or getting it directly from a Sportsbook (and even that might have some problems).

  4. #4
    vampire assassin
    vampire assassin's Avatar Become A Pro!
    Join Date: 03-09-18
    Posts: 279
    Betpoints: 9908

    Quote Originally Posted by ClippersSux View Post
    Check out sportsoptions. They seem reliable.
    I like Sportsoptions. Their line data is pretty clean.

  5. #5
    Bsims
    Bsims's Avatar Become A Pro!
    Join Date: 02-03-09
    Posts: 827
    Betpoints: 13

    Quote Originally Posted by ClippersSux View Post
    Check out sportsoptions. They seem reliable.
    I looked at their site but it is not free. With their acquisition by Don Best they indicated their prices would go up. I'm only interested in free data. I supply my own programming and analysis.

  6. #6
    On the come
    On the come's Avatar Become A Pro!
    Join Date: 09-03-11
    Posts: 125
    Betpoints: 6499

    Quote Originally Posted by vampire assassin View Post
    I like Sportsoptions. Their line data is pretty clean.
    Is there a simple way of getting historical data from SO? Right now the only way I can find is to go Scores>Archive>>select date>>>repeat.

  7. #7
    Miz
    Miz's Avatar Become A Pro!
    Join Date: 08-30-09
    Posts: 695
    Betpoints: 3162

    i get mlb data from fangraphs. odds data needs to be scraped, or just pay someone to scrape it.

  8. #8
    Miz
    Miz's Avatar Become A Pro!
    Join Date: 08-30-09
    Posts: 695
    Betpoints: 3162

    i can probably send you open/close lines for a few years... or I can scrape up to the current day in exchange for something you have. just message me if you're interested

  9. #9
    vegasreaper
    vegasreaper's Avatar Become A Pro!
    Join Date: 04-27-11
    Posts: 5,656
    Betpoints: 11044

    I use baseball-reference.com my friend to get all and any data you may be searching for my friend.. GL

  10. #10
    hubie69
    I am JJs bookie
    hubie69's Avatar Become A Pro!
    Join Date: 09-16-10
    Posts: 7,329
    Betpoints: 617

    I currently scrape baseball-reference for baseball data, I also use their sister sites for ncaaf/ncaab data. Things REALLY get messy if you need to scrape multiple sites for NCAA stuff, as linking the teams together is a pain. Different names, spellings, teamID's, etc to correlate data is a Pain. If you're not proficient in python (for scraping) there's a handful of git pages out there where people have already built scrapers for -reference.com. Some modification may be needed though.

  11. #11
    vegasreaper
    vegasreaper's Avatar Become A Pro!
    Join Date: 04-27-11
    Posts: 5,656
    Betpoints: 11044

    Quote Originally Posted by hubie69 View Post
    I currently scrape baseball-reference for baseball data, I also use their sister sites for ncaaf/ncaab data. Things REALLY get messy if you need to scrape multiple sites for NCAA stuff, as linking the teams together is a pain. Different names, spellings, teamID's, etc to correlate data is a Pain. If you're not proficient in python (for scraping) there's a handful of git pages out there where people have already built scrapers for <sport>-reference.com. Some modification may be needed though.
    Very true Hubie69 I couldn't agree with you more

  12. #12
    Bsims
    Bsims's Avatar Become A Pro!
    Join Date: 02-03-09
    Posts: 827
    Betpoints: 13

    Quote Originally Posted by hubie69 View Post
    Things REALLY get messy if you need to scrape multiple sites for NCAA stuff, as linking the teams together is a pain. Different names, spellings, teamID's, etc to correlate data is a Pain.
    I faced this problem many years ago when I first started writing programs to process data from multiple web sites. Following is how I solved the problem.


    • Created a real team name file for each sport. This contained the standardized name that I would use for each team. It also contained an abbreviation for that team. (i.e. CLE INDIANS,cle).
    • Created an alias team name file that contained any variations of the team name I encountered and the real name. (i.e. Cleveland,CLE INDIANS).
    • Wrote a find team subroutine. When I encountered a team name, passed it to this routine and it would search the team name files and return the standardized name and abbreviation.
    • Created a GameID that has the game date and two abbreviations. (i.e. 20180728detcle0). (The 0 becomes 1 or 2 for double headers.)


    I can then sort, merge, and compare on the GameIDs. Obviously there are some utility functions needed to maintain these files. Yes, it was a lot of work initially, but it very effective now.

  13. #13
    ChuckyTheGoat
    ChuckyTheGoat's Avatar SBR PRO
    Join Date: 04-04-11
    Posts: 31,511
    Betpoints: 24869

    Quote Originally Posted by Bsims View Post
    I faced this problem many years ago when I first started writing programs to process data from multiple web sites. Following is how I solved the problem.


    • Created a real team name file for each sport. This contained the standardized name that I would use for each team. It also contained an abbreviation for that team. (i.e. CLE INDIANS,cle).
    • Created an alias team name file that contained any variations of the team name I encountered and the real name. (i.e. Cleveland,CLE INDIANS).
    • Wrote a find team subroutine. When I encountered a team name, passed it to this routine and it would search the team name files and return the standardized name and abbreviation.
    • Created a GameID that has the game date and two abbreviations. (i.e. 20180728detcle0). (The 0 becomes 1 or 2 for double headers.)


    I can then sort, merge, and compare on the GameIDs. Obviously there are some utility functions needed to maintain these files. Yes, it was a lot of work initially, but it very effective now.
    I like this. Team&Date should be standard in lookup table. Like u say, a final number to indicate 1st/2nd half of double-header.

    There are some other sports where "double-header" can show up. Sometimes in tennis, u see a player w/ two games on one date if they are playing make-up game after rainout (for example).

Top