Clean Data Downloads

Bsims · 03-21-18 08:37 AM

As I finally begin my annual baseball analysis in preparation, I'm faced with the issue of where to get good, clean, downloadable data. I've started this season by downloading the 2017 MLB data from the SBR archives. Unfortunately, the format has again changed a little (not a big deal). But I've also encountered some garbage data.

The next step will be to pick up the data from covers. Past experience tells me it will include some garbage data also. Furthermore, the data it does include will likely differ from the SBR data (primarily in the odds).

Does anyone know of other sources with cleaner data?

ClippersSux · 03-21-18 10:54 AM

Check out sportsoptions. They seem reliable.

vampire assassin · 03-21-18 08:20 PM

Cleaning data is always a huge headache. If you have two sources, you might accept as "good" any line where both sources are within 10 cents of each other. Other options could include paying someone else that has already done the work, or getting it directly from a Sportsbook (and even that might have some problems).

vampire assassin · 03-21-18 08:21 PM

Originally Posted by ClippersSux

Check out sportsoptions. They seem reliable.

I like Sportsoptions. Their line data is pretty clean.

Bsims · 03-22-18 05:32 AM

Originally Posted by ClippersSux

Check out sportsoptions. They seem reliable.

I looked at their site but it is not free. With their acquisition by Don Best they indicated their prices would go up. I'm only interested in free data. I supply my own programming and analysis.

On the come · 06-15-18 09:47 AM

Originally Posted by vampire assassin

I like Sportsoptions. Their line data is pretty clean.

Is there a simple way of getting historical data from SO? Right now the only way I can find is to go Scores>Archive>>select date>>>repeat.

Miz · 06-23-18 01:10 PM

i get mlb data from fangraphs. odds data needs to be scraped, or just pay someone to scrape it.

Miz · 06-23-18 01:11 PM

i can probably send you open/close lines for a few years... or I can scrape up to the current day in exchange for something you have. just message me if you're interested

vegasreaper · 07-27-18 04:33 AM

I use baseball-reference.com my friend to get all and any data you may be searching for my friend.. GL

hubie69 · 07-27-18 09:57 AM

I currently scrape baseball-reference for baseball data, I also use their sister sites for ncaaf/ncaab data. Things REALLY get messy if you need to scrape multiple sites for NCAA stuff, as linking the teams together is a pain. Different names, spellings, teamID's, etc to correlate data is a Pain. If you're not proficient in python (for scraping) there's a handful of git pages out there where people have already built scrapers for -reference.com. Some modification may be needed though.

vegasreaper · 07-27-18 07:41 PM

Originally Posted by hubie69

I currently scrape baseball-reference for baseball data, I also use their sister sites for ncaaf/ncaab data. Things REALLY get messy if you need to scrape multiple sites for NCAA stuff, as linking the teams together is a pain. Different names, spellings, teamID's, etc to correlate data is a Pain. If you're not proficient in python (for scraping) there's a handful of git pages out there where people have already built scrapers for <sport>-reference.com. Some modification may be needed though.

Very true Hubie69 I couldn't agree with you more

Bsims · 07-28-18 06:59 AM

Originally Posted by hubie69

Things REALLY get messy if you need to scrape multiple sites for NCAA stuff, as linking the teams together is a pain. Different names, spellings, teamID's, etc to correlate data is a Pain.

I faced this problem many years ago when I first started writing programs to process data from multiple web sites. Following is how I solved the problem.

Created a real team name file for each sport. This contained the standardized name that I would use for each team. It also contained an abbreviation for that team. (i.e. CLE INDIANS,cle).
Created an alias team name file that contained any variations of the team name I encountered and the real name. (i.e. Cleveland,CLE INDIANS).
Wrote a find team subroutine. When I encountered a team name, passed it to this routine and it would search the team name files and return the standardized name and abbreviation.
Created a GameID that has the game date and two abbreviations. (i.e. 20180728detcle0). (The 0 becomes 1 or 2 for double headers.)

I can then sort, merge, and compare on the GameIDs. Obviously there are some utility functions needed to maintain these files. Yes, it was a lot of work initially, but it very effective now.

ChuckyTheGoat · 07-28-18 08:41 PM

Originally Posted by Bsims

I faced this problem many years ago when I first started writing programs to process data from multiple web sites. Following is how I solved the problem.

Created a real team name file for each sport. This contained the standardized name that I would use for each team. It also contained an abbreviation for that team. (i.e. CLE INDIANS,cle).
Created an alias team name file that contained any variations of the team name I encountered and the real name. (i.e. Cleveland,CLE INDIANS).
Wrote a find team subroutine. When I encountered a team name, passed it to this routine and it would search the team name files and return the standardized name and abbreviation.
Created a GameID that has the game date and two abbreviations. (i.e. 20180728detcle0). (The 0 becomes 1 or 2 for double headers.)

I can then sort, merge, and compare on the GameIDs. Obviously there are some utility functions needed to maintain these files. Yes, it was a lot of work initially, but it very effective now.

I like this. Team&Date should be standard in lookup table. Like u say, a final number to indicate 1st/2nd half of double-header.

There are some other sports where "double-header" can show up. Sometimes in tennis, u see a player w/ two games on one date if they are playing make-up game after rainout (for example).

SBR Top-Rated Sportsbooks				Best Sportsbooks List
#1 FanDuel	SBR rating 4.8/5	Review	#6 BetRivers	SBR rating 4.1/5	Review
#2 Caesars	SBR rating 4.7/5	Review	#7 Fanatics	SBR rating 4.1/5	Review
#3 DraftKings	SBR rating 4.7/5	Review	#8 Betway	SBR rating 3.8/5	Review
#4 BetMGM	SBR rating 4.6/5	Review	#9 Borgata	SBR rating 3.5/5	Review
#5 bet365	SBR rating 4.6/5	Review	#10 ClutchBet	SBR rating 2.9/5	Review

Clean Data Downloads

Thread Tools

Clean Data Downloads