Originally posted by arwar
The only downside is that you can't scrape some of the sites (like Matchbook). You really need a site that has a structure of pages with identifying information in the URL. For instance, I am generalizing accessing Covers data. You can load the page for a sport that links to the teams. Then you parse that page extracting the team name and covers number. From that you can dowload the previous results page for each team and parse out the data you need.
You mentioned the problem with team names not being standard. I solved that years ago by building a file for each sport called real team names. In this file, I standardize the team name (i.e. LA Kings in hockey). I also have a 3 character abbreviation for each team.
I then have a second file with alias names. This contains aliases for each team and the standardized name. All routines picking up a team name call a subroutine which does a table lookup for a match to a real name or an alias and then uses the standardized name. If no match is found it adds the unidentified name to an error file. I then have a utility function that reads the error file listing the unknown name and the real names that are similar (have same starting letters). I can then manually indicate which team the new name is for and add that to my alias file. The process is a bit slow when you encouter a new web page that uses unusual names (like nicknames). But once it is done, there rarely are many new entries. (When I did soccer last year, this was a challenge trying to figure out which team was which).
I've considered picking up a new language to simplify my life. Right now, I'd probably try C# first. But, I've got analystical projects a mile long before I get any free time.
