I do scrape from a variety of sites. I use PowerBasic Console Compiler. I'm an old man and have programmed for years using procedural languages (Fortran, Basic, PL/I, 360 assembler). I realize life would be easier if I would get into the modern era and use some object oriented stuff. But I'm more interested in playing with the data than learning new stuff. Add to this, PowerBasic is the language I use for all my mathematical analysis.
The only downside is that you can't scrape some of the sites (like Matchbook). You really need a site that has a structure of pages with identifying information in the URL. For instance, I am generalizing accessing Covers data. You can load the page for a sport that links to the teams. Then you parse that page extracting the team name and covers number. From that you can dowload the previous results page for each team and parse out the data you need.
You mentioned the problem with team names not being standard. I solved that years ago by building a file for each sport called real team names. In this file, I standardize the team name (i.e. LA Kings in hockey). I also have a 3 character abbreviation for each team.
I then have a second file with alias names. This contains aliases for each team and the standardized name. All routines picking up a team name call a subroutine which does a table lookup for a match to a real name or an alias and then uses the standardized name. If no match is found it adds the unidentified name to an error file. I then have a utility function that reads the error file listing the unknown name and the real names that are similar (have same starting letters). I can then manually indicate which team the new name is for and add that to my alias file. The process is a bit slow when you encouter a new web page that uses unusual names (like nicknames). But once it is done, there rarely are many new entries. (When I did soccer last year, this was a challenge trying to figure out which team was which).
I've considered picking up a new language to simplify my life. Right now, I'd probably try C# first. But, I've got analystical projects a mile long before I get any free time.
The only downside is that you can't scrape some of the sites (like Matchbook). You really need a site that has a structure of pages with identifying information in the URL. For instance, I am generalizing accessing Covers data. You can load the page for a sport that links to the teams. Then you parse that page extracting the team name and covers number. From that you can dowload the previous results page for each team and parse out the data you need.
You mentioned the problem with team names not being standard. I solved that years ago by building a file for each sport called real team names. In this file, I standardize the team name (i.e. LA Kings in hockey). I also have a 3 character abbreviation for each team.
I then have a second file with alias names. This contains aliases for each team and the standardized name. All routines picking up a team name call a subroutine which does a table lookup for a match to a real name or an alias and then uses the standardized name. If no match is found it adds the unidentified name to an error file. I then have a utility function that reads the error file listing the unknown name and the real names that are similar (have same starting letters). I can then manually indicate which team the new name is for and add that to my alias file. The process is a bit slow when you encouter a new web page that uses unusual names (like nicknames). But once it is done, there rarely are many new entries. (When I did soccer last year, this was a challenge trying to figure out which team was which).
I've considered picking up a new language to simplify my life. Right now, I'd probably try C# first. But, I've got analystical projects a mile long before I get any free time.