SBRodds Scraping

Bsims · 01-09-10, 03:29 PM

Originally posted by arwar

hey ole bill - I thought you have your own scraping (at least for baseball) tools. FWIW, I generally code in Delphi because it has a TWebBrowser component and a slew of other useful web based options. I imagine VB has them as well (and I make a lot of API calls, which are certainly supported there) but it is too kludgy to compile. A 5 line VB app ends up with 50 megs of dlls, etc. Delphi compiles into a neat standalone app - even smaller as a console app. Lately I have been fooling around with the VBA interface inside of EXCEL, but I still don't like VB even though I am certified in it. I did get a copy of Power Basic as you suggested, but it isn't as robust as Delphi. I guess Delphi has a lot bigger user base and thus more people developing components. Another benefit of Delphi is Kylix which allows apps to be ported to Linux which is hosting most web servers. I scrape covers (easy), nfl.com, atsdatabase, mlb, espn, of course sbr, yahoo, usatoday , et al. The first obstacle is getting on the same page with team names between sites. EG one site may have have Boston Red-Sox and the next one will be Boston Red Sox or Bosox. It's not too bad for pro sports, but NCAA baskets has something like 350 teams. I am just posting this stuff here to give people data to play with. BTW I haven't found any usuable trends (yet!!)

I do scrape from a variety of sites. I use PowerBasic Console Compiler. I'm an old man and have programmed for years using procedural languages (Fortran, Basic, PL/I, 360 assembler). I realize life would be easier if I would get into the modern era and use some object oriented stuff. But I'm more interested in playing with the data than learning new stuff. Add to this, PowerBasic is the language I use for all my mathematical analysis.

The only downside is that you can't scrape some of the sites (like Matchbook). You really need a site that has a structure of pages with identifying information in the URL. For instance, I am generalizing accessing Covers data. You can load the page for a sport that links to the teams. Then you parse that page extracting the team name and covers number. From that you can dowload the previous results page for each team and parse out the data you need.

You mentioned the problem with team names not being standard. I solved that years ago by building a file for each sport called real team names. In this file, I standardize the team name (i.e. LA Kings in hockey). I also have a 3 character abbreviation for each team.

I then have a second file with alias names. This contains aliases for each team and the standardized name. All routines picking up a team name call a subroutine which does a table lookup for a match to a real name or an alias and then uses the standardized name. If no match is found it adds the unidentified name to an error file. I then have a utility function that reads the error file listing the unknown name and the real names that are similar (have same starting letters). I can then manually indicate which team the new name is for and add that to my alias file. The process is a bit slow when you encouter a new web page that uses unusual names (like nicknames). But once it is done, there rarely are many new entries. (When I did soccer last year, this was a challenge trying to figure out which team was which).

I've considered picking up a new language to simplify my life. Right now, I'd probably try C# first. But, I've got analystical projects a mile long before I get any free time.

arwar · 01-14-10, 11:30 PM

i do all of that too with hashing team names but some of this crap is impossible, like SALAB or WIGB or LOUTE or CLEST or LOULA. If you are scraping multiple sites, you can make a 1/1 table for each===major pain. I have found on some sites like covers.com you can build a url list where only the numbers in the url change (=teams) so once you have the scraping logic, you can run the whole range and process the data at the same time like from 28600-31000 and output it to whatever you need.

p.s. i still need to get with you more about MLB sims.......!!

arwar · 01-14-10, 11:34 PM

what about the Excel function to 'import web query'? I'm not an Excel type, but is there any ------ ?? does it work?

playersonly69 · 01-14-10, 11:46 PM

, ,

arwar · 01-14-10, 11:57 PM

Originally posted by Bsims

I do scrape from a variety of sites. I use PowerBasic Console Compiler. I'm an old man and have programmed for years using procedural languages (Fortran, Basic, PL/I, 360 assembler). I realize life would be easier if I would get into the modern era and use some object oriented stuff. But I'm more interested in playing with the data than learning new stuff. Add to this, PowerBasic is the language I use for all my mathematical analysis.

The only downside is that you can't scrape some of the sites (like Matchbook). You really need a site that has a structure of pages with identifying information in the URL. For instance, I am generalizing accessing Covers data. You can load the page for a sport that links to the teams. Then you parse that page extracting the team name and covers number. From that you can dowload the previous results page for each team and parse out the data you need.

You mentioned the problem with team names not being standard. I solved that years ago by building a file for each sport called real team names. In this file, I standardize the team name (i.e. LA Kings in hockey). I also have a 3 character abbreviation for each team.

I then have a second file with alias names. This contains aliases for each team and the standardized name. All routines picking up a team name call a subroutine which does a table lookup for a match to a real name or an alias and then uses the standardized name. If no match is found it adds the unidentified name to an error file. I then have a utility function that reads the error file listing the unknown name and the real names that are similar (have same starting letters). I can then manually indicate which team the new name is for and add that to my alias file. The process is a bit slow when you encouter a new web page that uses unusual names (like nicknames). But once it is done, there rarely are many new entries. (When I did soccer last year, this was a challenge trying to figure out which team was which).

I've considered picking up a new language to simplify my life. Right now, I'd probably try C# first. But, I've got analystical projects a mile long before I get any free time.

PB is OOP??

Megaman · 01-17-10, 07:43 AM

I use the lynx text browser to first save all relevant web pages to files. When I got them I then goes through them again and dump the parsed text output. This text is then parsed by my own program written in C to produce a text file with the data I want.
To get the relevant pages I use a bash script, possibly together with lynx to dump and parse main pages containing links to the pages I want.

I find this a quite good approach, when I (the parser) find a possible error in the data I can check it and fix it manually in the saved web page and then redo the parsing. I don't need to access the web server again.

Edit: Most of this process is automated if that was not clear.

dickersonka · 01-18-10, 11:33 AM

Two questions. What software did you use to develop the programs? What sites can you scrape from?

C# .NET -- Visual Studio
I have written it to scrape from covers, sbr, and usatoday

pedro803 · 01-18-10, 07:18 PM

I just bought an OEM version of microsoft office 2003 (because I want excel) off ebay and I can't get it to install is there anybody here who knows what might be my problem? I am running Vista and my PC already had an unlocked version of Office on it. When I first put the CD in it was like the PC didn't recognize the CD at all. I then unistalled the version of Office that was already on there, and now it lets me open the CD to see what is on it (but it takes 5 times as long as it should) and it will not execute any of the set up files that are on the CD.

Do I need to be running in some compatability mode, or did I just buy the wrong thing, or does anyone know what I might be doing wrong, I really need excel so I can do some spread sheet stuff!

Thanks!!

arwar · 01-20-10, 06:25 AM

Originally posted by rfr3sh

how would you scrape the line movements and the final outcome of nba games to an excel spread sheet

01/06 18:46 +13½ +100 / -13½ -110
01/06 18:51 +13½ -101 / -13½ -109
01/06 18:57 +13½ +101 / -13½ -111
01/06 19:01 +13½ +105 / -13½ -115
01/06 19:06 +13½ +106 / -13½ -116

like that transfere to an excel?

change all the spaces to commas and save it as something.csv or paste it in a spreadsheet and use text to columns option?

arwar · 01-20-10, 06:30 AM

When you say OEM are you talking about the Enterprise Blue Edition?

Bsims · 01-20-10, 01:42 PM

Originally posted by dickersonka

C# .NET -- Visual Studio
I have written it to scrape from covers, sbr, and usatoday

Thanks. I'm interested in learning C# and a little sample code would help. I've sent you a private message.

pedro803 · 01-21-10, 07:01 AM

Arwar, it wasn't advertised as Enterprise Blue Edition, it was just Original Equipment Manufacturers version, which to my knowledge comes with just the CD and no manuals and is meant to be used when someone is building a PC, so they can sell the software as part of the PC. Hence OEM is not meant for standalone retail sale.

The disk I have looks authentic and with much difficulty my computer will pull the names of the files up, but it cannot read the CD efficiently enough to run the installation. I wrestled with it for hours as I was watching TV trying to let it read and run it in comatability modes to no avail.

I have given up now, I paid 50 bucks for it, and I am just marking it off as a loss, and ordering a retail version, thats what I get for being cheap I guess!

pats3peat · 01-21-10, 12:39 PM

dude just download MS 03

Bluehorseshoe · 01-21-10, 04:43 PM

I wish I was a little more computer savvy. I'd love to be able to track line moves with my three locals.