yahoo nhl box scrape

laxbrah420 · 10-11-11 09:04 PM

OK this pulls stats from yesterday's box scores:

Code:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import urllib2, re, time
from datetime import date, timedelta

dateYest=date.today() - timedelta(1)
nhlDateYest=dateYest.strftime("%Y-%m-%d")
nhlScoresWeb="http://sports.yahoo.com/nhl/scoreboard?d="+nhlDateYest
page = urllib2.urlopen(nhlScoresWeb).read()
soup = BeautifulSoup(page)

#iterate through gameIds and Scrape box at same time
for b in soup.findAll('a', href=re.compile('/nhl/boxscore')):
  url = b['href']
  gid= url[-10:]
  g = open(gid+".csv", "w")
  g.write(nhlDateYest+','+gid+',')
  g.write("\n")
  fullUrl = "http://sports.yahoo.com" + str(url)
  boxurl = urllib2.urlopen(fullUrl).read()
  boxsoup = BeautifulSoup(boxurl)
  
  #FindAwayTeamName
  re1='(awayTeamName)'
  re2='.*?'
  re3='(\\\'.*?\\\')'
  rg = re.compile(re1+re2+re3)
  m = rg.search(boxurl)
  awayteam=m.group(2)
    
  #FindAwayScore
  re4='(awayTeamScore)'
  re5='.*?'
  re6='(\\\'.*?\\\')'
  rg = re.compile(re4+re5+re6)
  m = rg.search(boxurl)
  awayscore=m.group(2)  
  g.write(awayteam+", "+awayscore+",") 
  g.write("\n") 
  
  #FindHomeTeamName
  re1='(homeTeamScore)'
  re2='.*?'
  re3='(\\\'.*?\\\')'
  rg = re.compile(re1+re2+re3)
  m = rg.search(boxurl)
  hometeam=m.group(2)
    
  #FindAwayScore
  re4='(homeTeamName)'
  re5='.*?'
  re6='(\\\'.*?\\\')'
  rg = re.compile(re4+re5+re6)
  m = rg.search(boxurl)
  homescore=m.group(2)  
  g.write(hometeam+", "+homescore+",") 
  
    
  #Scrape Team Stats
  t = boxsoup.findAll('div', id = "ysp-reg-box-team_stats")
  for table in t:
    rows = table.findAll('tr')
    for tr in rows:
        cols = tr.findAll('td')
        for td in cols:
            g.write(td.find(text=True))
            g.write(",")
        g.write("\n")

and output looks like this:

Code:

2011-10-11,2011101114,
'Minnesota', '3',
'Ottawa', '4',
27,44,
12,16,
2,13,
10,13,
3,2,
2,3,
1,1,
50%,33%,
50%,33%,
15,21,
34,42,
45%,55%,
45%,55%,
31,28,
21,21,

Can someone recommend a good method for bringing the csv files into a database?
Is there a clear favorite between relational and nosql? I was thinking I might want to learn couchdb, but id most likely just end up using mysql

Also that's my first program ever so if anyone has suggestions to make the code
a. better
b. easier to read
i'd really appreciate it

thanks

uva3021 · 10-11-11 11:17 PM

you can just use re and urllib, no need for beautiful soup

laxbrah420 · 10-12-11 12:25 AM

Thanks man. Got it going perfectly.
Couldnt for the life of me figure out why my hometeam and homescore were getting swapped...played around with variables, order, and syntax forever...

yahoo source:
awayTeamName : 'Florida',
awayTeamScore : '2',
homeTeamName : '4',
homeTeamScore : 'Pittsburgh'

as long as it stays that way I dont care, but it won't.

Dink87522 · 10-12-11 05:51 AM

Interesting.

TheEditor · 10-13-11 07:36 PM

Is that Python code?

laxbrah420 · 10-13-11 09:54 PM

ya can you help make it better?

Flight · 10-13-11 11:10 PM

Thanks for sharing the Python code.

For Microsoft SQL Server, I use the following code to bring CSV into a table:

Code:

BULK INSERT football.dbo.nflgamedata
FROM 'C:/Users/User1/Documents/ScrapeData/NFL_2011.csv'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
GO

If there is a header row in the CSV, you can skip the header by adding this to the WITH block:

Code:

FIRSTROW=2

I think MySQL uses slightly different code, but I've seen it work.

Your Python code is fine. Without totally rewriting it (taking hours) I couldn't offer any specific guidance. Overall suggestions would be to avoid regular expressions, because they suck building and they break. I recommend a solution using xpath. I took a look at the NHL box score HTML and you can use xpath to find ysp-reg-box-line_score and then get child anchors and innertext. If I get time, I'll write up some xpath statements in python to try and help.

I've built lots of scrapers with regex as well, so I know where you're coming from, and you may be fine for years doing this. I changed my ways the past year because of the more robust parsing offered by xpath and other HTML parsers. You actually pull in a powerful HTML parser, Soup, but don't really harness it's power (other than findAll). You should be able to do all your parse work without a single regex.

GL.

Flight · 10-13-11 11:17 PM

For importing CSV to SQL, you need to define your table schema. Each row in your CSV will respond to a single record in your table. The table schema must then match the columns in the CSV.

I see in your sample output that a single game spans multiple lines. I recommend getting each game on a single line. Before your "foreach boxscore" loop, I recommend printing a header row to the CSV file to make sure you don't forget what column is what. You can use Excel to preview your CSV format and verify the correctness before tossing it to SQL.

In other words, get rid of your g.write("\n")

laxbrah420 · 10-14-11 01:48 AM

thanks a lot man. yea, i thought soup would be better but couldnt figure it out other than findAll and had asked on here for some insight --UVA told me to just use re so that's what I figured out.

Also, I decided to get rid of the returns tonight but wasn't sure it was necessary. thanks for confirming that.

The one major ****** up part of my code now that I realized is that for an OT game, there's actual an extra row in that table for goals scored in OT.

I'm having a tough time dealing with that.

Here's where I'm at:

Code:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import urllib2, re, time
from datetime import date, timedelta

dateYest=date.today() - timedelta(1)
nhlDateYest=dateYest.strftime("%Y-%m-%d")
nhlScoresWeb="http://sports.yahoo.com/nhl/scoreboard?d="+nhlDateYest
page = urllib2.urlopen(nhlScoresWeb).read()
soup = BeautifulSoup(page)

#iterate through gameIds and Scrape box at same time
for b in soup.findAll('a', href=re.compile('/nhl/boxscore')):
  i=0
  url = b['href']
  gid= url[-10:]
  g = open(gid+".csv", "w")
  g.write(nhlDateYest+','+gid+',')
  #g.write("\n")
  fullUrl = "http://sports.yahoo.com" + str(url)
  boxurl = urllib2.urlopen(fullUrl).read()
  boxsoup = BeautifulSoup(boxurl)
  
  #FindAwayTeamName
  re1='(awayTeamName)'
  re2='.*?'
  re3='(\\\'.*?\\\')'
  rg = re.compile(re1+re2+re3)
  m = rg.search(boxurl)
  awayteam=m.group(2)
    
  #FindAwayScore
  re4='(awayTeamScore)'
  re5='.*?'
  re6='(\\\'.*?\\\')'
  rg = re.compile(re4+re5+re6)
  m = rg.search(boxurl)
  awayscore=m.group(2)  
  g.write(awayteam+", "+awayscore+",") 
  #g.write("\n") 
  
  #FindHomeTeamName
  re1='(homeTeamScore)'
  re2='.*?'
  re3='(\\\'.*?\\\')'
  rg = re.compile(re1+re2+re3)
  m = rg.search(boxurl)
  hometeam=m.group(2)
    
  #FindAwayScore
  re4='(homeTeamName)'
  re5='.*?'
  re6='(\\\'.*?\\\')'
  rg = re.compile(re4+re5+re6)
  m = rg.search(boxurl)
  homescore=m.group(2)  
  g.write(hometeam+", "+homescore+",") 
  
    
  #Scrape Team Stats
  t = boxsoup.findAll('div', id = "ysp-reg-box-team_stats")
  for table in t:
    rows = table.findAll('tr')
    for tr in rows:
        cols = tr.findAll('td')
        for td in cols:
            g.write(td.find(text=True))
            g.write(",")
        #g.write("\n")
        i=i+1
  if i==16:
    #check if line=="0,0"
    g.close()
    g = open(gid+".csv", "r")
    line=g.readline()
    otS=line[85:88]
    if otS == '0,0':
      g.close()
      g=open(gid+".csv","a")
      g.write("SOW")
    else:
      g.close()
      g=open(gid+".csv","a")
      g.write("OTW")
  if i==15:
    g.write("REG")

If you look at the last part of my code, I messily determine if the game went to OT or shootout by iterating through it. The worst part of the code is this:
line=g.readline()
otS=line[85:88]
Simply because I couldn't figure out how to truly read the file.
In order for this to work though, I actually need to delete those records so everything lines up --(or incorporate some extra bit of logic into the import script which I'd like to avoid).

The only documentation I could find on deleting stuff was the translate method which doesn't seem to work in my case. Can anybody help me to delete the OT score entry?

Thanks a lot

TheEditor · 10-14-11 08:08 AM

I'm a Delphi guy, though I'm seriously considering taking up Python and wxPython.

Yeah, my scraping has problems with OTs, no matter the sport. The REs get a lot more hairy. Thinking about it, it seems like a good idea to test a box first to see if it has OT then apply the appropriate RE. But then you have two different REs to maintain going forward.

laxbrah420 · 10-18-11 01:33 AM

anybody have a good idea on how to deal with OT?
my best idea is to delete the 3 characters that represent the score but im not sure the best way to do that

laxbrah420 · 10-18-11 03:00 AM

for the record im a total dope. my otw vs sow "logic" was actually testing shots on goal in the OT period (and declaring a sow only if there were no shots)

mathdotcom · 10-25-11 06:58 PM

---

rsigley · 10-26-11 09:34 AM

count the number of children for the div

if certain size parse OT, SO, or FG. if its like espn where they put in blank td's so the size is the same regardless of game length just look at value for where you think OT should be

then way i do it is

if SO, OT = 0 for both, 1,2,3 normal, get SO score
if OT, SO = "X" for both, 1,2,3 normal, get OT score
if neither, OT & SO = "X", get score normally

and why export to csv then import into database why not just write directly to db

subs · 10-27-11 04:13 AM

Free sigley

donkson · 10-27-11 08:07 AM

haha

laxbrah420 · 10-27-11 10:27 AM

Originally Posted by rsigley

count the number of children for the div

if certain size parse OT, SO, or FG. if its like espn where they put in blank td's so the size is the same regardless of game length just look at value for where you think OT should be

then way i do it is

if SO, OT = 0 for both, 1,2,3 normal, get SO score
if OT, SO = "X" for both, 1,2,3 normal, get OT score
if neither, OT & SO = "X", get score normally

and why export to csv then import into database why not just write directly to db

OK thanks Ill work on that.
And b/c i don't know how to write to the db yet and figured i'd just start by collecting data haha.
For the time being i can pretty quickly move the files into excel.

Maverick22 · 10-31-11 02:56 PM

you dont know how to write the data base? what does that mean?

laxbrah420 · 11-09-11 09:04 PM

Originally Posted by Maverick22

you dont know how to write the data base? what does that mean?

It means I figured out the write to file function but not the write to mysql stuff

babar1000 · 11-09-11 09:15 PM

For MySQL You can use xampp and HeidiSQL to browse the data.
You must create a table with the classic command CREATE TABLE.....
And after you load the file with the command LOAD DATA (http://dev.mysql.com/doc/refman/5.1/en/load-data.html)

Originally Posted by laxbrah420

It means I figured out the write to file function but not the write to mysql stuff

rsigley · 11-10-11 08:30 AM

just use the sQL command like
t
INSERT INTO

along with some python function that deals with mysql

Maverick22 · 11-14-11 08:56 AM

You might want to invest some time in learning about 'data access objects'

If you can write to a file, then most of the work is done for you to write to a database.

Do you have your database created/modelled already?

TheEditor · 11-14-11 08:57 AM

You talking 'bout DAO specifically?

Maverick22 · 11-14-11 09:00 AM

I do not understand your question?

TheEditor · 11-14-11 09:04 AM

Then the answer is no. DAO is a data access protocol from Microsoft. Was just a little surprised to see a reference to it in a Python thread and wanted to see if that was what you were referring to.

Originally Posted by Maverick22

I do not understand your question?

Maverick22 · 11-14-11 09:14 AM

My understand of Data Access Objects is a design pattern. Not a vendor specific concept.

In my own programs there exists some Object/Datatype. The Data Access Objects are 'aware' of this object. They exist only to read/write this object to/from some medium. A database, a file, an input stream, anything else.

So if this guy can already manipulate a file with his data, then 'refactoring' it to use Data Access Objects... and to write/read to a database

TheEditor · 11-14-11 10:25 AM

Its that too. As a Windows guy I took notice.

Originally Posted by Maverick22

My understand of Data Access Objects is a design pattern. Not a vendor specific concept. In my own programs there exists some Object/Datatype. The Data Access Objects are 'aware' of this object. They exist only to read/write this object to/from some medium. A database, a file, an input stream, anything else. So if this guy can already manipulate a file with his data, then 'refactoring' it to use Data Access Objects... and to write/read to a database

SBR Top-Rated Sportsbooks				Best Sportsbooks List
#1 FanDuel	SBR rating 4.8/5	Review	#6 BetRivers	SBR rating 4.1/5	Review
#2 Caesars	SBR rating 4.7/5	Review	#7 Fanatics	SBR rating 4.1/5	Review
#3 DraftKings	SBR rating 4.7/5	Review	#8 Betway	SBR rating 3.8/5	Review
#4 BetMGM	SBR rating 4.6/5	Review	#9 Borgata	SBR rating 3.5/5	Review
#5 bet365	SBR rating 4.6/5	Review	#10 ClutchBet	SBR rating 2.9/5	Review

yahoo nhl box scrape

Thread Tools

yahoo nhl box scrape