1. #1
    laxbrah420
    laxbrah420's Avatar Become A Pro!
    Join Date: 10-29-10
    Posts: 210
    Betpoints: 505

    yahoo nhl box scrape

    OK this pulls stats from yesterday's box scores:

    Code:
    from BeautifulSoup import BeautifulSoup, SoupStrainer
    import urllib2, re, time
    from datetime import date, timedelta
    
    dateYest=date.today() - timedelta(1)
    nhlDateYest=dateYest.strftime("%Y-%m-%d")
    nhlScoresWeb="http://sports.yahoo.com/nhl/scoreboard?d="+nhlDateYest
    page = urllib2.urlopen(nhlScoresWeb).read()
    soup = BeautifulSoup(page)
    
    #iterate through gameIds and Scrape box at same time
    for b in soup.findAll('a', href=re.compile('/nhl/boxscore')):
      url = b['href']
      gid= url[-10:]
      g = open(gid+".csv", "w")
      g.write(nhlDateYest+','+gid+',')
      g.write("\n")
      fullUrl = "http://sports.yahoo.com" + str(url)
      boxurl = urllib2.urlopen(fullUrl).read()
      boxsoup = BeautifulSoup(boxurl)
      
      #FindAwayTeamName
      re1='(awayTeamName)'
      re2='.*?'
      re3='(\\\'.*?\\\')'
      rg = re.compile(re1+re2+re3)
      m = rg.search(boxurl)
      awayteam=m.group(2)
        
      #FindAwayScore
      re4='(awayTeamScore)'
      re5='.*?'
      re6='(\\\'.*?\\\')'
      rg = re.compile(re4+re5+re6)
      m = rg.search(boxurl)
      awayscore=m.group(2)  
      g.write(awayteam+", "+awayscore+",") 
      g.write("\n") 
      
      #FindHomeTeamName
      re1='(homeTeamScore)'
      re2='.*?'
      re3='(\\\'.*?\\\')'
      rg = re.compile(re1+re2+re3)
      m = rg.search(boxurl)
      hometeam=m.group(2)
        
      #FindAwayScore
      re4='(homeTeamName)'
      re5='.*?'
      re6='(\\\'.*?\\\')'
      rg = re.compile(re4+re5+re6)
      m = rg.search(boxurl)
      homescore=m.group(2)  
      g.write(hometeam+", "+homescore+",") 
      
        
      #Scrape Team Stats
      t = boxsoup.findAll('div', id = "ysp-reg-box-team_stats")
      for table in t:
        rows = table.findAll('tr')
        for tr in rows:
            cols = tr.findAll('td')
            for td in cols:
                g.write(td.find(text=True))
                g.write(",")
            g.write("\n")
    and output looks like this:

    Code:
    2011-10-11,2011101114,
    'Minnesota', '3',
    'Ottawa', '4',
    27,44,
    12,16,
    2,13,
    10,13,
    3,2,
    2,3,
    1,1,
    50%,33%,
    50%,33%,
    15,21,
    34,42,
    45%,55%,
    45%,55%,
    31,28,
    21,21,
    Can someone recommend a good method for bringing the csv files into a database?
    Is there a clear favorite between relational and nosql? I was thinking I might want to learn couchdb, but id most likely just end up using mysql

    Also that's my first program ever so if anyone has suggestions to make the code
    a. better
    b. easier to read
    i'd really appreciate it

    thanks
    Last edited by laxbrah420; 10-12-11 at 12:35 AM.

  2. #2
    uva3021
    uva3021's Avatar Become A Pro!
    Join Date: 03-01-07
    Posts: 537
    Betpoints: 381

    you can just use re and urllib, no need for beautiful soup

  3. #3
    laxbrah420
    laxbrah420's Avatar Become A Pro!
    Join Date: 10-29-10
    Posts: 210
    Betpoints: 505

    Thanks man. Got it going perfectly.
    Couldnt for the life of me figure out why my hometeam and homescore were getting swapped...played around with variables, order, and syntax forever...

    yahoo source:
    awayTeamName : 'Florida',
    awayTeamScore : '2',
    homeTeamName : '4',
    homeTeamScore : 'Pittsburgh'


    as long as it stays that way I dont care, but it won't.
    Last edited by laxbrah420; 10-12-11 at 12:39 AM.

  4. #4
    Dink87522
    Dink87522's Avatar Become A Pro!
    Join Date: 10-26-09
    Posts: 38

    Interesting.

  5. #5
    TheEditor
    TheEditor's Avatar Become A Pro!
    Join Date: 09-29-11
    Posts: 95
    Betpoints: 696

    Is that Python code?

  6. #6
    laxbrah420
    laxbrah420's Avatar Become A Pro!
    Join Date: 10-29-10
    Posts: 210
    Betpoints: 505

    ya can you help make it better?

  7. #7
    Flight
    Update your status
    Flight's Avatar Become A Pro!
    Join Date: 01-27-09
    Posts: 1,979

    Thanks for sharing the Python code.

    For Microsoft SQL Server, I use the following code to bring CSV into a table:

    Code:
    BULK INSERT football.dbo.nflgamedata
    FROM 'C:/Users/User1/Documents/ScrapeData/NFL_2011.csv'
    WITH
    (
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '\n'
    )
    GO
    If there is a header row in the CSV, you can skip the header by adding this to the WITH block:
    Code:
    FIRSTROW=2
    I think MySQL uses slightly different code, but I've seen it work.

    Your Python code is fine. Without totally rewriting it (taking hours) I couldn't offer any specific guidance. Overall suggestions would be to avoid regular expressions, because they suck building and they break. I recommend a solution using xpath. I took a look at the NHL box score HTML and you can use xpath to find ysp-reg-box-line_score and then get child anchors and innertext. If I get time, I'll write up some xpath statements in python to try and help.

    I've built lots of scrapers with regex as well, so I know where you're coming from, and you may be fine for years doing this. I changed my ways the past year because of the more robust parsing offered by xpath and other HTML parsers. You actually pull in a powerful HTML parser, Soup, but don't really harness it's power (other than findAll). You should be able to do all your parse work without a single regex.

    GL.

  8. #8
    Flight
    Update your status
    Flight's Avatar Become A Pro!
    Join Date: 01-27-09
    Posts: 1,979

    For importing CSV to SQL, you need to define your table schema. Each row in your CSV will respond to a single record in your table. The table schema must then match the columns in the CSV.

    I see in your sample output that a single game spans multiple lines. I recommend getting each game on a single line. Before your "foreach boxscore" loop, I recommend printing a header row to the CSV file to make sure you don't forget what column is what. You can use Excel to preview your CSV format and verify the correctness before tossing it to SQL.

    In other words, get rid of your g.write("\n")

  9. #9
    laxbrah420
    laxbrah420's Avatar Become A Pro!
    Join Date: 10-29-10
    Posts: 210
    Betpoints: 505

    thanks a lot man. yea, i thought soup would be better but couldnt figure it out other than findAll and had asked on here for some insight --UVA told me to just use re so that's what I figured out.

    Also, I decided to get rid of the returns tonight but wasn't sure it was necessary. thanks for confirming that.

    The one major ****** up part of my code now that I realized is that for an OT game, there's actual an extra row in that table for goals scored in OT.

    I'm having a tough time dealing with that.

    Here's where I'm at:

    Code:
    from BeautifulSoup import BeautifulSoup, SoupStrainer
    import urllib2, re, time
    from datetime import date, timedelta
    
    dateYest=date.today() - timedelta(1)
    nhlDateYest=dateYest.strftime("%Y-%m-%d")
    nhlScoresWeb="http://sports.yahoo.com/nhl/scoreboard?d="+nhlDateYest
    page = urllib2.urlopen(nhlScoresWeb).read()
    soup = BeautifulSoup(page)
    
    #iterate through gameIds and Scrape box at same time
    for b in soup.findAll('a', href=re.compile('/nhl/boxscore')):
      i=0
      url = b['href']
      gid= url[-10:]
      g = open(gid+".csv", "w")
      g.write(nhlDateYest+','+gid+',')
      #g.write("\n")
      fullUrl = "http://sports.yahoo.com" + str(url)
      boxurl = urllib2.urlopen(fullUrl).read()
      boxsoup = BeautifulSoup(boxurl)
      
      #FindAwayTeamName
      re1='(awayTeamName)'
      re2='.*?'
      re3='(\\\'.*?\\\')'
      rg = re.compile(re1+re2+re3)
      m = rg.search(boxurl)
      awayteam=m.group(2)
        
      #FindAwayScore
      re4='(awayTeamScore)'
      re5='.*?'
      re6='(\\\'.*?\\\')'
      rg = re.compile(re4+re5+re6)
      m = rg.search(boxurl)
      awayscore=m.group(2)  
      g.write(awayteam+", "+awayscore+",") 
      #g.write("\n") 
      
      #FindHomeTeamName
      re1='(homeTeamScore)'
      re2='.*?'
      re3='(\\\'.*?\\\')'
      rg = re.compile(re1+re2+re3)
      m = rg.search(boxurl)
      hometeam=m.group(2)
        
      #FindAwayScore
      re4='(homeTeamName)'
      re5='.*?'
      re6='(\\\'.*?\\\')'
      rg = re.compile(re4+re5+re6)
      m = rg.search(boxurl)
      homescore=m.group(2)  
      g.write(hometeam+", "+homescore+",") 
      
        
      #Scrape Team Stats
      t = boxsoup.findAll('div', id = "ysp-reg-box-team_stats")
      for table in t:
        rows = table.findAll('tr')
        for tr in rows:
            cols = tr.findAll('td')
            for td in cols:
                g.write(td.find(text=True))
                g.write(",")
            #g.write("\n")
            i=i+1
      if i==16:
        #check if line=="0,0"
        g.close()
        g = open(gid+".csv", "r")
        line=g.readline()
        otS=line[85:88]
        if otS == '0,0':
          g.close()
          g=open(gid+".csv","a")
          g.write("SOW")
        else:
          g.close()
          g=open(gid+".csv","a")
          g.write("OTW")
      if i==15:
        g.write("REG")
    If you look at the last part of my code, I messily determine if the game went to OT or shootout by iterating through it. The worst part of the code is this:
    line=g.readline()
    otS=line[85:88]
    Simply because I couldn't figure out how to truly read the file.
    In order for this to work though, I actually need to delete those records so everything lines up --(or incorporate some extra bit of logic into the import script which I'd like to avoid).

    The only documentation I could find on deleting stuff was the translate method which doesn't seem to work in my case. Can anybody help me to delete the OT score entry?

    Thanks a lot

  10. #10
    TheEditor
    TheEditor's Avatar Become A Pro!
    Join Date: 09-29-11
    Posts: 95
    Betpoints: 696

    I'm a Delphi guy, though I'm seriously considering taking up Python and wxPython.

    Yeah, my scraping has problems with OTs, no matter the sport. The REs get a lot more hairy. Thinking about it, it seems like a good idea to test a box first to see if it has OT then apply the appropriate RE. But then you have two different REs to maintain going forward.

  11. #11
    laxbrah420
    laxbrah420's Avatar Become A Pro!
    Join Date: 10-29-10
    Posts: 210
    Betpoints: 505

    anybody have a good idea on how to deal with OT?
    my best idea is to delete the 3 characters that represent the score but im not sure the best way to do that

  12. #12
    laxbrah420
    laxbrah420's Avatar Become A Pro!
    Join Date: 10-29-10
    Posts: 210
    Betpoints: 505

    for the record im a total dope. my otw vs sow "logic" was actually testing shots on goal in the OT period (and declaring a sow only if there were no shots)

  13. #13
    mathdotcom
    mathdotcom's Avatar Become A Pro!
    Join Date: 03-24-08
    Posts: 11,689
    Betpoints: 1943

    ---
    Last edited by mathdotcom; 10-25-11 at 07:25 PM.

  14. #14
    rsigley
    rsigley's Avatar Become A Pro!
    Join Date: 02-23-08
    Posts: 304
    Betpoints: 186

    count the number of children for the div

    if certain size parse OT, SO, or FG. if its like espn where they put in blank td's so the size is the same regardless of game length just look at value for where you think OT should be

    then way i do it is

    if SO, OT = 0 for both, 1,2,3 normal, get SO score
    if OT, SO = "X" for both, 1,2,3 normal, get OT score
    if neither, OT & SO = "X", get score normally

    and why export to csv then import into database why not just write directly to db

  15. #15
    subs
    subs's Avatar Become A Pro!
    Join Date: 04-30-10
    Posts: 1,412
    Betpoints: 969

    Free sigley

  16. #16
    donkson
    donkson's Avatar Become A Pro!
    Join Date: 03-12-11
    Posts: 411
    Betpoints: 1797

    haha

  17. #17
    laxbrah420
    laxbrah420's Avatar Become A Pro!
    Join Date: 10-29-10
    Posts: 210
    Betpoints: 505

    Quote Originally Posted by rsigley View Post
    count the number of children for the div

    if certain size parse OT, SO, or FG. if its like espn where they put in blank td's so the size is the same regardless of game length just look at value for where you think OT should be

    then way i do it is

    if SO, OT = 0 for both, 1,2,3 normal, get SO score
    if OT, SO = "X" for both, 1,2,3 normal, get OT score
    if neither, OT & SO = "X", get score normally

    and why export to csv then import into database why not just write directly to db
    OK thanks Ill work on that.
    And b/c i don't know how to write to the db yet and figured i'd just start by collecting data haha.
    For the time being i can pretty quickly move the files into excel.

  18. #18
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 807
    Betpoints: 58

    you dont know how to write the data base? what does that mean?

  19. #19
    laxbrah420
    laxbrah420's Avatar Become A Pro!
    Join Date: 10-29-10
    Posts: 210
    Betpoints: 505

    Quote Originally Posted by Maverick22 View Post
    you dont know how to write the data base? what does that mean?
    It means I figured out the write to file function but not the write to mysql stuff

  20. #20
    babar1000
    babar1000's Avatar Become A Pro!
    Join Date: 08-08-11
    Posts: 174
    Betpoints: 12

    For MySQL You can use xampp and HeidiSQL to browse the data.
    You must create a table with the classic command CREATE TABLE.....
    And after you load the file with the command LOAD DATA (http://dev.mysql.com/doc/refman/5.1/en/load-data.html)

    Quote Originally Posted by laxbrah420 View Post
    It means I figured out the write to file function but not the write to mysql stuff

  21. #21
    rsigley
    rsigley's Avatar Become A Pro!
    Join Date: 02-23-08
    Posts: 304
    Betpoints: 186

    just use the sQL command like
    t
    INSERT INTO

    along with some python function that deals with mysql

  22. #22
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 807
    Betpoints: 58

    You might want to invest some time in learning about 'data access objects'

    If you can write to a file, then most of the work is done for you to write to a database.

    Do you have your database created/modelled already?

  23. #23
    TheEditor
    TheEditor's Avatar Become A Pro!
    Join Date: 09-29-11
    Posts: 95
    Betpoints: 696

    You talking 'bout DAO specifically?

  24. #24
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 807
    Betpoints: 58

    I do not understand your question?

  25. #25
    TheEditor
    TheEditor's Avatar Become A Pro!
    Join Date: 09-29-11
    Posts: 95
    Betpoints: 696

    Then the answer is no. DAO is a data access protocol from Microsoft. Was just a little surprised to see a reference to it in a Python thread and wanted to see if that was what you were referring to.

    Quote Originally Posted by Maverick22 View Post
    I do not understand your question?

  26. #26
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 807
    Betpoints: 58

    My understand of Data Access Objects is a design pattern. Not a vendor specific concept.

    In my own programs there exists some Object/Datatype. The Data Access Objects are 'aware' of this object. They exist only to read/write this object to/from some medium. A database, a file, an input stream, anything else.

    So if this guy can already manipulate a file with his data, then 'refactoring' it to use Data Access Objects... and to write/read to a database

  27. #27
    TheEditor
    TheEditor's Avatar Become A Pro!
    Join Date: 09-29-11
    Posts: 95
    Betpoints: 696

    Its that too. As a Windows guy I took notice.

    Quote Originally Posted by Maverick22 View Post
    My understand of Data Access Objects is a design pattern. Not a vendor specific concept. In my own programs there exists some Object/Datatype. The Data Access Objects are 'aware' of this object. They exist only to read/write this object to/from some medium. A database, a file, an input stream, anything else. So if this guy can already manipulate a file with his data, then 'refactoring' it to use Data Access Objects... and to write/read to a database

Top