yahoo nhl box scrape

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • laxbrah420
    SBR High Roller
    • 10-29-10
    • 210

    #1
    yahoo nhl box scrape
    OK this pulls stats from yesterday's box scores:

    Code:
    from BeautifulSoup import BeautifulSoup, SoupStrainer
    import urllib2, re, time
    from datetime import date, timedelta
    
    dateYest=date.today() - timedelta(1)
    nhlDateYest=dateYest.strftime("%Y-%m-%d")
    nhlScoresWeb="http://sports.yahoo.com/nhl/scoreboard?d="+nhlDateYest
    page = urllib2.urlopen(nhlScoresWeb).read()
    soup = BeautifulSoup(page)
    
    #iterate through gameIds and Scrape box at same time
    for b in soup.findAll('a', href=re.compile('/nhl/boxscore')):
      url = b['href']
      gid= url[-10:]
      g = open(gid+".csv", "w")
      g.write(nhlDateYest+','+gid+',')
      g.write("\n")
      fullUrl = "http://sports.yahoo.com" + str(url)
      boxurl = urllib2.urlopen(fullUrl).read()
      boxsoup = BeautifulSoup(boxurl)
      
      #FindAwayTeamName
      re1='(awayTeamName)'
      re2='.*?'
      re3='(\\\'.*?\\\')'
      rg = re.compile(re1+re2+re3)
      m = rg.search(boxurl)
      awayteam=m.group(2)
        
      #FindAwayScore
      re4='(awayTeamScore)'
      re5='.*?'
      re6='(\\\'.*?\\\')'
      rg = re.compile(re4+re5+re6)
      m = rg.search(boxurl)
      awayscore=m.group(2)  
      g.write(awayteam+", "+awayscore+",") 
      g.write("\n") 
      
      #FindHomeTeamName
      re1='(homeTeamScore)'
      re2='.*?'
      re3='(\\\'.*?\\\')'
      rg = re.compile(re1+re2+re3)
      m = rg.search(boxurl)
      hometeam=m.group(2)
        
      #FindAwayScore
      re4='(homeTeamName)'
      re5='.*?'
      re6='(\\\'.*?\\\')'
      rg = re.compile(re4+re5+re6)
      m = rg.search(boxurl)
      homescore=m.group(2)  
      g.write(hometeam+", "+homescore+",") 
      
        
      #Scrape Team Stats
      t = boxsoup.findAll('div', id = "ysp-reg-box-team_stats")
      for table in t:
        rows = table.findAll('tr')
        for tr in rows:
            cols = tr.findAll('td')
            for td in cols:
                g.write(td.find(text=True))
                g.write(",")
            g.write("\n")
    and output looks like this:

    Code:
    2011-10-11,2011101114,
    'Minnesota', '3',
    'Ottawa', '4',
    27,44,
    12,16,
    2,13,
    10,13,
    3,2,
    2,3,
    1,1,
    50%,33%,
    50%,33%,
    15,21,
    34,42,
    45%,55%,
    45%,55%,
    31,28,
    21,21,
    Can someone recommend a good method for bringing the csv files into a database?
    Is there a clear favorite between relational and nosql? I was thinking I might want to learn couchdb, but id most likely just end up using mysql

    Also that's my first program ever so if anyone has suggestions to make the code
    a. better
    b. easier to read
    i'd really appreciate it

    thanks
    Last edited by laxbrah420; 10-12-11, 12:35 AM.
  • uva3021
    SBR Wise Guy
    • 03-01-07
    • 537

    #2
    you can just use re and urllib, no need for beautiful soup
    Comment
    • laxbrah420
      SBR High Roller
      • 10-29-10
      • 210

      #3
      Thanks man. Got it going perfectly.
      Couldnt for the life of me figure out why my hometeam and homescore were getting swapped...played around with variables, order, and syntax forever...

      yahoo source:
      awayTeamName : 'Florida',
      awayTeamScore : '2',
      homeTeamName : '4',
      homeTeamScore : 'Pittsburgh'


      as long as it stays that way I dont care, but it won't.
      Last edited by laxbrah420; 10-12-11, 12:39 AM.
      Comment
      • Dink87522
        SBR Rookie
        • 10-26-09
        • 38

        #4
        Interesting.
        Comment
        • TheEditor
          SBR Hustler
          • 09-29-11
          • 95

          #5
          Is that Python code?
          Comment
          • laxbrah420
            SBR High Roller
            • 10-29-10
            • 210

            #6
            ya can you help make it better?
            Comment
            • Flight
              Restricted User
              • 01-28-09
              • 1979

              #7
              Thanks for sharing the Python code.

              For Microsoft SQL Server, I use the following code to bring CSV into a table:

              Code:
              BULK INSERT football.dbo.nflgamedata
              FROM 'C:/Users/User1/Documents/ScrapeData/NFL_2011.csv'
              WITH
              (
              FIELDTERMINATOR = ',',
              ROWTERMINATOR = '\n'
              )
              GO
              If there is a header row in the CSV, you can skip the header by adding this to the WITH block:
              Code:
              FIRSTROW=2
              I think MySQL uses slightly different code, but I've seen it work.

              Your Python code is fine. Without totally rewriting it (taking hours) I couldn't offer any specific guidance. Overall suggestions would be to avoid regular expressions, because they suck building and they break. I recommend a solution using xpath. I took a look at the NHL box score HTML and you can use xpath to find ysp-reg-box-line_score and then get child anchors and innertext. If I get time, I'll write up some xpath statements in python to try and help.

              I've built lots of scrapers with regex as well, so I know where you're coming from, and you may be fine for years doing this. I changed my ways the past year because of the more robust parsing offered by xpath and other HTML parsers. You actually pull in a powerful HTML parser, Soup, but don't really harness it's power (other than findAll). You should be able to do all your parse work without a single regex.

              GL.
              Comment
              • Flight
                Restricted User
                • 01-28-09
                • 1979

                #8
                For importing CSV to SQL, you need to define your table schema. Each row in your CSV will respond to a single record in your table. The table schema must then match the columns in the CSV.

                I see in your sample output that a single game spans multiple lines. I recommend getting each game on a single line. Before your "foreach boxscore" loop, I recommend printing a header row to the CSV file to make sure you don't forget what column is what. You can use Excel to preview your CSV format and verify the correctness before tossing it to SQL.

                In other words, get rid of your g.write("\n")
                Comment
                • laxbrah420
                  SBR High Roller
                  • 10-29-10
                  • 210

                  #9
                  thanks a lot man. yea, i thought soup would be better but couldnt figure it out other than findAll and had asked on here for some insight --UVA told me to just use re so that's what I figured out.

                  Also, I decided to get rid of the returns tonight but wasn't sure it was necessary. thanks for confirming that.

                  The one major ****** up part of my code now that I realized is that for an OT game, there's actual an extra row in that table for goals scored in OT.

                  I'm having a tough time dealing with that.

                  Here's where I'm at:

                  Code:
                  from BeautifulSoup import BeautifulSoup, SoupStrainer
                  import urllib2, re, time
                  from datetime import date, timedelta
                  
                  dateYest=date.today() - timedelta(1)
                  nhlDateYest=dateYest.strftime("%Y-%m-%d")
                  nhlScoresWeb="http://sports.yahoo.com/nhl/scoreboard?d="+nhlDateYest
                  page = urllib2.urlopen(nhlScoresWeb).read()
                  soup = BeautifulSoup(page)
                  
                  #iterate through gameIds and Scrape box at same time
                  for b in soup.findAll('a', href=re.compile('/nhl/boxscore')):
                    i=0
                    url = b['href']
                    gid= url[-10:]
                    g = open(gid+".csv", "w")
                    g.write(nhlDateYest+','+gid+',')
                    #g.write("\n")
                    fullUrl = "http://sports.yahoo.com" + str(url)
                    boxurl = urllib2.urlopen(fullUrl).read()
                    boxsoup = BeautifulSoup(boxurl)
                    
                    #FindAwayTeamName
                    re1='(awayTeamName)'
                    re2='.*?'
                    re3='(\\\'.*?\\\')'
                    rg = re.compile(re1+re2+re3)
                    m = rg.search(boxurl)
                    awayteam=m.group(2)
                      
                    #FindAwayScore
                    re4='(awayTeamScore)'
                    re5='.*?'
                    re6='(\\\'.*?\\\')'
                    rg = re.compile(re4+re5+re6)
                    m = rg.search(boxurl)
                    awayscore=m.group(2)  
                    g.write(awayteam+", "+awayscore+",") 
                    #g.write("\n") 
                    
                    #FindHomeTeamName
                    re1='(homeTeamScore)'
                    re2='.*?'
                    re3='(\\\'.*?\\\')'
                    rg = re.compile(re1+re2+re3)
                    m = rg.search(boxurl)
                    hometeam=m.group(2)
                      
                    #FindAwayScore
                    re4='(homeTeamName)'
                    re5='.*?'
                    re6='(\\\'.*?\\\')'
                    rg = re.compile(re4+re5+re6)
                    m = rg.search(boxurl)
                    homescore=m.group(2)  
                    g.write(hometeam+", "+homescore+",") 
                    
                      
                    #Scrape Team Stats
                    t = boxsoup.findAll('div', id = "ysp-reg-box-team_stats")
                    for table in t:
                      rows = table.findAll('tr')
                      for tr in rows:
                          cols = tr.findAll('td')
                          for td in cols:
                              g.write(td.find(text=True))
                              g.write(",")
                          #g.write("\n")
                          i=i+1
                    if i==16:
                      #check if line=="0,0"
                      g.close()
                      g = open(gid+".csv", "r")
                      line=g.readline()
                      otS=line[85:88]
                      if otS == '0,0':
                        g.close()
                        g=open(gid+".csv","a")
                        g.write("SOW")
                      else:
                        g.close()
                        g=open(gid+".csv","a")
                        g.write("OTW")
                    if i==15:
                      g.write("REG")
                  If you look at the last part of my code, I messily determine if the game went to OT or shootout by iterating through it. The worst part of the code is this:
                  line=g.readline()
                  otS=line[85:88]
                  Simply because I couldn't figure out how to truly read the file.
                  In order for this to work though, I actually need to delete those records so everything lines up --(or incorporate some extra bit of logic into the import script which I'd like to avoid).

                  The only documentation I could find on deleting stuff was the translate method which doesn't seem to work in my case. Can anybody help me to delete the OT score entry?

                  Thanks a lot
                  Comment
                  • TheEditor
                    SBR Hustler
                    • 09-29-11
                    • 95

                    #10
                    I'm a Delphi guy, though I'm seriously considering taking up Python and wxPython.

                    Yeah, my scraping has problems with OTs, no matter the sport. The REs get a lot more hairy. Thinking about it, it seems like a good idea to test a box first to see if it has OT then apply the appropriate RE. But then you have two different REs to maintain going forward.
                    Comment
                    • laxbrah420
                      SBR High Roller
                      • 10-29-10
                      • 210

                      #11
                      anybody have a good idea on how to deal with OT?
                      my best idea is to delete the 3 characters that represent the score but im not sure the best way to do that
                      Comment
                      • laxbrah420
                        SBR High Roller
                        • 10-29-10
                        • 210

                        #12
                        for the record im a total dope. my otw vs sow "logic" was actually testing shots on goal in the OT period (and declaring a sow only if there were no shots)
                        Comment
                        • mathdotcom
                          SBR Posting Legend
                          • 03-24-08
                          • 11689

                          #13
                          ---
                          Last edited by mathdotcom; 10-25-11, 07:25 PM.
                          Comment
                          • rsigley
                            SBR Sharp
                            • 02-23-08
                            • 304

                            #14
                            count the number of children for the div

                            if certain size parse OT, SO, or FG. if its like espn where they put in blank td's so the size is the same regardless of game length just look at value for where you think OT should be

                            then way i do it is

                            if SO, OT = 0 for both, 1,2,3 normal, get SO score
                            if OT, SO = "X" for both, 1,2,3 normal, get OT score
                            if neither, OT & SO = "X", get score normally

                            and why export to csv then import into database why not just write directly to db
                            Comment
                            • subs
                              SBR MVP
                              • 04-30-10
                              • 1412

                              #15
                              Free sigley
                              Comment
                              • donkson
                                SBR Sharp
                                • 03-12-11
                                • 411

                                #16
                                haha
                                Comment
                                • laxbrah420
                                  SBR High Roller
                                  • 10-29-10
                                  • 210

                                  #17
                                  Originally posted by rsigley
                                  count the number of children for the div

                                  if certain size parse OT, SO, or FG. if its like espn where they put in blank td's so the size is the same regardless of game length just look at value for where you think OT should be

                                  then way i do it is

                                  if SO, OT = 0 for both, 1,2,3 normal, get SO score
                                  if OT, SO = "X" for both, 1,2,3 normal, get OT score
                                  if neither, OT & SO = "X", get score normally

                                  and why export to csv then import into database why not just write directly to db
                                  OK thanks Ill work on that.
                                  And b/c i don't know how to write to the db yet and figured i'd just start by collecting data haha.
                                  For the time being i can pretty quickly move the files into excel.
                                  Comment
                                  • Maverick22
                                    SBR Wise Guy
                                    • 04-10-10
                                    • 807

                                    #18
                                    you dont know how to write the data base? what does that mean?
                                    Comment
                                    • laxbrah420
                                      SBR High Roller
                                      • 10-29-10
                                      • 210

                                      #19
                                      Originally posted by Maverick22
                                      you dont know how to write the data base? what does that mean?
                                      It means I figured out the write to file function but not the write to mysql stuff
                                      Comment
                                      • babar1000
                                        SBR High Roller
                                        • 08-08-11
                                        • 174

                                        #20
                                        For MySQL You can use xampp and HeidiSQL to browse the data.
                                        You must create a table with the classic command CREATE TABLE.....
                                        And after you load the file with the command LOAD DATA (http://dev.mysql.com/doc/refman/5.1/en/load-data.html)

                                        Originally posted by laxbrah420
                                        It means I figured out the write to file function but not the write to mysql stuff
                                        Comment
                                        • rsigley
                                          SBR Sharp
                                          • 02-23-08
                                          • 304

                                          #21
                                          just use the sQL command like
                                          t
                                          INSERT INTO

                                          along with some python function that deals with mysql
                                          Comment
                                          • Maverick22
                                            SBR Wise Guy
                                            • 04-10-10
                                            • 807

                                            #22
                                            You might want to invest some time in learning about 'data access objects'

                                            If you can write to a file, then most of the work is done for you to write to a database.

                                            Do you have your database created/modelled already?
                                            Comment
                                            • TheEditor
                                              SBR Hustler
                                              • 09-29-11
                                              • 95

                                              #23
                                              You talking 'bout DAO specifically?
                                              Comment
                                              • Maverick22
                                                SBR Wise Guy
                                                • 04-10-10
                                                • 807

                                                #24
                                                I do not understand your question?
                                                Comment
                                                • TheEditor
                                                  SBR Hustler
                                                  • 09-29-11
                                                  • 95

                                                  #25
                                                  Then the answer is no. DAO is a data access protocol from Microsoft. Was just a little surprised to see a reference to it in a Python thread and wanted to see if that was what you were referring to.

                                                  Originally posted by Maverick22
                                                  I do not understand your question?
                                                  Comment
                                                  • Maverick22
                                                    SBR Wise Guy
                                                    • 04-10-10
                                                    • 807

                                                    #26
                                                    My understand of Data Access Objects is a design pattern. Not a vendor specific concept.

                                                    In my own programs there exists some Object/Datatype. The Data Access Objects are 'aware' of this object. They exist only to read/write this object to/from some medium. A database, a file, an input stream, anything else.

                                                    So if this guy can already manipulate a file with his data, then 'refactoring' it to use Data Access Objects... and to write/read to a database
                                                    Comment
                                                    • TheEditor
                                                      SBR Hustler
                                                      • 09-29-11
                                                      • 95

                                                      #27
                                                      Its that too. As a Windows guy I took notice.

                                                      Originally posted by Maverick22
                                                      My understand of Data Access Objects is a design pattern. Not a vendor specific concept. In my own programs there exists some Object/Datatype. The Data Access Objects are 'aware' of this object. They exist only to read/write this object to/from some medium. A database, a file, an input stream, anything else. So if this guy can already manipulate a file with his data, then 'refactoring' it to use Data Access Objects... and to write/read to a database
                                                      Comment
                                                      SBR Contests
                                                      Collapse
                                                      Top-Rated US Sportsbooks
                                                      Collapse
                                                      Working...