Programmers Wanted For Group NFL Web Scraper Project

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Maverick22
    SBR Wise Guy
    • 04-10-10
    • 807

    #1
    Programmers Wanted For Group NFL Web Scraper Project
    Hi all...

    It was mentioned to have a community/group created NFL Scraper... So I figured I would spear head this effort...

    Was looking to scrape webpages similar to this: http://sports.espn.go.com/nfl/boxscore?gameId=290914017

    I am pretty crispy on Java ( did my MLB scraper in Java). But am open to whatever "the group" wants to use...

    Does there exist any interest for this?
  • Grind-It-Out
    SBR Wise Guy
    • 05-04-10
    • 537

    #2
    I'm game if we do it in PHP. I have zero interest in ever using Java again .
    Comment
    • fightingwarrior
      Restricted User
      • 05-06-10
      • 7818

      #3
      i can do it.seems easy enough,
      Comment
      • FreeFall
        SBR MVP
        • 02-20-08
        • 3365

        #4
        I don't see why it matters. Just something that is platform independent or hits the big three, Linux, Mac, Win. Either way aren't we all just looking for the end goal and thats the DB?
        Comment
        • Grind-It-Out
          SBR Wise Guy
          • 05-04-10
          • 537

          #5
          Originally posted by FreeFall
          I don't see why it matters. Just something that is platform independent or hits the big three, Linux, Mac, Win. Either way aren't we all just looking for the end goal and thats the DB?
          Yes. but by "group" project I took it as we would all be working on the same script. If that's the case, it would have to be the same language. Although, it definitely seems easy enough for one person to do.
          Comment
          • TomG
            SBR Wise Guy
            • 10-29-07
            • 500

            #6
            I've recently completed a project to compile publicly-available NFL play-by-play data. It took a while, but now it's ready. The resulting ...
            Comment
            • Maverick22
              SBR Wise Guy
              • 04-10-10
              • 807

              #7
              Originally posted by Grind-It-Out

              Yes. but by "group" project I took it as we would all be working on the same script. If that's the case, it would have to be the same language. Although, it definitely seems easy enough for one person to do.
              Yes, it is easy enough for a single person. It is not hard...but quite tedious... I could write it myself... But then I would hoard the code and it would never see the light of day... Why not pool talents and roll out a "product" quicker than a single man can... I'm awesome... but i'm not superman...

              So... According to this website, the guy did a single mass upload of the stats. And WELL after the season was over. For what most of us have in mind, this needs to be done every day... End of the season does little good. Unless i missed the link where it is noted the stats will be updated/uploaded nightly... This link is only good in the offseason...
              Comment
              • gimpy
                SBR Wise Guy
                • 10-17-10
                • 510

                #8
                Is there still interest for this? I'd like to see it happen.
                Comment
                • Maverick22
                  SBR Wise Guy
                  • 04-10-10
                  • 807

                  #9
                  I have thus far fully created this scraper. So i guess there is no more need for a collaborative effort.

                  I did it the 'long' way, but it is the 'right' and most 'robust' way. I am an expert programmer after all
                  Comment
                  • uva3021
                    SBR Wise Guy
                    • 03-01-07
                    • 537

                    #10
                    Originally posted by Maverick22
                    I have thus far fully created this scraper. So i guess there is no more need for a collaborative effort.

                    I did it the 'long' way, but it is the 'right' and most 'robust' way. I am an expert programmer after all
                    share the source so we can begin collaborating

                    hopefully you did it in python or PHP
                    Comment
                    • Maverick22
                      SBR Wise Guy
                      • 04-10-10
                      • 807

                      #11
                      ...
                      Comment
                      • Maverick22
                        SBR Wise Guy
                        • 04-10-10
                        • 807

                        #12
                        ...
                        Comment
                        • chilidog
                          SBR Posting Legend
                          • 04-05-09
                          • 10305

                          #13
                          Originally posted by Maverick22
                          Once upon a time, a little red hen lived in a small cottage. She worked
                          hard to keep her family fed. One day, when the little red hen was out walk-
                          ing with her friends, the goose, the cat, and the pig, she found a few grains of
                          wheat....
                          Such a moral doesn't work anymore. In modern times, the cat and pig would've just killed the hen and took her bread.
                          Comment
                          • pedro803
                            SBR Sharp
                            • 01-02-10
                            • 309

                            #14
                            Originally posted by chilidog
                            Such a moral doesn't work anymore. In modern times, the cat and pig would've just killed the hen and took her bread.

                            probably this is the reason Maverick ain't sharing his address!
                            Comment
                            • Maverick22
                              SBR Wise Guy
                              • 04-10-10
                              • 807

                              #15
                              ...
                              Comment
                              • podonne
                                SBR High Roller
                                • 07-01-11
                                • 104

                                #16
                                This is part of my frustration with the current state of affairs in the sports-information marketplace. Everyone writes this kind of code for themselves and hoards it, or is forced to pay $100s of dollars a month to websites. I really think we could all benefit from a simple service that charged a nominal fee ($10\month?) to download box scores and odds lines in XML format. After all, the information is free (scraped) so they really shouldn't charge that mush to recieve it.
                                Comment
                                • Maverick22
                                  SBR Wise Guy
                                  • 04-10-10
                                  • 807

                                  #17
                                  ...
                                  Comment
                                  • uva3021
                                    SBR Wise Guy
                                    • 03-01-07
                                    • 537

                                    #18
                                    Get over yourself Maverick22
                                    Comment
                                    • Maverick22
                                      SBR Wise Guy
                                      • 04-10-10
                                      • 807

                                      #19
                                      ...
                                      Comment
                                      • rsigley
                                        SBR Sharp
                                        • 02-23-08
                                        • 304

                                        #20
                                        fwiw nfl.com boxscore has much more info than espn that could be useful. that is the one i scrape

                                        also

                                        w1 lines:

                                        w2 lines:


                                        w1 box:

                                        w2 box:
                                        Comment
                                        • podonne
                                          SBR High Roller
                                          • 07-01-11
                                          • 104

                                          #21
                                          Originally posted by rsigley
                                          fwiw nfl.com boxscore has much more info than espn that could be useful. that is the one i scrape
                                          What nfl.com box scores are you looking at? I can't find a proper box score on the site, except for a "Gamebook" but it's in PDF. Are you parsing the PDF file?

                                          A game page: http://www.nfl.com/gamecenter/200909...ghts&tab=recap
                                          Comment
                                          • podonne
                                            SBR High Roller
                                            • 07-01-11
                                            • 104

                                            #22
                                            ALso, what's square sports betting?
                                            Comment
                                            • subs
                                              SBR MVP
                                              • 04-30-10
                                              • 1412

                                              #23
                                              the exact opposite of what it's called
                                              Comment
                                              • rsigley
                                                SBR Sharp
                                                • 02-23-08
                                                • 304

                                                #24
                                                Originally posted by podonne
                                                What nfl.com box scores are you looking at? I can't find a proper box score on the site, except for a "Gamebook" but it's in PDF. Are you parsing the PDF file?

                                                A game page: http://www.nfl.com/gamecenter/200909...ghts&tab=recap


                                                but using top tier methods to by pass the ajax which makes it pretty impossible to scrape
                                                Comment
                                                • Maverick22
                                                  SBR Wise Guy
                                                  • 04-10-10
                                                  • 807

                                                  #25
                                                  top tier methods? What does that mean?
                                                  Comment
                                                  • podonne
                                                    SBR High Roller
                                                    • 07-01-11
                                                    • 104

                                                    #26
                                                    Originally posted by rsigley
                                                    http://www.nfl.com/gamecenter/200909...ts&tab=analyze

                                                    but using top tier methods to by pass the ajax which makes it pretty impossible to scrape
                                                    That's pretty good! Its making a call for a formatted JSON file with all the box score information. Love when its in a non-html format!

                                                    Try this from the game above: http://www.nfl.com/liveupdate/game-c...=1316930020000

                                                    If the random number is meaningful, you could always just load up an internet explorer webbrowser control and use WebBrowser.Navigate(). Wait a few seconds and then parse the DOM. It'd be alot slower, but it can't be defeated by trickery.

                                                    If you don't mind parsing the HTML, try a link like this one:

                                                    The official source for NFL news, video highlights, fantasy football, game-day coverage, schedules, stats, scores and more.


                                                    Returns an HTML version of the box score. Just set the date in YYYYMMDD format and append a "01" to "16" and you should get what you need. No random numbers...
                                                    Comment
                                                    • podonne
                                                      SBR High Roller
                                                      • 07-01-11
                                                      • 104

                                                      #27
                                                      Here's a URL for play-by-play using the same method as for the box score:

                                                      The official source for NFL news, video highlights, fantasy football, game-day coverage, schedules, stats, scores and more.
                                                      Comment
                                                      • rsigley
                                                        SBR Sharp
                                                        • 02-23-08
                                                        • 304

                                                        #28
                                                        oh yep i use those links. if anyone wants to know how to find them, on firefox you can install this plugin called firebug (the one on chrome is better but its harder to find these urls unless you know where to look)

                                                        when you click on the boxscore then play by play tab you get (you have to enable console for this)



                                                        So you can see that when you click the boxscore button its sending a get request to get the data from that URL and it seems there's a variable involved called "gameID".
                                                        but how do you figure out the gameID? if you click scores at the top and go to week you can see that its embedded in every gamecenter url

                                                        so the logic behind a parser would be

                                                        - Go to current week in scores (they use the convenient REGX where X = week of the season)
                                                        - Grab all the game center URLs (they're of the form /gamecenter/XXXX not the full www.nfl.com/gamecenter/etc)
                                                        - Find a tricky way to get the gameID out of the URL (one way I use it to string replace /gamecenter/ and then explode on the "/" so it will be the first element in the new array. You could also just explode on / and look at the second element but I ran into some errors on the older boxscores that way. I mean you could do checks to see if its numeric or whatever but why bother when something like this would work:

                                                        $link2 = str_replace("/gamecenter/", "", $dom2->find('div[id="score-boxes"]', 0)->find('div[class="game-center-area"]', $c)->find('a', 0)->href);
                                                        $bsid = explode("/", $link2);
                                                        $BSLink = "http://www.nfl.com/widget/gc/2011/tabs/cat-post-boxscore?gameId=" . $bsid[0];
                                                        $PBPLink = "http://www.nfl.com/widget/gc/2011/tabs/cat-post-playbyplay?gameId=" . $bsid[0];

                                                        Then you can just get those files and parse accordingly.

                                                        This method is useful to parse any website that uses ajax to load the data. Some (like Covers) require you to send post requests to the server to get the data back as JSON but that's not that big of a deal. Just use firebug to see what variable the requests are sending and what is coming back and use your script to send/receive those
                                                        Comment
                                                        • chemicalbrother
                                                          Restricted User
                                                          • 01-26-11
                                                          • 4086

                                                          #29
                                                          rsigley: not just for trolling anymore.
                                                          Comment
                                                          • podonne
                                                            SBR High Roller
                                                            • 07-01-11
                                                            • 104

                                                            #30
                                                            rsigley: Can't you just start from July 1, 2000 and increment day by day, putting each date in YYYYMMDD format and pulling every value between "01" and "16"?

                                                            You'll have to figure out which game you've pulled, but it should be easier than parsing the gamecenter urls.
                                                            Comment
                                                            • rsigley
                                                              SBR Sharp
                                                              • 02-23-08
                                                              • 304

                                                              #31
                                                              dunno if you want but its easier to just get the link than not knowing what you're getting

                                                              w3 if anyones interested:

                                                              w3 lines


                                                              w3 boxscore
                                                              Comment
                                                              Search
                                                              Collapse
                                                              SBR Contests
                                                              Collapse
                                                              Top-Rated US Sportsbooks
                                                              Collapse
                                                              Working...