Excel based tutorial for web scraping

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • a4u2fear
    SBR Hall of Famer
    • 01-29-10
    • 8147

    #36
    Public Sub MainCode()
    'This is a comment. This sub will have all of our
    'main code and will call other subs

    'Below are my variables
    Dim totalgames As Integer 'i.e. 1230
    Dim NHLyear As Integer 'i.e. 12, two digit because the year in NHL's website is only two digits
    Dim gamenumber As Integer

    Application.ScreenUpdating = False

    For NHLyear = 11 To 11

    If NHLyear = 12 Then
    totalgames = 720 'shortened season!
    Else
    totalgames = 1230 '2011 and 2013 both were full seasons
    End If

    For gamenumber = 1 To 10
    Call ImportWebpage(gamenumber, NHLyear)
    Call PullData
    Next

    Next

    Application.ScreenUpdating = True


    End Sub


    Public Sub ImportWebpage(game, gameyear)
    'this sub only imports the webpage
    Dim gamestring As String

    Application.DisplayAlerts = False
    Sheets("Sheet2").Delete
    Application.DisplayAlerts = True
    Sheets.Add After:=Sheets(Sheets.Count)
    Sheets(Sheets.Count).Name = "Sheet2"

    If game < 10 Then
    gamestring = "000" & game
    ElseIf game < 100 Then
    gamestring = "00" & game
    ElseIf game < 1000 Then
    gamestring = "0" & game
    Else
    gamestring = game
    End If

    With Sheets("Sheet2").QueryTables.Add(Connect ion:= _
    "URL;http://www.nhl.com/ice/boxscore.htm?id=20" & gameyear & "02" & gamestring, Destination:= _
    Range("$A$1"))
    .Name = ""
    .FieldNames = True
    .RowNumbers = False
    .FillAdjacentFormulas = False
    .PreserveFormatting = True
    .RefreshOnFileOpen = False
    .BackgroundQuery = True
    .RefreshStyle = xlInsertDeleteCells
    .SavePassword = False
    .SaveData = True
    .AdjustColumnWidth = True
    .RefreshPeriod = 0
    .WebSelectionType = xlEntirePage
    .WebFormatting = xlWebFormattingNone
    .WebPreFormattedTextToColumns = True
    .WebConsecutiveDelimitersAsOne = True
    .WebSingleBlockTextImport = False
    .WebDisableDateRecognition = False
    .WebDisableRedirections = False
    .Refresh BackgroundQuery:=False
    End With

    End Sub

    Public Sub PullData()
    'this sub will pull data from the imported webpage and
    'put into a main sheet which will contain all game data
    Dim lastrowsheetone As Integer
    Dim findmatch As Integer

    lastrowsheetone = Sheets("Sheet1").UsedRange.Rows.Count
    findmatch = Application.Match("1st", Sheets("Sheet2").Range("B1:B100"), 0)

    'away team stats
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 2) = Sheets("Sheet2").Cells(findmatch + 1, 1)
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 3) = Sheets("Sheet2").Cells(findmatch + 1, 2)
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 4) = Sheets("Sheet2").Cells(findmatch + 1, 3)
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 5) = Sheets("Sheet2").Cells(findmatch + 1, 4)
    If Sheets("Sheet2").Cells(findmatch + 1, 5) = "" Then
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 6) = 0
    Else
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 6) = Sheets("Sheet2").Cells(findmatch + 1, 5)
    End If
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 7) = Sheets("Sheet2").Cells(findmatch + 1, 7)

    'home team stats
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 8) = Sheets("Sheet2").Cells(findmatch + 2, 1)
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 9) = Sheets("Sheet2").Cells(findmatch + 2, 2)
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 10) = Sheets("Sheet2").Cells(findmatch + 2, 3)
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 11) = Sheets("Sheet2").Cells(findmatch + 2, 4)
    If Sheets("Sheet2").Cells(findmatch + 2, 5) = "" Then
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 12) = 0
    Else
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 12) = Sheets("Sheet2").Cells(findmatch + 2, 5)
    End If
    Sheets("Sheet1").Cells(lastrowsheetone + 1, 13) = Sheets("Sheet2").Cells(findmatch + 2, 7)



    End Sub
    Last edited by a4u2fear; 03-17-14, 04:21 PM.
    Comment
    • a4u2fear
      SBR Hall of Famer
      • 01-29-10
      • 8147

      #37
      The code posted above will pull the data I just mentioned for the first 10 games of the season. You can change it to do the whole season, but you should see what you are up against first by just doing the first 10 games.

      lastrowsheetone finds where the last row of data is in our main sheet (sheet1) and places all data beneath it. If you have data already in your main sheet, you should select all rows and delete the row, not the data before importing the data.

      Every webpage is different; so it's important to find where the data you need is. In our imported webpage, I noticed the word "1st" is easily findable in column B of the imported webpage and right below it, the teams and goals per period is found. That is why i used:

      findmatch = Application.Match("1st", Sheets("Sheet2").Range("B1:B100"), 0)

      This will return the row position of "1st" and I know that match +1 will be the row of the away team and match+2 be the row of the home team.

      This code is almost the same as the one I posted earlier, but PullData is now filled in and I changed the number of games to import from 1 to 10
      Comment
      • a4u2fear
        SBR Hall of Famer
        • 01-29-10
        • 8147

        #38
        if you are having trouble following the placement of the data in Sheet1, Cells(row,column), Column A=1, B=2 etc. Row 1=1, in this case lastrowsheetone +1 since there are already lastrowsheetone of rows occupied, so we place the data in the row beneath it

        There is plenty of other data in here to pull if desired. If you want to do this, for example pull Giveaways, just add another match variable to search for "Giveaways". Be sure to set the range of lookup. In our previous example I set it to B1:B100. It will return the first found match of it. If you were to set the range of lookup to B20:B100 and the match is in row 25, match will return 5, so starting at the first row is helpful. Also, be sure to set the match find word (i.e. Giveaways) to something that does not exist more than once. Match function looks for whole words or phrases only
        Comment
        • a4u2fear
          SBR Hall of Famer
          • 01-29-10
          • 8147

          #39
          just so there's no confusion, if you are looking for Giveaways, be sure to set your range of lookup in column A since that is where it will exist. In a game I have pulled up, Giveaways is in row 136, but that may not always be the case. So set your range to something like

          findgiveaways = Application.Match("Giveaways", Sheets("Sheet2").Range("A1:A300"), 0)
          Comment
          • b_rad_1983
            SBR High Roller
            • 01-07-13
            • 127

            #40
            Looks good. I ran the code once. It gave me the first game. I had to manually enter the headings into sheet1. From there I ran the code again and it gave me 11 games were added to the list. The every time I run the code more games get added. I will try again in a bit to see if this repeats.
            Comment
            • b_rad_1983
              SBR High Roller
              • 01-07-13
              • 127

              #41
              I ran the code with no headings in sheet1 and it only gives one game.
              If the headings are in before the code is ran it puts 11 games, then each time the code is ran it duplicates.
              That's where I went wrong.
              You need headings and it can only be ran once.
              Comment
              • a4u2fear
                SBR Hall of Famer
                • 01-29-10
                • 8147

                #42
                Yes I put headers in there and assumed others would. It duplicates because I have it set for games 1:10, change 10 to totalgames
                Comment
                • b_rad_1983
                  SBR High Roller
                  • 01-07-13
                  • 127

                  #43
                  In your example, if NHL changed its format then the pull data code will be no good and will need to be adjusted correct?
                  Comment
                  • a4u2fear
                    SBR Hall of Famer
                    • 01-29-10
                    • 8147

                    #44
                    Yes it's possible, but this works all way back to 2007 to today and I see no reason for them to change it since it will always be 1st period
                    Comment
                    • b_rad_1983
                      SBR High Roller
                      • 01-07-13
                      • 127

                      #45
                      Okay. What's the next part of the lesson?
                      Comment
                      • a4u2fear
                        SBR Hall of Famer
                        • 01-29-10
                        • 8147

                        #46
                        Could be anything. I could show how to run code with a macro/button.

                        a lot more of the coding would be after you have your database and I could show how to go thru and estimate teams avg stats before a game etc.

                        If I were you, I'd start thinking about what data I want to save and go thru a whole season and try to gather all of it. Once you have a nice database, it's easier to start going thru it and analyzing it.
                        Comment
                        • b_rad_1983
                          SBR High Roller
                          • 01-07-13
                          • 127

                          #47
                          What if we pull the data from last season, every game. From there let's add some math to a couple columns over to total goals of each period throughout the season.
                          Comment
                          • b_rad_1983
                            SBR High Roller
                            • 01-07-13
                            • 127

                            #48
                            Starting at O1 heading will be ''total 1st period goals''
                            O2 = the summed amount from C2:C1000....
                            And so on.. For all the periods.
                            Comment
                            • b_rad_1983
                              SBR High Roller
                              • 01-07-13
                              • 127

                              #49
                              i'm not sure what data we would need to get as per past data to predict a current game.

                              My current spreadsheet in Google docs is working with Goals For and Goals Against for each team then i'm using the numbers into a percent to pick the better team.

                              To do this I will grab in the NHL Current stats page.
                              Pull in data to get each team down column A, get the Goals For in B, Goals Against in C, use Pythagorean Theorem to get a percent for each team down column C.

                              Then import the schedule into another sheet, Reference the teams to the percent, then compare the two teams in the matchup for the night.
                              Comment
                              • b_rad_1983
                                SBR High Roller
                                • 01-07-13
                                • 127

                                #50
                                When i do the work on excel 2007 then try the same thing on excel2013 I run into errors.
                                Last edited by b_rad_1983; 03-18-14, 02:41 AM.
                                Comment
                                • b_rad_1983
                                  SBR High Roller
                                  • 01-07-13
                                  • 127

                                  #51
                                  What about calculating the win % before every game by using the teams home/loss/tie record, comparing it to the ml price for that game then see if anything stands out that could help us currently.
                                  Then use that information on current games.
                                  Comment
                                  • a4u2fear
                                    SBR Hall of Famer
                                    • 01-29-10
                                    • 8147

                                    #52
                                    b_rad, you are free to run on your own. A lot of these calculations and estimating a team's performance going into a game I have done much of, and leave that for each of you to decide. I don't claim to have a system to eventually show everyone that guarantees winners. You seem to be a little more advance probably than others on here. I was planning on helping others out through tutorials and how to use VBA rather than try out systems on the code.
                                    Comment
                                    • a4u2fear
                                      SBR Hall of Famer
                                      • 01-29-10
                                      • 8147

                                      #53
                                      so if you have used the code posted above, you have a column filled with home teams and away teams. If you wanted to create another column to combine the teams playing so you filter for a specific team:

                                      for example, our first game is in row 2, PHI vs BOS, and we have data in columns up to M. So in column N2 (column N, row 2), we could type

                                      =B2 & H2, this will provide the result PHIBOS. You can then drag this formula through all of your games. Filtering on column N you can then see all of one team's games, instead of just filtering on just their away or home games.
                                      Comment
                                      • b_rad_1983
                                        SBR High Roller
                                        • 01-07-13
                                        • 127

                                        #54
                                        I'm still in the beginning stages of understanding excel.

                                        We can keep it directed towards your example.

                                        Under the URL of the site we grab the data from, has lines of true and false. What are they called? Switches? Also, how much more of them is there to choose from?




                                        The area below is where some people may be stumped.

                                        'away team stats
                                        Sheets("Sheet1").Cells(lastrowsheetone + 1, 2) = Sheets("Sheet2").Cells(findmatch + 1, 1)
                                        Sheets("Sheet1").Cells(lastrowsheetone + 1, 3) = Sheets("Sheet2").Cells(findmatch + 1, 2)
                                        Sheets("Sheet1").Cells(lastrowsheetone + 1, 4) = Sheets("Sheet2").Cells(findmatch + 1, 3)
                                        Sheets("Sheet1").Cells(lastrowsheetone + 1, 5) = Sheets("Sheet2").Cells(findmatch + 1, 4)
                                        If Sheets("Sheet2").Cells(findmatch + 1, 5) = "" Then
                                        Sheets("Sheet1").Cells(lastrowsheetone + 1, 6) = 0
                                        Else
                                        Sheets("Sheet1").Cells(lastrowsheetone + 1, 6) = Sheets("Sheet2").Cells(findmatch + 1, 5)
                                        End If
                                        Sheets("Sheet1").Cells(lastrowsheetone + 1, 7) = Sheets("Sheet2").Cells(findmatch + 1, 7)
                                        Comment
                                        • b_rad_1983
                                          SBR High Roller
                                          • 01-07-13
                                          • 127

                                          #55
                                          Post #53

                                          =B2 & H2 will give you PHIBOS

                                          =B2 & " @ " & H2 will give you PHI @ BOS BUT,

                                          Can you still drag it down throughout all the games?






                                          Suggestion,
                                          Also how about linking the full team names to the abbreviation?
                                          -Have the Team list on a new sheet and have a new macro to run to clean up all the names?

                                          ANAHEIM
                                          BOSTON
                                          BUFFALO
                                          CALGARY
                                          CAROLINA
                                          CHICAGO
                                          COLORADO
                                          COLUMBUS
                                          DALLAS
                                          DETROIT
                                          EDMONTON
                                          FLORIDA
                                          LOS ANGELES
                                          MINNESOTA
                                          MONTRÉAL
                                          NASHVILLE
                                          NEW JERSEY
                                          NY ISLANDERS
                                          NY RANGERS
                                          OTTAWA
                                          PHILADELPHIA
                                          PHOENIX
                                          PITTSBURGH
                                          SAN JOSE
                                          ST. LOUIS
                                          TAMPA BAY
                                          TORONTO
                                          VANCOUVER
                                          WASHINGTON
                                          WINNIPEG
                                          Comment
                                          • burrissa
                                            SBR High Roller
                                            • 12-22-09
                                            • 113

                                            #56
                                            I've been keeping an eye on this thread but you two are past my coding abilities. I'm more in the area of VB for dummies. I managed to take the original script and modified it to scrape six pages on ESPN (Batting and Pitching stats) but now I'm lost. How do I get the info to update in the same cells after each update? I guess what I'm saying is for the batting stats ESPN ranks teams by runs scored. Boston may lead the league one day and Detroit the next which would change where the teams stats are on the table. Boston was on row one but would drop to row two according to rankings.
                                            Comment
                                            • b_rad_1983
                                              SBR High Roller
                                              • 01-07-13
                                              • 127

                                              #57
                                              I don't think I can help but I can try to trouble shoot with you until a4u2fear gets back on.

                                              If you run the code with the website it will get the newest stats.
                                              You may need to delete the data in your "sheet2" , then run the main code.

                                              Every time you run the code it should put the data in the same places. Make sure you do not have and formulas in the areas that data will be placed.

                                              I've noticed with excel you need to be very specific with every thing as one wrong move could mess your entire work space up. What I was doing before I actually deleted a line was comment it out with " ' " (line will turn green) then run the code to see what changed. From there you will know if you need it or not.
                                              Comment
                                              • a4u2fear
                                                SBR Hall of Famer
                                                • 01-29-10
                                                • 8147

                                                #58
                                                Hey guys I'll check it out tonight and let you know
                                                Comment
                                                • burrissa
                                                  SBR High Roller
                                                  • 12-22-09
                                                  • 113

                                                  #59
                                                  The data does import to the same cells on each update. I guess what I'm asking is if the data changes in the table on the site your scraping does it change in your spreadsheet? The data I'm scraping now is for the 2013 baseball season so the team rankings don't change.Once the 2014 season starts teams will move up or down in the rankings on the site I'm scraping. Will this cause the data imported to my spreadsheet to also move up or down? Say Boston is first in runs scored so the data will import to row A3. The next day Boston drops to second in the rankings so after I import the updated data will Bostons data show up in row A4 or stay in row A3? Or how do I make sure the correct teams data is being pulled from my stats spreadsheet to my calculations spreadsheet? Sorry if this is redundant. I work a 12 hr rotating shift so my responses may be sporadic. Thanks to all that help.
                                                  Comment
                                                  • b_rad_1983
                                                    SBR High Roller
                                                    • 01-07-13
                                                    • 127

                                                    #60
                                                    I believe as long as you run the code to update from the site, then your sheet should too.
                                                    I'm curious as to if it's possible to have it auto update. Maybe check once an hour and if it doesn't match then update it.
                                                    As the hockey example of the box scores, it would be nice to have them update once the game finishes. So all the data from past seasons and current seasons are all there to work with.
                                                    Comment
                                                    • b_rad_1983
                                                      SBR High Roller
                                                      • 01-07-13
                                                      • 127

                                                      #61
                                                      The way he explained it was we match a key word that doesn't repeat. From there we tell it the location of the data. As long as the site doesn't change format and the teams just move up and down everything should be linked and move. As long as you re run the main code.
                                                      Comment
                                                      • a4u2fear
                                                        SBR Hall of Famer
                                                        • 01-29-10
                                                        • 8147

                                                        #62
                                                        Burr, don't worry about being a newbie or "sounding dumb". No questions are. Most may have the same question or issue.

                                                        The code I've posted is to append games to complete a full list for a season, or every time the importer is run, add to an existing table.

                                                        It sounds like what you are trying to do is import a table every day and replace an existing one. That is not currently what the code I've posted does. If that is what you are trying to do, you can post the website you are trying to do this for and explain what you would like to do and we can work through it.
                                                        Comment
                                                        • a4u2fear
                                                          SBR Hall of Famer
                                                          • 01-29-10
                                                          • 8147

                                                          #63
                                                          b_rad, have you imported a full season successfully? Are you stuck anywhere?
                                                          Comment
                                                          • a4u2fear
                                                            SBR Hall of Famer
                                                            • 01-29-10
                                                            • 8147

                                                            #64
                                                            Originally posted by b_rad_1983
                                                            Post #53

                                                            =B2 & H2 will give you PHIBOS

                                                            =B2 & " @ " & H2 will give you PHI @ BOS BUT,

                                                            Can you still drag it down throughout all the games?






                                                            Suggestion,
                                                            Also how about linking the full team names to the abbreviation?
                                                            -Have the Team list on a new sheet and have a new macro to run to clean up all the names?

                                                            ANAHEIM
                                                            BOSTON
                                                            BUFFALO
                                                            CALGARY
                                                            CAROLINA
                                                            CHICAGO
                                                            COLORADO
                                                            COLUMBUS
                                                            DALLAS
                                                            DETROIT
                                                            EDMONTON
                                                            FLORIDA
                                                            LOS ANGELES
                                                            MINNESOTA
                                                            MONTRÉAL
                                                            NASHVILLE
                                                            NEW JERSEY
                                                            NY ISLANDERS
                                                            NY RANGERS
                                                            OTTAWA
                                                            PHILADELPHIA
                                                            PHOENIX
                                                            PITTSBURGH
                                                            SAN JOSE
                                                            ST. LOUIS
                                                            TAMPA BAY
                                                            TORONTO
                                                            VANCOUVER
                                                            WASHINGTON
                                                            WINNIPEG
                                                            My experience, do not add "@" or spaces if you do not have to. You know that this column PHIBOS, the first team displayed will always be the away team, making it more difficult is not necessary.

                                                            In regards to turning "PHI" into "Philadelphia" or "Philadelphia Flyers". Again, my experience has been keep it simple and do not make things more difficult than they need to be.

                                                            Let's say you import 2011 season, changed all short hand names (PHI) to long hand names (Philadelphia).

                                                            Then in a month or two you find time to import another season, and the names are now PHI creating a discrepancy, etc

                                                            What I've found, is I have my own list of shorthand names I use for teams/cities that I like. Once you go through a bunch of games you will have your own abbreviations you like to use.
                                                            Comment
                                                            • a4u2fear
                                                              SBR Hall of Famer
                                                              • 01-29-10
                                                              • 8147

                                                              #65
                                                              Let's say in my example, column N row 2 (1st NHL game of season), in cell N2 we typed =B2 & H2, this gave us PHIBOS. And you want to do the same for the next 100 or 1000 cells as well.

                                                              Click or double click on cell N2 with your mouse (either or doesn't matter), on the bottom right of the cell you will see a fat rectangle. If you double clikc this, the formula will populate through the rest of the rows you have populated. If it doesn't, you can click and hold it and scroll down to fill the rows.
                                                              Comment
                                                              • b_rad_1983
                                                                SBR High Roller
                                                                • 01-07-13
                                                                • 127

                                                                #66
                                                                I have successfully imported a full season. Thanks.
                                                                I'm not working on understanding the placement of where to grab the info from sheet2.
                                                                which I found to be sorta simple. **'away team stats Sheets("Sheet1").Cells(lastrowsheetone + 1, 2) = Sheets("Sheet2").Cells(findmatch + 1, 1)**

                                                                So, I have used that on my NHL standing sheet and was able to copy 4 cells over with what you taught us. I am however stuck with looping it on the same table thirty or so times.


                                                                Comment
                                                                • a4u2fear
                                                                  SBR Hall of Famer
                                                                  • 01-29-10
                                                                  • 8147

                                                                  #67
                                                                  b_rad, I don't get what you said. "Im not working on.." and what you are trying to do. Explain in more detail
                                                                  Comment
                                                                  • b_rad_1983
                                                                    SBR High Roller
                                                                    • 01-07-13
                                                                    • 127

                                                                    #68
                                                                    Sorry typo.
                                                                    "Now working on.."
                                                                    Comment
                                                                    • a4u2fear
                                                                      SBR Hall of Famer
                                                                      • 01-29-10
                                                                      • 8147

                                                                      #69
                                                                      post your code and explain what you are trying to do
                                                                      Comment
                                                                      • a4u2fear
                                                                        SBR Hall of Famer
                                                                        • 01-29-10
                                                                        • 8147

                                                                        #70
                                                                        I will try to work up an example to show vlookup tomorrow. It's difficult to teach this stuff, maybe there's a way I can start uploading a sample excel file and it will be easier to show
                                                                        Comment
                                                                        SBR Contests
                                                                        Collapse
                                                                        Top-Rated US Sportsbooks
                                                                        Collapse
                                                                        Working...