1. #1
    dcat2
    dcat2's Avatar Become A Pro!
    Join Date: 01-24-12
    Posts: 9
    Betpoints: 61

    A question regarding data scraping

    Here is an example sheet: http://www.baseball-reference.com/te...-batting.shtml

    When scraping my data is coming out all vertically (all in one column). I'm using ruby. Basically my idea was that when it runs through the html and identifies a player link (since all player names include an href) to then tell the program to start a new row/column in the excel spreadsheet. I suppose my question regards the best way to alter my coding to format these numbers into something that resembles the display on the website.

    Current Script:

    require 'nokogiri'
    require 'open-uri'
    require 'csv'


    doc = Nokogiri::HTML(open('http://www.baseball-reference.com/teams/ARI/2011-batting.shtml'))


    stats = Array.new


    doc.css('tbody > tr > td').each do |stat|
    stats << stat.content
    end

    (0..stats.length - 1).each do |index|
    puts "stat: #{stats[index]**"
    puts ""
    end

    CSV.open("losol.csv", "wb") do |row|
    row << ["stat"]
    (0..stats.length - 1).each do |index|
    row << [stats[index]]
    end
    end

  2. #2
    MonkeyF0cker
    Update your status
    MonkeyF0cker's Avatar Become A Pro!
    Join Date: 06-12-07
    Posts: 12,144
    Betpoints: 1127

    Well, yeah...

    You're telling it to put each stat in a new row.

    Code:
    CSV.open("losol.csv", "wb") do |row|
     row << ["stat"]
     (0..stats.length - 1).each do |index|
     row << [stats[index]]
    You either need to create an array of arrays with each array containing the stats of each individual player (or line) or create a loop that iterates through the stats by the number of stats per player. In other words, something like this...

    Code:
    NUMBER_OF_STATS_PER_PLAYER = 18
    NUMBER_OF_PLAYERS = stats.length / NUMBER_OF_STATS_PER_PLAYER
    
    for i in (0..NUMBER_OF_PLAYERS - 1)
    a = i * NUMBER_OF_STATS_PER_PLAYER
    row << [stats[a]]  + [stats[a + 1]] + [stats[a + 2]]....  + [stats[a + NUMBER_OF_STATS_PER_PLAYER - 1]]
    end
    Why are you using Ruby if you're just parsing for Excel anyway?

  3. #3
    dcat2
    dcat2's Avatar Become A Pro!
    Join Date: 01-24-12
    Posts: 9
    Betpoints: 61

    thanks for the input, what would you suggest using if just parsing for excel?

  4. #4
    MonkeyF0cker
    Update your status
    MonkeyF0cker's Avatar Become A Pro!
    Join Date: 06-12-07
    Posts: 12,144
    Betpoints: 1127

    Why not just use the Web data import in Excel? It imports that page without needing to code a thing...

  5. #5
    thom321
    thom321's Avatar Become A Pro!
    Join Date: 06-17-11
    Posts: 112
    Betpoints: 4983

    web query with Excel

    Agree with Monkey. Look into Excel web queries and how the work. I also suggest recording a macro when adding a web query and the look at the code. That way you can also easily see the names of the table(s) used in the query.

    I have included a sample workbook where a web query is used to import a data table to a specified cell (named range). The query is used to get the data but only the values are stored.

    Have a look at the code and reverse engineer it to your liking.

    If you are wondering why local names are deleted after the query is run, it is because Excel stores the query range as a local name (so it knows where to add refreshed data if you are having live queries in the workbook). The way my code is written, the query is only used to import the data so we have no used for the named range. However, if you run the code multiple times, a local name will be added each time and after a while Excel starts to struggle if there are too many named ranges. So if is better to just delete them after each run. Obviously, if you are using local named ranges on purpose in your workbook, you would have to adjust the code to only remove the local names specifically created when using the query and not remove all local names in the sheet the way my code works in its current form.
    Attached Files

Top