1. #1
    Flight
    Update your status
    Flight's Avatar Become A Pro!
    Join Date: 01-27-09
    Posts: 1,979

    Holy Grail of HTML Scrapers in .NET

    Just started using this little gem this week

    http://htmlagilitypack.codeplex.com/

    Why the hell have I been using regular expressions to parse HTML? The Html Agility Pack solves 90% of the work I have to do when building a web scraper. And it supports LINQ!
    Points Awarded:

    TomG gave Flight 10 SBR Point(s) for this post.


  2. #2
    Pokerjoe
    Pokerjoe's Avatar Become A Pro!
    Join Date: 04-17-09
    Posts: 704
    Betpoints: 307

    Very cool, Flight, thanks for this.

  3. #3
    MonkeyF0cker
    Update your status
    MonkeyF0cker's Avatar Become A Pro!
    Join Date: 06-12-07
    Posts: 12,144
    Betpoints: 1127

    I'm too set in my ways. I've been using Regex for far too long to switch now.

    Does it work with AJAX generated pages? Doesn't really say on the front page.

    That's really the biggest pain IMO.

  4. #4
    Flight
    Update your status
    Flight's Avatar Become A Pro!
    Join Date: 01-27-09
    Posts: 1,979

    Seriously this is hotness, 4 lines I get all links in a page:

    Code:
    HtmlWeb htmlWeb = new HtmlWeb();
    string url = @"http://www.savebigbucks.ca/";
    HtmlDocument doc = htmlWeb.Load(url);
    var links = doc.DocumentNode.Descendants("a").Select(x => x.OuterHtml).ToList();

  5. #5
    Flight
    Update your status
    Flight's Avatar Become A Pro!
    Join Date: 01-27-09
    Posts: 1,979

    Monkey - It is for HTML, so if the Ajax returns HTML, yes! But if it's JSON then no. XML... maybe.

    Monkey I know you are a .NET guy as well. I have written tons of scrapers just like you using RegEx on HTML strings. This is a game changer for me.

    Embrace it, you're not that old.

  6. #6
    MonkeyF0cker
    Update your status
    MonkeyF0cker's Avatar Become A Pro!
    Join Date: 06-12-07
    Posts: 12,144
    Betpoints: 1127

    LOL. I'll give it a look.

    Thanks, buddy.

  7. #7
    Salmon Steak
    Salmon Steak's Avatar Become A Pro!
    Join Date: 03-05-10
    Posts: 2,110
    Betpoints: 613

    I usually just import data into excel and do my work there. Is this stuff really much better? I anything about LINQ, AJAX, or JSON. I just read a blog about it and feel lost. If it really is much better where did you learn it?

    Oh, and go Irish. Can't believe they are not starting Rees this year.

  8. #8
    Flight
    Update your status
    Flight's Avatar Become A Pro!
    Join Date: 01-27-09
    Posts: 1,979

    This stuff is better when you need to do more with your data. For example:
    1) Enter results into a SQL database
    2) Download tons of stuff (for example, I once built a porn site crawler that downloaded 500 videos over 100 GB in total size (that's called putting your skills to work!)
    3) Automate web site interaction, ie HTTP GETs and POSTs, forms, URLs, cookies, etc.

    You don't need to know anything about LINQ, AJAX, or JSON. Those are advanced usages. In a simple web scraper, you just grab HTML from a site (say a boxscore for a baseball game from ESPN), parse up the document into things of interest (batter names, hits, runs, etc) and do something with your resulting dataset, usually enter it into a DB or dump it to a file for later analysis.

    The hardest part in building a scraper is always groveling through the HTML file and figuring out how to pull sets of strings into data objects. Usually this involved building many many regular expressions that each look something awful like:
    Code:
    @"\w\s+.*?\<\a href=\([\s\w\d]+?]\).*?>(\w)</a>"
    Seriously that is not an exaggeration - something like that would get one players name and number of hits, for example.

    This little plugin makes it so much easier. It's been around for years, I can't believe I've never used it.

    In terms of learning, you have to start somewhere and just keep doing it. In the other thread I recommended Python for beginners, gets you up and going quick (but if you've never programmed, I caution you that it is not easy)
    Last edited by Flight; 08-25-11 at 08:53 PM.
    Points Awarded:

    Salmon Steak gave Flight 10 SBR Point(s) for this post.


  9. #9
    Salmon Steak
    Salmon Steak's Avatar Become A Pro!
    Join Date: 03-05-10
    Posts: 2,110
    Betpoints: 613

    Cool, and thanks for the warning. I might look into python if I get more free time.

  10. #10
    oddsfellow
    oddsfellow's Avatar Become A Pro!
    Join Date: 02-20-11
    Posts: 18
    Betpoints: 11752

    Thanks looks useful. What is the legal details with regards scraping. For instance can i scrape golf results from Golf Observer over a period of time or will they block my IP once i trigger a data limit?

  11. #11
    Flight
    Update your status
    Flight's Avatar Become A Pro!
    Join Date: 01-27-09
    Posts: 1,979

    In terms of limiting, It's up to the web site operator. If you build your tool such that it is nice and doesn't rape their bandwidth, you will not have any problems. ie put delays in between page requests, only request what you need, etc.

  12. #12
    JoeVig
    JoeVig's Avatar Become A Pro!
    Join Date: 01-11-08
    Posts: 772
    Betpoints: 37

    I assume you can fill forms, login, etc with this fancy thing-a-majig? Some of my scraping is after login.

  13. #13
    Flight
    Update your status
    Flight's Avatar Become A Pro!
    Join Date: 01-27-09
    Posts: 1,979

    JoeVig: filling forms and logging in is an HTTP thing. The tool that I am recommending is for HTML, and parsing.

    If you are using .NET, the class System.Net.HttpWebRequest is what you would use for HTTP operations.

  14. #14
    JoeVig
    JoeVig's Avatar Become A Pro!
    Join Date: 01-11-08
    Posts: 772
    Betpoints: 37

    Flight - The code snippet you posted shows this tool getting its HTML by passing it a URL, so then I assume the tool is doing its own HTTP get from the target?

    I'm using a WebBrowser object out of convenience right now, and either pulling HTML or parsing DOM depending on need.

  15. #15
    Flight
    Update your status
    Flight's Avatar Become A Pro!
    Join Date: 01-27-09
    Posts: 1,979

    Oops, I forgot about my own snippet! Yah you're right, the HtmlWeb class can do its own get! But to answer your original question, I doubt it can be used for posting.

    WebBrowser does seem very convenient, and I saw a simple example using it for posting data as well. The only drawback I saw is that it belongs to System.Windows.Forms, which I assume excludes it from WPF Apps and Console Apps (the two application types I use).

    For simple web crawler scraping only involving gets, I use the WebClient class. It is similar to WebBrowser in its convenience. But when I have to do posts and cookie management I always switch over to HttpWebRequest/HttpWebResponse.

    Yay .NET developers on SBR forum!

  16. #16
    Flight
    Update your status
    Flight's Avatar Become A Pro!
    Join Date: 01-27-09
    Posts: 1,979

    JoeVig, the HtmlAgilityPack also gives you a new HtmlDocument class. There is a class of the same name in the .NET framework, but the class that comes with HtmlAgilityPack has a lot more features and is more powerful. It supports xpath and LINQ for selecting HtmlNode elements.

    Just wanted to point out it's extended capabilities if you have already used HtmlDocument before.

  17. #17
    MonkeyF0cker
    Update your status
    MonkeyF0cker's Avatar Become A Pro!
    Join Date: 06-12-07
    Posts: 12,144
    Betpoints: 1127

    Quote Originally Posted by Flight View Post
    WebBrowser does seem very convenient, and I saw a simple example using it for posting data as well. The only drawback I saw is that it belongs to System.Windows.Forms, which I assume excludes it from WPF Apps and Console Apps (the two application types I use).
    Nothing for console apps but there is a wrapper for the Webbrowser control in WPF(System.Windows.Controls::WebBrowser) . There's also a Chromium wrapper (Awesomium) out there that supposedly plays better with WPF. I haven't used it though.

  18. #18
    uva3021
    uva3021's Avatar Become A Pro!
    Join Date: 03-01-07
    Posts: 537
    Betpoints: 381

    beautiful soup in python

  19. #19
    podonne
    podonne's Avatar Become A Pro!
    Join Date: 07-01-11
    Posts: 104

    Agreed, HTMLAgilityPack was a huge leap forward for me.Handles poorly formatted html pages, LINQ and especially XPATH. In the past I was taking a raw HTML, "fixing it" into something that fit the XML specification, the reading it with System.XML.XMLDocument. Huge pain... eliminated.

    For interactivity (refreshing\ajaxy pages or filling forms), I use a web browser control to navigate to the page, periodically pull the HTML out into HTMLAgilityPack, figure out what needs to happen, then use the web browser to navigate to the next page, where neccessary. Works like a charm 99% of the time and (best part) looks just like a normal web user to the website admins...

    If you pull web pages directly using System.Net.HttpWebRequest, be sure to change the user agent to something harmless sounding like a known crawler. Pick something reasonably popular so the web admin recognizes it, but not so popular that it can be easily checked against your IP. Look here: (http://www.useragentstring.com/pages/Crawlerlist/)

  20. #20
    Salmon Steak
    Salmon Steak's Avatar Become A Pro!
    Join Date: 03-05-10
    Posts: 2,110
    Betpoints: 613

    Quote Originally Posted by Salmon Steak View Post
    Can't believe they are not starting Rees this year.
    This, and...

    I got a book called baseball hacks. It mostly uses SQL. I am finding it dated and difficult (mostly b/c it is dated). I think I will put this to the side so I can focus on the upcoming basketball season. not enough hours in the day.

Top