Holy Grail of HTML Scrapers in .NET

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Flight
    Restricted User
    • 01-28-09
    • 1979

    #1
    Holy Grail of HTML Scrapers in .NET
    Just started using this little gem this week



    Why the hell have I been using regular expressions to parse HTML? The Html Agility Pack solves 90% of the work I have to do when building a web scraper. And it supports LINQ!
  • Pokerjoe
    SBR Wise Guy
    • 04-17-09
    • 704

    #2
    Very cool, Flight, thanks for this.
    Comment
    • MonkeyF0cker
      SBR Posting Legend
      • 06-12-07
      • 12144

      #3
      I'm too set in my ways. I've been using Regex for far too long to switch now.

      Does it work with AJAX generated pages? Doesn't really say on the front page.

      That's really the biggest pain IMO.
      Comment
      • Flight
        Restricted User
        • 01-28-09
        • 1979

        #4
        Seriously this is hotness, 4 lines I get all links in a page:

        Code:
        HtmlWeb htmlWeb = new HtmlWeb();
        string url = @"http://www.savebigbucks.ca/";
        HtmlDocument doc = htmlWeb.Load(url);
        var links = doc.DocumentNode.Descendants("a").Select(x => x.OuterHtml).ToList();
        Comment
        • Flight
          Restricted User
          • 01-28-09
          • 1979

          #5
          Monkey - It is for HTML, so if the Ajax returns HTML, yes! But if it's JSON then no. XML... maybe.

          Monkey I know you are a .NET guy as well. I have written tons of scrapers just like you using RegEx on HTML strings. This is a game changer for me.

          Embrace it, you're not that old.
          Comment
          • MonkeyF0cker
            SBR Posting Legend
            • 06-12-07
            • 12144

            #6
            LOL. I'll give it a look.

            Thanks, buddy.
            Comment
            • Salmon Steak
              SBR MVP
              • 03-05-10
              • 2110

              #7
              I usually just import data into excel and do my work there. Is this stuff really much better? I anything about LINQ, AJAX, or JSON. I just read a blog about it and feel lost. If it really is much better where did you learn it?

              Oh, and go Irish. Can't believe they are not starting Rees this year.
              Comment
              • Flight
                Restricted User
                • 01-28-09
                • 1979

                #8
                This stuff is better when you need to do more with your data. For example:
                1) Enter results into a SQL database
                2) Download tons of stuff (for example, I once built a porn site crawler that downloaded 500 videos over 100 GB in total size (that's called putting your skills to work!)
                3) Automate web site interaction, ie HTTP GETs and POSTs, forms, URLs, cookies, etc.

                You don't need to know anything about LINQ, AJAX, or JSON. Those are advanced usages. In a simple web scraper, you just grab HTML from a site (say a boxscore for a baseball game from ESPN), parse up the document into things of interest (batter names, hits, runs, etc) and do something with your resulting dataset, usually enter it into a DB or dump it to a file for later analysis.

                The hardest part in building a scraper is always groveling through the HTML file and figuring out how to pull sets of strings into data objects. Usually this involved building many many regular expressions that each look something awful like:
                Code:
                @"\w\s+.*?\<\a href=\([\s\w\d]+?]\).*?>(\w)</a>"
                Seriously that is not an exaggeration - something like that would get one players name and number of hits, for example.

                This little plugin makes it so much easier. It's been around for years, I can't believe I've never used it.

                In terms of learning, you have to start somewhere and just keep doing it. In the other thread I recommended Python for beginners, gets you up and going quick (but if you've never programmed, I caution you that it is not easy)
                Comment
                • Salmon Steak
                  SBR MVP
                  • 03-05-10
                  • 2110

                  #9
                  Cool, and thanks for the warning. I might look into python if I get more free time.
                  Comment
                  • oddsfellow
                    SBR Rookie
                    • 02-20-11
                    • 18

                    #10
                    Thanks looks useful. What is the legal details with regards scraping. For instance can i scrape golf results from Golf Observer over a period of time or will they block my IP once i trigger a data limit?
                    Comment
                    • Flight
                      Restricted User
                      • 01-28-09
                      • 1979

                      #11
                      In terms of limiting, It's up to the web site operator. If you build your tool such that it is nice and doesn't rape their bandwidth, you will not have any problems. ie put delays in between page requests, only request what you need, etc.
                      Comment
                      • JoeVig
                        SBR Wise Guy
                        • 01-11-08
                        • 772

                        #12
                        I assume you can fill forms, login, etc with this fancy thing-a-majig? Some of my scraping is after login.
                        Comment
                        • Flight
                          Restricted User
                          • 01-28-09
                          • 1979

                          #13
                          JoeVig: filling forms and logging in is an HTTP thing. The tool that I am recommending is for HTML, and parsing.

                          If you are using .NET, the class System.Net.HttpWebRequest is what you would use for HTTP operations.
                          Comment
                          • JoeVig
                            SBR Wise Guy
                            • 01-11-08
                            • 772

                            #14
                            Flight - The code snippet you posted shows this tool getting its HTML by passing it a URL, so then I assume the tool is doing its own HTTP get from the target?

                            I'm using a WebBrowser object out of convenience right now, and either pulling HTML or parsing DOM depending on need.
                            Comment
                            • Flight
                              Restricted User
                              • 01-28-09
                              • 1979

                              #15
                              Oops, I forgot about my own snippet! Yah you're right, the HtmlWeb class can do its own get! But to answer your original question, I doubt it can be used for posting.

                              WebBrowser does seem very convenient, and I saw a simple example using it for posting data as well. The only drawback I saw is that it belongs to System.Windows.Forms, which I assume excludes it from WPF Apps and Console Apps (the two application types I use).

                              For simple web crawler scraping only involving gets, I use the WebClient class. It is similar to WebBrowser in its convenience. But when I have to do posts and cookie management I always switch over to HttpWebRequest/HttpWebResponse.

                              Yay .NET developers on SBR forum!
                              Comment
                              • Flight
                                Restricted User
                                • 01-28-09
                                • 1979

                                #16
                                JoeVig, the HtmlAgilityPack also gives you a new HtmlDocument class. There is a class of the same name in the .NET framework, but the class that comes with HtmlAgilityPack has a lot more features and is more powerful. It supports xpath and LINQ for selecting HtmlNode elements.

                                Just wanted to point out it's extended capabilities if you have already used HtmlDocument before.
                                Comment
                                • MonkeyF0cker
                                  SBR Posting Legend
                                  • 06-12-07
                                  • 12144

                                  #17
                                  Originally posted by Flight
                                  WebBrowser does seem very convenient, and I saw a simple example using it for posting data as well. The only drawback I saw is that it belongs to System.Windows.Forms, which I assume excludes it from WPF Apps and Console Apps (the two application types I use).
                                  Nothing for console apps but there is a wrapper for the Webbrowser control in WPF(System.Windows.Controls::WebBrowser) . There's also a Chromium wrapper (Awesomium) out there that supposedly plays better with WPF. I haven't used it though.
                                  Comment
                                  • uva3021
                                    SBR Wise Guy
                                    • 03-01-07
                                    • 537

                                    #18
                                    beautiful soup in python
                                    Comment
                                    • podonne
                                      SBR High Roller
                                      • 07-01-11
                                      • 104

                                      #19
                                      Agreed, HTMLAgilityPack was a huge leap forward for me.Handles poorly formatted html pages, LINQ and especially XPATH. In the past I was taking a raw HTML, "fixing it" into something that fit the XML specification, the reading it with System.XML.XMLDocument. Huge pain... eliminated.

                                      For interactivity (refreshing\ajaxy pages or filling forms), I use a web browser control to navigate to the page, periodically pull the HTML out into HTMLAgilityPack, figure out what needs to happen, then use the web browser to navigate to the next page, where neccessary. Works like a charm 99% of the time and (best part) looks just like a normal web user to the website admins...

                                      If you pull web pages directly using System.Net.HttpWebRequest, be sure to change the user agent to something harmless sounding like a known crawler. Pick something reasonably popular so the web admin recognizes it, but not so popular that it can be easily checked against your IP. Look here: (http://www.useragentstring.com/pages/Crawlerlist/)
                                      Comment
                                      • Salmon Steak
                                        SBR MVP
                                        • 03-05-10
                                        • 2110

                                        #20
                                        Originally posted by Salmon Steak
                                        Can't believe they are not starting Rees this year.
                                        This, and...

                                        I got a book called baseball hacks. It mostly uses SQL. I am finding it dated and difficult (mostly b/c it is dated). I think I will put this to the side so I can focus on the upcoming basketball season. not enough hours in the day.
                                        Comment
                                        Search
                                        Collapse
                                        SBR Contests
                                        Collapse
                                        Top-Rated US Sportsbooks
                                        Collapse
                                        Working...