Holy Grail of HTML Scrapers in .NET

Pokerjoe · 08-25-11, 05:51 PM

Very cool, Flight, thanks for this.

MonkeyF0cker · 08-25-11, 06:04 PM

I'm too set in my ways. I've been using Regex for far too long to switch now.

Does it work with AJAX generated pages? Doesn't really say on the front page.

That's really the biggest pain IMO.

Flight · 08-25-11, 06:04 PM

Seriously this is hotness, 4 lines I get all links in a page:

Code:

HtmlWeb htmlWeb = new HtmlWeb();
string url = @"http://www.savebigbucks.ca/";
HtmlDocument doc = htmlWeb.Load(url);
var links = doc.DocumentNode.Descendants("a").Select(x => x.OuterHtml).ToList();

Flight · 08-25-11, 06:06 PM

Monkey - It is for HTML, so if the Ajax returns HTML, yes! But if it's JSON then no. XML... maybe.

Monkey I know you are a .NET guy as well. I have written tons of scrapers just like you using RegEx on HTML strings. This is a game changer for me.

Embrace it, you're not that old.

MonkeyF0cker · 08-25-11, 06:10 PM

LOL. I'll give it a look.

Thanks, buddy.

Salmon Steak · 08-25-11, 08:28 PM

I usually just import data into excel and do my work there. Is this stuff really much better? I anything about LINQ, AJAX, or JSON. I just read a blog about it and feel lost. If it really is much better where did you learn it?

Oh, and go Irish. Can't believe they are not starting Rees this year.

Flight · 08-25-11, 08:49 PM

This stuff is better when you need to do more with your data. For example:
1) Enter results into a SQL database
2) Download tons of stuff (for example, I once built a porn site crawler that downloaded 500 videos over 100 GB in total size (that's called putting your skills to work!)
3) Automate web site interaction, ie HTTP GETs and POSTs, forms, URLs, cookies, etc.

You don't need to know anything about LINQ, AJAX, or JSON. Those are advanced usages. In a simple web scraper, you just grab HTML from a site (say a boxscore for a baseball game from ESPN), parse up the document into things of interest (batter names, hits, runs, etc) and do something with your resulting dataset, usually enter it into a DB or dump it to a file for later analysis.

The hardest part in building a scraper is always groveling through the HTML file and figuring out how to pull sets of strings into data objects. Usually this involved building many many regular expressions that each look something awful like:

Code:

@"\w\s+.*?\<\a href=\([\s\w\d]+?]\).*?>(\w)</a>"

Seriously that is not an exaggeration - something like that would get one players name and number of hits, for example.

This little plugin makes it so much easier. It's been around for years, I can't believe I've never used it.

In terms of learning, you have to start somewhere and just keep doing it. In the other thread I recommended Python for beginners, gets you up and going quick (but if you've never programmed, I caution you that it is not easy)

Salmon Steak · 08-25-11, 10:39 PM

Cool, and thanks for the warning. I might look into python if I get more free time.

oddsfellow · 08-26-11, 12:21 PM

Thanks looks useful. What is the legal details with regards scraping. For instance can i scrape golf results from Golf Observer over a period of time or will they block my IP once i trigger a data limit?

Flight · 08-27-11, 09:24 AM

In terms of limiting, It's up to the web site operator. If you build your tool such that it is nice and doesn't rape their bandwidth, you will not have any problems. ie put delays in between page requests, only request what you need, etc.

JoeVig · 08-27-11, 12:07 PM

I assume you can fill forms, login, etc with this fancy thing-a-majig? Some of my scraping is after login.

Flight · 08-28-11, 02:53 PM

JoeVig: filling forms and logging in is an HTTP thing. The tool that I am recommending is for HTML, and parsing.

If you are using .NET, the class System.Net.HttpWebRequest is what you would use for HTTP operations.

JoeVig · 08-28-11, 03:11 PM

Flight - The code snippet you posted shows this tool getting its HTML by passing it a URL, so then I assume the tool is doing its own HTTP get from the target?

I'm using a WebBrowser object out of convenience right now, and either pulling HTML or parsing DOM depending on need.

Flight · 08-28-11, 06:40 PM

Oops, I forgot about my own snippet! Yah you're right, the HtmlWeb class can do its own get! But to answer your original question, I doubt it can be used for posting.

WebBrowser does seem very convenient, and I saw a simple example using it for posting data as well. The only drawback I saw is that it belongs to System.Windows.Forms, which I assume excludes it from WPF Apps and Console Apps (the two application types I use).

For simple web crawler scraping only involving gets, I use the WebClient class. It is similar to WebBrowser in its convenience. But when I have to do posts and cookie management I always switch over to HttpWebRequest/HttpWebResponse.

Yay .NET developers on SBR forum!

Flight · 08-28-11, 06:49 PM

JoeVig, the HtmlAgilityPack also gives you a new HtmlDocument class. There is a class of the same name in the .NET framework, but the class that comes with HtmlAgilityPack has a lot more features and is more powerful. It supports xpath and LINQ for selecting HtmlNode elements.

Just wanted to point out it's extended capabilities if you have already used HtmlDocument before.

MonkeyF0cker · 08-28-11, 07:10 PM

Originally posted by Flight

WebBrowser does seem very convenient, and I saw a simple example using it for posting data as well. The only drawback I saw is that it belongs to System.Windows.Forms, which I assume excludes it from WPF Apps and Console Apps (the two application types I use).

Nothing for console apps but there is a wrapper for the Webbrowser control in WPF(System.Windows.Controls::WebBrowser) . There's also a Chromium wrapper (Awesomium) out there that supposedly plays better with WPF. I haven't used it though.

uva3021 · 08-29-11, 02:28 PM

beautiful soup in python

podonne · 09-06-11, 01:43 PM

Agreed, HTMLAgilityPack was a huge leap forward for me.Handles poorly formatted html pages, LINQ and especially XPATH. In the past I was taking a raw HTML, "fixing it" into something that fit the XML specification, the reading it with System.XML.XMLDocument. Huge pain... eliminated.

For interactivity (refreshing\ajaxy pages or filling forms), I use a web browser control to navigate to the page, periodically pull the HTML out into HTMLAgilityPack, figure out what needs to happen, then use the web browser to navigate to the next page, where neccessary. Works like a charm 99% of the time and (best part) looks just like a normal web user to the website admins...

If you pull web pages directly using System.Net.HttpWebRequest, be sure to change the user agent to something harmless sounding like a known crawler. Pick something reasonably popular so the web admin recognizes it, but not so popular that it can be easily checked against your IP. Look here: (http://www.useragentstring.com/pages/Crawlerlist/)

Salmon Steak · 09-07-11, 10:32 PM

Originally posted by Salmon Steak

Can't believe they are not starting Rees this year.

This, and...

I got a book called baseball hacks. It mostly uses SQL. I am finding it dated and difficult (mostly b/c it is dated). I think I will put this to the side so I can focus on the upcoming basketball season. not enough hours in the day.