Scraping using Windmill?

**Indecent** · 08-07-10, 11:10 AM

Originally posted by bztips

Does anyone in the tank know anything about using Windmill for scraping?

Normally I use the mechanize library (for Python) to get the html for most of my scraping, which works great as long as the content is directly coded in the page. But some sites (like mlb.com) use a lot of Javascript that auto-generates the content only after you open the page - mechanize won't work in such cases.

Which page specifically are you trying to scrape? It looks like most of the javascript code on mlb.com is not obsuficated (the code is easy to read/understand), and as a result it could be (potentially) easy to do this with the tools you are already using.

**bztips** · 08-07-10, 12:15 PM

Originally posted by Indecent

Which page specifically are you trying to scrape? It looks like most of the javascript code on mlb.com is not obsuficated (the code is easy to read/understand), and as a result it could be (potentially) easy to do this with the tools you are already using.

For example, the injury page: http://mlb.mlb.com/mlb/fantasy/wsfb/news/injuries.jsp

If you call this up with your favorite browser and look at the html, it all seems to be there.

But if I try to grab it with mechanize, none of it's there because of the javascript.

**Indecent** · 08-07-10, 01:48 PM

Originally posted by bztips

For example, the injury page: http://mlb.mlb.com/mlb/fantasy/wsfb/news/injuries.jsp

If you call this up with your favorite browser and look at the html, it all seems to be there.

But if I try to grab it with mechanize, none of it's there because of the javascript.

Gotcha.

Try using this tutorial, it seems like it covers a lot of what you might need.

**bztips** · 08-07-10, 03:06 PM

Originally posted by Indecent

Gotcha.

Try using this tutorial, it seems like it covers a lot of what you might need.

Yes, I've seen that already, that's what got me interested in Windmill in the first place. As I mentioned in my OP, the problem is that I don't know how to get Windmill instantiated inside of Python. I also found this in GoogleGroups, which claims to directly answer my question, but I can't get it to work.

(I suppose I could post a followon to that in GG, but I'm really not a programming wiz, so it's likely I wouldn't understand anyway.) Just looking for someone who might happen to know how to start up Windmill inside of Python.

**Indecent** · 08-08-10, 09:40 AM

Originally posted by bztips

Yes, I've seen that already, that's what got me interested in Windmill in the first place. As I mentioned in my OP, the problem is that I don't know how to get Windmill instantiated inside of Python. I also found this in GoogleGroups, which claims to directly answer my question, but I can't get it to work.

(I suppose I could post a followon to that in GG, but I'm really not a programming wiz, so it's likely I wouldn't understand anyway.) Just looking for someone who might happen to know how to start up Windmill inside of Python.

I got it up and running. Which part did you have problems with?

**bztips** · 08-08-10, 11:34 AM

Originally posted by Indecent

I got it up and running. Which part did you have problems with?

The following command:

setup_module(sys.modules[__name__])

throws a traceback error -- NameError: name 'sys' not defined

FYI, I've gone ahead and posted my prob. with the googlegroup I mentioned earlier, maybe I'll get a response there.