1. #1
    bztips
    bztips's Avatar Become A Pro!
    Join Date: 06-03-10
    Posts: 283

    Scraping using Windmill?

    Does anyone in the tank know anything about using Windmill for scraping?

    Normally I use the mechanize library (for Python) to get the html for most of my scraping, which works great as long as the content is directly coded in the page. But some sites (like mlb.com) use a lot of Javascript that auto-generates the content only after you open the page - mechanize won't work in such cases.

    I've read several external discussions about Windmill, which is a testing library that allows you to interact with a real web browser. I'm using Windows, and have installed it successfully as a Python site-package following the directions here.

    I've also come across a couple of good coding descriptions for it for scraping, but I can't figure out how to actually fire it up, either from the Python command line or the Python GUI (Idle).

    [For those in the know, I want to be able to get to the point where I can issue a command such as: client = WindmillTestClient()]

    Any help appreciated.

  2. #2
    Indecent
    Indecent's Avatar Become A Pro!
    Join Date: 09-08-09
    Posts: 758
    Betpoints: 1156

    Quote Originally Posted by bztips View Post
    Does anyone in the tank know anything about using Windmill for scraping?

    Normally I use the mechanize library (for Python) to get the html for most of my scraping, which works great as long as the content is directly coded in the page. But some sites (like mlb.com) use a lot of Javascript that auto-generates the content only after you open the page - mechanize won't work in such cases.
    Which page specifically are you trying to scrape? It looks like most of the javascript code on mlb.com is not obsuficated (the code is easy to read/understand), and as a result it could be (potentially) easy to do this with the tools you are already using.

  3. #3
    bztips
    bztips's Avatar Become A Pro!
    Join Date: 06-03-10
    Posts: 283

    Quote Originally Posted by Indecent View Post
    Which page specifically are you trying to scrape? It looks like most of the javascript code on mlb.com is not obsuficated (the code is easy to read/understand), and as a result it could be (potentially) easy to do this with the tools you are already using.
    For example, the injury page: http://mlb.mlb.com/mlb/fantasy/wsfb/news/injuries.jsp

    If you call this up with your favorite browser and look at the html, it all seems to be there.

    But if I try to grab it with mechanize, none of it's there because of the javascript.

  4. #4
    Indecent
    Indecent's Avatar Become A Pro!
    Join Date: 09-08-09
    Posts: 758
    Betpoints: 1156

    Quote Originally Posted by bztips View Post
    For example, the injury page: http://mlb.mlb.com/mlb/fantasy/wsfb/news/injuries.jsp

    If you call this up with your favorite browser and look at the html, it all seems to be there.

    But if I try to grab it with mechanize, none of it's there because of the javascript.
    Gotcha.

    Try using this tutorial, it seems like it covers a lot of what you might need.

  5. #5
    bztips
    bztips's Avatar Become A Pro!
    Join Date: 06-03-10
    Posts: 283

    Quote Originally Posted by Indecent View Post
    Gotcha.

    Try using this tutorial, it seems like it covers a lot of what you might need.

    Yes, I've seen that already, that's what got me interested in Windmill in the first place. As I mentioned in my OP, the problem is that I don't know how to get Windmill instantiated inside of Python. I also found this in GoogleGroups, which claims to directly answer my question, but I can't get it to work.

    (I suppose I could post a followon to that in GG, but I'm really not a programming wiz, so it's likely I wouldn't understand anyway.) Just looking for someone who might happen to know how to start up Windmill inside of Python.

  6. #6
    Indecent
    Indecent's Avatar Become A Pro!
    Join Date: 09-08-09
    Posts: 758
    Betpoints: 1156

    Quote Originally Posted by bztips View Post
    Yes, I've seen that already, that's what got me interested in Windmill in the first place. As I mentioned in my OP, the problem is that I don't know how to get Windmill instantiated inside of Python. I also found this in GoogleGroups, which claims to directly answer my question, but I can't get it to work.

    (I suppose I could post a followon to that in GG, but I'm really not a programming wiz, so it's likely I wouldn't understand anyway.) Just looking for someone who might happen to know how to start up Windmill inside of Python.
    I got it up and running. Which part did you have problems with?

  7. #7
    bztips
    bztips's Avatar Become A Pro!
    Join Date: 06-03-10
    Posts: 283

    Quote Originally Posted by Indecent View Post
    I got it up and running. Which part did you have problems with?
    The following command:

    setup_module(sys.modules[__name__])

    throws a traceback error -- NameError: name 'sys' not defined

    FYI, I've gone ahead and posted my prob. with the googlegroup I mentioned earlier, maybe I'll get a response there.

Top