Scraping using Windmill?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • bztips
    SBR Sharp
    • 06-03-10
    • 283

    #1
    Scraping using Windmill?
    Does anyone in the tank know anything about using Windmill for scraping?

    Normally I use the mechanize library (for Python) to get the html for most of my scraping, which works great as long as the content is directly coded in the page. But some sites (like mlb.com) use a lot of Javascript that auto-generates the content only after you open the page - mechanize won't work in such cases.

    I've read several external discussions about Windmill, which is a testing library that allows you to interact with a real web browser. I'm using Windows, and have installed it successfully as a Python site-package following the directions here.

    I've also come across a couple of good coding descriptions for it for scraping, but I can't figure out how to actually fire it up, either from the Python command line or the Python GUI (Idle).

    [For those in the know, I want to be able to get to the point where I can issue a command such as: client = WindmillTestClient()]

    Any help appreciated.
  • Indecent
    SBR Wise Guy
    • 09-08-09
    • 758

    #2
    Originally posted by bztips
    Does anyone in the tank know anything about using Windmill for scraping?

    Normally I use the mechanize library (for Python) to get the html for most of my scraping, which works great as long as the content is directly coded in the page. But some sites (like mlb.com) use a lot of Javascript that auto-generates the content only after you open the page - mechanize won't work in such cases.
    Which page specifically are you trying to scrape? It looks like most of the javascript code on mlb.com is not obsuficated (the code is easy to read/understand), and as a result it could be (potentially) easy to do this with the tools you are already using.
    Comment
    • bztips
      SBR Sharp
      • 06-03-10
      • 283

      #3
      Originally posted by Indecent
      Which page specifically are you trying to scrape? It looks like most of the javascript code on mlb.com is not obsuficated (the code is easy to read/understand), and as a result it could be (potentially) easy to do this with the tools you are already using.
      For example, the injury page: http://mlb.mlb.com/mlb/fantasy/wsfb/news/injuries.jsp

      If you call this up with your favorite browser and look at the html, it all seems to be there.

      But if I try to grab it with mechanize, none of it's there because of the javascript.
      Comment
      • Indecent
        SBR Wise Guy
        • 09-08-09
        • 758

        #4
        Originally posted by bztips
        For example, the injury page: http://mlb.mlb.com/mlb/fantasy/wsfb/news/injuries.jsp

        If you call this up with your favorite browser and look at the html, it all seems to be there.

        But if I try to grab it with mechanize, none of it's there because of the javascript.
        Gotcha.

        Try using this tutorial, it seems like it covers a lot of what you might need.
        Comment
        • bztips
          SBR Sharp
          • 06-03-10
          • 283

          #5
          Originally posted by Indecent
          Gotcha.

          Try using this tutorial, it seems like it covers a lot of what you might need.

          Yes, I've seen that already, that's what got me interested in Windmill in the first place. As I mentioned in my OP, the problem is that I don't know how to get Windmill instantiated inside of Python. I also found this in GoogleGroups, which claims to directly answer my question, but I can't get it to work.

          (I suppose I could post a followon to that in GG, but I'm really not a programming wiz, so it's likely I wouldn't understand anyway.) Just looking for someone who might happen to know how to start up Windmill inside of Python.
          Comment
          • Indecent
            SBR Wise Guy
            • 09-08-09
            • 758

            #6
            Originally posted by bztips
            Yes, I've seen that already, that's what got me interested in Windmill in the first place. As I mentioned in my OP, the problem is that I don't know how to get Windmill instantiated inside of Python. I also found this in GoogleGroups, which claims to directly answer my question, but I can't get it to work.

            (I suppose I could post a followon to that in GG, but I'm really not a programming wiz, so it's likely I wouldn't understand anyway.) Just looking for someone who might happen to know how to start up Windmill inside of Python.
            I got it up and running. Which part did you have problems with?
            Comment
            • bztips
              SBR Sharp
              • 06-03-10
              • 283

              #7
              Originally posted by Indecent
              I got it up and running. Which part did you have problems with?
              The following command:

              setup_module(sys.modules[__name__])

              throws a traceback error -- NameError: name 'sys' not defined

              FYI, I've gone ahead and posted my prob. with the googlegroup I mentioned earlier, maybe I'll get a response there.
              Comment
              SBR Contests
              Collapse
              Top-Rated US Sportsbooks
              Collapse
              Working...