1. #1
    RollDamnTide
    RollDamnTide's Avatar Become A Pro!
    Join Date: 09-23-13
    Posts: 3
    Betpoints: 73

    Scraping in Java

    I am by far way more proficient with java was wondering if someone could point me in the right direction in regards to this or is java just nearly impossible to handle these needs.

  2. #2
    Blax0r
    Blax0r's Avatar Become A Pro!
    Join Date: 10-13-10
    Posts: 688
    Betpoints: 1512

    I recommend using the selenium library for scraping http://www.seleniumhq.org/.

    It's actually used for webpage testing, but can be re-purposed to scrape data off the static HTML as well as DOM stuff that's modified by javascript.

  3. #3
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 807
    Betpoints: 58

    If you want to start web scraping the first thing you need to do is make sure your database is exactly how you want it.

    You don't want to start coding the scraper and then still be in the process designing the database at the same time.

    After you build your database, you need to get a source code management solution going. Something like subversion or git...

    You might not understand the value of this... just trust me. DO IT. Integrate the subversion or git with your programming ide and always commit your code...

    You'd be really well served to have a server external to your home network do all the scraping, in case you get like IP banned or something. Look into a server from a website like digitalocean dot com. I run a server from there. just get a 5$ 64 bit linux server and keep it simple.

    This server can host your application, the source code control application, as well as your database. For 5 bucks a month, you are winning.

    As far as scraping the first thing you need to do is find out where you want your data to come from. Most of all the data you want will be inside html tables on some pages. So you will need to figure out how to convert that html table data to some data structure that you can manipulate and then store off to the database.

    Also... you will need to learn regular expressions. And by learn... i mean you should be able to be certified in regular expressions by the time you are done with your web scraper

  4. #4
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 807
    Betpoints: 58

    Hopefully this is good advice, i'm just communicating things I wish I knew back when I started

  5. #5
    Blax0r
    Blax0r's Avatar Become A Pro!
    Join Date: 10-13-10
    Posts: 688
    Betpoints: 1512

    I definitely agree with Maverick's point about SVN; honestly, I think everyone should use versioning software for everything (not just code).

    Although, I believe Selenium is a cleaner solution than regex's, but you'll have to re-code for every webpage change for either case, so it's really a matter of preference for static data.

  6. #6
    creditcardclown
    creditcardclown's Avatar Become A Pro!
    Join Date: 11-28-10
    Posts: 242
    Betpoints: 322

    maverick, have you been IP banned before for scraping? can you please explain the importance of source code management?

    regex sucks for HTML. i use python and LXML, and some tool for firefox "page inspector" i can get any info from a page within a few minutes.

  7. #7
    HUY
    HUY's Avatar Become A Pro!
    Join Date: 04-29-09
    Posts: 253
    Betpoints: 3257

    Quote Originally Posted by creditcardclown View Post
    regex sucks for HTML. i use python and LXML, and some tool for firefox "page inspector" i can get any info from a page within a few minutes.
    This is what I'm doing as well. Whoever is parsing HTML with regex needs to get his head checked. Still, all programmers need to know regex anyway, just to handle the data once you get to it.

Top