Login Search

An introduction to research

Last Post
#124

Default

This is thread is really great, thanks a lot, i have some questions about using
br.set_handle_robots(False) in mechanize
when a site has a robots.txt file, i know there are legal o ethical issues respecting this,
i want to try scraping but from what i read you need to set timeouts on your scripts so your ip doesnt get ban, and other measures.

are there any sites that are "ok" with being scrape for stats (sbr?)?? or should you be really careful with your scraping since most i would guess dont like it, what other things should we consider??
#128

Default

Quote Originally Posted by Wrecktangle View Post
dmolition, most sites are NOT OK with scraping due to copyright and not a few will actively block you. And it seems that even those who tolerate it change formats so often that you are always in tweaking code to get around the changes.
Yeah i figured as much, so to scrape 10 seasons of any sport i imagine i need multiple IPs, timeouts in scripts, constantly checking for changes in DOM structure of the HTML,etc,etc. Now i know the cost of data.

Im gonna research and maybe if i gather enough data i'll be willing to trade it (after i validate it of course)
It would be nice to have a list of sites of where they enforce more strictly anti scraping policies or where NOT to try it so we can have a little piece of mind.

Also i'm taking the hard road and learning R and python (checking out SciPy also) for data analysis, i'm savvy with software development, when i can actually start doing some serious data analysis if anyone wants to exchange technical tips of how to do that and this, maybe we can open a "hacking/data analysis stuff" thread to discuss tips and such, to ask general questions,tips and contribute in general.
#130

Default

Quote Originally Posted by Wrecktangle View Post
dmolition, most sites are NOT OK with scraping due to copyright and not a few will actively block you.
Hm, most sites? No way. The largest projects traffic wise are collecting several years worth of box scores and play-by-plays. Everything else is peanuts. The only site that ever temporarily blocked my scrapping was !Yahoo.