An introduction to research

**MrX** · 04-02-10, 10:48 AM

Potentially one of the best threads on here. Good job.

I would suggest throwing some mysql into the mix.

**IrishTim** · 04-02-10, 11:09 AM

Originally posted by MrX

Potentially one of the best threads on here. Good job.

I would suggest throwing some mysql into the mix.

No doubt. Looking forward to see this one progress.

**Jule** · 04-02-10, 11:26 PM

Great info. :-)

**jessetopolski** · 04-02-10, 11:27 PM

intresting

**uva3021** · 04-03-10, 02:52 AM

great thread, looking forward to continue reading

**statnerds** · 04-03-10, 07:11 PM

excellent.

thanks for the huge contribution. i will try my hand at this

**Wrecktangle** · 04-03-10, 07:49 PM

Nicely written article. Are there any subroutine libraries we can obtain so we don't have to write all our typically used routines from scratch?

**ljump12** · 04-03-10, 09:02 PM

As far as Python and webscraping goes, your going to want to obtain BeautifulSoup and something called Mechanize. I intend to write about these in the next section. I may try to write up a commonly used betting function library, I have a bunch of functions lying around that i could probably scrape together.

**benjaminj78** · 04-05-10, 12:56 AM

Python is a lifesaver when learning to code. This is the best reference anywhere,in regards to building the ultimate spreadsheet set-up. Thanks!

**Wrecktangle** · 04-05-10, 04:14 AM

OK, is there something about Python that makes is easier or more forgiving to program in than other languages?

Why not use some variant of C? or VBA? or Java?

**ljump12** · 04-05-10, 12:27 PM

Originally posted by Wrecktangle

OK, is there something about Python that makes is easier or more forgiving to program in than other languages?

Why not use some variant of C? or VBA? or Java?

Python is infinitely easier than C, Visual basic and Java. It will become more apparent the more you do with it, but take a look at the code to open and parse a CSV file. 3 readable lines. Show me a short elegant solution like that in C or Java. Don't get me wrong, compiled languages like C and Java have their place -- and are much much more efficient. But for this type of project, you will want to use a scripting language, (Python, Ruby, Perl etc...) I've just chosen Python because it's what I'm most familiar with.

Bottom line: I can write something in Python in 1/3 the time it takes me to write something in C or Java, with increased readability/maintainability, and without losing any functionality.

**MonkeyF0cker** · 04-06-10, 03:35 AM

It may be slightly faster to code parsing a CSV, but why would you be working much with CSV's anyway? Personally, I hate them. Generally, you'd be scraping directly into a DB of some sort anyway. At least, you probably should.

**MonkeyF0cker** · 04-06-10, 03:38 AM

Also, I prefer my efficiency to be on the executable side and not the coding side of my projects. However, for the purposes of this thread, I would agree that Python is far better as a tutorial on a more efficient language would be much more involved.

**Daniel** · 04-06-10, 06:48 AM

Not that it really matters, but parsing a CSV file with C# is just a few lines of code too

.

Code:

[COLOR=#000000][FONT=Times New Roman][FONT=Verdana][SIZE=2]Textreader tr = [/SIZE][SIZE=2][COLOR=#0000ff]new[/COLOR][/SIZE][SIZE=2] StreamReader(fileName);[/SIZE][/FONT][/FONT][/COLOR]
string line;
while ((line = tr.ReadLine()) != null)
{
   string[] brokenDown = line.Split(',');
   // Do what you want with an array of strings split by the commasign
}

What it boils down to is obviously language preference.. everyone has their own language that they feel the most comfortable with. The end result is still based more on the logic than which language you used.

**Wrecktangle** · 04-06-10, 07:35 AM

I guess this sort of guess to my point. Granted Python is good for scraping, what does it have for me as far as modeling sports? If I'm a newbie, not wanting to learn two languages to do my programming (one to scrape, one or organize my modeling efforts, perhaps R) which one should I use? In my world, checked out scientific routines are important. Reinventing statistical wheels might be fun for some, but I rather directly model sports techniques.

**Daniel** · 04-06-10, 07:43 AM

Like I said, language is secondary. Choose the one you're comfortable using, be it Python or .net or whatever. For this project, it's still going to be the output that matters. As far as I know, no language is more suited to statistical analysis than another. At least not for a project such as this.

I'm a .net developer, I would chose C# over Python every day of the week and twice on Sundays.. but for someone else the preferences is probably completely opposite. That doesn't mean that I can create something that more accurately analyzes data than the other guy.

**MonkeyF0cker** · 04-06-10, 07:24 PM

It also depends on your environment. .NET won't do you much good in Linux. That said, I use .NET to code the majority of my handicapping apps.

**MonkeyF0cker** · 04-06-10, 07:34 PM

If you'd like to concentrate on only one language, I'd honestly go with C#. There are a lot of things that I do with arrays and structs in my models that would be extremely cumbersome and inefficient in Python.

**ljump12** · 04-06-10, 07:43 PM

Originally posted by Wrecktangle

I guess this sort of guess to my point. Granted Python is good for scraping, what does it have for me as far as modeling sports? If I'm a newbie, not wanting to learn two languages to do my programming (one to scrape, one or organize my modeling efforts, perhaps R) which one should I use? In my world, checked out scientific routines are important. Reinventing statistical wheels might be fun for some, but I rather directly model sports techniques.

R is completely different from python, they do very different things, and both are very valuable to know. Python is more versitile than R in my opinion, and if you were only going to learn one language, python would be it imo. However, I don't want to waste more time arguing for python, it's mostly a personal choice -- you can use any language you'd like, the concepts taught here should still generally apply.

**sharpcat** · 04-06-10, 07:51 PM

Ijump12, I think many are interested to see your write-up Please continue

As far as this arguing about what language every body prefers to code in please start your own threads and let the man continue with his thread, lets allow this to be the educational thread it was meant be. I am sure everybody is capable of starting their own thread if they feel the need to prove that they are more intelligent and that their technique is better.

**ljump12** · 04-06-10, 07:51 PM

Originally posted by MonkeyF0cker

If you'd like to concentrate on only one language, I'd honestly go with C#. There are a lot of things that I do with arrays and structs in my models that would be extremely cumbersome and inefficient in Python.

I think efficiency is kind of a moot point at this stage. Were not dealing with things in handicapping that efficiency would not matter. I've written a python baseball simulator that processes millions of rows of PlayByPlay data, and it has no trouble.

**OMGRandyJackson** · 04-08-10, 10:20 AM

Once I get some free time, Im going to check out the python tutorial and get started on this. Cannot wait for your to continue with the topics!

Thanks so much man!

**ljump12** · 04-10-10, 02:58 PM

Section D) How to scrape the internet for data

One of the most important aspects of research is the data that you have. Without data, there can't be any model. Fortunately, most data is free -- Unfortunately, most data isn't immediately in the best computer parsable formats [like .csv, or .xml]. To get the data into formats we can use we will need to "scrape" websites for it.

A couple "packages" have been created that will greatly improve our ability to scrape webpages. It can certaintly be done in python without them -- but they will make your life a whole lot easier:

Mechanize - This will allow us to open webpages easily (http://wwwsearch.sourceforge.net/mechanize/)
Beautiful Soup - This will allow us to parse apart the webpages (http://www.crummy.com/software/BeautifulSoup/)

Installing Beautiful Soup is pretty easy, you can just put the http://www.crummy.com/software/Beaut...lSoup-3.0.0.py Beautiful soup python file in the same directory you are running your code from.

Installing Mechanize is a little tougher, on a *nix machine, cd to the directory of where you downloaded it and extract it (tar -xzvf [filename]). Then cd into the extracted directory and install it by typing "sudo python setup.py install" It should install, you can post here if you have any problems. As far as windows goes, you may be on your own -- I can't imagine it's very tough, and there's probably a tutorial somewhere online.

Now that the installation is out of the way, it's time to get down to business. I'll give you the basics here, and you should be able to refer to the documentation for more complicated examples. I'm going to assume you have a basic familiarity of html -- if you don't, you may want to search for a quick tutorial. Let's make our first example getting a list of today's injuries from statfox for MLB baseball:

PHP Code:



from BeautifulSoup import BeautifulSoup, SoupStrainer ## This tells python to use Beautiful Soup
from mechanize import Browser   ## This tells python we want to use a browser (which is defined in mechanize)
import re   ## This tells python that we will be using some regular expressions.
            ## .. Regular expression allow us to search for a sequence of characters
            ## .. within a larger string
import time
import datetime

## The first step is to create our browser..
br = Browser()

## Now let's open the injuries page on statfox. This one line will open and retreive the html.
response = br.open("http://www.sbrodds.com/StoryArchivesForm.aspx?ShortNameLeague=mlb&ArticleType=injury&l=3").read()

## Now we need to tell Beautiful Soup that we would like to search through the response.
## .. This next line will tell beautiful soup to only return links to the individual inuries.
## .. We know that all the links to the injuries have "ShortNameLeague=mlb&ArticleType=injury" 
## .. in their url, so we search for these links. Each of these links has a title that describes
## .. the injury which we will use in the next line.
linksToInjuries = SoupStrainer('a', href=re.compile('ShortNameLeague=mlb&ArticleType=injury'))

## This will put the title of all links in the "linksToInjuries" into an array.
## We then call Set on our array to change the array to a "set" which by definition has no duplicates.
injuryTitles = set([injuryPage['title'] for injuryPage in BeautifulSoup(response, parseOnlyThese=linksToInjuries)])


## Finally let's print all the injuries out that are for today's date.
today = datetime.date.today()
# the function strftime() (string-format time) produces nice formatting
# All codes are detailed at http://www.python.org/doc/current/lib/module-time.html
date =  today.strftime("%m/%d") 

## Now let's print out the injuries that we have.
for title in injuryTitles:
    ## See if the date is in the title, if it is: print it.
    if re.search(date, title):
        print title

It might seem like a lot at first, but it's not much code. Take it slow and use google when you dont know what a function does. Googling "python [some piece of code you dont understand]" will work magic. Ask here and i can further break down any slice of code.

Sorry I haven't had much time -- If anyone can post an example of what kind of data they would like to be scraped, I will create one more example using both BeautifulSoup and Mechanize.

**pats3peat** · 04-10-10, 03:40 PM

Got to love research, one o the best things in sports

**MadTiger** · 04-10-10, 05:56 PM

Originally posted by Wrecktangle

... although I suppose I could call them through Python, if it is allowed. ...

It is very allowed and possible http://www.omegahat.org/RSPython/

(ex-developer here. Ancient. These languages are new to me, but mixed languages have been my thing for a while.)

**sycoogtit** · 04-14-10, 11:30 AM

Very nice thread ljump12. Your elegant python examples have convinced a perl programmer to spend a bit more time with python.

However, I'm conflicted. This is selfish of me, but as sports bettors we have to be selfish when it comes to this. If everyone knows about an edge, then it isn't an edge anymore. Do we really want to be giving everyone these step-by-step instructions on how to research betting trends? The information on how to program web scrapers is widely available, but putting it all down right here has made it significantly easier to learn how to apply it directly to our field.

I'm sure you thought of this before you started this thread -- I guess I'm curious what your thoughts are.

**ljump12** · 04-14-10, 09:57 PM

Originally posted by sycoogtit

Very nice thread ljump12. Your elegant python examples have convinced a perl programmer to spend a bit more time with python.

However, I'm conflicted. This is selfish of me, but as sports bettors we have to be selfish when it comes to this. If everyone knows about an edge, then it isn't an edge anymore. Do we really want to be giving everyone these step-by-step instructions on how to research betting trends? The information on how to program web scrapers is widely available, but putting it all down right here has made it significantly easier to learn how to apply it directly to our field.

I'm sure you thought of this before you started this thread -- I guess I'm curious what your thoughts are.

This is a very valid concern. Here's the thing, and its kind of selfish on my part too. I'm not, and probably won't be a huge sports bettor. It's not that i cant be... It's something if I put 100% effort into i believe i could do well, but I don't really want to. Since im not doing it, i figure i may as well help other people. You may feel differently about what I'm doing, and I can totally respect that. I guess the bottom line is that, even given these tools and this "tutorial" (if you could call it that), not many are going to follow through with it, so i wouldn't be too worried.

Finally one of my biggest hopes for this thread is that it so sparks discussion. Please feel free to post on anything related..

**IrishTim** · 04-14-10, 10:07 PM

I see where both of you guys are coming from, but I tend to agree with ljump here. I don't think we're going to have 100 clowns from players talk see this thread and all of the sudden go from looking for the 100 unit lock of the century to setting up web scrapers, churning out dbs with 20k samples, and firing away +EV plays into soft spots in the market by Friday. My guess would be that most of the people who have the patience (and intelligent quotient) to read, understand, and apply the lessons in this thread already know how to do this type of programming or who have contacts who they share/get help from.

As long as you aren't attaching databases with +EV models to each post, I think everyone is going to be okay.

**romanowski** · 04-14-10, 10:38 PM

most people are too lazy to do any of this, I wouldnt worry about losing any edge

**frankzig** · 04-16-10, 08:11 AM

this is nice

**MonkeyF0cker** · 04-16-10, 10:57 PM

Relatively similar? LOL. I hope you're joking.

Some people have only box scores, some have play by play data, some have pitch by pitch data, some people have linked line history tables, some people have closing number columns, some have linked player tables with keys, some have individual tables for each game, some people perform simple system or prop queries, some perform a series of queries to populate and process a model, etc., etc., etc., etc., etc. Not to mention, there are probably more than 3 billion ways that one can go about doing the exact same thing.

AGAIN, IT ALL DEPENDS ON YOUR DATASET AND HOW YOU PLAN ON USING YOUR DATA!!!!!!!

Anyone whose profession is in data warehousing should be able to grasp this simple concept the first time they are told. However, they certainly shouldn't need to be told this in the first place.

**durito** · 04-17-10, 12:05 AM

Yea, just ignore the odds, that should work out perfectly.

**Wrecktangle** · 04-17-10, 09:04 AM

I'm always struck by how hard it can be to express yourself in print, and the fact that we all use differing terms to label the same items. I'm not a data base guy but data dictionary "thingies" are important even in my simplistic world. I would like to see us stay away from the Players Talk way of solving differences of opinion here in the Tank, however.

I keep saying this to no observable progress: I'd like to see a group form where the interest is sharing checked out data sets. I spend way too much time cross checking data and way too little time on model building and analysis; especially the analysis.

**MonkeyF0cker** · 04-17-10, 10:01 AM

The reason his statement was confusing is because he wasn't using the proper terminology, Wrecktangle. If someone attacks my integrity in here, I'm certainly going to prove my point. If you don't design your data tables to coincide with your end product, you'll likely create a ton of unnecessary work for yourself and inefficiencies in the modelling phase. It can make your queries a nightmare to code and process.

As far as sharing datasets, I have no interest in that. I do everything programmatically and I think I have far more reliable data than the vast majority of posters here. I really doubt I'd get any desirable reciprocation for my work. Not to mention, I'm not one to trust other people's work when it comes to these things. If someone gave me a set of data, the first thing I'd do is verify its integrity. So it would be a completely unproductive process for me.