How should I structure my data?

Carl-Haakon · 02-08-13 06:23 PM

Hello everyone!

I've worked on a scraper to download horse-racing data and I now have data for approximately one thousand races.

My main question:
The data that I have scraped is "starting list information" (don't know if it's the correct english term) combined with results for the race. Problem is, the data doesn't fit into neat lines. What I mean is: in one race there are about ten or so horses, and for each horse there are reported results for its five last races (together with date and track code). My problem is that I wish to use the data from the five last races to somehow estimate the "fitness" of the horse, but I don't know how to structure the data. The ML library that I usually use (Orange for Python 2.7) only takes tab-delimited data printed out on single lines, whereas my data would be better written out in a tree-like structure (or something). Do you guys have any ideas on what I could do for this to work? Currently I have list objects with data from previous races as data points, like so:

horse_name driver_name [data for prev. race 1] [data for prev. race 2] {etc.**

(I realize this question is somewhat unclear, but it's probably because I don't even know if it's the right question to be asking to begin with)

Less important second question:
Do you find the odds on pari-mutuel betting markets to be better or worse than those given by betting firms?

Thank you all for your time; this is my first post but I want to thank you all for the great discussions you've had on this board so far! It's been a great joy for me to read!

EDIT: I should probably add that I'm working with Python, and I'm writing the data to a .txt file. I really don't know anything about databases, so please excuse me if this is a dumb question.

Maverick22 · 02-11-13 10:29 AM

Are you writing the program that will read the data? If so... it probably doesn't matter how you lay it out in your file.

If you want the absolute prettiest data file possible. Write it to xml and and/or write a conversion to a character delimited file (think csv or excel)

I would of course recommend folk to put their data in a relational database, but I've stopped suggesting such a thing

Carl-Haakon · 02-12-13 11:59 AM

Thank you for your reply!

I actually am writing the data as an xml-file at the moment, but I am not sure how I should go from there; as I wrote, my favourite ML-library for Python only takes data sorted into neat lines (which I assume is what you were referring to when you said "character delimited file"). What would a conversion to such a format look like?

Actually, I wouldn't mind learning about relational databases if they could give me an edge. What are they and do you have any literature on the subject that you could recommend?

339955 · 02-12-13 03:43 PM

carl, just put it in sqlite3 DB. If you can use python to scrape it, it will just take you a couple days to learn how to store it and access it from sqlite3.

Maverick, how do you store you data? how much data have you saved?

Topo · 02-17-13 05:05 PM

I maintain a few small databases and have found the results disappointing. The data is nice to have, but it really just reveals what I wished were not true: I need to find another edge other than statistical history in order to turn a betting profit. That said, if you look at smaller betting markets then you can acquire an advantage through data analysis. But any reasonably sized sports betting market will probably not be beatable through data analysis alone.

SQLServer Express and MySQL are fairly simple to learn and free to acquire. You can learn how to structure and edit your databases using these programs in a few days. Querying them is easy to once you read a tutorial or two on Structured Query Language. There are many free tutorials offered on the web.

littlezola · 02-17-13 10:14 PM

[QUOTE=Topo;17842864] But any reasonably sized sports betting market will probably not be beatable through data analysis alone.
[QUOTE]

Why not?

Carl-Haakon · 02-18-13 09:29 AM

Originally Posted by littlezola

Originally Posted by Topo

But any reasonably sized sports betting market will probably not be beatable through data analysis alone.

Why not?

This caught my interest as well; would you care to explain what you mean, Topo?

Jontheman · 02-18-13 03:51 PM

Presumably because, if this data is freely available, then many people have already done what you have done and their betting patterns have adjusted typical lines; and so prices already accurately reflect the information you are acquiring.

littlezola · 02-18-13 05:46 PM

Originally Posted by Jontheman

Presumably because, if this data is freely available, then many people have already done what you have done and their betting patterns have adjusted typical lines; and so prices already accurately reflect the information you are acquiring.

Which presumes that -
A)All possible avenues of interpreting data have been exhausted
B)All markets are efficient.

Prove A or B.

sbrhedge · 02-18-13 11:07 PM

Originally Posted by Jontheman

Presumably because, if this data is freely available, then many people have already done what you have done and their betting patterns have adjusted typical lines; and so prices already accurately reflect the information you are acquiring.

trading and sportsbetting have one thing in common: the more data that is available, the worse the index / line / point spread gets out of line. the problem is not data itself, the problem is the number of boneheads that use the data. the advent of electronic data/trading has pretty much guaranteed a fund manager's career. sportsbetting seems to have substantially more horrible lines then say 5-10 years ago.

Carl-Haakon · 02-20-13 12:07 PM

Originally Posted by sbrhedge

trading and sportsbetting have one thing in common: the more data that is available, the worse the index / line / point spread gets out of line. the problem is not data itself, the problem is the number of boneheads that use the data. the advent of electronic data/trading has pretty much guaranteed a fund manager's career. sportsbetting seems to have substantially more horrible lines then say 5-10 years ago.

Everywhere else I've read (and I'm mostly talking old HTT posts) everybody seems to be of the opposite opinion, but I'll bite. Do you have data to support your assertion, and in which sports/markets?

Carl-Haakon · 02-20-13 12:15 PM

Originally Posted by Jontheman

Presumably because, if this data is freely available, then many people have already done what you have done and their betting patterns have adjusted typical lines; and so prices already accurately reflect the information you are acquiring.

But in order for this to work out in practice, you'd require limitless (or high limits) betting and big volumes. I didn't mention this in my first post, but the volume bet on horse racing in my home country is relatively small, and the fact that the market is pari-mutuel means it's useless to bet large amounts since you'd mostly be betting against yourself (thus creating a sort of self-limiting mechanism).

This means (based on my probably flawed understanding) the market is unlikely to be efficient in any strong sense, doesn't it?

Blax0r · 02-20-13 12:24 PM

In regards to your first question, I would actually recommend using a database (I made a similar change myself over the past few months). I imagine python has libraries to make queries as painless as possible, and having data easily query-able is such a huge gain. Another side-gain is that as your data grows, under your current format, you'll be putting a lot of stuff in memory and coding probably-complicated lines to order or organize that data. With a DB, you can just query for data you need at whatever moment with powerful SQL syntax.

Just my 2 cents.

And I really hope to prove Topo wrong, but his statement definitely resonates loudly with me.

Carl-Haakon · 02-20-13 12:41 PM

Originally Posted by Blax0r

In regards to your first question, I would actually recommend using a database (I made a similar change myself over the past few months).

Thank you! I've been looking into databases ever since I first ran into this problem. I guess I just have to look up more internet tutorials on the subject and tinker more.

Originally Posted by Blax0r

And I really hope to prove Topo wrong, but his statement definitely resonates loudly with me.

If you don't mind my asking, how far have you come in your modelling efforts?

Blax0r · 02-20-13 12:50 PM

Yea it's surprisingly not too difficult (I hooked up Matlab and Postgresql), and the technical gains are astounding. But you will undergo some time-cost in order develop a solid DB schema, tune tables for better performance, and develop a way to create your INSERT scripts from your current data.

Last year was my first "live" year, finished negatively (~10%) though after being up a decent amount (~25-30%) just before the summer olympics (i bet tennis). At this point, Topo's point has been true for me, but there are quite a few things I want to try out before declaring defeat.

Miz · 02-20-13 04:27 PM

It may be true for TOPO, but it isn't true for everyone.

strixee · 02-20-13 07:02 PM

My guess is you might need like 6 tables : Horses, Owners, Jockeys, Tracks, Races, Results.
Start the process with ER modeling, that'll hive you the exact answer.
MySQL works well with hundreds of thousands of records on shared hostings, for 10M+ you should use VPS.
If you're running it on your own PC, you won't have any performance troubles for a long time.

TravisVOX · 02-22-13 05:39 AM

FWIW -- I have maintained a "horse racing" database for a handful of years now but spent the last few months redesigning it because it wasn't scaling well (it was basically a dump of data, no real structure). I use MySQL because I'm a hack PHP programmer and they play together well. Like with any sport, the database is big. For example, six months of data represents about 25,000 races, 215,000 starters, 65,000 horses etc. So, a database is the route you want to go.

SBR Top-Rated Sportsbooks				Best Sportsbooks List
#1 FanDuel	SBR rating 4.8/5	Review	#6 BetRivers	SBR rating 4.1/5	Review
#2 Caesars	SBR rating 4.7/5	Review	#7 Fanatics	SBR rating 4.1/5	Review
#3 DraftKings	SBR rating 4.7/5	Review	#8 Betway	SBR rating 3.8/5	Review
#4 BetMGM	SBR rating 4.6/5	Review	#9 Borgata	SBR rating 3.5/5	Review
#5 bet365	SBR rating 4.6/5	Review	#10 ClutchBet	SBR rating 2.9/5	Review

How should I structure my data?

Thread Tools

How should I structure my data?