1. #1
    Carl-Haakon
    Carl-Haakon's Avatar Become A Pro!
    Join Date: 02-08-13
    Posts: 35

    How should I structure my data?

    Hello everyone!

    I've worked on a scraper to download horse-racing data and I now have data for approximately one thousand races.

    My main question:
    The data that I have scraped is "starting list information" (don't know if it's the correct english term) combined with results for the race. Problem is, the data doesn't fit into neat lines. What I mean is: in one race there are about ten or so horses, and for each horse there are reported results for its five last races (together with date and track code). My problem is that I wish to use the data from the five last races to somehow estimate the "fitness" of the horse, but I don't know how to structure the data. The ML library that I usually use (Orange for Python 2.7) only takes tab-delimited data printed out on single lines, whereas my data would be better written out in a tree-like structure (or something). Do you guys have any ideas on what I could do for this to work? Currently I have list objects with data from previous races as data points, like so:

    horse_name driver_name [data for prev. race 1] [data for prev. race 2] {etc.**

    (I realize this question is somewhat unclear, but it's probably because I don't even know if it's the right question to be asking to begin with)

    Less important second question:
    Do you find the odds on pari-mutuel betting markets to be better or worse than those given by betting firms?

    Thank you all for your time; this is my first post but I want to thank you all for the great discussions you've had on this board so far! It's been a great joy for me to read!

    EDIT: I should probably add that I'm working with Python, and I'm writing the data to a .txt file. I really don't know anything about databases, so please excuse me if this is a dumb question.
    Last edited by Carl-Haakon; 02-08-13 at 08:01 PM.

  2. #2
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 807
    Betpoints: 58

    Are you writing the program that will read the data? If so... it probably doesn't matter how you lay it out in your file.

    If you want the absolute prettiest data file possible. Write it to xml and and/or write a conversion to a character delimited file (think csv or excel)

    I would of course recommend folk to put their data in a relational database, but I've stopped suggesting such a thing
    Points Awarded:

    Carl-Haakon gave Maverick22 2 SBR Point(s) for this post.


  3. #3
    Carl-Haakon
    Carl-Haakon's Avatar Become A Pro!
    Join Date: 02-08-13
    Posts: 35

    Thank you for your reply!

    I actually am writing the data as an xml-file at the moment, but I am not sure how I should go from there; as I wrote, my favourite ML-library for Python only takes data sorted into neat lines (which I assume is what you were referring to when you said "character delimited file"). What would a conversion to such a format look like?

    Actually, I wouldn't mind learning about relational databases if they could give me an edge. What are they and do you have any literature on the subject that you could recommend?

  4. #4
    339955
    339955's Avatar Become A Pro!
    Join Date: 07-20-12
    Posts: 198

    carl, just put it in sqlite3 DB. If you can use python to scrape it, it will just take you a couple days to learn how to store it and access it from sqlite3.

    Maverick, how do you store you data? how much data have you saved?
    Points Awarded:

    Carl-Haakon gave 339955 2 SBR Point(s) for this post.


  5. #5
    Topo
    Topo's Avatar Become A Pro!
    Join Date: 02-17-13
    Posts: 27
    Betpoints: 120

    I maintain a few small databases and have found the results disappointing. The data is nice to have, but it really just reveals what I wished were not true: I need to find another edge other than statistical history in order to turn a betting profit. That said, if you look at smaller betting markets then you can acquire an advantage through data analysis. But any reasonably sized sports betting market will probably not be beatable through data analysis alone.

    SQLServer Express and MySQL are fairly simple to learn and free to acquire. You can learn how to structure and edit your databases using these programs in a few days. Querying them is easy to once you read a tutorial or two on Structured Query Language. There are many free tutorials offered on the web.

  6. #6
    littlezola
    littlezola's Avatar Become A Pro!
    Join Date: 01-29-12
    Posts: 98
    Betpoints: 448

    [QUOTE=Topo;17842864] But any reasonably sized sports betting market will probably not be beatable through data analysis alone.
    [QUOTE]

    Why not?
    Last edited by littlezola; 02-17-13 at 10:15 PM. Reason: bad formatting

  7. #7
    Carl-Haakon
    Carl-Haakon's Avatar Become A Pro!
    Join Date: 02-08-13
    Posts: 35

    Quote Originally Posted by littlezola View Post
    Quote Originally Posted by Topo View Post
    But any reasonably sized sports betting market will probably not be beatable through data analysis alone.
    Why not?
    This caught my interest as well; would you care to explain what you mean, Topo?

  8. #8
    Jontheman
    Jontheman's Avatar Become A Pro!
    Join Date: 09-09-08
    Posts: 139
    Betpoints: 4073

    Presumably because, if this data is freely available, then many people have already done what you have done and their betting patterns have adjusted typical lines; and so prices already accurately reflect the information you are acquiring.

  9. #9
    littlezola
    littlezola's Avatar Become A Pro!
    Join Date: 01-29-12
    Posts: 98
    Betpoints: 448

    Quote Originally Posted by Jontheman View Post
    Presumably because, if this data is freely available, then many people have already done what you have done and their betting patterns have adjusted typical lines; and so prices already accurately reflect the information you are acquiring.
    Which presumes that -
    A)All possible avenues of interpreting data have been exhausted
    B)All markets are efficient.

    Prove A or B.

  10. #10
    sbrhedge
    sbrhedge's Avatar Become A Pro!
    Join Date: 01-18-11
    Posts: 1,354
    Betpoints: 87

    Quote Originally Posted by Jontheman View Post
    Presumably because, if this data is freely available, then many people have already done what you have done and their betting patterns have adjusted typical lines; and so prices already accurately reflect the information you are acquiring.
    trading and sportsbetting have one thing in common: the more data that is available, the worse the index / line / point spread gets out of line. the problem is not data itself, the problem is the number of boneheads that use the data. the advent of electronic data/trading has pretty much guaranteed a fund manager's career. sportsbetting seems to have substantially more horrible lines then say 5-10 years ago.

  11. #11
    Carl-Haakon
    Carl-Haakon's Avatar Become A Pro!
    Join Date: 02-08-13
    Posts: 35

    Quote Originally Posted by sbrhedge View Post
    trading and sportsbetting have one thing in common: the more data that is available, the worse the index / line / point spread gets out of line. the problem is not data itself, the problem is the number of boneheads that use the data. the advent of electronic data/trading has pretty much guaranteed a fund manager's career. sportsbetting seems to have substantially more horrible lines then say 5-10 years ago.
    Everywhere else I've read (and I'm mostly talking old HTT posts) everybody seems to be of the opposite opinion, but I'll bite. Do you have data to support your assertion, and in which sports/markets?

  12. #12
    Carl-Haakon
    Carl-Haakon's Avatar Become A Pro!
    Join Date: 02-08-13
    Posts: 35

    Quote Originally Posted by Jontheman View Post
    Presumably because, if this data is freely available, then many people have already done what you have done and their betting patterns have adjusted typical lines; and so prices already accurately reflect the information you are acquiring.
    But in order for this to work out in practice, you'd require limitless (or high limits) betting and big volumes. I didn't mention this in my first post, but the volume bet on horse racing in my home country is relatively small, and the fact that the market is pari-mutuel means it's useless to bet large amounts since you'd mostly be betting against yourself (thus creating a sort of self-limiting mechanism).

    This means (based on my probably flawed understanding) the market is unlikely to be efficient in any strong sense, doesn't it?

  13. #13
    Blax0r
    Blax0r's Avatar Become A Pro!
    Join Date: 10-13-10
    Posts: 688
    Betpoints: 1512

    In regards to your first question, I would actually recommend using a database (I made a similar change myself over the past few months). I imagine python has libraries to make queries as painless as possible, and having data easily query-able is such a huge gain. Another side-gain is that as your data grows, under your current format, you'll be putting a lot of stuff in memory and coding probably-complicated lines to order or organize that data. With a DB, you can just query for data you need at whatever moment with powerful SQL syntax.

    Just my 2 cents.

    And I really hope to prove Topo wrong, but his statement definitely resonates loudly with me.
    Points Awarded:

    Carl-Haakon gave Blax0r 2 SBR Point(s) for this post.


  14. #14
    Carl-Haakon
    Carl-Haakon's Avatar Become A Pro!
    Join Date: 02-08-13
    Posts: 35

    Quote Originally Posted by Blax0r View Post
    In regards to your first question, I would actually recommend using a database (I made a similar change myself over the past few months).
    Thank you! I've been looking into databases ever since I first ran into this problem. I guess I just have to look up more internet tutorials on the subject and tinker more.

    Quote Originally Posted by Blax0r View Post
    And I really hope to prove Topo wrong, but his statement definitely resonates loudly with me.
    If you don't mind my asking, how far have you come in your modelling efforts?

  15. #15
    Blax0r
    Blax0r's Avatar Become A Pro!
    Join Date: 10-13-10
    Posts: 688
    Betpoints: 1512

    Yea it's surprisingly not too difficult (I hooked up Matlab and Postgresql), and the technical gains are astounding. But you will undergo some time-cost in order develop a solid DB schema, tune tables for better performance, and develop a way to create your INSERT scripts from your current data.

    Last year was my first "live" year, finished negatively (~10%) though after being up a decent amount (~25-30%) just before the summer olympics (i bet tennis). At this point, Topo's point has been true for me, but there are quite a few things I want to try out before declaring defeat.
    Points Awarded:

    Carl-Haakon gave Blax0r 2 SBR Point(s) for this post.


  16. #16
    Miz
    Miz's Avatar Become A Pro!
    Join Date: 08-30-09
    Posts: 695
    Betpoints: 3162

    It may be true for TOPO, but it isn't true for everyone.

  17. #17
    strixee
    I think, therefore I win
    strixee's Avatar Become A Pro!
    Join Date: 05-31-10
    Posts: 432

    My guess is you might need like 6 tables : Horses, Owners, Jockeys, Tracks, Races, Results.
    Start the process with ER modeling, that'll hive you the exact answer.
    MySQL works well with hundreds of thousands of records on shared hostings, for 10M+ you should use VPS.
    If you're running it on your own PC, you won't have any performance troubles for a long time.

  18. #18
    TravisVOX
    TravisVOX's Avatar Become A Pro!
    Join Date: 12-25-12
    Posts: 30
    Betpoints: 861

    FWIW -- I have maintained a "horse racing" database for a handful of years now but spent the last few months redesigning it because it wasn't scaling well (it was basically a dump of data, no real structure). I use MySQL because I'm a hack PHP programmer and they play together well. Like with any sport, the database is big. For example, six months of data represents about 25,000 races, 215,000 starters, 65,000 horses etc. So, a database is the route you want to go.

Top