How should I structure my data?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Carl-Haakon
    SBR Rookie
    • 02-08-13
    • 35

    #1
    How should I structure my data?
    Hello everyone!

    I've worked on a scraper to download horse-racing data and I now have data for approximately one thousand races.

    My main question:
    The data that I have scraped is "starting list information" (don't know if it's the correct english term) combined with results for the race. Problem is, the data doesn't fit into neat lines. What I mean is: in one race there are about ten or so horses, and for each horse there are reported results for its five last races (together with date and track code). My problem is that I wish to use the data from the five last races to somehow estimate the "fitness" of the horse, but I don't know how to structure the data. The ML library that I usually use (Orange for Python 2.7) only takes tab-delimited data printed out on single lines, whereas my data would be better written out in a tree-like structure (or something). Do you guys have any ideas on what I could do for this to work? Currently I have list objects with data from previous races as data points, like so:

    horse_name driver_name [data for prev. race 1] [data for prev. race 2] {etc.**

    (I realize this question is somewhat unclear, but it's probably because I don't even know if it's the right question to be asking to begin with)

    Less important second question:
    Do you find the odds on pari-mutuel betting markets to be better or worse than those given by betting firms?

    Thank you all for your time; this is my first post but I want to thank you all for the great discussions you've had on this board so far! It's been a great joy for me to read!

    EDIT: I should probably add that I'm working with Python, and I'm writing the data to a .txt file. I really don't know anything about databases, so please excuse me if this is a dumb question.
    Last edited by Carl-Haakon; 02-08-13, 09:01 PM.
  • Maverick22
    SBR Wise Guy
    • 04-10-10
    • 807

    #2
    Are you writing the program that will read the data? If so... it probably doesn't matter how you lay it out in your file.

    If you want the absolute prettiest data file possible. Write it to xml and and/or write a conversion to a character delimited file (think csv or excel)

    I would of course recommend folk to put their data in a relational database, but I've stopped suggesting such a thing
    Comment
    • Carl-Haakon
      SBR Rookie
      • 02-08-13
      • 35

      #3
      Thank you for your reply!

      I actually am writing the data as an xml-file at the moment, but I am not sure how I should go from there; as I wrote, my favourite ML-library for Python only takes data sorted into neat lines (which I assume is what you were referring to when you said "character delimited file"). What would a conversion to such a format look like?

      Actually, I wouldn't mind learning about relational databases if they could give me an edge. What are they and do you have any literature on the subject that you could recommend?
      Comment
      • 339955
        Restricted User
        • 07-20-12
        • 198

        #4
        carl, just put it in sqlite3 DB. If you can use python to scrape it, it will just take you a couple days to learn how to store it and access it from sqlite3.

        Maverick, how do you store you data? how much data have you saved?
        Comment
        • Topo
          SBR Rookie
          • 02-17-13
          • 27

          #5
          I maintain a few small databases and have found the results disappointing. The data is nice to have, but it really just reveals what I wished were not true: I need to find another edge other than statistical history in order to turn a betting profit. That said, if you look at smaller betting markets then you can acquire an advantage through data analysis. But any reasonably sized sports betting market will probably not be beatable through data analysis alone.

          SQLServer Express and MySQL are fairly simple to learn and free to acquire. You can learn how to structure and edit your databases using these programs in a few days. Querying them is easy to once you read a tutorial or two on Structured Query Language. There are many free tutorials offered on the web.
          Comment
          • littlezola
            SBR Hustler
            • 01-29-12
            • 98

            #6
            [QUOTE=Topo;17842864] But any reasonably sized sports betting market will probably not be beatable through data analysis alone.
            [QUOTE]

            Why not?
            Last edited by littlezola; 02-17-13, 11:15 PM. Reason: bad formatting
            Comment
            • Carl-Haakon
              SBR Rookie
              • 02-08-13
              • 35

              #7
              Originally posted by littlezola
              Originally posted by Topo
              But any reasonably sized sports betting market will probably not be beatable through data analysis alone.
              Why not?
              This caught my interest as well; would you care to explain what you mean, Topo?
              Comment
              • Jontheman
                SBR High Roller
                • 09-09-08
                • 139

                #8
                Presumably because, if this data is freely available, then many people have already done what you have done and their betting patterns have adjusted typical lines; and so prices already accurately reflect the information you are acquiring.
                Comment
                • littlezola
                  SBR Hustler
                  • 01-29-12
                  • 98

                  #9
                  Originally posted by Jontheman
                  Presumably because, if this data is freely available, then many people have already done what you have done and their betting patterns have adjusted typical lines; and so prices already accurately reflect the information you are acquiring.
                  Which presumes that -
                  A)All possible avenues of interpreting data have been exhausted
                  B)All markets are efficient.

                  Prove A or B.
                  Comment
                  • sbrhedge
                    SBR MVP
                    • 01-18-11
                    • 1354

                    #10
                    Originally posted by Jontheman
                    Presumably because, if this data is freely available, then many people have already done what you have done and their betting patterns have adjusted typical lines; and so prices already accurately reflect the information you are acquiring.
                    trading and sportsbetting have one thing in common: the more data that is available, the worse the index / line / point spread gets out of line. the problem is not data itself, the problem is the number of boneheads that use the data. the advent of electronic data/trading has pretty much guaranteed a fund manager's career. sportsbetting seems to have substantially more horrible lines then say 5-10 years ago.
                    Comment
                    • Carl-Haakon
                      SBR Rookie
                      • 02-08-13
                      • 35

                      #11
                      Originally posted by sbrhedge
                      trading and sportsbetting have one thing in common: the more data that is available, the worse the index / line / point spread gets out of line. the problem is not data itself, the problem is the number of boneheads that use the data. the advent of electronic data/trading has pretty much guaranteed a fund manager's career. sportsbetting seems to have substantially more horrible lines then say 5-10 years ago.
                      Everywhere else I've read (and I'm mostly talking old HTT posts) everybody seems to be of the opposite opinion, but I'll bite. Do you have data to support your assertion, and in which sports/markets?
                      Comment
                      • Carl-Haakon
                        SBR Rookie
                        • 02-08-13
                        • 35

                        #12
                        Originally posted by Jontheman
                        Presumably because, if this data is freely available, then many people have already done what you have done and their betting patterns have adjusted typical lines; and so prices already accurately reflect the information you are acquiring.
                        But in order for this to work out in practice, you'd require limitless (or high limits) betting and big volumes. I didn't mention this in my first post, but the volume bet on horse racing in my home country is relatively small, and the fact that the market is pari-mutuel means it's useless to bet large amounts since you'd mostly be betting against yourself (thus creating a sort of self-limiting mechanism).

                        This means (based on my probably flawed understanding) the market is unlikely to be efficient in any strong sense, doesn't it?
                        Comment
                        • Blax0r
                          SBR Wise Guy
                          • 10-13-10
                          • 688

                          #13
                          In regards to your first question, I would actually recommend using a database (I made a similar change myself over the past few months). I imagine python has libraries to make queries as painless as possible, and having data easily query-able is such a huge gain. Another side-gain is that as your data grows, under your current format, you'll be putting a lot of stuff in memory and coding probably-complicated lines to order or organize that data. With a DB, you can just query for data you need at whatever moment with powerful SQL syntax.

                          Just my 2 cents.

                          And I really hope to prove Topo wrong, but his statement definitely resonates loudly with me.
                          Comment
                          • Carl-Haakon
                            SBR Rookie
                            • 02-08-13
                            • 35

                            #14
                            Originally posted by Blax0r
                            In regards to your first question, I would actually recommend using a database (I made a similar change myself over the past few months).
                            Thank you! I've been looking into databases ever since I first ran into this problem. I guess I just have to look up more internet tutorials on the subject and tinker more.

                            Originally posted by Blax0r
                            And I really hope to prove Topo wrong, but his statement definitely resonates loudly with me.
                            If you don't mind my asking, how far have you come in your modelling efforts?
                            Comment
                            • Blax0r
                              SBR Wise Guy
                              • 10-13-10
                              • 688

                              #15
                              Yea it's surprisingly not too difficult (I hooked up Matlab and Postgresql), and the technical gains are astounding. But you will undergo some time-cost in order develop a solid DB schema, tune tables for better performance, and develop a way to create your INSERT scripts from your current data.

                              Last year was my first "live" year, finished negatively (~10%) though after being up a decent amount (~25-30%) just before the summer olympics (i bet tennis). At this point, Topo's point has been true for me, but there are quite a few things I want to try out before declaring defeat.
                              Comment
                              • Miz
                                SBR Wise Guy
                                • 08-30-09
                                • 695

                                #16
                                It may be true for TOPO, but it isn't true for everyone.
                                Comment
                                • strixee
                                  SBR Sharp
                                  • 05-31-10
                                  • 432

                                  #17
                                  My guess is you might need like 6 tables : Horses, Owners, Jockeys, Tracks, Races, Results.
                                  Start the process with ER modeling, that'll hive you the exact answer.
                                  MySQL works well with hundreds of thousands of records on shared hostings, for 10M+ you should use VPS.
                                  If you're running it on your own PC, you won't have any performance troubles for a long time.
                                  Comment
                                  • TravisVOX
                                    SBR Rookie
                                    • 12-25-12
                                    • 30

                                    #18
                                    FWIW -- I have maintained a "horse racing" database for a handful of years now but spent the last few months redesigning it because it wasn't scaling well (it was basically a dump of data, no real structure). I use MySQL because I'm a hack PHP programmer and they play together well. Like with any sport, the database is big. For example, six months of data represents about 25,000 races, 215,000 starters, 65,000 horses etc. So, a database is the route you want to go.
                                    Comment
                                    SBR Contests
                                    Collapse
                                    Top-Rated US Sportsbooks
                                    Collapse
                                    Working...