An introduction to research

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Maverick22
    SBR Wise Guy
    • 04-10-10
    • 807

    #71
    You asked for a book about building sports databases

    But fair enough. Building a college basketball database... I could explain to you how I would do it. I dont know if its THE way. But I'm sure a good way none the less. How good are your programming skills?

    What do you want a database of? Statistics I'm sure... At the game level? At the "play" level? Do you want to use excel or a more "relational" database save for the word "real"?

    Tell me what you are trying to do. And how you think you will get the data and I can try to explain from there.
    Comment
    • CrimsonQueen
      SBR MVP
      • 08-12-09
      • 1068

      #72
      Essentially I used to know how to program 15 years ago, but haven't done it in a LONG time, so if I saw it, I would understand it, but to program it myself, I'm back to the shear basics.

      But basically I want to build a database that can measure stats and back test last years spreads against last years stats so this coming year I'll be ready with something that's tested. But I want to give different stats different weights and say, put 20% on Adj PPP, 15% on Offensive rebounds, 10% on this stat, 5% on that stat... compare two teams that have played.. then match up what spread my thing pops out vs. what the spread was...see if my spreads match closer to what happened vs. what the books spreads were. Then adjust different stats to different levels to find the optimal amount of games that beat the spreads at the different percentage weights of different stats.

      EI. as an easy example I'll use money lines. If I just used 1 stat, Free Throw % (for no reason, just as an example). So basically it would go through all the games played and my system would then spit out (since there's only one stat) 100% of games should be won by teams with better FT% averages.
      Then it would match it up with what really happened... and say maybe 51% of games were won by teams with better FT%'s. Then I would take a different weighing of stats and see if I can find a balance where 60% win or something.

      I haven't totally thought it out, I have no sources, and have no real background in this, but when I challenge myself to do something, I don't sleep until it has been accomplished.

      So any and all help is appreciated! (and I do like Excel and am fluent with programming in there, and I hate Access if by "relational" databases you meant using Access. Also due to this thread I'm reading up on Python)
      Comment
      • Maverick22
        SBR Wise Guy
        • 04-10-10
        • 807

        #73
        Alright... This will take some thinking

        But in the mean time, you should start thinking on and REALLY understanding regular expressions. That is... unless you want to "scrape" all this data by hand...

        If you want to do it programattically (which is always my first choice)... web scraping and regular expressions go hand in hand (at least for me).

        I'll start a new thread when I am able to think up how best do it [or alternately how i would do it]
        Comment
        • CrimsonQueen
          SBR MVP
          • 08-12-09
          • 1068

          #74
          Great!! I look forward to working on this! Thanks so much for the help!
          Comment
          • oldstylecubsfan
            SBR High Roller
            • 07-15-10
            • 184

            #75
            wow thanks man this is exactly the kind of tutorial i was looking for
            Comment
            • bztips
              SBR Sharp
              • 06-03-10
              • 283

              #76
              Help!
              I've used the python/BeautifulSoup approach suggested above, and it has worked great... until now.

              I'm trying to scrape some data from: http://mlb.mlb.com/schedule/
              If I go to that page in my web browser, I see 3 days worth of schedules and probable starting pitchers -- that's the data I want.
              But if I try to scrape and read it into a soup, none of that data is there. In Firefox, I can see the html that has the data I want by running the Firebug add-in, but none of it is there in when pushed through BeautifulSoup.

              Anyone have any ideas what's causing this? (Points to be awarded if you can help me figure this out.)

              TIA.
              Comment
              • bztips
                SBR Sharp
                • 06-03-10
                • 283

                #77
                never mind, just figured it out.
                (I would have just deleted the previous msg, but I can't figure out how to do that!)
                Comment
                • Indecent
                  SBR Wise Guy
                  • 09-08-09
                  • 758

                  #78
                  Originally posted by bztips
                  never mind, just figured it out.
                  (I would have just deleted the previous msg, but I can't figure out how to do that!)
                  What was it? Maybe you will help someone else if it wasn't a small oversight on your part.
                  Comment
                  • bztips
                    SBR Sharp
                    • 06-03-10
                    • 283

                    #79
                    Originally posted by Indecent
                    What was it? Maybe you will help someone else if it wasn't a small oversight on your part.
                    Well, this won't mean much to anyone who's not semi-conversant with python (I'm a beginner myself, just started using it a couple weeks ago, but doing simple scrapes is pretty easy):
                    As it turns out, the url I cited runs some scripts after being accessed that change the page, which are not picked up when read by python. But by using the prettify() method of BeautifulSoup, I was able to see the reference to a related url that contains the same data I was looking for, without the scripting problem.

                    I guess there' a small lesson to be learned: prettify() can be pretty handy if you need to actually look at the html to figure out what's going on.
                    Comment
                    • CrimsonQueen
                      SBR MVP
                      • 08-12-09
                      • 1068

                      #80
                      prettify() is definitely a great tool (this from another beginner)!
                      Comment
                      • Wojo
                        SBR MVP
                        • 03-19-10
                        • 1764

                        #81
                        I haven't read the entire thread, but this is pretty damn cool if it enables me/us to build a database that we can query against. Thanks to all for this educational trip.
                        Comment
                        • uva3021
                          SBR Wise Guy
                          • 03-01-07
                          • 537

                          #82
                          python is very nice, but the scraping power of excel is vastly underrated

                          with just some fundamental understanding of vb, you can scrape years and years worth of data and display it in a nice table all with one macro

                          run it, come back in an hour and everything is set up perfectly, just finished my nfl database and it took one coded sequence and about 30 minutes, now i have lines and totals for every team going back to 2000
                          Comment
                          • Spektre
                            SBR High Roller
                            • 02-28-10
                            • 184

                            #83
                            Originally posted by uva3021
                            python is very nice, but the scraping power of excel is vastly underrated

                            with just some fundamental understanding of vb, you can scrape years and years worth of data and display it in a nice table all with one macro

                            run it, come back in an hour and everything is set up perfectly, just finished my nfl database and it took one coded sequence and about 30 minutes, now i have lines and totals for every team going back to 2000
                            Would you mind sharing? I am looking to make an NFL database and my needs are small...historical scores and lines. Where did you scrape from? Mind sharing your macro/vb code?

                            Thanks in advance,

                            Spektre
                            Comment
                            • knelson
                              Restricted User
                              • 08-01-10
                              • 197

                              #84
                              Thanks. This was very helpful. I'll be sure to try this out.
                              Comment
                              • TakeIt
                                SBR Wise Guy
                                • 04-23-10
                                • 778

                                #85
                                Originally posted by uva3021
                                python is very nice, but the scraping power of excel is vastly underrated

                                with just some fundamental understanding of vb, you can scrape years and years worth of data and display it in a nice table all with one macro

                                run it, come back in an hour and everything is set up perfectly, just finished my nfl database and it took one coded sequence and about 30 minutes, now i have lines and totals for every team going back to 2000
                                does this include 1st and 2nd half lines and totals?
                                Comment
                                • uva3021
                                  SBR Wise Guy
                                  • 03-01-07
                                  • 537

                                  #86
                                  Here is the excel macro VB code to get NFL data from Statfox


                                  '
                                  ' Get NFL Date from Statfox -- Enter specified years in the appropriate variable definitions
                                  '

                                  '
                                  On Error Resume Next
                                  Dim Datenum As Integer
                                  Dim Datestart As Integer
                                  Dim sht As String
                                  Dim i As Integer
                                  Dim n As Integer
                                  Application.ScreenUpdating = False
                                  Application.Calculation = xlCalculationManual

                                  Datenum = 2009 'This corresponds to the desired starting data that fits the StatFox linking structure
                                  Datestart = 2000
                                  For n = 0 To (Datenum - Datestart)

                                  For i = 1 To Range("NFLteams").Rows.Count 'NFLteams refers to the Range of teams labeled in the workbook
                                  sht = Range("NFLteams").Cells(i, 1).Value
                                  Worksheets.Add().Name = sht & Datenum
                                  Sheets(sht).Select
                                  With ActiveSheet.QueryTables.Add(Connection:= _
                                  "URL;~" & Datenum & "~teamid~" & sht & ".htm", _
                                  Destination:=Range("$A$1"))
                                  .Name = sht & Datenum
                                  .FieldNames = True
                                  .RowNumbers = False
                                  .FillAdjacentFormulas = False
                                  .PreserveFormatting = True
                                  .RefreshOnFileOpen = False
                                  .BackgroundQuery = True
                                  .RefreshStyle = xlInsertDeleteCells
                                  .SavePassword = False
                                  .SaveData = True
                                  .AdjustColumnWidth = True
                                  .RefreshPeriod = 0
                                  .WebSelectionType = xlSpecifiedTables
                                  .WebFormatting = xlWebFormattingNone
                                  .WebTables = "4,7,8"
                                  .WebPreFormattedTextToColumns = True
                                  .WebConsecutiveDelimitersAsOne = True
                                  .WebSingleBlockTextImport = False
                                  .WebDisableDateRecognition = True
                                  .WebDisableRedirections = False
                                  .Refresh BackgroundQuery:=False
                                  End With
                                  With Range("A2", Cells(Rows.Count, 1).End(xlUp))
                                  .SpecialCells(xlCellTypeBlanks, xlTextValues).EntireRow.Delete
                                  End With
                                  Next i
                                  Datenum = Datenum - 1 'Count down or Count up from a given date
                                  Next n
                                  Application.ScreenUpdating = True
                                  Application.Calculation = xlCalculationAutomatic

                                  End Sub[/extra]

                                  What you need to do before running is create a table with the names of each team according to the parameters that correspond to the web address, i.e. San Diego = San+Diego, Giants = NY+Giants

                                  You can speed up the process by importing a team report or statistics table from the website and find all '" "' and replace with '+' (Find spaces and replace with +), go here and copy and paste or the more efficient way is to import through excel web service browser


                                  Then select all the teams in the table and define a name to the range, a brief survey of the code and one can see I named the range "NFLTeams"

                                  That's it, run the code. Takes a while but gets the job done. The workbook will contain however many years worth of sheets for each team, all named with the same format, team and year (Arizona2009, Washington2004, NY+Jets2007, etc...)

                                  Each sheet will have the date, opponent, score, line, total, ats results, total results, offense/defense - rushing yds, rypa, passing yards, pypc
                                  Last edited by sbr.rodrigo; 09-08-14, 12:52 PM.
                                  Comment
                                  • tlfs
                                    SBR Rookie
                                    • 08-09-10
                                    • 7

                                    #87
                                    This is all very intriguing but I have a dumb question -- what specific kind of raw data is being acquired? Or is that not the point, just that this is how to acquire any data you want?
                                    Comment
                                    • CrimsonQueen
                                      SBR MVP
                                      • 08-12-09
                                      • 1068

                                      #88
                                      Originally posted by uva3021
                                      Here is the excel macro VB code to get NFL data from Statfox

                                      Sub NFLfromStatfox()
                                      '
                                      ' Get NFL Date from Statfox -- Enter specified years in the appropriate variable definitions
                                      '

                                      '
                                      On Error Resume Next
                                      Dim Datenum As Integer
                                      Dim Datestart As Integer
                                      Dim sht As String
                                      Dim i As Integer
                                      Dim n As Integer
                                      Application.ScreenUpdating = False
                                      Application.Calculation = xlCalculationManual

                                      Datenum = 2009 'This corresponds to the desired starting data that fits the StatFox linking structure
                                      Datestart = 2000
                                      For n = 0 To (Datenum - Datestart)

                                      For i = 1 To Range("NFLteams").Rows.Count 'NFLteams refers to the Range of teams labeled in the workbook
                                      sht = Range("NFLteams").Cells(i, 1).Value
                                      Worksheets.Add().Name = sht & Datenum
                                      Sheets(sht).Select
                                      With ActiveSheet.QueryTables.Add(Connection:= _
                                      "URL;http://www.statfox.com/nfl/gamelog~season~" & Datenum & "~teamid~" & sht & ".htm", _
                                      Destination:=Range("$A$1"))
                                      .Name = sht & Datenum
                                      .FieldNames = True
                                      .RowNumbers = False
                                      .FillAdjacentFormulas = False
                                      .PreserveFormatting = True
                                      .RefreshOnFileOpen = False
                                      .BackgroundQuery = True
                                      .RefreshStyle = xlInsertDeleteCells
                                      .SavePassword = False
                                      .SaveData = True
                                      .AdjustColumnWidth = True
                                      .RefreshPeriod = 0
                                      .WebSelectionType = xlSpecifiedTables
                                      .WebFormatting = xlWebFormattingNone
                                      .WebTables = "4,7,8"
                                      .WebPreFormattedTextToColumns = True
                                      .WebConsecutiveDelimitersAsOne = True
                                      .WebSingleBlockTextImport = False
                                      .WebDisableDateRecognition = True
                                      .WebDisableRedirections = False
                                      .Refresh BackgroundQuery:=False
                                      End With
                                      With Range("A2", Cells(Rows.Count, 1).End(xlUp))
                                      .SpecialCells(xlCellTypeBlanks, xlTextValues).EntireRow.Delete
                                      End With
                                      Next i
                                      Datenum = Datenum - 1 'Count down or Count up from a given date
                                      Next n
                                      Application.ScreenUpdating = True
                                      Application.Calculation = xlCalculationAutomatic

                                      End Sub


                                      What you need to do before running is create a table with the names of each team according to the parameters that correspond to the web address, i.e. San Diego = San+Diego, Giants = NY+Giants

                                      You can speed up the process by importing a team report or statistics table from the website and find all '" "' and replace with '+' (Find spaces and replace with +), go here and copy and paste or the more efficient way is to import through excel web service browser



                                      Then select all the teams in the table and define a name to the range, a brief survey of the code and one can see I named the range "NFLTeams"

                                      That's it, run the code. Takes a while but gets the job done. The workbook will contain however many years worth of sheets for each team, all named with the same format, team and year (Arizona2009, Washington2004, NY+Jets2007, etc...)

                                      Each sheet will have the date, opponent, score, line, total, ats results, total results, offense/defense - rushing yds, rypa, passing yards, pypc
                                      I think I did all this, and ran the macro... and it came up with sheets named 2000-2009, and sheet42-sheet1, all of which say "2009 season game log" "Select Season:" and nothing else.
                                      Why are all my sheets blank ?


                                      EDIT: NEVER MIND... I re-opened it on a separate sheet and tried again and it worked this time... dunno what I did different.

                                      Anyhow, this is AMAZING!!!!!!!!!! Thanks for sharing! Now to attempt to make this work for my NCAA basketball....
                                      Last edited by sbr.rodrigo; 09-08-14, 12:50 PM.
                                      Comment
                                      • CrimsonQueen
                                        SBR MVP
                                        • 08-12-09
                                        • 1068

                                        #89
                                        Ok... now I've tried to do this with Statfox's CBB ... but I can't get anything to show up in my tables. How do you determine "WebTables = "4,7,8""
                                        I feel like this is where the major problem is going to come into play as not all the teams have the same format in data... but even still, I can't get any numbers to show up, but the headers of rows will show up " #G SU W-L ATS W-L O/U/P Over % Under % PF PA Margin "

                                        But no numbers. Any help?
                                        Comment
                                        • Atomicdog
                                          SBR Hustler
                                          • 06-21-10
                                          • 61

                                          #90
                                          Thanks a lot, Good info.
                                          Comment
                                          • uva3021
                                            SBR Wise Guy
                                            • 03-01-07
                                            • 537

                                            #91
                                            Originally posted by CrimsonQueen
                                            Ok... now I've tried to do this with Statfox's CBB ... but I can't get anything to show up in my tables. How do you determine "WebTables = "4,7,8""
                                            I feel like this is where the major problem is going to come into play as not all the teams have the same format in data... but even still, I can't get any numbers to show up, but the headers of rows will show up " #G SU W-L ATS W-L O/U/P Over % Under % PF PA Margin "

                                            But no numbers. Any help?
                                            For each sport you should record a macro importing the data then run a loop around it. To help you out though here is the table extracting code for Statfox NCAAB

                                            ...
                                            Sheets(sht).Select
                                            With ActiveSheet.QueryTables.Add(Connection:= _
                                            "URL;http://www.statfox.com/cbb/cbbteam~teamid~" & sht & "~season~" & Datenum & "~log~1.htm", _
                                            Destination:=Range("$A$1"))

                                            .Name = sht & Datenum
                                            .FieldNames = True
                                            .RowNumbers = False
                                            .FillAdjacentFormulas = False
                                            .PreserveFormatting = True
                                            .RefreshOnFileOpen = False
                                            .BackgroundQuery = True
                                            .RefreshStyle = xlInsertDeleteCells
                                            .SavePassword = False
                                            .SaveData = True
                                            .AdjustColumnWidth = True
                                            .RefreshPeriod = 0
                                            .WebSelectionType = xlSpecifiedTables
                                            .WebFormatting = xlWebFormattingNone
                                            .WebTables = "6,7,9"
                                            .WebPreFormattedTextToColumns = True
                                            .WebConsecutiveDelimitersAsOne = True
                                            .WebSingleBlockTextImport = False
                                            .WebDisableDateRecognition = True
                                            .WebDisableRedirections = False
                                            .Refresh BackgroundQuery:=False
                                            End With
                                            With Range("A6", Cells(Rows.Count, 1).End(xlUp))
                                            .SpecialCells(xlCellTypeBlanks, xlTextValues).EntireRow.Delete
                                            End With
                                            ...


                                            I put in bold the snippets that were changed. The previous NFL macro accidentally stripped the headers from the columns in each sheet (to fix for NFL change 'With Range("A2")'... to 'With Range("A3")') this will keep the headers.
                                            Comment
                                            • CrimsonQueen
                                              SBR MVP
                                              • 08-12-09
                                              • 1068

                                              #92
                                              Sir, you are my hero.
                                              If you have any other wisdom to share on building databases and models, PLEASE share it!
                                              Comment
                                              • CrimsonQueen
                                                SBR MVP
                                                • 08-12-09
                                                • 1068

                                                #93
                                                Can you explain how you determine what webtables to use? Because the 2008-2009 seasons don't work right due to the table being in a different spot.
                                                Thanks again!
                                                EDIT: I guessed a number of times and eventually figured out it was table 13,14 that I needed. But there has to be a way you can figure it out without all the guesswork, no?
                                                Last edited by CrimsonQueen; 08-10-10, 10:21 PM.
                                                Comment
                                                • uva3021
                                                  SBR Wise Guy
                                                  • 03-01-07
                                                  • 537

                                                  #94
                                                  There certainly are ways to isolate the table you want without trial and error, I suggest to google the DOM method of website data extraction, which essentially allows you to look for text inside the websites html code itself and select the respective table for which it lies within. Then you can run a loop around the code like was done previous

                                                  Mine worked fine, but I built the tables a few weeks ago and each sub-site of statfox used the exact same table format for each sport. Maybe it was changed for whatever reason

                                                  Like I said import a table through the Import web table service in excel and find the correct table numbers that way, then run the loop
                                                  Comment
                                                  • craigpb
                                                    SBR Wise Guy
                                                    • 06-19-08
                                                    • 699

                                                    #95
                                                    This is some cool information. Thanks a lot.
                                                    Comment
                                                    • spud58
                                                      SBR High Roller
                                                      • 02-23-10
                                                      • 178

                                                      #96
                                                      id have to say its all about the WNBA
                                                      Comment
                                                      • diecast
                                                        SBR High Roller
                                                        • 11-20-09
                                                        • 142

                                                        #97
                                                        gotta love python! i've been using a script i wrote for odds and live game scores with BeautifulSoup. Well done on posting yours up here.
                                                        Comment
                                                        • neila
                                                          SBR High Roller
                                                          • 07-08-10
                                                          • 106

                                                          #98
                                                          What about that model? Since the OP decided not to pursue this thread lets see if we can move it forward a bit. This is after all the Think Tank so lets think about it. I'm currently building an NBA model. I have built two other successful models in other fields of endeavor so I have a little experience that I can share. I built my first model over 20+ years ago using Lotus 123. If you know what that is your getting old like me. Today I use Perl and Mysql.
                                                          OK, lets assume everyone now knows how to get the data they want and load it into Excel or Mysql or whatever. What do we do now?
                                                          First, a caveat. Information that is readily available and fairly easy to obtain is NOT profitable. If you think that running a few macros or Excel functions in your favorite sport will give you long term profit your going to be disappointed. If it were that easy we would all be wealthy. The program I have been writing for the last 2 months is already over 1500 lines of code and its not finished. It also is currently not profitable returning only about 53% win percentage ATS. Of course, I have just recently brought it to the point where I can backtest it.
                                                          It really is about re-inventing the wheel to a large extent. You have to build the databases and write the programs so you can discover those "gems" that the masses do not have. You will spend as much time thinking about your model as building it. You have to look at the information that everyone else has in a new light. It is doable, but it ain't easy.
                                                          It is commonly understood that only 2 - 3% of gamblers are successful long term. Now you know why. If any of you are interested in pursuing this further, lets do it. If not have a nice day.
                                                          Comment
                                                          • Wrecktangle
                                                            SBR MVP
                                                            • 03-01-09
                                                            • 1524

                                                            #99
                                                            Originally posted by neila
                                                            You have to build the databases and write the programs so you can discover those "gems" that the masses do not have. You will spend as much time thinking about your model as building it. You have to look at the information that everyone else has in a new light. It is doable, but it ain't easy.
                                                            It is commonly understood that only 2 - 3% of gamblers are successful long term. Now you know why. If any of you are interested in pursuing this further, lets do it. If not have a nice day.
                                                            OK, whats keeping the masses from reading this post?

                                                            2-3% is currently too high a number, pros are leaving the biz because it has become much less profitable due to lack of square money to inflate the market.
                                                            Comment
                                                            • neila
                                                              SBR High Roller
                                                              • 07-08-10
                                                              • 106

                                                              #100
                                                              Originally posted by Wrecktangle
                                                              OK, whats keeping the masses from reading this post?.
                                                              Absolutely nothing. But, when you consider the difficulty and time involved to gain a long term edge, I think you can safely eliminate the vast majority.

                                                              2-3% is currently too high a number, pros are leaving the biz because it has become much less profitable due to lack of square money to inflate the market.
                                                              Good. That should increase the ratio of squares to sharps. By the way where are all the pros going? Are they becoming plumbers?
                                                              Comment
                                                              • Flying Dutchman
                                                                SBR MVP
                                                                • 05-17-09
                                                                • 2467

                                                                #101
                                                                Originally posted by neila
                                                                Good. That should increase the ratio of squares to sharps. By the way where are all the pros going? Are they becoming plumbers?
                                                                Those fvkkers are becoming the homeless on the streets of San Diego. Can't go downtown without falling over these worthless shiits hitting you up for money/drugs/women.

                                                                Comment
                                                                • WindMaster
                                                                  SBR Rookie
                                                                  • 08-22-10
                                                                  • 3

                                                                  #102
                                                                  Great info thx a loot
                                                                  Comment
                                                                  • pedro803
                                                                    SBR Sharp
                                                                    • 01-02-10
                                                                    • 309

                                                                    #103
                                                                    I tried running the visual basic/excel macro code above but it didn't work for me, it created all the sheets with the teams and the years but they are all empty. Maybe this weekend I will try again -- I have order the Visual Basic for Dummies book so maybe that will help too! I didn't realize you could scrape with Excel, thanks for the info UVA -- now I just hope I can get it to work
                                                                    Comment
                                                                    • uva3021
                                                                      SBR Wise Guy
                                                                      • 03-01-07
                                                                      • 537

                                                                      #104
                                                                      Originally posted by pedro803
                                                                      I tried running the visual basic/excel macro code above but it didn't work for me, it created all the sheets with the teams and the years but they are all empty. Maybe this weekend I will try again -- I have order the Visual Basic for Dummies book so maybe that will help too! I didn't realize you could scrape with Excel, thanks for the info UVA -- now I just hope I can get it to work
                                                                      stick a comment,', before the "On Error Resume Next" sequence, then post the error message, if any

                                                                      it could be merely statfox being offline, or a bad internet connection
                                                                      Comment
                                                                      • CrimsonQueen
                                                                        SBR MVP
                                                                        • 08-12-09
                                                                        • 1068

                                                                        #105
                                                                        I got blank tabs the first time too, I think cuz I named the column the wrong thing. double check everything, that's what I did and it worked the second time
                                                                        Comment
                                                                        SBR Contests
                                                                        Collapse
                                                                        Top-Rated US Sportsbooks
                                                                        Collapse
                                                                        Working...