1. #71
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 798
    Betpoints: 42

    You asked for a book about building sports databases

    But fair enough. Building a college basketball database... I could explain to you how I would do it. I dont know if its THE way. But I'm sure a good way none the less. How good are your programming skills?

    What do you want a database of? Statistics I'm sure... At the game level? At the "play" level? Do you want to use excel or a more "relational" database save for the word "real"?

    Tell me what you are trying to do. And how you think you will get the data and I can try to explain from there.

  2. #72
    CrimsonQueen
    CrimsonQueen's Avatar SBR PRO
    Join Date: 08-12-09
    Posts: 1,068
    Betpoints: 12

    Essentially I used to know how to program 15 years ago, but haven't done it in a LONG time, so if I saw it, I would understand it, but to program it myself, I'm back to the shear basics.

    But basically I want to build a database that can measure stats and back test last years spreads against last years stats so this coming year I'll be ready with something that's tested. But I want to give different stats different weights and say, put 20% on Adj PPP, 15% on Offensive rebounds, 10% on this stat, 5% on that stat... compare two teams that have played.. then match up what spread my thing pops out vs. what the spread was...see if my spreads match closer to what happened vs. what the books spreads were. Then adjust different stats to different levels to find the optimal amount of games that beat the spreads at the different percentage weights of different stats.

    EI. as an easy example I'll use money lines. If I just used 1 stat, Free Throw % (for no reason, just as an example). So basically it would go through all the games played and my system would then spit out (since there's only one stat) 100% of games should be won by teams with better FT% averages.
    Then it would match it up with what really happened... and say maybe 51% of games were won by teams with better FT%'s. Then I would take a different weighing of stats and see if I can find a balance where 60% win or something.

    I haven't totally thought it out, I have no sources, and have no real background in this, but when I challenge myself to do something, I don't sleep until it has been accomplished.

    So any and all help is appreciated! (and I do like Excel and am fluent with programming in there, and I hate Access if by "relational" databases you meant using Access. Also due to this thread I'm reading up on Python)

    BTP
    Week 9
    3-1-1 285 pts


  3. #73
    Maverick22
    Maverick22's Avatar Become A Pro!
    Join Date: 04-10-10
    Posts: 798
    Betpoints: 42

    Alright... This will take some thinking

    But in the mean time, you should start thinking on and REALLY understanding regular expressions. That is... unless you want to "scrape" all this data by hand...

    If you want to do it programattically (which is always my first choice)... web scraping and regular expressions go hand in hand (at least for me).

    I'll start a new thread when I am able to think up how best do it [or alternately how i would do it]

  4. #74
    CrimsonQueen
    CrimsonQueen's Avatar SBR PRO
    Join Date: 08-12-09
    Posts: 1,068
    Betpoints: 12

    Great!! I look forward to working on this! Thanks so much for the help!

    BTP
    Week 9
    3-1-1 285 pts


  5. #75
    oldstylecubsfan
    oldstylecubsfan's Avatar Become A Pro!
    Join Date: 07-15-10
    Posts: 181
    Betpoints: 42

    wow thanks man this is exactly the kind of tutorial i was looking for

  6. #76
    bztips
    bztips's Avatar Become A Pro!
    Join Date: 06-03-10
    Posts: 283

    Help!
    I've used the python/BeautifulSoup approach suggested above, and it has worked great... until now.

    I'm trying to scrape some data from: http://mlb.mlb.com/schedule/
    If I go to that page in my web browser, I see 3 days worth of schedules and probable starting pitchers -- that's the data I want.
    But if I try to scrape and read it into a soup, none of that data is there. In Firefox, I can see the html that has the data I want by running the Firebug add-in, but none of it is there in when pushed through BeautifulSoup.

    Anyone have any ideas what's causing this? (Points to be awarded if you can help me figure this out.)

    TIA.

  7. #77
    bztips
    bztips's Avatar Become A Pro!
    Join Date: 06-03-10
    Posts: 283

    never mind, just figured it out.
    (I would have just deleted the previous msg, but I can't figure out how to do that!)

  8. #78
    Indecent
    Indecent's Avatar Become A Pro!
    Join Date: 09-08-09
    Posts: 757
    Betpoints: 1143

    Quote Originally Posted by bztips View Post
    never mind, just figured it out.
    (I would have just deleted the previous msg, but I can't figure out how to do that!)
    What was it? Maybe you will help someone else if it wasn't a small oversight on your part.

  9. #79
    bztips
    bztips's Avatar Become A Pro!
    Join Date: 06-03-10
    Posts: 283

    Quote Originally Posted by Indecent View Post
    What was it? Maybe you will help someone else if it wasn't a small oversight on your part.
    Well, this won't mean much to anyone who's not semi-conversant with python (I'm a beginner myself, just started using it a couple weeks ago, but doing simple scrapes is pretty easy):
    As it turns out, the url I cited runs some scripts after being accessed that change the page, which are not picked up when read by python. But by using the prettify() method of BeautifulSoup, I was able to see the reference to a related url that contains the same data I was looking for, without the scripting problem.

    I guess there' a small lesson to be learned: prettify() can be pretty handy if you need to actually look at the html to figure out what's going on.

  10. #80
    CrimsonQueen
    CrimsonQueen's Avatar SBR PRO
    Join Date: 08-12-09
    Posts: 1,068
    Betpoints: 12

    prettify() is definitely a great tool (this from another beginner)!

    BTP
    Week 9
    3-1-1 285 pts


  11. #81
    Wojo
    Wojo's Avatar Become A Pro!
    Join Date: 03-19-10
    Posts: 1,765
    Betpoints: 9513

    I haven't read the entire thread, but this is pretty damn cool if it enables me/us to build a database that we can query against. Thanks to all for this educational trip.

  12. #82
    uva3021
    uva3021's Avatar Become A Pro!
    Join Date: 03-01-07
    Posts: 537
    Betpoints: 30

    python is very nice, but the scraping power of excel is vastly underrated

    with just some fundamental understanding of vb, you can scrape years and years worth of data and display it in a nice table all with one macro

    run it, come back in an hour and everything is set up perfectly, just finished my nfl database and it took one coded sequence and about 30 minutes, now i have lines and totals for every team going back to 2000

  13. #83
    Spektre
    Spektre's Avatar Become A Pro!
    Join Date: 02-28-10
    Posts: 122
    Betpoints: 742

    Quote Originally Posted by uva3021 View Post
    python is very nice, but the scraping power of excel is vastly underrated

    with just some fundamental understanding of vb, you can scrape years and years worth of data and display it in a nice table all with one macro

    run it, come back in an hour and everything is set up perfectly, just finished my nfl database and it took one coded sequence and about 30 minutes, now i have lines and totals for every team going back to 2000
    Would you mind sharing? I am looking to make an NFL database and my needs are small...historical scores and lines. Where did you scrape from? Mind sharing your macro/vb code?

    Thanks in advance,

    Spektre

  14. #84
    knelson
    knelson's Avatar Become A Pro!
    Join Date: 08-01-10
    Posts: 197

    Thanks. This was very helpful. I'll be sure to try this out.

  15. #85
    TakeIt
    TakeIt's Avatar Become A Pro!
    Join Date: 04-23-10
    Posts: 774
    Betpoints: 13486

    Quote Originally Posted by uva3021 View Post
    python is very nice, but the scraping power of excel is vastly underrated

    with just some fundamental understanding of vb, you can scrape years and years worth of data and display it in a nice table all with one macro

    run it, come back in an hour and everything is set up perfectly, just finished my nfl database and it took one coded sequence and about 30 minutes, now i have lines and totals for every team going back to 2000
    does this include 1st and 2nd half lines and totals?

  16. #86
    uva3021
    uva3021's Avatar Become A Pro!
    Join Date: 03-01-07
    Posts: 537
    Betpoints: 30

    Here is the excel macro VB code to get NFL data from Statfox


    '
    ' Get NFL Date from Statfox -- Enter specified years in the appropriate variable definitions
    '

    '
    On Error Resume Next
    Dim Datenum As Integer
    Dim Datestart As Integer
    Dim sht As String
    Dim i As Integer
    Dim n As Integer
    Application.ScreenUpdating = False
    Application.Calculation = xlCalculationManual

    Datenum = 2009 'This corresponds to the desired starting data that fits the StatFox linking structure
    Datestart = 2000
    For n = 0 To (Datenum - Datestart)

    For i = 1 To Range("NFLteams").Rows.Count 'NFLteams refers to the Range of teams labeled in the workbook
    sht = Range("NFLteams").Cells(i, 1).Value
    Worksheets.Add().Name = sht & Datenum
    Sheets(sht).Select
    With ActiveSheet.QueryTables.Add(Connection:= _
    "URL;~" & Datenum & "~teamid~" & sht & ".htm", _
    Destination:=Range("$A$1"))
    .Name = sht & Datenum
    .FieldNames = True
    .RowNumbers = False
    .FillAdjacentFormulas = False
    .PreserveFormatting = True
    .RefreshOnFileOpen = False
    .BackgroundQuery = True
    .RefreshStyle = xlInsertDeleteCells
    .SavePassword = False
    .SaveData = True
    .AdjustColumnWidth = True
    .RefreshPeriod = 0
    .WebSelectionType = xlSpecifiedTables
    .WebFormatting = xlWebFormattingNone
    .WebTables = "4,7,8"
    .WebPreFormattedTextToColumns = True
    .WebConsecutiveDelimitersAsOne = True
    .WebSingleBlockTextImport = False
    .WebDisableDateRecognition = True
    .WebDisableRedirections = False
    .Refresh BackgroundQuery:=False
    End With
    With Range("A2", Cells(Rows.Count, 1).End(xlUp))
    .SpecialCells(xlCellTypeBlanks, xlTextValues).EntireRow.Delete
    End With
    Next i
    Datenum = Datenum - 1 'Count down or Count up from a given date
    Next n
    Application.ScreenUpdating = True
    Application.Calculation = xlCalculationAutomatic

    End Sub[/extra]

    What you need to do before running is create a table with the names of each team according to the parameters that correspond to the web address, i.e. San Diego = San+Diego, Giants = NY+Giants

    You can speed up the process by importing a team report or statistics table from the website and find all '" "' and replace with '+' (Find spaces and replace with +), go here and copy and paste or the more efficient way is to import through excel web service browser


    Then select all the teams in the table and define a name to the range, a brief survey of the code and one can see I named the range "NFLTeams"

    That's it, run the code. Takes a while but gets the job done. The workbook will contain however many years worth of sheets for each team, all named with the same format, team and year (Arizona2009, Washington2004, NY+Jets2007, etc...)

    Each sheet will have the date, opponent, score, line, total, ats results, total results, offense/defense - rushing yds, rypa, passing yards, pypc
    Last edited by sbr.rodrigo; 09-08-14 at 01:52 PM.
    Nomination(s):
    This post was nominated 1 time . To view the nominated thread please click here. People who nominated: vavoulas

  17. #87
    tlfs
    tlfs's Avatar Become A Pro!
    Join Date: 08-09-10
    Posts: 7

    This is all very intriguing but I have a dumb question -- what specific kind of raw data is being acquired? Or is that not the point, just that this is how to acquire any data you want?

  18. #88
    CrimsonQueen
    CrimsonQueen's Avatar SBR PRO
    Join Date: 08-12-09
    Posts: 1,068
    Betpoints: 12

    Quote Originally Posted by uva3021 View Post
    Here is the excel macro VB code to get NFL data from Statfox

    Sub NFLfromStatfox()
    '
    ' Get NFL Date from Statfox -- Enter specified years in the appropriate variable definitions
    '

    '
    On Error Resume Next
    Dim Datenum As Integer
    Dim Datestart As Integer
    Dim sht As String
    Dim i As Integer
    Dim n As Integer
    Application.ScreenUpdating = False
    Application.Calculation = xlCalculationManual

    Datenum = 2009 'This corresponds to the desired starting data that fits the StatFox linking structure
    Datestart = 2000
    For n = 0 To (Datenum - Datestart)

    For i = 1 To Range("NFLteams").Rows.Count 'NFLteams refers to the Range of teams labeled in the workbook
    sht = Range("NFLteams").Cells(i, 1).Value
    Worksheets.Add().Name = sht & Datenum
    Sheets(sht).Select
    With ActiveSheet.QueryTables.Add(Connection:= _
    "URL;http://www.statfox.com/nfl/gamelog~season~" & Datenum & "~teamid~" & sht & ".htm", _
    Destination:=Range("$A$1"))
    .Name = sht & Datenum
    .FieldNames = True
    .RowNumbers = False
    .FillAdjacentFormulas = False
    .PreserveFormatting = True
    .RefreshOnFileOpen = False
    .BackgroundQuery = True
    .RefreshStyle = xlInsertDeleteCells
    .SavePassword = False
    .SaveData = True
    .AdjustColumnWidth = True
    .RefreshPeriod = 0
    .WebSelectionType = xlSpecifiedTables
    .WebFormatting = xlWebFormattingNone
    .WebTables = "4,7,8"
    .WebPreFormattedTextToColumns = True
    .WebConsecutiveDelimitersAsOne = True
    .WebSingleBlockTextImport = False
    .WebDisableDateRecognition = True
    .WebDisableRedirections = False
    .Refresh BackgroundQuery:=False
    End With
    With Range("A2", Cells(Rows.Count, 1).End(xlUp))
    .SpecialCells(xlCellTypeBlanks, xlTextValues).EntireRow.Delete
    End With
    Next i
    Datenum = Datenum - 1 'Count down or Count up from a given date
    Next n
    Application.ScreenUpdating = True
    Application.Calculation = xlCalculationAutomatic

    End Sub


    What you need to do before running is create a table with the names of each team according to the parameters that correspond to the web address, i.e. San Diego = San+Diego, Giants = NY+Giants

    You can speed up the process by importing a team report or statistics table from the website and find all '" "' and replace with '+' (Find spaces and replace with +), go here and copy and paste or the more efficient way is to import through excel web service browser



    Then select all the teams in the table and define a name to the range, a brief survey of the code and one can see I named the range "NFLTeams"

    That's it, run the code. Takes a while but gets the job done. The workbook will contain however many years worth of sheets for each team, all named with the same format, team and year (Arizona2009, Washington2004, NY+Jets2007, etc...)

    Each sheet will have the date, opponent, score, line, total, ats results, total results, offense/defense - rushing yds, rypa, passing yards, pypc
    I think I did all this, and ran the macro... and it came up with sheets named 2000-2009, and sheet42-sheet1, all of which say "2009 season game log" "Select Season:" and nothing else.
    Why are all my sheets blank ?


    EDIT: NEVER MIND... I re-opened it on a separate sheet and tried again and it worked this time... dunno what I did different.

    Anyhow, this is AMAZING!!!!!!!!!! Thanks for sharing! Now to attempt to make this work for my NCAA basketball....
    Last edited by sbr.rodrigo; 09-08-14 at 01:50 PM.

    BTP
    Week 9
    3-1-1 285 pts


  19. #89
    CrimsonQueen
    CrimsonQueen's Avatar SBR PRO
    Join Date: 08-12-09
    Posts: 1,068
    Betpoints: 12

    Ok... now I've tried to do this with Statfox's CBB ... but I can't get anything to show up in my tables. How do you determine "WebTables = "4,7,8""
    I feel like this is where the major problem is going to come into play as not all the teams have the same format in data... but even still, I can't get any numbers to show up, but the headers of rows will show up " #G SU W-L ATS W-L O/U/P Over % Under % PF PA Margin "

    But no numbers. Any help?

    BTP
    Week 9
    3-1-1 285 pts


  20. #90
    Atomicdog
    Atomicdog's Avatar Become A Pro!
    Join Date: 06-21-10
    Posts: 61

    Thanks a lot, Good info.

  21. #91
    uva3021
    uva3021's Avatar Become A Pro!
    Join Date: 03-01-07
    Posts: 537
    Betpoints: 30

    Quote Originally Posted by CrimsonQueen View Post
    Ok... now I've tried to do this with Statfox's CBB ... but I can't get anything to show up in my tables. How do you determine "WebTables = "4,7,8""
    I feel like this is where the major problem is going to come into play as not all the teams have the same format in data... but even still, I can't get any numbers to show up, but the headers of rows will show up " #G SU W-L ATS W-L O/U/P Over % Under % PF PA Margin "

    But no numbers. Any help?
    For each sport you should record a macro importing the data then run a loop around it. To help you out though here is the table extracting code for Statfox NCAAB

    ...
    Sheets(sht).Select
    With ActiveSheet.QueryTables.Add(Connection:= _
    "URL;http://www.statfox.com/cbb/cbbteam~teamid~" & sht & "~season~" & Datenum & "~log~1.htm", _
    Destination:=Range("$A$1"))

    .Name = sht & Datenum
    .FieldNames = True
    .RowNumbers = False
    .FillAdjacentFormulas = False
    .PreserveFormatting = True
    .RefreshOnFileOpen = False
    .BackgroundQuery = True
    .RefreshStyle = xlInsertDeleteCells
    .SavePassword = False
    .SaveData = True
    .AdjustColumnWidth = True
    .RefreshPeriod = 0
    .WebSelectionType = xlSpecifiedTables
    .WebFormatting = xlWebFormattingNone
    .WebTables = "6,7,9"
    .WebPreFormattedTextToColumns = True
    .WebConsecutiveDelimitersAsOne = True
    .WebSingleBlockTextImport = False
    .WebDisableDateRecognition = True
    .WebDisableRedirections = False
    .Refresh BackgroundQuery:=False
    End With
    With Range("A6", Cells(Rows.Count, 1).End(xlUp))
    .SpecialCells(xlCellTypeBlanks, xlTextValues).EntireRow.Delete
    End With
    ...


    I put in bold the snippets that were changed. The previous NFL macro accidentally stripped the headers from the columns in each sheet (to fix for NFL change 'With Range("A2")'... to 'With Range("A3")') this will keep the headers.
    Points Awarded:

    CrimsonQueen gave uva3021 1 SBR Point(s) for this post.


  22. #92
    CrimsonQueen
    CrimsonQueen's Avatar SBR PRO
    Join Date: 08-12-09
    Posts: 1,068
    Betpoints: 12

    Sir, you are my hero.
    If you have any other wisdom to share on building databases and models, PLEASE share it!

    BTP
    Week 9
    3-1-1 285 pts


  23. #93
    CrimsonQueen
    CrimsonQueen's Avatar SBR PRO
    Join Date: 08-12-09
    Posts: 1,068
    Betpoints: 12

    Can you explain how you determine what webtables to use? Because the 2008-2009 seasons don't work right due to the table being in a different spot.
    Thanks again!
    EDIT: I guessed a number of times and eventually figured out it was table 13,14 that I needed. But there has to be a way you can figure it out without all the guesswork, no?
    Last edited by CrimsonQueen; 08-10-10 at 11:21 PM.

    BTP
    Week 9
    3-1-1 285 pts


  24. #94
    uva3021
    uva3021's Avatar Become A Pro!
    Join Date: 03-01-07
    Posts: 537
    Betpoints: 30

    There certainly are ways to isolate the table you want without trial and error, I suggest to google the DOM method of website data extraction, which essentially allows you to look for text inside the websites html code itself and select the respective table for which it lies within. Then you can run a loop around the code like was done previous

    Mine worked fine, but I built the tables a few weeks ago and each sub-site of statfox used the exact same table format for each sport. Maybe it was changed for whatever reason

    Like I said import a table through the Import web table service in excel and find the correct table numbers that way, then run the loop

  25. #95
    craigpb
    craigpb's Avatar SBR PRO
    Join Date: 06-19-08
    Posts: 581
    Betpoints: 1708

    This is some cool information. Thanks a lot.

    BTP
    Week 7
    3-2-0 169 pts


  26. #96
    spud58
    spud58's Avatar Become A Pro!
    Join Date: 02-23-10
    Posts: 178

    id have to say its all about the WNBA

  27. #97
    diecast
    diecast's Avatar Become A Pro!
    Join Date: 11-20-09
    Posts: 140
    Betpoints: 98

    gotta love python! i've been using a script i wrote for odds and live game scores with BeautifulSoup. Well done on posting yours up here.

  28. #98
    neila
    neila's Avatar Become A Pro!
    Join Date: 07-09-10
    Posts: 106

    What about that model? Since the OP decided not to pursue this thread lets see if we can move it forward a bit. This is after all the Think Tank so lets think about it. I'm currently building an NBA model. I have built two other successful models in other fields of endeavor so I have a little experience that I can share. I built my first model over 20+ years ago using Lotus 123. If you know what that is your getting old like me. Today I use Perl and Mysql.
    OK, lets assume everyone now knows how to get the data they want and load it into Excel or Mysql or whatever. What do we do now?
    First, a caveat. Information that is readily available and fairly easy to obtain is NOT profitable. If you think that running a few macros or Excel functions in your favorite sport will give you long term profit your going to be disappointed. If it were that easy we would all be wealthy. The program I have been writing for the last 2 months is already over 1500 lines of code and its not finished. It also is currently not profitable returning only about 53% win percentage ATS. Of course, I have just recently brought it to the point where I can backtest it.
    It really is about re-inventing the wheel to a large extent. You have to build the databases and write the programs so you can discover those "gems" that the masses do not have. You will spend as much time thinking about your model as building it. You have to look at the information that everyone else has in a new light. It is doable, but it ain't easy.
    It is commonly understood that only 2 - 3% of gamblers are successful long term. Now you know why. If any of you are interested in pursuing this further, lets do it. If not have a nice day.

  29. #99
    Wrecktangle
    Wrecktangle's Avatar Become A Pro!
    Join Date: 03-01-09
    Posts: 1,524
    Betpoints: 1222

    Quote Originally Posted by neila View Post
    You have to build the databases and write the programs so you can discover those "gems" that the masses do not have. You will spend as much time thinking about your model as building it. You have to look at the information that everyone else has in a new light. It is doable, but it ain't easy.
    It is commonly understood that only 2 - 3% of gamblers are successful long term. Now you know why. If any of you are interested in pursuing this further, lets do it. If not have a nice day.
    OK, whats keeping the masses from reading this post?

    2-3% is currently too high a number, pros are leaving the biz because it has become much less profitable due to lack of square money to inflate the market.

  30. #100
    neila
    neila's Avatar Become A Pro!
    Join Date: 07-09-10
    Posts: 106

    Quote Originally Posted by Wrecktangle View Post
    OK, whats keeping the masses from reading this post?.
    Absolutely nothing. But, when you consider the difficulty and time involved to gain a long term edge, I think you can safely eliminate the vast majority.

    2-3% is currently too high a number, pros are leaving the biz because it has become much less profitable due to lack of square money to inflate the market.
    Good. That should increase the ratio of squares to sharps. By the way where are all the pros going? Are they becoming plumbers?

  31. #101
    Flying Dutchman
    Floggings continue until morale improves
    Flying Dutchman's Avatar Become A Pro!
    Join Date: 05-17-09
    Posts: 2,468

    Quote Originally Posted by neila View Post
    Good. That should increase the ratio of squares to sharps. By the way where are all the pros going? Are they becoming plumbers?
    Those fvkkers are becoming the homeless on the streets of San Diego. Can't go downtown without falling over these worthless shiits hitting you up for money/drugs/women.

  32. #102
    WindMaster
    WindMaster's Avatar Become A Pro!
    Join Date: 08-22-10
    Posts: 3

    Great info thx a loot

  33. #103
    pedro803
    pedro803's Avatar Become A Pro!
    Join Date: 01-02-10
    Posts: 309
    Betpoints: 5708

    I tried running the visual basic/excel macro code above but it didn't work for me, it created all the sheets with the teams and the years but they are all empty. Maybe this weekend I will try again -- I have order the Visual Basic for Dummies book so maybe that will help too! I didn't realize you could scrape with Excel, thanks for the info UVA -- now I just hope I can get it to work

  34. #104
    uva3021
    uva3021's Avatar Become A Pro!
    Join Date: 03-01-07
    Posts: 537
    Betpoints: 30

    Quote Originally Posted by pedro803 View Post
    I tried running the visual basic/excel macro code above but it didn't work for me, it created all the sheets with the teams and the years but they are all empty. Maybe this weekend I will try again -- I have order the Visual Basic for Dummies book so maybe that will help too! I didn't realize you could scrape with Excel, thanks for the info UVA -- now I just hope I can get it to work
    stick a comment,', before the "On Error Resume Next" sequence, then post the error message, if any

    it could be merely statfox being offline, or a bad internet connection

  35. #105
    CrimsonQueen
    CrimsonQueen's Avatar SBR PRO
    Join Date: 08-12-09
    Posts: 1,068
    Betpoints: 12

    I got blank tabs the first time too, I think cuz I named the column the wrong thing. double check everything, that's what I did and it worked the second time

    BTP
    Week 9
    3-1-1 285 pts


First 123456 ... Last
Top