Logit regression, R, and the NCAA tournament

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • zeros_and_ones
    SBR Rookie
    • 03-05-12
    • 3

    #1
    Logit regression, R, and the NCAA tournament
    Hi all,

    Haven't done full-on stats since college but I started tinkering with R and wanted to use it for research for NCAA tournament underdogs. My theory was this; underdogs that cover point spreads would most likely maximize their possessions and give opponents fewer opportunities to score points, in other words, teams that rebound well on both ends and don't turn the ball over will do better than others.

    To test this, I performed the following:

    1. Pulled data on all NCAA tournament teams for the past 6 years
    2. Pulled all spreads and assigned a "1" to underdogs that covered in the 1st round and a "0" to all others
    3. Uploaded that data into R (I attached the data to this message)
    4. Did not receive the responses I expected, details below:


    Deviance Residuals:
    Min 1Q Median 3Q Max
    -1.1604 -0.6199 -0.4857 -0.3028 2.4929

    Coefficients:
    Estimate Std. Error z value Pr(>|z|)
    (Intercept) 15.332281 7.139957 2.147 0.03176 *
    Reb 0.002679 0.002510 1.067 0.28587
    opp_reb -0.002790 0.002649 -1.053 0.29223
    Turn -0.017394 0.044127 -0.394 0.69344
    opp_turns 0.010915 0.044040 0.248 0.80425
    opp_fg_per -1.657098 12.996452 -0.128 0.89854
    fg_diff 0.708769 10.636560 0.067 0.94687
    to_margin 0.140202 1.413622 0.099 0.92100
    RPI -0.035033 0.014554 -2.407 0.01608 *
    SOS 0.015673 0.005987 2.618 0.00885 **
    winper -16.215107 5.006029 -3.239 0.00120 **
    asst_to -1.827639 1.551475 -1.178 0.23880
    ---
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    (Dispersion parameter for binomial family taken to be 1)
    Null deviance: 278.73 on 323 degrees of freedom
    Residual deviance: 257.22 on 312 degrees of freedom
    AIC: 281.22
    Number of Fisher Scoring iterations: 5

    If I'm interpreting the data right, and I'd like to think that I am, the most statistically significant variables are RPI (Rank Percentage Index), SOS (strength of schedule), and winper (team's winning percentage) were as Reb (team's rebounds), opp_reb (opponents rebounds), and turn (team's TOs) weren't what I had hoped.

    What I'm stuck on; is this a *#$% model? I mean, I get to be on the lookout for teams that may have a better RPI versus a favorite but does this make sense to the group? If there is some value, how do I take it to the next level/what do I do with it now/how do I apply it to this year's field? I'm somewhat familiar with R (I did the above by just reading up on the subject) but by no means an expert (I added the procedures I performed to the word document attached to this message, always show your work, right?).

    Some other thoughts; I would like to push it a bit further and add in offensive efficiency (to push the last piece of what I "believe" to be a good underdog). Or, do I "flip" the whole model and instead look for what are the traits that favorites covered with?

    Any guidance would be appreciated, PM me if you feel the need.

    data and code.zip
  • TomG
    SBR Wise Guy
    • 10-29-07
    • 500

    #2
    the good news is that its not so hard, right? r does all the work for you.

    the bad news is that your regression equation is a jumbled mess. it's a good example why you can't just throw a bunch of stuff into the formula and expect to get anything meaningful out of it. read the underlying assumptions for a linear regression



    hint: try typing in pairs(RPI, SOS, winper, ft_per, score_margin, ft_per, reb_margin, to_margin, asst_to) and post a pic of the output
    Comment
    • RickySteve
      Restricted User
      • 01-31-06
      • 3415

      #3
      Tommy, Heritage is originating on MLB derivatives this year.
      Comment
      • TomG
        SBR Wise Guy
        • 10-29-07
        • 500

        #4
        100 limits on mlb derivatives there for me
        Comment
        • RickySteve
          Restricted User
          • 01-31-06
          • 3415

          #5
          I'm sure you have 2nd cousins with higher limits.
          Comment
          • zeros_and_ones
            SBR Rookie
            • 03-05-12
            • 3

            #6
            Originally posted by TomG
            the good news is that its not so hard, right? r does all the work for you.

            the bad news is that your regression equation is a jumbled mess. it's a good example why you can't just throw a bunch of stuff into the formula and expect to get anything meaningful out of it. read the underlying assumptions for a linear regression



            hint: try typing in pairs(RPI, SOS, winper, ft_per, score_margin, ft_per, reb_margin, to_margin, asst_to) and post a pic of the output
            Appreciate the response and to clarify your hint, are you suggesting running the model for each variable? As in the following:

            hoopslogit1<- glm(X1stdogwin~RPI, family=binomial)

            obtain summary

            hoopslogit2<- glm(X1stdogwin~reb_margin, family=binomial)

            obtain summary

            and so forth?

            Overall, taking a step back, I agree with your feedback that the model is a bit jumbled. My first iteration just focused on rebound related information but I found that none of the variables were statistically significant hence me throwing some other stuff in. Makes sense that I can't bake a cake with throwing a bunch of crap in.
            Comment
            • TomG
              SBR Wise Guy
              • 10-29-07
              • 500

              #7
              oops looks like i didn't do that right just do pairs(hoops) to check for multicollinearity and don't include highly correlated variables as predictors. there are lots of ways to build models--prune top down or build bottom up. just follow the regression assumptions and see which model has the best adjusted r-squared, aic, bic, or whatever selection criteria you want to work with. i don't even understand what you are trying to predict, though, so i think you have a ways to go.
              Comment
              • zeros_and_ones
                SBR Rookie
                • 03-05-12
                • 3

                #8
                Originally posted by TomG
                oops looks like i didn't do that right just do pairs(hoops) to check for multicollinearity and don't include highly correlated variables as predictors. there are lots of ways to build models--prune top down or build bottom up. just follow the regression assumptions and see which model has the best adjusted r-squared, aic, bic, or whatever selection criteria you want to work with. i don't even understand what you are trying to predict, though, so i think you have a ways to go.
                Hi Tom,

                Appreciate the feedback, I need to just play around with it a bit more. To answer your last question (re: what I'm trying to predict) is this, I wanted to figure out which attributes (i.e. rebound margin, turnovers, etc) successful underdogs possessed. To get the inital data set, I took all past underdog winners, and coded them as 1.
                Comment
                SBR Contests
                Collapse
                Top-Rated US Sportsbooks
                Collapse
                Working...