Want your own college basketball database using python?

Waterstpub87 · 11-30-20 09:24 PM

Step 1: Download anaconda from here: https://www.anaconda.com/products/individual

Step 2: Open Spyder that is installed with it

Step 3: Paste the code below into a blank script, pasting over the lines at the top of the script that get automatically

Step 4: Click the run button

You might get errors based on installation issues. Post them here, I'll tell you how to fix

If you want more years, go to

"years = ['2020']" in the code

add any years you want to it, but put around single quotes ex. '2019'

Use a comma to seperate them ex. ['2020','2019','2018']

I haven't tested many years in the past, could have errors in formatting.

Let me know if any of the data is incorrect. This is year old, I haven't used it (very tired of losing at college basketball)
It will give you a csv file with the data, whereever you have the script saved.

Code:

# -*- coding: utf-8 -*-
"""
Created on Mon Nov 30 20:10:35 2020

@author: Waterstpub87
"""

import numpy as np
import pandas as pd

years = ['2020']

for year in years:

        schoolsurl = "https://www.sports-reference.com/cbb/seasons/" + year + "-school-stats.html"

        schools = pd.read_html(schoolsurl)


        df = schools[0]

        df = df
[list(df)]
        

        scl = df['Overall']

        scl['School'] = scl['School'].str.replace('NCAA','')
        scl['School'] = scl['School'].str.strip()
        scl.index = scl['School']
        scl['URL'] = scl['School']
        scl['URL'] = scl['URL'].str.replace(' ','-')
        scl['URL'] = scl['URL'].str.replace('.','')
        scl['URL'] = scl['URL'].str.replace('&','')
        scl['URL'] = scl['URL'].str.replace('(','')
        scl['URL'] = scl['URL'].str.replace(')','')
        scl['URL'] = scl['URL'].str.replace("'",'')
        scl['URL'] = scl['URL'].str.replace("--",'-')
        scl['URL'] = scl['URL'].str.lower()
        scl['URL'] = scl['URL'].str.replace('little-rock','arkansas-little-rock')
        scl['URL'] = scl['URL'].str.replace('uc-','california-')
        scl['URL'] = scl['URL'].str.replace('university-of-california','california')
        scl['URL'] = scl['URL'].str.replace('purdue-fort-wayne','ipfw')
        scl['URL'] = scl['URL'].str.replace('fort-wayne','ipfw')
        scl['URL'] = scl['URL'].str.replace('omaha','nebraska-omaha')
        scl['URL'] = scl['URL'].str.replace('siu-edwardsville','southern-illinois-edwardsville')
        scl['URL'] = scl['URL'].str.replace('texas-rio-grande-valley','texas-pan-american')
        #scl['URL'] = scl['URL'].str.replace('vmi','virginia-military-institute')
        scl['URL'] = scl['URL'].str.replace('cal-state-long-beach','long-beach-state')
        scl.loc['Louisiana']['URL']='louisiana-lafayette'
        scl.loc['VMI']['URL']='virginia-military-institute'
        scl = scl[scl['School'] != 'Overall']
        scl = scl[scl['School'] != 'School']
        
        scl.index = scl['URL']


        for x in scl['URL']:
            try:
                url = 'https://www.sports-reference.com/cbb/schools/' + x + '/' + year + '-gamelogs.html'
                data = pd.read_html(url)
                data = data[0]
                data = data
[list(data)]
                
                data['School1'] = scl.loc[x]['School']
                if x == 'abilene-christian' and years[0]==year:
                    results = data
                else:
                    results = results.append(data)
            except:
                    pass
        results.to_csv('CBBDB.csv')
        cols = ['G','Date','Location','Opp','Results','P oints','Points Against','FG','FGA','FG%','3P','3PA','3P       %','FT','FTA','FT%','ORB','TRB','AST','S TL','BLK','TOV','PF','Blank','OPPFP','OPFPA','OPFG%','OPP3P', 'OPP3PA','OPP3P%','OPPFT','OPPFTA','OPPF T%','OPPORB','OPPTRB','OPPAST','OPPSTL', 'OPPBLK','OPPTOV','OPPPF','School']
        results.columns = cols
        mid = results['School']
        results.drop(labels=['School'], axis=1,inplace = True)
        results.insert(2, 'School', mid)
            
        results.drop(labels=['Blank'], axis=1,inplace = True)
        results.drop(labels=['FG%'], axis=1,inplace = True)
        results.drop(labels=['3P %'], axis=1,inplace = True)
        results.drop(labels=['FT%'], axis=1,inplace = True)
        results.drop(labels=['OPFG%'], axis=1,inplace = True)
        results.drop(labels=['OPP3P%'], axis=1,inplace = True)
        results.drop(labels=['OPPF T%'], axis=1,inplace = True)
    


        results = results[results.Date != 'School']
        results = results[results.Date != 'Date']
        results= results.fillna(0)
        counter = 6
        cols = list(results)
        while counter < 34:
            column = cols[counter]
            results[column] = results[column].astype(int)
            counter = counter +1


        results['Pace'] = (.50*(results['FGA'] + (.49*results['FTA']) + results['TOV'] - results['ORB'])) + (.50 *     (results['OPFPA'] + (.49 * results['OPPFTA'])-results['OPPORB']+results['OPPTOV']))
        results['School'] = results['School'].str.replace('Cal State Long Beach','Long Beach State')
        results['School']= results['School'].str.replace('SIU Edwardsville','Southern Illinois-Edwardsville')
        results['School']= results['School'].str.replace('VMI','Virginia Military Institute')
    
        results['Opp']= results['Opp'].str.replace('UMBC','Maryland-Baltimore County')
        results['Opp']= results['Opp'].str.replace('UNLV','Nevada-Las Vegas')
        results['Opp']= results['Opp'].str.replace('Detroit','Detroit Mercy')
        results['Opp']= results['Opp'].str.replace('BYU','Brigham Young')
        results['Opp']= results['Opp'].str.replace('Southern Miss','Southern Mississippi')
        results['Opp']= results['Opp'].str.replace('UTEP','Texas-El Paso')
        results['Opp']= results['Opp'].str.replace('UTSA','Texas-San Antonio')
        results['Opp']= results['Opp'].str.replace('UCF','Central Florida')
        results['Opp']= results['Opp'].str.replace('LSU','Louisiana State')
        results['Opp']= results['Opp'].str.replace('Ole Miss','Mississippi')
        results['Opp']= results['Opp'].str.replace('LIU-Brooklyn','Long Island University')
        results['Opp']= results['Opp'].str.replace('UMass-Lowell','Massachusetts-Lowell')
        results['Opp']= results['Opp'].str.replace('California','University of California')
        results['Opp']= results['Opp'].str.replace('USC','Southern California')
        results['Opp']= results['Opp'].str.replace('UConn','Connecticut')
        results['Opp']= results['Opp'].str.replace('UMass','Massachusetts')
        results['Opp']= results['Opp'].str.replace('UCSB','UC-Santa Barbara')
        results['Opp']= results['Opp'].str.replace('UNC Wilmington','North Carolina-Wilmington')
        results['Opp']= results['Opp'].str.replace("St. Peter's","Saint Peter's")
        results['Opp']= results['Opp'].str.replace('UNC Asheville','North Carolina-Asheville')
        results['Opp']= results['Opp'].str.replace('NC State','North Carolina State')
        results['Opp']= results['Opp'].str.replace('UNC','North Carolina')
        results['Opp']= results['Opp'].str.replace('Central Connecticut','Central Connecticut State')
        results['Opp']= results['Opp'].str.replace('UT-Martin','Tennessee-Martin')
        results['Opp']= results['Opp'].str.replace('TCU','Texas Christian')
        results['Opp']= results['Opp'].str.replace("Saint Mary's","Saint Mary's (CA)")
        results['Opp']= results['Opp'].str.replace("Pitt","Pittsburgh")
        results['Opp']= results['Opp'].str.replace("VCU","Virginia Commonwealth")
        results['Opp']= results['Opp'].str.replace("UIC","Illinois-Chicago")
        results['Opp']= results['Opp'].str.replace("SMU","Southern Methodist")
        results['Opp']= results['Opp'].str.replace("Penn","Pennsylvania")
        results['Opp']= results['Opp'].str.replace("USC Upstate","South Carolina Upstate")
        results['Opp']= results['Opp'].str.replace("UMKC","Missouri-Kansas City")
        results['Opp']= results['Opp'].str.replace("UNC Greensboro","North Carolina-Greensboro")
        results['Opp']= results['Opp'].str.replace("St. Joseph's","Saint Joseph's")
        results['Opp']= results['Opp'].str.replace("ETSU","East Tennessee State")
        results['Opp']= results['Opp'].str.replace("Pennsylvania State","Penn State")
        results['Opp']= results['Opp'].str.replace("North Carolina Greensboro","North Carolina-Greensboro")
        results['Opp']= results['Opp'].str.replace("Southern California Upstate","South Carolina Upstate")
        results['Opp']= results['Opp'].str.replace("University of California Baptist","California Baptist")
        results['Opp']= results['Opp'].str.replace('SIU-Edwardsville','Southern Illinois-Edwardsville')
        results['Opp']= results['Opp'].str.replace('VMI','Virginia Military Institute')
            
        results.to_csv('CBBD'+year+'.csv')

Waterstpub87 · 11-30-20 09:26 PM

couple of errors in the paste;
df = df

[list(df)]

the brackets should be next to the df in the same line

Waterstpub87 · 11-30-20 09:32 PM

This will throw warnings in the console. Don't worry about it. I never cared to fix it.

Waterstpub87 · 12-01-20 09:19 PM

You get a spreadsheet that looks like this, with all information, roughly 40 columns

G	Date	School	Location	Opp	Results	P oints	Points Against	FG	FGA	3P
1	11/5/2019	Abilene Christian	0	Arlington Baptist	W	90	39	36	75	6
2	11/10/2019	Abilene Christian	@	Drexel	LÂ (1 OT)	83	86	30	65	9
3	11/16/2019	Abilene Christian	0	Pepperdine	L	69	73	20	50	5
4	11/18/2019	Abilene Christian	@	Nevada-Las Vegas	L	58	72	22	59	7

jacksonstreet · 12-01-20 10:17 PM

If you find a way to include the closing line and % bet on each side, I'll show you a way to predict point spread winners at a 70%+ clip.

Waterstpub87 · 12-01-20 10:19 PM

Originally Posted by jacksonstreet

If you find a way to include the closing line and % bet on each side, I'll show you a way to predict point spread winners at a 70%+ clip.

Hard to get that data. Also hard to know if its right. In the past compared SBR to vegas insider?, numbers were completely different

Waterstpub87 · 12-02-20 04:52 PM

For those who sent me PMs, my box was full. I have space now

Fullkelly · 12-02-20 06:28 PM

Your in high demand the PM box is full again.

Waterstpub87 · 12-02-20 06:39 PM

Originally Posted by Fullkelly

Your in high demand the PM box is full again.

Should be good now.

jacksonstreet · 12-02-20 10:09 PM

Originally Posted by Waterstpub87

Hard to get that data. Also hard to know if its right. In the past compared SBR to vegas insider?, numbers were completely different

Yeah - I think that's intentional. Closing lines are pretty easy to get, but if we had access to accurate %'s bet on each side, we'd be able to build a model that would win at such a high rate that books would be out of business eventually.

jonal · 12-04-20 01:24 PM

Hello,

I get the following error when I tried to run the script. Any suggestions?

Thanks in advance,

---------------------------------------------------------------------------------------------------

runfile('C:/Users/Jonathan/.spyder-py3/temp.py', wdir='C:/Users/Jonathan/.spyder-py3') File "C:\Users\Jonathan\.spyder-py3\temp.py", line 27 scl = df['Overall'] ^IndentationError: unexpected indent

KVB · 12-04-20 01:29 PM

Nice work Water St.

KVB · 12-04-20 01:33 PM

Originally Posted by jacksonstreet

...but if we had access to accurate %'s bet on each side, we'd be able to build a model that would win at such a high rate that books would be out of business eventually.

This is absolutely not true. The books can control that action, it's the purpose of the line.

They wouldn't dig their own graves. In fact, what they would do, and they do do, is exploit your thinking and take full advantage of running you, and bettors like you, in circles.

We witness it every single day and I have helped write the programs to exploit the bettors reliably.

One of the issues with the integrity of the information is that what many sources put out as betting percentages are simply a survey of their website members or traffic, and nothing more.

Information that some of us get access to, from books, including back end from Vegas, can be misleading because the individual books are under no obligation to report the truth here.

It's another thing we see every single day.

While I often say we all have access to the same information and it's just how you use it, when it comes to some information it just isn't true. Some of us have access to info others don't, and we respect that privilege.

Waterstpub87 · 12-04-20 01:39 PM

Originally Posted by jonal

Hello,

I get the following error when I tried to run the script. Any suggestions?

Thanks in advance,

---------------------------------------------------------------------------------------------------

runfile('C:/Users/Jonathan/.spyder-py3/temp.py', wdir='C:/Users/Jonathan/.spyder-py3') File "C:\Users\Jonathan\.spyder-py3\temp.py", line 27 scl = df['Overall'] ^IndentationError: unexpected indent

Some of the indenting gets messed up on the paste. Hit tab once in like 27, see if that fixes it.

jonal · 12-04-20 02:18 PM

Originally Posted by Waterstpub87

Some of the indenting gets messed up on the paste. Hit tab once in like 27, see if that fixes it.

It fixed the error, but now when I run the script nothing appears in the Console. Is there suppose to be a message returned when the script successfully runs?

Waterstpub87 · 12-04-20 02:32 PM

No. Nothing should appear in the console. You should have 2 csv files where ever you have the script saved if it ran successfully. You might get warnings depending on your python version

jonal · 12-04-20 03:22 PM

Originally Posted by Waterstpub87

No. Nothing should appear in the console. You should have 2 csv files where ever you have the script saved if it ran successfully. You might get warnings depending on your python version

so the last line of your code is this:

results.to_csv('CBBD'+year+'.csv')

am i suppose to add a file location? sorry for all the questions.

Waterstpub87 · 12-04-20 03:43 PM

No, should be saved where the script is saved.

If you want to add a directory to put it in a specific place, do this:

Results.to_csv('c\\documents\\' plus the rest.

Whatever folder you want. Put it in single quotes and double any slash

So c\documents
Becomes 'c\\documents\\'

mikmak · 12-08-20 03:11 PM

First of all, thank you so much for sharing this. I'm an IT professional but not a programmer. I've decided to teach myself Python for scraping and try to build a database so I can back test my models. Right now, I'm using Excel and it does an ok job using power query for scraping but it runs out of memory and is slow as dirt. I also want to automate so much of what I am currently doing and Excel just isn't going to cut the mustard.

I experienced the indent error discussed above and had to play around with the tabs in
[list(df)] section and I'm past that issue but now I'm getting the following error:

Code:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "C:\Users\mikma\.spyder-py3\temp.py", line 27, in     scl['School'] = scl['School'].str.replace('NCAA','')

  File "C:\Users\mikma\anaconda3\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__    indexer = self.columns.get_loc(key)

  File "C:\Users\mikma\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc
    raise KeyError(key) from err

KeyError: 'School'

Any ideas? And thank you again so much for trying to help others on this board. We need more people like you on the interwebz!

Roscoe_Word · 12-08-20 07:32 PM

One year went an entire NBA season and logged a 55% ATS mark.

That was with a notebook, pen and calculator.

Then got a computer and learned some code to automate things.

Have never repeated that mark since.

Waterspub....thanks for some past help you've given.......

Ahh...sorry, posted in wrong thread.

Mean't to post in "How many people use coding" thread......

Waterstpub87 · 12-08-20 09:17 PM

Originally Posted by Roscoe_Word

One year went an entire NBA season and logged a 55% ATS mark.

That was with a notebook, pen and calculator.

Then got a computer and learned some code to automate things.

Have never repeated that mark since.

Waterspub....thanks for some past help you've given.......

Ahh...sorry, posted in wrong thread.

Mean't to post in "How many people use coding" thread......

Appreciate the kind words. Always good to help people. Plenty of people have helped me here with stuff.

Waterstpub87 · 12-08-20 09:24 PM

Originally Posted by mikmak

First of all, thank you so much for sharing this. I'm an IT professional but not a programmer. I've decided to teach myself Python for scraping and try to build a database so I can back test my models. Right now, I'm using Excel and it does an ok job using power query for scraping but it runs out of memory and is slow as dirt. I also want to automate so much of what I am currently doing and Excel just isn't going to cut the mustard.

I experienced the indent error discussed above and had to play around with the tabs in
[list(df)] section and I'm past that issue but now I'm getting the following error:

Code:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "C:\Users\mikma\.spyder-py3\temp.py", line 27, in     scl['School'] = scl['School'].str.replace('NCAA','')

  File "C:\Users\mikma\anaconda3\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__    indexer = self.columns.get_loc(key)

  File "C:\Users\mikma\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc
    raise KeyError(key) from err

KeyError: 'School'

Any ideas? And thank you again so much for trying to help others on this board. We need more people like you on the interwebz!

Not sure, haven't been able to replicate this.

What year are you running for? did you change?

if not, can you do the following:

In the console on the right (where it displayed your error)

can you type:

schools next to In [2]: and hit enter, tell me if you get a table with columns and rows

If not, thats an issue

if you do,

can you type list(schools) in the same place and hit enter, and tell me if you see the word 'School'

gauchojake · 12-22-20 05:39 PM

I got the same error
This is what was retuned when I keyed in the list(schools) command

[ Unnamed: 0_level_0 Unnamed: 1_level_0 Overall ... Totals
Rk School G W ... STL BLK TOV PF
0 1 Abilene Christian 31 20 ... 293 81 436 661
1 2 Air Force 32 12 ... 161 43 395 534
2 3 Akron 31 24 ... 158 91 397 548
3 4 Alabama A&M 30 8 ... 174 63 391 538
4 5 Alabama-Birmingham 32 19 ... 191 93 471 527
.. ... ... ... .. ... ... ... ... ...
382 349 Wright State 32 25 ... 209 113 396 516
383 350 Wyoming 33 9 ... 175 66 418 626
384 351 Xavier 32 19 ... 201 114 446 535
385 352 Yale 30 23 ... 188 101 389 449
386 353 Youngstown State 33 18 ... 188 85 397 579

[387 rows x 38 columns]]

gauchojake · 12-22-20 05:50 PM

BTW thanks for posting this. I have been messing around with a few different iterations of Python and this one is the easiest to work with so far. I am not a programmer by any stretch and I am learning from scratch on the internet how to do this.

Waterstpub87 · 12-22-20 07:05 PM

I'm updating my installation. I can't replicate the error you are getting. It reads the table correctly, but it isn't reading it like a table. You can literally see where the word school is.

Do me a favor? Change
scl = df['Overall']

to
scl = df['Overall'].copy()

gauchojake · 12-22-20 07:14 PM

I still get the same error

File "C:\Users\jake\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc raise KeyError(key) from errKeyError: 'School'

Waterstpub87 · 12-22-20 07:20 PM

You are running 2020?

gauchojake · 12-22-20 07:25 PM

I looked and I saved the script with 2019 as the year. Edited to 2020 and this was the error returned

Traceback (most recent call last): File "C:\Users\jake\.spyder-py3\Basketball Project.py", line 21, in scl['School'] = scl['School'].str.replace('NCAA','') File "C:\Users\jake\anaconda3\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__ indexer = self.columns.get_loc(key) File "C:\Users\jake\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc raise KeyError(key) from errKeyError: 'School'

Waterstpub87 · 12-22-20 08:00 PM

I have to update my version. If you have a fresh install, it might be causing issues. They update how things work behind the scenes, and sometimes it causes things to change. Will do later. My python was like 4 or so versions back.

gauchojake · 12-22-20 08:13 PM

Yeah I just installed it. Cool thanks for the help.

Waterstpub87 · 12-22-20 09:42 PM

Ok, now I get the same error:

fix is

Code:

        schools = pd.read_html(schoolsurl,header=[1])


        df = schools[0]

        #df = df
[list(df)]
        

        scl = pd.DataFrame(df['School'].copy())

in lines 17-25

you also need to make an edit around line 89
currently looks like

Code:

         results = results[results.Date != 'School']
        results = results[results.Date != 'Date']
        results= results.fillna(0)
        counter = 6
        cols = list(results)

You need to add a line after to make it look like:

Code:

       results = results[results.Date != 'School']
        results = results[results.Date != 'Date']
        results= results.fillna(0)
        counter = 6
        cols = list(results)      
        results = results[results['FG'] != 'School']

the first lines should be in line with the others, the indentation gets messed up when I post it.

gauchojake · 12-23-20 11:23 AM

I am getting different errors now but it's probably due to my lack of experience coding and not understanding the nuances. I'll play around a little more because I want to see if I can get it.

Nappyx · 12-23-20 01:47 PM

@Waterstpub87why don't you just scrape all the information and post the file here so that the non-coders of the bunch don't have to monkey with your code to pull the results. Would probably save many folks in in this thread a lot of time....

Waterstpub87 · 12-23-20 01:54 PM

Originally Posted by Nappyx

@Waterstpub87why don't you just scrape all the information and post the file here so that the non-coders of the bunch don't have to monkey with your code to pull the results. Would probably save many folks in in this thread a lot of time....

Depending on if basketball reference feels that is proprietary information, it might get taken down. Also, not sure how to do that.

Also, what fun is that? Teach a man to fish and all.

gauchojake · 12-23-20 09:21 PM

Success!! I had to remove two lines of code but got the script to run. It's a little messy but I spot checked the data and it looks good. Thank you sir.

SBR Top-Rated Sportsbooks				Best Sportsbooks List
#1 FanDuel	SBR rating 4.8/5	Review	#6 BetRivers	SBR rating 4.1/5	Review
#2 Caesars	SBR rating 4.7/5	Review	#7 Fanatics	SBR rating 4.1/5	Review
#3 DraftKings	SBR rating 4.7/5	Review	#8 Betway	SBR rating 3.8/5	Review
#4 BetMGM	SBR rating 4.6/5	Review	#9 Borgata	SBR rating 3.5/5	Review
#5 bet365	SBR rating 4.6/5	Review	#10 ClutchBet	SBR rating 2.9/5	Review

Want your own college basketball database using python?

Thread Tools

Want your own college basketball database using python?