web scraping/python generic (AI?) question

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • gojetsgomoxies
    SBR MVP
    • 09-04-12
    • 4222

    #1
    web scraping/python generic (AI?) question
    i'm curious about this subject.

    say you want to automate a college football betting model.

    but you get input from 7 different sources.

    so for louisiana-monroe, you get many different ways of referring to it. or mississippi (ole miss etc.)

    is it very hard for a computer to get to the point of identifying all 130 teams no matter how they are referred to (within reason)?

    can i as a human always know what team is referred to? i'd say yes, but i'm not sure.

    and then there's the pedestrian problem of a data provider changing its description of teams.

    thx advance for any insight on this
  • NSN21
    SBR Sharp
    • 05-13-11
    • 322

    #2
    yeah this is always a challenge of using strings instead of IDs. you need to create a mapping system that takes in these strings and does a lookup to see what it should really be.

    for example, in a lookup table you may have:

    ULM --> UL Monroe
    La-Monroe --> UL Monroe
    UL-Monroe --> UL Monroe

    etc etc

    the way I do it is that whenever it comes across a string that doesn't exist in my lookup table, it asks what it should map to and saves the result. eventually a massive lookup table gets built and saved that allows these references to easily work.
    Comment
    • Waterstpub87
      SBR MVP
      • 09-09-09
      • 4102

      #3
      you could make a matching table. So for example, get your first data set. Set up a column in front, called data set 2 name. Vlookup your data set 1 name into data set 2. You will then probably get 50 or so correct. Then go through and fix the ones that are wrong (Paste what the actual correct answer is). Repeat for each data set.

      One of the things that is a common error is "State" vs. "St." vs "St". You could use VBA to find and replace, but this causes additional errors with the third, because then when st is another name, I think there are a couple like this.

      If you set up a matching table, you only have do it once, and will be good for backtesting, as well as going forward, because data sources tend not to change name formats that frequently.

      You could then load this into python.

      You would then have a dataframe of 7 different names x ~130 teams based on the matching table. Then merge left all of the other tables to this, merging on their column name.
      Comment
      • Waterstpub87
        SBR MVP
        • 09-09-09
        • 4102

        #4
        Alternatively, you could just text replace in python with the appropriate names. I recently ran into an issue with basketball reference. With some of the teams, the name is different when you are on their team page vs. when they are the opponent. So through running it, and finding the missing teams, I text replaced different schools, looks something like this:

        results['Opp']= results['Opp'].str.replace('California','University of California')
        results['Opp']= results['Opp'].str.replace('USC','Southern California')
        results['Opp']= results['Opp'].str.replace('UConn','Connecticut')
        results['Opp']= results['Opp'].str.replace('UMass','Massachusetts')
        Comment
        SBR Contests
        Collapse
        Top-Rated US Sportsbooks
        Collapse
        Working...