Most efficient test programatically for goodness of fit?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MonkeyF0cker
    SBR Posting Legend
    • 06-12-07
    • 12144

    #1
    Most efficient test programatically for goodness of fit?
    I'm looking for a recommendation on a normality test for frequency distributions. I am implementing it programatically. Essentially, the purpose of the test is to remove erroneous data from consideration in my model. I want to test for the normal distribution and remove sample points that do not lie within four standard deviations from the mean. Does anyone recommend an efficient test for this? K-S? Shapiro-Wilk? These seem too cumbersome for what I'm trying to accomplish. There must be an easier test.

    BTW, the reason for the test is that some of the distributions are not Guassian, as the dataset may consist of combined normal distributions. In that case, I'll simply have to utilize Chebyshev's inequality, remove data within seven standard deviations, and further dissect the data from there.
  • MonkeyF0cker
    SBR Posting Legend
    • 06-12-07
    • 12144

    #2
    I think I found a good test for this actually. Anderson-Darling.
    Comment
    • Data
      SBR MVP
      • 11-27-07
      • 2236

      #3
      Originally posted by MonkeyF0cker
      Does anyone recommend an efficient test for this? K-S? Shapiro-Wilk? These seem too cumbersome for what I'm trying to accomplish. There must be an easier test.
      AFAIK, the K-S is the simplest to the extent it considered to be useless. The D’Agostino-Pearson omnibus test is a good compromise between difficulty and quality. Regardless, why reinvent the wheel if you can test you data in (free) R that has almost a dozen of normality tests already there as functions?
      Comment
      • MonkeyF0cker
        SBR Posting Legend
        • 06-12-07
        • 12144

        #4
        I've considered R but I'm looking to make the algorithm as efficient as possible as I will be handling datasets of approximately 3 million records every time it runs.
        Comment
        • Data
          SBR MVP
          • 11-27-07
          • 2236

          #5
          Originally posted by MonkeyF0cker
          I've considered R but I'm looking to make the algorithm as efficient as possible as I will be handling datasets of approximately 3 million records every time it runs.
          I advise you to read on applicability. In most cases you should decide on a single way to treat the entire series instead of testing each data set for normality.
          Comment
          • MonkeyF0cker
            SBR Posting Legend
            • 06-12-07
            • 12144

            #6
            I've worked out the particulars of efficiency now. I really only need to test for normality on occasion. I've built several tables that contain the frequency distribution bin counts and I simply add to those counts when new game data is acquired. This should speed up the process considerably.
            Comment
            SBR Contests
            Collapse
            Top-Rated US Sportsbooks
            Collapse
            Working...