Math/Statistics question

**SBR Lou** · 01-04-08, 11:02 PM

Paging Ganchrow...

**calm** · 01-05-08, 12:11 AM

I'm pretty sure your intuition is wrong. I'm too lazy/tired to go through the math, but whether you're looking at one handicapper or a thousand, I don't think it would change anything.

**pokernut9999** · 01-05-08, 12:20 AM

lost me after the 3rd paragraph

**pico** · 01-05-08, 12:30 AM

Originally posted by pokernut9999

lost me after the 3rd paragraph

you're better than me. i stop reading after the first sentence.

**mofome** · 01-05-08, 12:33 AM

ganch, don, remp? whos here?

**pokernut9999** · 01-05-08, 12:34 AM

excuse me, I got lost in the 3rd paragraph

**CHALKbreaker** · 01-05-08, 12:54 AM

Seems like paralysis by analysis to me.

**DrunkenLullaby** · 01-05-08, 01:03 AM

I can't give an answer, but my gut tells me that when Ganch arrives that there may be a Chi-square distribution in our future.

**Ganchrow** · 01-05-08, 04:45 AM

Originally posted by maxpower79

What I'd like to know is how things change if I have lots of sets of observations - say I'm data-mining, and I look at 100 different angles at once. Or I'm evaluating lots of handicappers - maybe the BTP contest - and I want to know, quantitatively, "Can any of these guys really cap?" Obviously, I can calculate p-values for each individual set of picks. But intuitively, they don't appear to have the some meaning. If the smallest p-value out of 700 is less than some p*, well, maybe you would probably expect somebody to do that well just from sheer luck and the high sample size.

I did get that if you take the null to be "all of these guys' picks hit with the same probability q" then you can add all the results together and get one p (for each q) for the entire sample. But if you rejected that null all you would be able to say is "at least one of these cappers wins with probability greater than q". But that's not that useful. What I'm wondering is if there is a way to come up with an "adjusted" p for each observation, that takes into account that it was one of many.

Dating mining is very dangerous business. Unless you really know what you're doing do yourself a favor and stay far, far away.

With that caveat firmly in place, here are a couple of quick and dirty mathematical approaches. In my opinion, a clear understanding of the following would be a necessary, although by no means sufficient, precondition for embarking on any form of profit-oriented data mining sports betting project. The first approach considers the likelihood of the single best observed outcome, while the second considers the likelihood of a complete data set as extreme or better as that observed. Both tests are inherently one-tailed.

Let's say you're looking at a single handicapper making straight up picks at unbiased lines. If he were a 50% handicapper his probability of picking N or more games correctly would be given by =1-BINOMDIST(N-1,100,50%,1), where BINOMDIST() is the Excel binomial distribution function. Hence, a 50% picker would only have a 4.4313% probability of picking 59 or more games correctly. (This is referred to as either of the "p-value" or the "significance level".)

If you were looking at M handicappers, each making sets of 100 independent picks, then assuming they were all 50% pickers:

The probability of at least one handicapper picking n or more correctly would be =1-BINOMDIST(N-1,100,50%,1)^M.
The joint probability of results as extreme as the entire outcome or better where the i^th handicapper has made N_i correct picks could be approximated by =CHIDIST(-2×Σ_i≤Mln(1-BINOMDIST(N_i-1,100,50%,1)), 2*M), where CHIDIST() is the Excel chi-squared distribution function, which in this case is called with 2*M degrees of freedom. This is known as Fisher's method. A more accurate way of explaining it would be that the result of chi-square is the probability that the product of the individual significance levels would have the observed value or lower, assuming all pickers were 50/50.

So let's say you're looking at 5 pickers, each making 100 picks on unrelated games, with 60, 57, 52, 48, and 47 correct picks respectively. If all pickers were 50% pickers:

We'd expect to see at least one of the five picking 60 or more correctly with probability =1-BINOMDIST(60-1,100,50%,1)^5 ≈ 13.44%.
The joint probability of the entire outcome or better would be =CHIDIST(-2*ln(2.84%*9.67%*38.22%*69.14%*75.79%),1 0) ≈ 13.17%. (It might pay to give an example of what would be considered "the same outcome or better". This phrase refers to the product of the significance levels over independent trials, which in this case works out to be about 0.05507%, the fifth root of which is 22.29%. This happens to be is the approximate significance of picking a bit less that 512 out of 1,000 correctly (1-BINOMDIST(512-1,1000,50%,1) ≈ 23.352%). Hence, the above outcome is, by the standards of Fisher's method, would be about as "extreme" as 5 people all picking 512 out of 1,000 correctly.)

I'll note that when used to analyze contemporaneous contest results data the above methods would need to be adjusted to take into account underlying contest structure possibly including correlation between contestants' picks and the impact of stale lines on winning percentages.

Another issue with the first method is that holding the desired significance level constant, as you increase the number of sample sets considered (i.e., the number of contest participants or the number of alternativ betting strategies considered), the incidence of Type-II errors ("false negatives") would also increase, decreasing the statistical power of these tests and rendering this form of analysis effectively useless.

The can also be an issue with the second method, especially to the extent that only a relatively small number of truly talented pickers (or successful strategies) exist within a large population. If this becomes an issue there are certainly other (more complicated) testing methodologies featured in commercial statistical packages that you might consider.

**maxpower79** · 01-05-08, 04:44 PM

Thanks Ganch.

So the first method, you are essentially creating an adjusted cdf for the max of N observations, and testing using that. The second method, you may have gone a bit deeper than I can follow, but I got the jist of it.

FWIW, I'm not planning on doing any data-mining, as you are right, I would be in over my head. It was really more just curiosity.

~ Max

**DrunkenLullaby** · 01-06-08, 01:27 AM

Originally posted by Ganchrow

where CHIDIST() is the Excel chi-squared distribution function,

**Data** · 01-06-08, 02:06 AM

Ganchrow, excellent article, as always.

Originally posted by Ganchrow

The can also be an issue with the second method, especially to the extent that only a relatively small number of truly talented pickers (or successful strategies) exist within a large population.

Could you explain why and how that small number becomes an issue?

If this becomes an issue there are certainly other (more complicated) testing methodologies featured in commercial statistical packages that you might consider.

I am thinking about getting SPSS or may be even SAS. Can you elaborate how those packages can help here and their overall value for an analytical sportsbettor?

**Ganchrow** · 01-06-08, 12:28 PM

Originally posted by Data

Ganchrow, excellent article, as always.

Thanks.

Originally posted by Data

Could you explain why and how that small number becomes an issue?

It's apparent just by examining the test: Χ²[-2×Σ_i≤Mln(α_i); 2*M] (where α_i refers to the significance level of the i^th capper, and integer M is the number of cappers being tested, corresponding to half the degrees of freedom). If we have a large number of talented cappers within the population, then their α's will be low and the Χ² will in turn show significance. As we increase the number of "average" cappers within the population we'd start seeing many more α's of around 50%. Even if these cappers were better than average, with α's of e^-1 ≈ 36.79%, then the value of the tested would approach the degrees of freedom (twice the number of cappers). As d.o.f. apprioach infinity the Χ² approaches normality with a mean equal to the d.o.f., and so the significance of the test would approach 50%. Mind you, this occurs when filling in with cappers better than average.

Originally posted by Data

I am thinking about getting SPSS or may be even SAS. Can you elaborate how those packages can help here and their overall value for an analytical sportsbettor?

In the past I've used both Mathematica and S+ professionally. I haven't used either SAS or SPSS since grad school. Nowadays, for no particularly good reason I primarily used self-programmed purpose-written libraries.

Really what needs to eb done here is some form of categorical analysis. We aren't looking to determine how good the single best capper is, or how good the population is as a whole, but rather how good a particular unspecified category of capper is. To this end I believe a test known as Mantel-Haenszel may be applicable. To be perfectly honest I don't quite remember the details other than it's a chi-squared test.

**Data** · 01-06-08, 03:18 PM

Originally posted by Ganchrow

As we increase the number of "average" cappers within the population we'd start seeing many more α's of around 50%.

Why would we include the average cappers in our test? The Fisher method allows us to test "a basket" of cappers/strategies in a manner we test a single capper using the first (null-hypothesis) method and we want only the best cappers/strategies in that basket. What I do not immediately see is how to account for the low likehood (Bayesian-wise) of successful strategy in general population when translated into our cherry-picked sample.

**Ganchrow** · 01-06-08, 03:50 PM

Originally posted by Data

Why would we include the average cappers in our test? The Fisher method allows us to test "a basket" of cappers/strategies in a manner we test a single capper using the first (null-hypothesis) method and we want only the best cappers/strategies in that basket. What I do not immediately see is how to account for the low likehood (Bayesian-wise) of successful strategy in general population when translated into our cherry-picked sample.

Were we just to select a subset of tested cappers (let's say just the best X%) then the Fisher method wouldn't be directly applicable. It needs to be applied it to the entire test population.

If you were to limit the test population to just a superior subset, then of course the chi-square would be significant because you'd only be including the most significant results. Fisher tests the joint significance of all the results -- testing only the most significant for significance makes no sense. Applying Fisher in this manner would be akin to applying the first described method to only a portion of the results. You can't pick out the best results within a contest and pretend the other results never happened.

Now of course if you had prior knowledge that a certain group was likely to outperform then you could certainly use Fisher on just that group -- but you can't use the data set itself to come to that conclusion (in other words you'd need separate in-sample and out-of-sample data sets). But as I understood the OP's initial question the whole point is to identify the talented cappers in the first place.

Now I'm not saying Fisher is useless in this regard, but rather that it couldn't be used in its raw form on a portion of the data set that's determined by the data set itself -- that would be very bad practice. And if the portion of the data set which is selected has "only a relatively small number of truly talented pickers existing within it" ... then the problem I identified in my initial first post might crop up. In other words, if you select you properly select you data set in-sample, then there's no way to guarantee that this won't be an issue with Fisher out-of-sample.

**Data** · 01-06-08, 04:32 PM

Originally posted by Ganchrow

Were we just to select a subset of tested cappers (let's say just the best X%) then the Fisher method wouldn't be directly applicable. It needs to be applied it to the entire test population.

Why is that? Not only this method allows us cherry-picking cappers from one contest but we can pick any capper from any contest. Then, we can calculate a p-value for that "basket".

I do not know what practical result can be achieved by applying this test to all the cappers in a given contest but I would guess none.

If you were to limit the test population to just a superior subset, then of course the chi-square would be significant because you'd only be including the most significant results.

What is wrong with that? After all the significant results is what we are looking for, who needs insignificant ones? All we want is to calculate how significant the combined results are. So, how we do that? Can we just add all the records and treat the sum as one record? No, of course not, but we can use the Fisher's method instead.

**Ganchrow** · 01-08-08, 10:44 AM

Originally posted by Data

Why is that? Not only this method allows us cherry-picking cappers from one contest but we can pick any capper from any contest. Then, we can calculate a p-value for that "basket".

I do not know what practical result can be achieved by applying this test to all the cappers in a given contest but I would guess none.

It's rather unfortunate indeed that the gods of statistics have yet to decree that statistical validity need portend practicality. If you know of any virgins to sacrifice, however, I think I can get us a knife and some long flowing robes.

You can choose to use Fisher in any manner you like. However, if your selection of cappers is determined by their individual significance levels, and then you use those same individual significance levels as inputs for Fisher, you're engaging in data dredging at its most blatant.

The point of testing the Fisher statistics against the chi-square is to determine the likelihood of attaining that product of significance levels or lower. If, however, you've not properly determined your significance levels because you've not properly conditioned them on their of likelihood of appearing within your chosen subset in the first place -- then the Fisher Method will routinely deliver spurious results.

Try this experiment in Excel. Generate the results in column A from 500 samples of 100 randomized binomial trials assuming a 50% success rate. So cells A1:A500 would each look something like: =CRITBINOM(100,50%,RAND()). (These would represent the results of 500 talentless handicapper each picking 100 games.)

In column B, display each capper's significance levels(so cell B1 would read =1-BINOMDIST(A1-1,100,50%,1), B2 would read =1-BINOMDIST(A2-1,100,50%,1), etc.)

In column C fill in the natural logarithms of the values in column B (so cell C1 =ln(B1)).

Then determine the test-wide Fisher statistic by setting cell D1 to =-2*SUM(C1:C500).

To determine the significance, run a chi-square with 1,000 degrees of freedom by setting cell D2 =CHIDIST(D1,1000).

Press F9 a few times to recalculate, checking the p-value in cell D2 each time. You should be seeing numbers fairly close to 100%, implying a lack of statistical significance. (The fact that it's generally SO close to 1 is implicative of the issue with Type II errors I had earlier mentioned).

But now ... let's look at Fisher results if we cherry-pick a sample. Let's say we only look at the top 50% or better of pickers. What kind of Fisher method results will we see?

Set cell D3 to the Fisher statistic of the cherry picked subset =-2*SUMIF(A1:A500,">="&PERCENTILE(A1:A500, 50%),C1:C500). The number of handicappers in the top half would naturally be given by =COUNTIF(A1:A500,">="&PERCENTILE(A1:A500 ,50%)). Set cell C4 to the chi-squared p-value for the cherry picked sample =CHIDIST(D3,2*COUNTIF(A1:A500,">="&PERCE NTILE(A1:A500,50%))).

See a difference? Hit F9 a few times to make sure you aren't looking at some crazy aberration. You should be seeing results almost indistinguishable from zero, implying extreme significance. And we're not looking at some absurd subset either -- we're just looking at the top 50% of pickers drawn from a population that flips coins to determine picks.

Now please don't get me wrong -- the problem here isn't inherent with Fisher itself, but rather with using incorrect p-values within the natural logarithm. Garbage-in, garbage-out, after all.

The way in which one would need to properly apply Fisher in this particular instance would be by using p-values in the logs in column C that were conditioned on having been found in the top 50% of results. Without that conditioning you're going to get spurious results every time.

The easiest way to handle the conditioning would be by appealing to the Central Limit Theorem as much as possible, while the correct way, probably involving Clopper-Pearson binomial intervals, would certainly be much tricker. That said as long as you're not overly proud, aren't trying to earn some sort of degree in statistics, and aren't serving time in prison, you're probably best off just appealing to that great equalizer among statisticians -- the Monte Carlo.

Attached Files

fisher demo.xls (108.0 KB, 339 views)

**Data** · 01-08-08, 07:38 PM

Originally posted by Ganchrow

It's rather unfortunate indeed that the gods of statistics have yet to decree that statistical validity need portend practicality.

I was just implying that my approach is purely practical. If I can use a method then I am all ears but if there is no way applying it to make a profit I could not care less. I very much appreciate your theoretical insight, not for the theory itself but for what it can do for me.

The point of testing the Fisher statistics against the chi-square is to determine the likelihood of attaining that product of significance levels or lower. If, however, you've not properly determined your significance levels because you've not properly conditioned them on their of likelihood of appearing within your chosen subset in the first place -- then the Fisher Method will routinely deliver spurious results.

Precisely, I am thinking the same, determining significance levels is the key.

You should be seeing results almost indistinguishable from zero, implying extreme significance.

This is where we think differently, or, more likely, where I do not see something. If we are looking at 100 50% cappers we should expect one of them to show the results with significance of 1% but if we are looking at 1000 of them then we expect to see the results with a lower significance level of 0.1%. None of these should make us excited. So, when we observe extremely low numbers in your example no matter how extreme is significance it has no value for us for obvious reasons. However, why cannot we use all these expected numbers (1%, 0.1%, close-to-zero) as baselines and if we get the numbers lower then those that would imply that we may be seeing something none-random?