1. #1
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    A Question About Probability

    Is there any who can show how to calculate the cumulative probability of something occurring when multiple probabilities are involved?

    I am aware of how to calculate the binomial distribution for probabilities that are constant but not when they vary. For example, if a fair coin was flipped 4 times and I was making a guess each time, the probability of me guessing wrong at least:
    4 times is 6.25% or 1 in 16 experiments
    3 or more times is 31.25% or 5 in 16 experiments
    2 or more times is 68.75% or 11 in 16 experiments
    1 or more times 93.75% or 15 in 16 experiments
    0 or more times is 100% or 16 in 16 experiments

    So, here are some relevant probabilities (actually they are real NHL ML numbers) and there outcomes (1=win, 0=lose):

    -111 1
    -143 1
    -125 0
    -135 1
    -112 1
    -109 1
    -107 0
    -110 0
    -111 1
    -116 1
    -106 0
    -120 1
    -110 1
    -123 0
    -124 1
    -118 0
    -135 0
    -109 0
    175 0
    216 0
    184 0
    179 1
    150 0
    265 0
    189 0
    185 0
    144 1
    145 0
    152 0
    188 0
    145 0
    185 0
    175 1
    212 1
    178 1
    167 1
    192 0
    180 0
    153 1
    168 0
    180 1
    307 1
    175 1
    170 0
    200 1
    175 1
    236 0
    157 0
    188 0
    222 0
    270 1
    147 1
    164 1
    172 1
    183 1


    Assuming a bet to win an equal percentage of bankroll for each of the above events, my results would be:

    Win 48.332811 units, Bet 41.766906 units for a net ROI of 15.7%.

    2 questions:

    1) What was the probability that I would win "at least" the 15.7% that was obtained during this experiment and how is that calculated?
    2) How do you calculate the probability of winning or losing various ROI amounts? (e.g. win 2%, 5%, 10%, etc.)

    There are bonus smiley faces etc. if Excel type formulas are included with math notations answers.

  2. #2
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    I should add that up until now what I do is run simulations in Excel based on the odds. For example, I have my list of bets and their odds (e.g. +100 = .50, -150 = .60, etc.) and run at least 10,000 random simulations with these odds (e.g. if a random number is less than the odds, it is considered a win otherwise it is a loss). What this does for me is to determine how likely I would have been to win various ROI amounts purely by chance.

    For example, I ran 10,000 random events on the entire population of 55 odds shown in the first post. Here were the results:

    To win at least 15.7% (which is what I won), there is a 15.54% chance that it occurred totally randomly. I interpret this to mean that my 15.7% ROI is significant at the 15.54% level (notwithstanding my sample size of 10,000).

    Here are other numbers and levels using the odds in the original post:

    Win 1% ROI.....46.9% Chance
    Win 2% ROI.....44.37% Chance
    Win 3% ROI.....41.88% Chance
    Win 4% ROI.....39.39% Chance
    Win 5% ROI.....37.01% Chance
    Win 10% ROI.....25.77% Chance
    Win 15% ROI.....16.67% Chance
    Win 20% ROI.....10.04% Chance
    Win 25% ROI.....5.2% Chance
    Win 30% ROI.....2.62% Chance

    And finally, to have confidence at the 5% level (95% sure that my results are not by chance notwithstanding my sample size of 10,000)
    Win 25.255% ROI.....5% Chance

    Therefore, I could say that the null hypothesis (that I was lucky) should be accepted because there is a 15.54% chance that the results were random and the alternative hypothesis (that the algorithm that produced the choices was profitable) should be rejected.

    What I am looking for is a mathematical calculation that will save me the time and inaccuracy of these simulations.
    Last edited by VideoReview; 02-14-08 at 05:42 PM. Reason: Clarification

  3. #3
    chemist
    chemist's Avatar Become A Pro!
    Join Date: 01-15-08
    Posts: 217
    Betpoints: 366

    Quote Originally Posted by VideoReview View Post
    Is there any who can show how to calculate the cumulative probability of something occurring when multiple probabilities are involved?

    I am aware of how to calculate the binomial distribution for probabilities that are constant but not when they vary. For example, if a fair coin was flipped 4 times and I was making a guess each time, the probability of me guessing wrong at least:
    4 times is 6.25% or 1 in 16 experiments
    3 or more times is 31.25% or 5 in 16 experiments
    2 or more times is 68.75% or 11 in 16 experiments
    1 or more times 93.75% or 15 in 16 experiments
    0 or more times is 100% or 16 in 16 experiments

    So, here are some relevant probabilities (actually they are real NHL ML numbers) and there outcomes (1=win, 0=lose):

    -111 1
    -143 1
    -125 0
    -135 1
    -112 1
    -109 1
    -107 0
    -110 0
    -111 1
    -116 1
    -106 0
    -120 1
    -110 1
    -123 0
    -124 1
    -118 0
    -135 0
    -109 0
    175 0
    216 0
    184 0
    179 1
    150 0
    265 0
    189 0
    185 0
    144 1
    145 0
    152 0
    188 0
    145 0
    185 0
    175 1
    212 1
    178 1
    167 1
    192 0
    180 0
    153 1
    168 0
    180 1
    307 1
    175 1
    170 0
    200 1
    175 1
    236 0
    157 0
    188 0
    222 0
    270 1
    147 1
    164 1
    172 1
    183 1


    Assuming a bet to win an equal percentage of bankroll for each of the above events, my results would be:

    Win 48.332811 units, Bet 41.766906 units for a net ROI of 15.7%.

    2 questions:

    1) What was the probability that I would win "at least" the 15.7% that was obtained during this experiment and how is that calculated?
    2) How do you calculate the probability of winning or losing various ROI amounts? (e.g. win 2%, 5%, 10%, etc.)

    There are bonus smiley faces etc. if Excel type formulas are included with math notations answers.
    I know nothing of Excel. In this case a reasonable null hypothesis is that the no-vig ML is the true probability. Computing the probability of M or more successes in N trials of this sort is straightforward but laborious. You could reasonably use the normal approximation and the fact that the variance of the sum of independent events is the sum of the variances of the events. The variance of single binomial trial is p*(1-p).

    HTH

  4. #4
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    There's nothing wrong with estimating p-values using Monte Carlo simulations. There is, however, something wrong with estimating p-values using Monte Carlo simulations in Excel. It's hard on the soul.

    Here's a very simple Monte Carlo script coded in Perl. (You can download a free copy of Perl from ) On a decent machine you should easily be able to run 2,000,000 55-bet trials in a minute or less. You should modify the EDGE, TRIALS, and BOGEY constants to suit your needs.

    Code:
    #!perl
    
    # Author: ganchrow@sbrforum.com
    # a very simple implementation of the
    # Monte Carlo method in fixed odds
    # sports betting
    use strict;
    use warnings;
    
    ### edit from here ###
    use constant EDGE    =>    0;
    use constant TRIALS    =>    3_000_000;
    use constant BOGEY    =>    0.1572035;    # as % of risk amount
    ### don't edit below this line unless you know what you're doing ###
    
    my @odds_ra = ();
    my $total_risk = 0;
    
    while(<>) {
        chomp;
        my ($us,) = split;
        next unless $us;
        my $dec = &us2dec($us);
        my $prob = (1+EDGE)/$dec;
        my $risk = 1/($dec-1);
        push @odds_ra, [$prob, $dec, $risk, 1];
        $total_risk += $risk;
    }
    
    my ($sum,$sumsq,$qualifiers,) = (0.0, 0.0, 0,);
    my $pct_bogey = BOGEY * $total_risk ;
    foreach my $i ( 1 .. TRIALS) {
        my $this_trial_result = 0;
        print STDERR "Trial $i\n" if $i%10_000 == 0;
        foreach my $j (0 .. $#odds_ra) {
            my ($prob, $dec, $risk, $win,) = @{$odds_ra[$j]};
            my $r = rand();
            my $this_bet_result;
            if ($r < $prob) {
                # win
                $this_bet_result = $win;
            } else {
                $this_bet_result = -$risk;
            }
            $this_trial_result += $this_bet_result;
        }
        print "$this_trial_result\n";
        $qualifiers++ if $this_trial_result >= $pct_bogey;
        $sum += $this_trial_result;
        $sumsq += $this_trial_result*$this_trial_result;
    }
    my $mean = $sum / TRIALS;
    my $stddev = sqrt($sumsq / TRIALS - $mean*$mean);
    my $frequency = $qualifiers / TRIALS;
    print STDERR "Mean     \t$mean\n";
    print STDERR "Std. Dev.\t$stddev\n";
    print STDERR "Qual     \t$frequency\n";
    
    sub us2dec {
        my $us = shift;
        return (
            $us >= 0 ? 1+$us/100 : 1-100/$us
        );
    }
    The script takes a text file of newline separated US-style odds and outputs to STDOUT the results of each of the trials (so you'll want to redirect STDOUT to a file), and to STDERR the mean, variance, and frequency with which the specified bogey (about 15.7% in your example) is reached.

    I'll just note that the script uses the Perl built-in rand() function, which has a fairly low periodicity. There's a Perl module available fro that implements the Mersenne twister pseudorandom number generation algorithm, which can be used a drop-in replacement for rand(). If you're going to be doing any moderately serious Monte Carlo sims and don't feel like coding in C, you should definitely hook that up (although it will slow down your sim). You can even seed it with data from . It's about 10 extra lines of code. Let me know if you want it.

    The only way to calculate an exact closed-form solution goes, would be to enumerate each of the 255 different outcomes, which would of course be completely impractical.

    Another possibility would be to break up the 55 bets into manageable tranches of (let's say) 11 bets apiece, enumerate the results for each tranche using, and then determine exact p-values using the binomial distribution. You could then use Fisher's chi-square method to determine a joint significance for the entire data set.

    Assuming all bets are independent of one another, then the simplest method would just be to appeal to the Central Limit Theorem. Take the sums of the variances of each bet (betting to win n unit at decimal odds d and edge E, variance would be (1+E)*(d-E-1) * n^2 / (d-1)^2) and then take the square root of that sum to obtain the standard deviation and a z-score. (So for zero-edge and betting to win 1 unit, variance would just be d-1.)

    In your example, assuming no edge, we get a standard deviation of about 15.47%. This means your results of ~15.72% is about 1.016 standard devs from breakeven for a p-value of about 15.48% (=1-NORMSDIST(1.016)).
    Last edited by SBRAdmin3; 07-03-14 at 06:09 PM.

  5. #5
    Quebb Diesel
    Quebb Diesel's Avatar Become A Pro!
    Join Date: 01-26-08
    Posts: 3,045
    Betpoints: 82

    Quote Originally Posted by Ganchrow View Post
    There's nothing wrong with estimating p-values using Monte Carlo simulations. There is, however, something wrong with estimating p-values using Monte Carlo simulations in Excel.

    Here's a very simple Monte Carlo script coded in Perl.
    -snip-
    Ganchrow...whats your educational background?

  6. #6
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    Quote Originally Posted by Ganchrow View Post
    There's nothing wrong with estimating p-values using Monte Carlo simulations. There is, however, something wrong with estimating p-values using Monte Carlo simulations in Excel. It's hard on the soul.

    Here's a very simple Monte Carlo script coded in Perl. (You can download a free copy of Perl from On a decent machine you should easily be able to run 2,000,000 55-bet trials in a minute or less. You should modify the EDGE, TRIALS, and BOGEY constants to suit your needs.

    Assuming all bets are independent of one another, then the simplest method would just be to appeal to the Central Limit Theorem. Take the sums of the variances of each bet (betting to win n unit at decimal odds d and edge E, variance would be (1+E)*(d-E-1) * n^2 / (d-1)^2) and then take the square root of that sum to obtain the standard deviation and a z-score. (So for zero-edge and betting to win 1 unit, variance would just be d-1.)

    In your example, assuming no edge, we get a standard deviation of about 15.47%. This means your results of ~15.72% is about 1.016 standard devs from breakeven for a p-value of about 15.48% (=1-NORMSDIST(1.016)).
    Monte Carlo method eh? Well, I feel pretty good that there is an official name for what I was doing. I thought I had invented it. You do have the correct picture of me sitting for 15 minutes waiting for Excel to chug through 10,000 simulations. Thank you for the link to the program code. I haven't downloaded it yet as I am just reading this post now but I would also be interested in the randomizing code you wrote about. Seems to me if I was running 100,000,000 trials to test a model that I would want to stay away from any repeating pattern.

    Regarding your simplest suggestion, there seems to be one parenthesis missing and I can not make the equation work.

    If I have:
    E=0 (Edge)
    d=.6 (American Odds of +150)
    n=1 (Units to win)

    Then I get:
    (1+0)*(.6-0-1) * 1^2 / (.6-1)^2)

    From my understanding, I assume I would calculate the results of the above equation for each of the 55 independent samples and take the square root which will give me the standard deviation. I believe the above example of +150 would evaluate to 3.75 if I ignore the last parenthesis. The square root of many such large numbers will not come close to 15.48%, so I am lost. Also, do the z-test and standard deviation both evaluate to 15.48% in your example and this is why you are able to get both numbers from one?

    I do have a question for you about adhering to the Central Limit Theorem. Not so much from this post but from other posts you have made. I get the feeling that when you say things like "as long as your comfortable appealing to the Central Limit Theorem" --> (not a direct quote as I am going from memory on this) that maybe I shouldn't be appealing to it and maybe I should be thinking along the lines of Bayesian. Here are 2 simple questions that I have often wondered. Relevant to sports betting, do you personally appeal to the Central Limit Theorem for calculating probabilities and significance? Are there situations that you do not?

    Also, if it were you and you had the choice of running 100,000,000 Monte Carlo simulations (with the better randomizer in place of course) or using the Central Limit Theorem example you proposed, which would you choose?

    Finally, the following equation did not work in Excel as it is missing parameters for the function. I am sure that they are probably assumed numbers like 1's and 0's but it would be helpful if you could fill them in for me.

    15.48% (=1-NORMSDIST(1.016))

    I promised lots of smiley faces for the Excel type formulas included with the answers so here they are. Thanks for coming through, again, for me Ganchrow.

    Last edited by SBRAdmin3; 07-03-14 at 06:09 PM.

  7. #7
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    Quote Originally Posted by chemist View Post
    I know nothing of Excel. In this case a reasonable null hypothesis is that the no-vig ML is the true probability. Computing the probability of M or more successes in N trials of this sort is straightforward but laborious. You could reasonably use the normal approximation and the fact that the variance of the sum of independent events is the sum of the variances of the events. The variance of single binomial trial is p*(1-p).

    HTH
    Thanks Chemist. Is p the no vig ML decimal odds?

    So, if I have +150 odds, the variance would be .6*(1-.6)=.24?

    If so, what do I do with several such numbers (i.e. .24, .25, .1, .6, etc.)?

  8. #8
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    Quote Originally Posted by VideoReview View Post
    I would also be interested in the randomizing code you wrote about.
    After the line "use warnings;" add the following code:
    Code:
    use Math::Random::MT;
    
    my $rand_gen;
    BEGIN {
    	warn "Seeding random number generator.\n";
    	require LWP::Simple;
    	use constant RAND_URL => 'http://random.org/integers/?num=1248&min=0&max=65535&col=2&base=10&format=plain&rnd=new';
    	my (@seed);
    	foreach (split(/\n/, LWP::Simple::get(RAND_URL))) {
    		m/^([0-9]+)\s+([0-9]+)$/;
    		push @seed, $1 + $2*2**16;
    	}
    	$rand_gen = Math::Random::MT->new(@seed);
    	warn "Random number generator seeded.\n";
    }
    and then replace the line "my $r = rand();" with:
    Code:
    my $r = $rand_gen->rand();
    This utilizes the Mersenne Twister algorithm with a 19,968-bit truly-random seed obtained from random.org. It's not cryptographically secure but it's more than adequate for Monte Carlo purposes.

    Quote Originally Posted by VideoReview View Post
    Regarding your simplest suggestion, there seems to be one parenthesis missing and I can not make the equation work.
    The entire clause "betting to win n unit at decimal odds d and edge E, variance would be (1+E)*(d-E-1) * n^2 / (d-1)^2" is in parentheses, so you can ignore the final paren mathematics-wise.

    Quote Originally Posted by VideoReview View Post
    If I have:
    d=.6 (American Odds of +150)
    +150 in decimal odds would be 2.5. See my odds converter. This yields variance of (1+0)*(2.5-1) * 1^2 / (2.5-0-1)^2 = 2/3.

    Quote Originally Posted by VideoReview View Post
    From my understanding, I assume I would calculate the results of the above equation for each of the 55 independent samples and take the square root which will give me the standard deviation. I believe the above example of +150 would evaluate to 3.75 if I ignore the last parenthesis. The square root of many such large numbers will not come close to 15.48%, so I am lost.
    The sum of the variances evaluates to 41.7669 which, not coincidentally, is also the total amount wagered. (As I noted previously, when betting to win 1 unit at 0 expectation, the variance of a bet equals the units risked.) The square root of the variance (6.4627 units) is the standard deviation. Since your final results were +6.5659 units, your z-score is 6.5659 / 6.4627 ≈ 1.0160, implying a p-value of about 15.482%. That the p-value is so close to the standard deviation is mere coincidence.

    Quote Originally Posted by VideoReview View Post
    I do have a question for you about adhering to the Central Limit Theorem. Not so much from this post but from other posts you have made. I get the feeling that when you say things like "as long as your comfortable appealing to the Central Limit Theorem" --> (not a direct quote as I am going from memory on this) that maybe I shouldn't be appealing to it and maybe I should be thinking along the lines of Bayesian. Here are 2 simple questions that I have often wondered. Relevant to sports betting, do you personally appeal to the Central Limit Theorem for calculating probabilities and significance? Are there situations that you do not?
    I'm just covering my ass -- I don't want some smart-alec screaming the distribution of outcomes isn't actually normal. In your example the CLT provides for a very decent approximate answer. If you had many fewer data points or had a number of big dogs or favorites then your skewed distribution of possible outcomes would not be so-well served by asserting normality.

    For example, if you change were to change the odds on the 1st bet to +1000 (but keep it as a win) then your units won wouldn't change but your CLT p-value would be 11.768%. A 10,000,000-trial Monte Carlo simulation of same yields a p-value of about 15.95%.

    Quote Originally Posted by VideoReview View Post
    IAlso, if it were you and you had the choice of running 100,000,000 Monte Carlo simulations (with the better randomizer in place of course) or using the Central Limit Theorem example you proposed, which would you choose?
    It really depends upon the data you're analyzing. As long as you have a let's say 30 or more data points with odds fairly close to even you'll get acceptably close results using the CLT. If you're concerned you can of course always verify your results with a quick Monte Carlo sim (couple million trials or so). If the results are comparable you can rest easy.

    Quote Originally Posted by VideoReview View Post
    Finally, the following equation did not work in Excel as it is missing parameters for the function. I am sure that they are probably assumed numbers like 1's and 0's but it would be helpful if you could fill them in for me.

    15.48% (=1-NORMSDIST(1.016))
    Are you sure you entered it as =1-NORMSDIST(1.016) and didn't omit the 'S'? That would give you the "too few arguments for this function" error message in Excel.

  9. #9
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    Quote Originally Posted by VideoReview View Post
    Thanks Chemist. Is p the no vig ML decimal odds?

    So, if I have +150 odds, the variance would be .6*(1-.6)=.24?

    If so, what do I do with several such numbers (i.e. .24, .25, .1, .6, etc.)?
    p represents the expected win probability and could be calculated as (1+edge)/(decimal odds).

    p*(1-p) represents the variance of a single binomial trial with success rate of p. Multiply that by the decimal payout odds squared and you'll get the variaance on a 1-unit risked bet. This is of course equivalent to the σ2 = (1+edge)*(decimal odds-edge-1) formulation.

    See Calculating Wager Variance.
    Last edited by SBRAdmin3; 07-03-14 at 06:09 PM.

  10. #10
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    Quote Originally Posted by Quebb Diesel View Post
    Ganchrow...whats your educational background?
    Bachelor's math Master's econ, both from Brown.

  11. #11
    yahoonino
    yahoonino's Avatar SBR PRO
    Join Date: 08-10-07
    Posts: 2,651
    Betpoints: 1461

    is all greeck to me

  12. #12
    LT Profits
    LT Profits's Avatar Become A Pro!
    Join Date: 10-27-06
    Posts: 90,963
    Betpoints: 5179

    I love the guy and have conversed with him often, but I need to take some "Ganchrow as a Second Language" courses!

  13. #13
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    Quote Originally Posted by Ganchrow View Post
    I'm just covering my ass. In your example he CLT provides a very decent approximate answer. If you had many fewer data points or had a number of big dogs or favorites then your skewed distribution of possible outcomes would not be so-well served by asserting normality.

    For example, if you change were to change the odds on the 1st bet to +1000 (but keep it as a win) then your units won wouldn't change but your CLT p-value would be 11.768%. A 10,000,000-trial Monte Carlo simulation of same yields a p-value of about 15.95%.

    It really depends upon the data you're analyzing. As long as you have a let's say 30 or more data points with odds fairly close to even you'll get acceptably close results using the CLT. If you're concerned you can of course always verify your results with a quick Monte Carlo sim (couple million trials or so). If the results are comparable you can rest easy.

    Are you sure you entered it as =1-NORMSDIST(1.016) and didn't omit the 'S'? That would give you the "too few arguments for this function" error message in Excel.
    For some strange reason I could not read about 10% of the right side of your reply. I was able to read the missing parts now as I reply though so no big worries.

    I did omit the "S" out of habit. Good call. I have got the CLT to work fine now in my spreadsheet by going through your comments.

    The difference between the CLT and Monte Carlo methods is disturbing to me since I do often have significantly skewed subsets of odds that I work with and I sure would not want to be near full Kelly with a CLT calculation. The fact that only one stray (the +1000) can throw off a sample that much seems to be much more risky than the chances of a 10,000,000 trial run being off by that same percentage. However, from what I can gather, I think the CLT is good as a double-check to a large Monte Carlo run. Does this make sense to you?

    You had mentioned another scenario:

    "Another possibility would be to break up the 55 bets into manageable tranches of (let's say) 11 bets apiece, enumerate the results for each tranche using, and then determine exact p-values using the binomial distribution. You could then use Fisher's chi-square method to determine a joint significance for the entire data set."

    You say "enumerate the results for each tranche using,".

    I am not sure what you are using to do this. The first question I have about this option is will I run into the same sort of thing as the CLT if I have date that has a skewed distribution? In other words, is this method still inferior to a large Monte Carlo run of 10,000,000 or more when the distribution of the data is skewed or unknown? If I won't run in to the problems I mentioned (or didn't mention because I don't know about them yet) and the p-value can be easily calculated exactly, I would like to know more.

  14. #14
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    Quote Originally Posted by VideoReview View Post
    The difference between the CLT and Monte Carlo methods is disturbing to me since I do often have significantly skewed subsets of odds that I work with and I sure would not want to be near full Kelly with a CLT calculation. The fact that only one stray (the +1000) can throw off a sample that much seems to be much more risky than the chances of a 10,000,000 trial run being off by that same percentage. However, from what I can gather, I think the CLT is good as a double-check to a large Monte Carlo run. Does this make sense to you?
    Yes.

    Quote Originally Posted by VideoReview View Post
    "Another possibility would be to break up the 55 bets into manageable tranches of (let's say) 11 bets apiece, enumerate the results for each tranche using, and then determine exact p-values using the binomial distribution. You could then use Fisher's chi-square method to determine a joint significance for the entire data set."

    You say "enumerate the results for each tranche using,".
    Either omit the "using" or change it to "using any method you like".

    There's an entirely different and nontrivial set of problems with using the Fisher method as I outlined. Most importantly he results you obtain will be very sensitive to the specfiic outcomes you include in tranche and as such you'd need to to perform numerous trials with randomly configured tranches. This may well take longer to converge than a straight Monte Carlo.

    I have a script that utilizes this method. Let me know if you'd like to see it.

    You're probably best off just sticking with Monte Carlo.

  15. #15
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    This is the script implementing the Fisher method as I outlined. I just hacked it together so I'm sure it could easily be optimized for improved performance. It utilizes the Statistics::Distributions module.

    Code:
    #!perl
    
    # Author: ganchrow@sbrforum.com
    # a rather poor iterative implementation of the 
    # Fisher Method
    use strict;
    use warnings;
    use Statistics::Distributions;
    
    use constant EDGE		=>	0;
    use constant TRANCHE_SIZE	=>	12;
    use constant TRIALS		=>	1_000;
    
    my @lines = <>;
    my $total_log_chi_square_result = 0;
    for (my $trial = 1; $trial <= TRIALS; $trial++) {
    #	print STDERR "Trial $trial\n" if $trial%10 == 0;
    	my $total_risk = 0;
    	my $tranche_results_r;
    	my $bets = 0;
    	&fisher_yates_shuffle(\@lines);
    
    	foreach(@lines) {
    		chomp;
    		my ($us,$iswin) = split;
    		next unless $us;
    		$bets++;
    		my $this_tranche_no = (($bets-1) - ($bets-1) % TRANCHE_SIZE) / TRANCHE_SIZE;
    		my $dec = &us2dec($us);
    		my $prob = (1+EDGE)/$dec;
    		my $risk = 1/($dec-1);
    		my $win = $risk * ($dec-1);
    		$tranche_results_r->[$this_tranche_no]->{total_risk} += $risk;
    		$tranche_results_r->[$this_tranche_no]->{total_win} += ($iswin ? $win : -$risk);
    		push @{$tranche_results_r->[$this_tranche_no]->{bets}}, [$prob, $risk, $win];
    		$total_risk += $risk;
    	}
    	my $number_tranches = scalar @{$tranche_results_r};
    	my $fisher_stat = 0;
    	foreach my $n ( 0 .. $number_tranches-1 ) {
    		my @bets = @{$tranche_results_r->[$n]->{bets}};
    		my $total_risk = $tranche_results_r->[$n]->{total_risk};
    		my $total_win = $tranche_results_r->[$n]->{total_win};
    		my $p_value = &enumerate_coutcomes(\@bets, $total_win);
    	#	warn "Trache $n: $p_value\n";
    		$fisher_stat += -2*log($p_value);
    	}
    	my $dof = 2*$number_tranches;
    	my $chi_square_result = Statistics::Distributions::chisqrprob($dof, $fisher_stat);
    	#warn "Fisher stat: $fisher_stat\n";
    	#warn "D.O.F.: " . 2*$number_tranches . "\n";
    	#warn "chi-square: $chi_square_result\n";
    	print "$trial\t$chi_square_result\n";
    	$total_log_chi_square_result += log($chi_square_result);
    }
    warn "p-value of outcome: " . exp($total_log_chi_square_result/TRIALS);
    
    sub enumerate_coutcomes(\@$) {
    	my @bets = @{+shift};
    	my $win_bogey = shift;
    	my $num_bets_in_tranche = scalar @bets; # will always be equal to TRANCHE_SIZE except possiblyt fo rthe last tranche
    	my $num_outcomes = 2 ** $num_bets_in_tranche;
    	my $qualifying_prob = 0;
    	for (my $i = 0; $i < $num_outcomes; $i++) {
    		my $this_outcome = sprintf("%0${num_bets_in_tranche}b", $i);	# the outcome number in binary
    		my $this_outcome_prob = 1;
    		my $this_outcome_result = 0;
    		for (my $j = 0; $j < $num_bets_in_tranche; $j++) {
    			my ($prob, $risk, $win) = @{$bets[$j]}[0,1,2];
    			my $is_this_bet_a_win = substr($this_outcome, $j, 1);
    			$this_outcome_prob *= $is_this_bet_a_win ? $prob : (1-$prob);
    			$this_outcome_result += $is_this_bet_a_win ? $win : -$risk;
    		}
    		$qualifying_prob += $this_outcome_prob if $this_outcome_result  >= $win_bogey;
    	}
    	return $qualifying_prob;
    }
    
    sub us2dec {
    	my $us = shift;
    	return (
    		$us >= 0 ? 1+$us/100 : 1-100/$us
    	);
    }
    
    sub fisher_yates_shuffle {
        my $list = shift;  # this is an array reference
        my $i = @{$list};
        return unless $i;
        while ( --$i ) {
            my $j = int rand( $i + 1 );
            @{$list}[$i,$j] = @{$list}[$j,$i];
        }
    }


    Honestly, after playing around with this a bit I think this particularly method is pretty shitty. Bad idea, I guess.

  16. #16
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    Quote Originally Posted by Ganchrow View Post
    Yes.
    You're probably best off just sticking with Monte Carlo.
    That is the answer I needed to hear.

    I hope everyone following this thread who is not already a long term profitable sports bettor realizes how important the information and final recommendation that Ganchrow has given us is. Imagine, I can now calculate with personal and mathematical confidence (at whatever confidence level I would like <100%) the probability of results for any non-normally distributed set of odds (e.g. a system) that I come up with assuming my sampling methodology is not bias.

    Ganchrow, please correct me if I am wrong here, but this seems to me like it is a very big piece of the puzzle to calculating a reliable win percentage.

  17. #17
    20Four7
    Timmy T = Failure
    20Four7's Avatar Become A Pro!
    Join Date: 04-08-07
    Posts: 6,703
    Betpoints: 4120

    Quote Originally Posted by LT Profits View Post
    I love the guy and have conversed with him often, but I need to take some "Ganchrow as a Second Language" courses!
    I"ve looked at the course for a while. When you register sign me up please.

  18. #18
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    Quote Originally Posted by Ganchrow View Post
    Here's a very simple Monte Carlo script coded in Perl. (You can download a free copy of Perl from http://www.activeperl.com/.) On a decent machine you should easily be able to run 2,000,000 55-bet trials in a minute or less. You should modify the EDGE, TRIALS, and BOGEY constants to suit your needs.
    I have installed perl from the above url and have saved the script as a .pls file but I can not seem to find a way to run the interpreter to execute the script. I have the Perl Package Manager on my screen but do not see an obvious way to run a script. Any ideas anyone?

  19. #19
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    Quote Originally Posted by VideoReview View Post
    I have installed perl from the above url and have saved the script as a .pls file but I can not seem to find a way to run the interpreter to execute the script. I have the Perl Package Manager on my screen but do not see an obvious way to run a script. Any ideas anyone?
    Save the script as a ".pl" file. Then run from the command prompt:

    perl script_name.pl input_file_name > output_file_name

    The input file is the file containing the newline seprated payout on odds in US format. If you don't care about saving the output (it will be a lot of data -- the unit results for each trial in the sim) use NUL as the output file name on Windows and /dev/null as the output file name on Linux/UNIX.

  20. #20
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    Quote Originally Posted by Ganchrow View Post
    Save the script as a ".pl" file. Then run from the command prompt:

    perl script_name.pl input_file_name > output_file_name

    The input file is the file containing the newline seprated payout on odds in US format. If you don't care about saving the output (it will be a lot of data -- the unit results for each trial in the sim) use NUL as the output file name on Windows and /dev/null as the output file name on Linux/UNIX.
    I am using Windows. Here is the line I run:

    perl c:\Users\David\Desktop\test.pl test.csv > NUL

    Both the test.pl and test.csv files are on the desktop. The test.csv file is a text file with one US odds per new line.

    When I run the line it immediately opens up a window with the main perl directory with folders for bin, eg, etc, html...

    Any ideas?

  21. #21
    Quebb Diesel
    Quebb Diesel's Avatar Become A Pro!
    Join Date: 01-26-08
    Posts: 3,045
    Betpoints: 82

    ganchrow...im actually a MS student going for statistics and right now i HATE the factorization criterion and sufficient/complete statistics...please tell me theres a legitimate reason to know this stuff in the real world!!!

  22. #22
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    Quote Originally Posted by VideoReview View Post
    I am using Windows. Here is the line I run:

    perl c:\Users\David\Desktop\test.pl test.csv > NUL

    Both the test.pl and test.csv files are on the desktop. The test.csv file is a text file with one US odds per new line.

    When I run the line it immediately opens up a window with the main perl directory with folders for bin, eg, etc, html...

    Any ideas?
    Is that a new command prompt window? If not try closing and reopening.

    Try typing "perl -h" and the prompt and note the result.

    Try calling perl as C:\Perl\bin\perl.exe (or whatever its actual path).

  23. #23
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    Quote Originally Posted by Quebb Diesel View Post
    i HATE the factorization criterion and sufficient/complete statistics...please tell me theres a legitimate reason to know this stuff in the real world!!!
    It obviously depends on what you plan to do in the real world. Were you goal to appear on The Real World, then I'd strongly suspect it would be of little if any importance.

    Grad School is really about learning a specific mode of though and then then remembering just enough actual information to be able to figure out what you need to look up that one time you really need it 6 years later.

  24. #24
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    Quote Originally Posted by Ganchrow View Post
    Is that a new command prompt window? If not try closing and reopening.

    Try typing "perl -h" and the prompt and note the result.

    Try calling perl as C:\Perl\bin\perl.exe (or whatever its actual path).
    Yeah, that was the problem alright. I was trying to run it right from the run command line. Once I opened the new command prompt window, I changed directories to the desktop and when I ran perl -h, all the options came up. It ran perfectly. Sweet.

    Took 16 minutes and 22 seconds for 3,000,000. Much slower than you said but I am only running a AMD64 Althon X2 laptop that I bought pretty cheap. Still, 183,300 trials per minute is much better than the 600 per minute I was doing with Excel and I can now set it up for a 100,000,000 trial run at night and wake up to the results.

    Here are the results I got for the original odds sample:

    MEAN -.00514111167514918
    STD DEV 6.46675394581734
    QUAL 0.15545533333333333

    I noticed by standard deviation is off a bit from yours. Would the difference you see be in the range you would expect from this type of sample?

    I understand how to interpret the standard deviation from your previous post but I would like an explanation of the MEAN in this case. I assume the QUAL can be used to calculate p and means significant at the p=1-NORMSDIST(.1572/.15545533333333333) or 15.5955% level. After I receive confirmation and clarification from you on this, I think I am well on my way and will busy for some time.

    Thanks again.
    Last edited by VideoReview; 02-17-08 at 08:18 PM. Reason: Correcting My Bad Math

  25. #25
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    Quote Originally Posted by VideoReview View Post
    Took 16 minutes and 22 seconds for 3,000,000. Much slower than you said but I am only running a AMD64 Althon X2 laptop that I bought pretty cheap. Still, 183,300 trials per minute is much better than the 600 per minute I was doing with Excel and I can now set it up for a 100,000,000 trial run at night and wake up to the results.
    Sounds to me like you need a new computer.

    Quote Originally Posted by VideoReview View Post
    Here are the results I got for the original odds sample:

    MEAN -.00514111167514918
    STD DEV 6.46675394581734
    QUAL 0.15545533333333333
    Quote Originally Posted by VideoReview View Post
    I understand how to interpret the standard deviation from your previous post but I would like an explanation of the MEAN in this case.
    The mean is just the average unit result over the course of the simulation and can give an indication of how "good" the simulation was. Since we expect the mean to be zero, given a population std. dev. of about 6.4627 units your result of -0.005141 is off by -0.005141*sqrt(3000000)/6.4627≈ -1.3778 std. devs.

    Quote Originally Posted by VideoReview View Post
    I noticed by standard deviation is off a bit from yours. Would the difference you see be in the range you would expect from this type of sample?
    The standard deviations look pretty close to me. The population standard deviation we know to be 6.4627. We know that the test statistic T = (n-1)*(s/σ)2 should follow a chi-square distribution with n-1=2,999,999 degrees of freedom. Therefore your value of 6.4668 corresponds to a test statistic of 2,999,999*6.4668/6.4627 ≈ 3,003,734.

    Taking =CHIDIST(3003734, 2999999) yields a critical probability of 6.3698%, meaning we'd be unable to reject the (2-tailed) null hypothesis that that the sample and population standard devs are equal even with an alpha as high as 12.740%.

    Quote Originally Posted by VideoReview View Post
    I assume the QUAL can be used to calculate p and means significant at the p=1-NORMSDIST(.1572/.15545533333333333) or 15.5955% level.
    The "QUAL" just corresponds to the p-value. It's is a probability and not a standard deviation.

  26. #26
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    Quote Originally Posted by Ganchrow View Post
    Since we expect the mean to be zero, given a population std. dev. of about 6.4627 units your result of -0.005141 is off by -0.005141*sqrt(3000000)/6.4627≈ -1.3778 std. devs.
    How should I, and do I even need to, interpret the -1.3778 standard deviations if I am using the chi-square calculations below? Or is this a case where I can choose the confidence level I would like (let's assume 95%) and if I am off by so many standard deviations at this point, it would be best for me to rerun the simulation. If so, how do I come up with the number of standard deviations for various levels of confidence at this point (i.e. before we get to chi-square)?

    Quote Originally Posted by Ganchrow View Post
    Taking =CHIDIST(3003734, 2999999) yields a critical probability of 6.3698%, meaning we'd be unable to reject the (2-tailed) null hypothesis that that the sample and population standard devs are equal even with an alpha as high as 12.740%.
    Where did the alpha number come from and at what level could the null hypothesis be rejected?

    Quote Originally Posted by Ganchrow View Post
    The "QUAL" just corresponds to the p-value. It's is a probability and not a standard deviation.
    I just got confused there for a moment and forgot that I had set the bogey in the program so it already knew the population result and I was trying to make the number into something it was not.

    In general, how important is it for me to do the checks that you have shown as opposed to just running a very large Monte Carlo run and letting the chips fall where they may?

  27. #27
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    Quote Originally Posted by VideoReview View Post
    How should I, and do I even need to, interpret the -1.3778 standard deviations if I am using the chi-square calculations below? Or is this a case where I can choose the confidence level I would like (let's assume 95%) and if I am off by so many standard deviations at this point, it would be best for me to rerun the simulation. If so, how do I come up with the number of standard deviations for various levels of confidence at this point (i.e. before we get to chi-square)?
    The sample mean should be normally distributed about the propulation mean with standard deviation equal to the population standard deviation divided by the square root of N. This should give you an idea of the accuracy of your simulation. If run the sim and routinely see a mean that's off by multiple standard devs then you know something is up.

    Quote Originally Posted by VideoReview View Post
    Quote Originally Posted by Ganchrow View Post
    Taking =CHIDIST(3003734, 2999999) yields a critical probability of 6.3698%, meaning we'd be unable to reject the (2-tailed) null hypothesis that that the sample and population standard devs are equal even with an alpha as high as 12.740%.
    Where did the alpha number come from and at what level could the null hypothesis be rejected?
    6,3698% is the right tail likelihood (i.e., the likelihhood of seeing a standard deviation as measured or higher). Since however, we're looking at a two-tailed test (because we're testing whether the results were either significantly higher or lower than as expected) at an alpha of 6.3698% * 2 ≈ 12.740% we couldn't reject the null (or rather that would be right on boundary between acceptance and rejection). The point is that we couldn't say that the observed standard deviation is isgnificantly different from what would be expected.

    Quote Originally Posted by VideoReview View Post
    In general, how important is it for me to do the checks that you have shown as opposed to just running a very large Monte Carlo run and letting the chips fall where they may?
    With a straightforward sim like this -- not very important at all. These are just basic sanity checks that are there to provide extra assurance that weren't any obvious problems with the sim (e.g., problems with the PRNG, the coding, or the input data).
    Last edited by Ganchrow; 02-19-08 at 12:33 AM. Reason: typo -- should be "PRNG"

  28. #28
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    Quote Originally Posted by Ganchrow View Post
    With a straightforward sim like this -- not very important at all. These are just basic sanity checks that are there to provide extra assurance that weren't any obvious problems with the sim (e.g., problems with the PNRG, the coding, or the input data).
    I am happy to hear that it is a double check only and get the concept now. What does PNRG stand for?

  29. #29
    RickySteve
    SBR is a criminal organization
    RickySteve's Avatar Become A Pro!
    Join Date: 01-31-06
    Posts: 3,415
    Betpoints: 187

    Quote Originally Posted by VideoReview View Post
    I am happy to hear that it is a double check only and get the concept now. What does PNRG stand for?
    I'm assuming he's referring to the pseudo-random number generator.

  30. #30
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    Quote Originally Posted by VideoReview View Post
    I am happy to hear that it is a double check only and get the concept now. What does PNRG stand for?
    Sorry that was a typo -- it should have been "PRNG".

    RickySteve is of course correct -- "pseudo-random number generator".

  31. #31
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    Quote Originally Posted by Ganchrow View Post
    Sounds to me like you need a new computer.
    I did some checking on why my laptop was so slow and I finally got it up to 500,000 per minute. I had been running it in Power Saver mode instead of High Performance for I don't know how long

    Not near the 2 million you suggested but a bit better.

    Thanks again for everything.

  32. #32
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    Quote Originally Posted by VideoReview View Post
    Not near the 2 million you suggested but a bit better.
    I have to apologize to you on this ... upon further review I'm only getting a shade more than a million trials a minute (and that's on LINUX). I must have been seeing double.

  33. #33
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    Quote Originally Posted by Ganchrow View Post
    Here's a very simple Monte Carlo script coded in Perl. (You can download a free copy of Perl from http://www.activeperl.com/.) On a decent machine you should easily be able to run 2,000,000 55-bet trials in a minute or less. You should modify the EDGE, TRIALS, and BOGEY constants to suit your needs.
    Does the program work with fractions of American Odds?

    For example, can I have a list like:

    +132.33345445
    +127.13234526
    -209.38983729
    etc.

    These would be fair no vig odds used to calculate p to accept or reject a fair bet for a single variable.

  34. #34
    Ganchrow
    Nolite te bastardes carborundorum.
    Ganchrow's Avatar Become A Pro!
    Join Date: 08-28-05
    Posts: 5,011
    Betpoints: 1088

    Quote Originally Posted by VideoReview View Post
    Does the program work with fractions of American Odds?

    For example, can I have a list like:

    +132.33345445
    +127.13234526
    -209.38983729
    etc.
    Indeed you can.

  35. #35
    VideoReview
    VideoReview's Avatar Become A Pro!
    Join Date: 12-14-07
    Posts: 107
    Betpoints: 270

    Quote Originally Posted by Ganchrow View Post
    The standard deviations look pretty close to me. The population standard deviation we know to be 6.4627. We know that the test statistic T = (n-1)*(s/σ)2 should follow a chi-square distribution with n-1=2,999,999 degrees of freedom. Therefore your value of 6.4668 corresponds to a test statistic of 2,999,999*6.4668/6.4627 ≈ 3,003,734.
    I have taken an entire population sample of 6300 games and divided by half based on one variable.

    Here are my statistics assuming betting to win the same 1 unit:
    # Of Events (n) = 3150
    Win = 2685.561544
    Bet = 2636.999913

    I am having a compile error when I ran the PRNG script so I am still running the original Monte Carlo script. Here are the results of 1,000,000 trials:

    Mean = 0.1047164613929
    STD DEV = 51.368468144382
    QUAL (p) = .172453

    I have run two 1,000,000 simulations on two different populations of over 3000 and had a mean of .093 on one and .105 on this one. I am starting to get concerned that "something is up".

    This would mean my results are off by 0.1047164613929 * SQRT(1,000,000) / 51.368468144382 = 2.0385346 STD DEV. My other sample of 1,000,000 on a population of 3225 using a different variable was off by 1.8091654 STD DEV. These are the only 2 runs I have done so far. I now see that double check can be important.

    My other concern is that my test statistic on the sample with the Monte Carlo results of:
    Mean = 0.1047164613929
    STD DEV = 51.368468144382
    QUAL (p) = .172453

    Would be: (1,000,000 - 1) * ( 51.368468144382 / SQRT(2636.999913) ) = 1,000,651.

    When I put the following into Excel I get CHIDIST(1000651,999999) = #NUM!

    I tried to see what X would give me about 6.36% like the original 55 event sample just for fun and see I would need a number like 1,002,156. I also wanted to see the lowest possible X that Excel would take and found that CHIDIST (1001090,999999) = .22016243 is the absolute minimum the Excel will take for X in this formula.

    I am starting to suspect it is my random number seeding. I can give you the decimal odds by email if you want (I don't want to post 3000+ odds on the forum). Also, the 3150 samples are all fair no vig decimal odds that I am using to see if I can reject a fair bet null hypothesis. The trail I am on at the moment is to start building a model by finding single variables that reject a fair bet at the 5% level. I am sure that I will find some and am almost equally sure that I will not find any that by themselves can reject a profitable bet at -105. However, it is my assumption that when I combine the variables that reject the null hypothesis in of a fair bet, that I would be able to isolate combinations that reject the null hypothesis of a profitable bet at -105. Does this seem plausible?

    What do you think is happening with my Mean?
    Last edited by VideoReview; 02-27-08 at 11:18 AM. Reason: Incomplete Post

12 Last
Top