Meaningful or random distribution?

mtneer1212 · 07-28-08, 11:00 PM

Originally posted by Dark Horse

Twelve people having nothing better to do than catching bullets with their teeth, shot from a semi-automatic gun placed in front of them.

Some are slower, some are faster, but all stop when a bomb drops right in the back yard.

All die instantly.

Later in the morgue, it turns out that eight people had shown unremarkable capacity to catch bullets with their teeth.

Out of the remaining characters, two showed above average capacity and the other two were below average.

If we might expect an even distribution, or something approaching an even distribution, are the following distributions random or meaningful:

27 bullets caught; 46 not caught
23 bullets caught; 16 not caught
11 bullets caught; 21 not caught
21 bullets caught; 9 not caught.

Did the good guys know what they were doing, or were they just lucky? Did the bad guys deserve to die anyway, or were they unlucky?

Not meaningful data. Too small of a sample size to have any confidence in the distribution.

Ganchrow · 07-29-08, 12:22 AM

I can't say I completely understand your question, DH. It might be the way in which it's worded or it might be just be me.

Firstly, don't you give statistics for the 8 people that "showed unremarkable capacity to catch bullets with their teeth". These individuals are a part of the sample, and completely ignoring them could be a form of data mining (depending upon potentially exogenous factors not specified in the problem).

Secondly, what exactly do you mean by "an even distribution"? Do you mean to say that we can expect an average individual not designated a member the first tranche would have a 50% probability of catching a bullet in his teeth? Are we further to assume that a failed attempt never results in death?

Thirdly, stated in specfic terms, what exactly is the hypothesis that you're looking to test?

Dark Horse · 07-29-08, 02:28 AM

Points are distributed in random fashion so that they combine as either 4-3, 5-2, or 6-1 (3 possible outcomes). These 3 outcomes, assigned to one of two teams before each game, are placed into a HF-RF-HD-RD grid, creating twelve different situations. Before each game the 'most points' are assigned to one team. In other words, if the HD is given the 4-3, then the RF automatically receives the 3-4, but the outcome of the game is recorded from the perspective of the HD (there is no double recording). There is not necessarily strength in having the higher number.

My first question is: is it within normal distribution parameters that the 4-3 HD produces 23-16 ATS results, while the 4-3 RD goes 11-21 ATS? Is it meaningful or random that the 6-1 HF goes 21-9 ATS and the 4-3 HF goes 27-46 ATS?

My second question: when looking at a grid of twelve situations, is it meaningful or random that 33% of those have numbers that 'pop'? Am I looking at a normal distribution pattern, or is this a little odd? Next to other grids I've done (and usually tossed out), this one seems more extreme. Sample size is 292 games. For the record, I'm not assigning value to it, and have kept the record on the side.

Ganchrow · 07-29-08, 10:27 AM

If I'm understanding you correctly then you're probably going to want to look at Pearson's chi-square test.

What most concerns me, however, is your statement that, "Next to other grids I've done (and usually tossed out), this one seems more extreme," which to me positively reeks of data mining. I have no idea of the significance of this particular "grid" (not having complete access to your data, but I'll neverthless point out that if you construct enough of these type grids then purely by chance some number will exhibit spurious correlations likely lacking in predictive power.

As I've mentioned before, the very first step is building a coherent theory based (at least in part) on exogenous knowledge of the sport question, and only once this is done, testing the theory in-sample. Constructing theories for the express purposes of explaining what are quite possibly little more than in-sample data anomalies is simply begging for out-of-sample disaster.

Dark Horse · 07-29-08, 12:05 PM

Originally posted by Ganchrow

If I'm understanding you correctly then you're probably going to want to look at Pearson's chi-square test.

What most concerns me, however, is your statement that, "Next to other grids I've done (and usually tossed out), this one seems more extreme," which to me positively reeks of data mining. I have no idea of the significance of this particular "grid" (not having complete access to your data, but I'll neverthless point out that if you construct enough of these type grids then purely by chance some number will exhibit spurious correlations likely lacking in predictive power.

As I've mentioned before, the very first step is building a coherent theory based (at least in part) on exogenous knowledge of the sport question, and only once this is done, testing the theory in-sample. Constructing theories for the express purposes of explaining what are quite possibly little more than in-sample data anomalies is simply begging for out-of-sample disaster.

Pearson's Chi-Square Test. Wow!

I do appreciate your warning, but it is why I asked the question. I know how to build working models per sport. This is not in that category, but a purely abstract breakdown. So the underlying question -behind my question- is: is it possible to have a purely abstract, meaningful breakdown?

A dead end? I'd bet on it. But before tossing it out, I'd like to put a statistical number to what I'm looking at.