I don't understand statistics

This question is probably too elementary for Veeky Forums, but I was hoping someone could enlighten me on how statistics works. I just don't understand how a sample can accurately represent a population.

Let's say they want to see whether Americans prefer apples or oranges. So they poll 1000 Americans with the question: What do you like better, apples or oranges? In the poll, 600 say apples and 400 say oranges.

So the headline is: "60% of Americans prefer apples to oranges". And I just don't understand this. There are 325 million people living in the USA. and yet it only takes 1000 to determine America's fruit preference? How does that make sense?

I understand the concept of random sampling. But it seems like human beings have so many variables that just sampling at random doesn't mean much. Especially in a country like the USA with vastly different cultural regions.

And also with polls, you're only getting the people who choose to participate. Probably many people refused. Is this not something worth considering as well?

It just doesn't make sense to me, to say the opinion of 1000 people can equal the opinion of 325 million. If that's the case, why even bother having elections? May as well pull 1000 random votes from the entire country and determine the president that way. I mean, same thing, right?

Other urls found in this thread:

robertniles.com/stats/margin.shtml
i.imgur.com/CJNljub.png
twitter.com/SFWRedditGifs

YOU STUPID ASS FUCKING PIGGOT, YOU OINKING BASTARD

why are you so narrow minded? it's all about probability and choosing a large enough sample that represents the larger population in such a good way that there are very high chances that that can be said about that small pool implies that the same thing can be said about the larger pool => the entire country.

What's so hard about this, you god damned pig?

In your case it isn't the statistics that is the problem--it's the headline. Statistically what we are saying is that we ESTIMATE 60% of Americans prefer apples.

We say this because this is all the information we really say (and from a more technical standpoint this would be the maximum likelihood estimate i.e. value of the binomial success parameter that maximizes the probability of having observed that particular sample).

What's more is that we can use this information to test hypotheses about more general statements-- Do more Americans prefer apples or oranges,etc.

In short, we can't say for certain that 60% of Americans prefer apples, but this is simply a good guess based on the information available that we can then use to make more general statements about the preferences among the general population.

Typical engineering mathematics guy

You are correct. 1000 people cannot be used to make a generalized statement about 325 million peoples. You need about 50 million people. Anything less is politics, not mathematics.

>You need about 50 million people to make a generalized statement about 325 million peoples.
What the fuck?

Wrong. 1000 people CAN be used to generalize. As long as the sample is randomly selected, and therefore, representative of the population. A sample size of 1000 estimating a population of 325 million has an error of roughly 0.03 (3%).

Using a confidence interval, we can say, with 95% certainty that between 59.6% and 63.1% of Americans prefer apples.
That means, if repeated samples were taken, 95% of the intervals would contain the population proportion. Therefore, unless we got an unlucky 1/20 sample, we can say that our sample is one of the 19/20 that DOES contain the true population proportion, and that the population proportion does therefore lie between 59.6% and 63.1%.

This is an introductory statistics topic. If you want to learn more about how to predict uncertainty and the future, take statistical inference classes (ANOVA is a useful skill to know too).

>he doesn't know about representative samples and how they should be at least 10% to be significant

this is why statistics is amazing.

As a mathematician, you learn how to prove theorems that are manmade and have no meaning unless you think in a manmade world (which changes depending on the mathematician).

As a statistician, you predict the future essentially and learn how to handle uncertainty and real numbers - essentially a magician.

this isn't actually how things work.
as it turns out, a sample size of 1067 will give you an answer to within 3% with 95% confidence. want to get that to within 1% with 99% confidence? you'll only need 16,640.

pic related, it's you.

>It just doesn't make sense to me, to say the opinion of 1000 people can equal the opinion of 325 million.

The point isn't to blindly say population statistic = sample statistic, the point of statistics is quantifying how good this estimate is.

as a PhD statistician, can confirm
we already can analyze Veeky Forums

I feel like everyone doing a science can benefit from a double major in statistics.

Double major in math helps too, but probably more for theoretical aspects, which may not apply directly to your field (might be good trivia).

My b

>1000 people CAN be used to generalize. As long as the sample is randomly selected, and therefore, representative of the population.
This seems wrong on its face

If within the 325 million people there are (say) 2,000 "types" of people, then you physically can't get a representative sample out of only 1000.

Being representative is not a property of a single sample, but of the algorithm that carries out the sampling.
A single representative sample doesn't have to capture everything, but you need the sampling algorithm to capture everything if done repeatedly.

For a simple toy example, suppose I hand you a coin and tell you that it's loaded so it only turns up heads 10% of the time. To test this claim, you flip the coin 10 times.
This constitutes your sample: 10 observations from the infinite population consisting of all times that the coin is flipped.
Now, how many "ways" are there to flip that coin? Arguably infinitely many, depending on the number of revolutions it makes in the air, the amount of force applied, the height at which you flip it, etc. In this sense, no finite sample will ever "truly represent" the infinite population.

But whether you agree with this perspective or not, it's impractical to say that flipping a coin 10 times, or any finite number of times, will never tell you anything about whether the coin is loaded or not.

Analogously, no sample of the American population (short of a full census) will allow you to generalize with 100% certainty. But if you're willing to compromise on the 100%, statistics can quantify the level of certainty you're allowed to have, based on details of the sampling algorithm (e.g., number of observations, underlying statistical model, presence of sampling biases etc.).

>random
>representative
here are your two spooks
>>As a statistician, you predict the future essentially and learn how to handle uncertainty and real numbers - essentially a magician.
only rationalists believe that statistics bring any kind of knowledge, and that this knowledge somehow corrects the ''what is sensed''

its like if you randomly pick 1000 people, what are the chances that so many of them have you chosen will be from a select group.

Not sure I ever saw a questionnaire that tried to sort the sample in to 2,000 possible categories per question, though.

Survey documents typically offer a few choices, and then may break some of them down into a few sub-choices.

But frankly, nobody cares about something 1 out of 2,000people think. If that few people think a certain way about a certain thing, who cares?

Of course, you're right that one sample can never "truly" capture the population. Although astronomically unlikely, it is still a logical possibility that all 324900000 people not included in the survey prefer apples to oranges. Same with confidence intervals. Sure, you can say that the true proportion is in the confidence interval 99.9% of the time (say), but you can never actually know whether the confidence interval of your original sample contains the true proportion unless you give the survey to all 325 million people. The reason statistics "works" is because we don't expect it to be 100% precise 100% of the time. It's not a crystal ball that can read the minds of millions of people or predict the future (remember the presidential election?).

Flip a coin, it may give tails with a 100% rate.
Flip again, it may give tails again, 2/2, 100% rate
Flip again, it gives you a head, 2/3 rate of tails or 66%
If you flipped the coin a thousand times, the average would be close to a perfect 50% rate.

That's how it works with polling too.
The first poll may be 95% accurate.
Doing a second poll may boost your accuracy to 98%. And doing a third boosts it to 99%.

>That's how it works with polling too.
>The first poll may be 95% accurate.
>Doing a second poll may boost your accuracy to 98%. And doing a third boosts it to 99%.

this is not how it works, you don't understand it, why spread false information and pretend you know what you're talking about -- at least give a disclaimer (e.g. this is a poorly educated guess, but here is how I think it works: blah blah)

robertniles.com/stats/margin.shtml

>Just as asking more people in one poll helps reduce your margin of error, looking at multiple polls can help you get a more accurate view of what people really think. Analysts such as Nate Silver and Sam Wang have created models that average multiple polls to help predict which candidates are most likely to win elections.

Boy you must feel really stupid right now.

This is why any person who'd properly report on this fictional stats would start off with something like "According to a question asked to 1000 Americans..." I think the problem here is mostly with journalism rather than statistics.
The sample is a quick way to get a general idea on something. What's practical about it is that it's easier to do than a census, so it can be done multiple times and the numbers collected from each tests can add up to a rather representative average.

>Piggot
Huh? Can someone explain what this is?

good post

You have to think of statistical results as evidence.

If you poll 1000 americans with a perfectly random sample, and find that 60% of them prefer apples to oranges, it's a small amount of evidence (but not negligible) that americans prefer apples to oranges. If you saw a study like that, you should update your beliefs accordingly (you should now be more willing to expect americans to like apples instead of oranges, if ever that query came up).

>statistics can predict the future

then why aren't you a billionaire yet

It's a portmanteau of the words "pig", "nigger", and "faggot".
Pig-nigger-faggot, or piggot for short.

>If that few people think a certain way about a certain thing, who cares?
Whichever group that 1/2000 arbirtrarily got lumped into certainly enjoys the boost
Give them of those boosts and they'll develop into a false majority

"Nigger" was never part of it you fucking piggot

I think the point that was trying to be made is that the total representation of your sample does need to correspond to your actual population. Taking 1000 random people in a single city only represents that city, as cultures and tastes are different throughout the country.

To make an extreme example, if there was a city full of people allergic to apples, and only they were used in the sample, you would not then say that 0% of Americans eat apples.

So what is it then that constitutes a representative sampling? If there are 2000 cities or regions with their own apple preferences, would a 1000 person sample still be enough to predict the general tastes across all of the country (within 3%)?

(1/2)
>So what is it then that constitutes a representative sampling?
The technical definition is that every member of the population must have an equal and independent probability of being selected for the sample.

Don't forget that statistics is still ultimately math, and math is an abstraction for modelling the process of sampling. Whether or not a sample is representative, or how to check if a sample is representative, is an empirical rather than theoretical question.

To give an analogy from science, the SI unit for temperature is (currently) defined in terms of an experimental setup involving water at a so-called 'standard pressure'. We can ask "how do you check if a setup is actually at the standard pressure?", and the answer will depend on the laboratory conditions, apparatus used, and other things which I don't really know about.
But none of that actually matters for the theory of thermodynamics which assumes a well-defined notion of temperature; the onus is on the experimentalist to ensure that their implementations match up with the theory, so that they can use the theory to make useful predictions.

Going back to your example, it is the responsibility of your hypothetical surveyor to ensure that the sample is carried out in a fair and unbiased manner, and it is only once the technical condition of a representative sample is assured that they are allowed to justify their generalization on statistical grounds.
(In practice, the technical condition is almost never met, but it is often assumed away as a "simplifying approximation" though the more scrupulous statisticians will issue a caveat when they do so. So long as you don't have extreme biasing examples such as the one you gave, this practical concession is usually acceptable -- think of how physics/engineering calculations generally remain accurate even after assuming away friction and air resistance.)

(cont.)

(2/2)

And your specific scenario with 2000 kinds of preferences can be answered, but it'll require you to be more precise with the fiddly empirical details. If there are 2000 kinds of preferences, does the survey present you with all of them and ask you to pick out yours? Or is apple-consumption measured as a combination of how often you eat apples (a continuous variable) + how tasty you find them (scale of 1-10, so an ordinal variable), whether you like them more than oranges (binary variable), ... and you're trying to predict the population means with a +/- 3% error? Or something else entirely?
And that's not even getting into the various kinds of sampling methods you can use (stratified, randomized block design, cluster sampling, single vs multistage etc.)

Long story short: you'll get a simple answer if the scenario is simple, but not if you're applying statistics to a complex situation.

>To make an extreme example, if there was a city full of people allergic to apples, and only they were used in the sample, you would not then say that 0% of Americans eat apples.
the same thing is valid for polls. they only ask people who are willing to answer polls. they never get answers from all those people who hate polls and who would never participate in one. therefore, every poll is and will be, wrong.

>maximum likelihood estimate
Fancy way of saying the sample average

>binomial success parameter
Dumber way of saying the true population proportion

We get it, you took Intro to Probability Theory.

>ctrl+F "random"
>11 results

randomness doesn't exist you brainlets

>Fancy way of saying sample average
Not really. The sample average happens to be MLE in this case, but it's not the same thing.

This is topic can indeed be confusing. One way to look at it that there is a probability distribution, which may be unknown, that all Americans are sampled from (pretaining to their fruit preference at least).

The model for this distribution is essentially:
P(American prefers apples to oranges) = p
P(otherwise) = 1-p;

The goal of the poll is to find the value p.
Usually we can never actually find the exact value of p, but we can guarantee that with enough samples, the probability of the estimated p to deviate from the real p by some epsilon, is as small as we want it to be.

Can anyone here explain the difference between these formulae (the standard errors)

i.imgur.com/CJNljub.png