notes/education/statistics/Sampling.md

(Ch 19, stat 1040)

| Term | Definition |
| ---- | ---- |
| Qualitative | A descriptive value (red, blue, high, low) |
| Quantitative | A numerical value (7, 8, 9) |
| Population | The entire set of existing units that investigators wish to study |
| Sample | A portion or subset of the population |
| Parameter | A number that describes a characteristic of an entire *population* (*10%* of US senators voted for something) |
| Statistic | A number that describes a *sample* characteristic (*71%* of Americans feel that ...) |
> A global consumer survey reported that 6% of US taxpayers used or owned cryptocurrency in 2020. The US government is interested in knowing if this percentage has increased. The University of Chicago surveys 1,004 taxpayers and finds that 13% have used or owned crypto in the past year (2021)

In the above example:
- The *population* was *US taxpayers*
- The *parameter* was *6%*
- The *sample* was *1004 taxpayers*
- The *statistic* was *13%*

An ideal sample will represent the whole population.
## Sampling
| Sample Type | Description |
| ---- | ---- |
| Simple random | Advantages:<br>- Procedure is impartial<br>- Law of Averages<br>Disadvantages<br>- Not always possible<br>- Can be very expensive |
| Quota Sampling | Attempts to get certain proportions based on key characteristics. Quota sampling doesn't guarantee that the selection is an accurate representation. |
| Cluster Sampling | Divide population into subgroups, randomly select a subgroup, and sample all of the subjects in that group |
| Convenience | Sampling done near to the researcher because it's easier. |
## Simple Random Samples

## Bias
| Bias Type | Description |
| ---- | ---- |
| Selection | When the procedure that selects the sample is biased |
| Non-Response | Those that don't respond to a survey may have different characteristics than those that do respond |
| Response | When the question is worded in a leading way to elicit a certain response. |
| Volunteer response | Self selecting, individuals volunteer to answer |
| Measurement | Interviewing method influences the response, uses loaded words or ambiguities. |

## Percentages
(Ch 20, stat 1040)

Throughout this chapter, percentages are often represented by referencing a box model of 1s and 0s, where 1s are datapoints that *are* counted, and 0s are datapoints that are not counted.

The expected value for a sample percentage equals the population percentage. The standard error for that percentage = `(SE_sum/sample_size) * 100%`.

To determine by how much the standard error is affected, if $n$ is the proportion that the population changed by, the standard error will change by $\frac{1}{\sqrt{n}}$. For example, if a population changed from 20 to 40, that change was by a proportion of 2, and so you would say the standard error decreased by $\frac{1}{\sqrt{2}}$.

Accuracy in statistics refers to how small the standard error is. A smaller standard error means your data is more accurate. As the sample size increases, the percentage standard error decreases.

You can use the below equation to find the percentage standard error of a box model that has ones and zeros. the % of ones and zeros should be represented as a proportion (EG: `60% = 0.6`).

$$ SE_\% =  \sqrt{\frac{(\%\space of\space 1s)(\%\space of\space 0s)}{num_{draws}}} $$
If asked if an observed % is reasonable, you can calculate the z score, and if the z score is more than 2-3 standard deviations away.

## Sampling Distributions
(Ch 23, stat 1040)
Take a sample, find the average, plot it and repeat. After many many samples, the *observed* probability histogram for sample averages looks like the *predicted* probability histogram.

As with $SE_\%$, as the sample size increase, the standard error decreases.

The central limit theorem still applies here, so the probability histogram for the average of the draws *follows the normal curve* with a large number of draws, even if the contents of the box do not.

To calculate the $SE_{ave}$, use the below equation:
$$ \frac{SD_{box}}{\sqrt{num_{draws}}} $$

| Term | Definition |
| ---- | ---- |
| $EV_{ave}$ | The expected value for the average of the population |
| $SE_{ave}$ | The standard error  |
## Confidence Interval
Remember that the *parameter* is the *number* that actually describes the population.
95% confidence means that 95% of the time the interval constructed will capture the parameter, and 95% of the time it will not.

For any unknown average, the probability histogram of the sample averages will be shaped like the normal curve and centered at the true average with a standard deviation equal to $SE_{ave}$.

$$ sample_{ave} \pm 2 * SE_{ave} $$
This equation should be a review:
$$ SE_{ave} = \frac{SD}{\sqrt{size\space samp}} $$
The above equation will give you an interval that you can be 95% confident that the true random will be within that point.

95% does *not* mean that 95% of the data is in the interval, it just means we are 95% confident that the actual point is going to lie within the range specified.

A confidence interval is only valid if the sample is not a simple random sample.

If we're using two standard deviations, the below statement can be used:
"We can be 95% confident that the interval \[we have constructed] contains the true average \[thing being measured]."