notes/education/statistics/Sampling.md

78 lines
4.8 KiB
Markdown
Raw Normal View History

2024-01-17 21:30:19 +00:00
(Ch 19, stat 1040)
| Term | Definition |
| ---- | ---- |
| Qualitative | A descriptive value (red, blue, high, low) |
| Quantitative | A numerical value (7, 8, 9) |
| Population | The entire set of existing units that investigators wish to study |
| Sample | A portion or subset of the population |
2024-01-25 20:55:30 +00:00
| Parameter | A number that describes a characteristic of an entire *population* (*10%* of US senators voted for something) |
2024-01-17 21:35:19 +00:00
| Statistic | A number that describes a *sample* characteristic (*71%* of Americans feel that ...) |
2024-01-18 20:48:01 +00:00
> A global consumer survey reported that 6% of US taxpayers used or owned cryptocurrency in 2020. The US government is interested in knowing if this percentage has increased. The University of Chicago surveys 1,004 taxpayers and finds that 13% have used or owned crypto in the past year (2021)
2024-01-17 21:40:19 +00:00
In the above example:
2024-01-17 21:57:53 +00:00
- The *population* was *US taxpayers*
- The *parameter* was *6%*
- The *sample* was *1004 taxpayers*
- The *statistic* was *13%*
2024-01-18 21:37:02 +00:00
An ideal sample will represent the whole population.
2024-01-18 21:26:31 +00:00
## Sampling
| Sample Type | Description |
| ---- | ---- |
| Simple random | Advantages:<br>- Procedure is impartial<br>- Law of Averages<br>Disadvantages<br>- Not always possible<br>- Can be very expensive |
| Quota Sampling | Attempts to get certain proportions based on key characteristics. Quota sampling doesn't guarantee that the selection is an accurate representation. |
| Cluster Sampling | Divide population into subgroups, randomly select a subgroup, and sample all of the subjects in that group |
2024-01-18 21:37:02 +00:00
| Convenience | Sampling done near to the researcher because it's easier. |
2024-01-18 21:16:30 +00:00
## Simple Random Samples
2024-01-18 21:21:30 +00:00
## Bias
| Bias Type | Description |
| ---- | ---- |
| Selection | When the procedure that selects the sample is biased |
| Non-Response | Those that don't respond to a survey may have different characteristics than those that do respond |
2024-01-18 21:26:31 +00:00
| Response | When the question is worded in a leading way to elicit a certain response. |
| Volunteer response | Self selecting, individuals volunteer to answer |
| Measurement | Interviewing method influences the response, uses loaded words or ambiguities. |
2024-01-25 20:50:30 +00:00
## Percentages
2024-01-24 20:59:54 +00:00
(Ch 20, stat 1040)
2024-01-25 21:00:30 +00:00
Throughout this chapter, percentages are often represented by referencing a box model of 1s and 0s, where 1s are datapoints that *are* counted, and 0s are datapoints that are not counted.
2024-01-25 20:55:30 +00:00
The expected value for a sample percentage equals the population percentage. The standard error for that percentage = `(SE_sum/sample_size) * 100%`.
2024-01-24 21:19:53 +00:00
2024-01-25 21:05:30 +00:00
To determine by how much the standard error is affected, if $n$ is the proportion that the population changed by, the standard error will change by $\frac{1}{\sqrt{n}}$. For example, if a population changed from 20 to 40, that change was by a proportion of 2, and so you would say the standard error decreased by $\frac{1}{\sqrt{2}}$.
2024-01-24 21:24:53 +00:00
2024-01-25 21:05:30 +00:00
Accuracy in statistics refers to how small the standard error is. A smaller standard error means your data is more accurate. As the sample size increases, the percentage standard error decreases.
2024-01-24 21:34:53 +00:00
2024-01-24 21:40:00 +00:00
You can use the below equation to find the percentage standard error of a box model that has ones and zeros. the % of ones and zeros should be represented as a proportion (EG: `60% = 0.6`).
2024-01-25 21:05:30 +00:00
$$ SE_\% = \sqrt{\frac{(\%\space of\space 1s)(\%\space of\space 0s)}{num_{draws}}} $$
2024-01-25 21:00:30 +00:00
If asked if an observed % is reasonable, you can calculate the z score, and if the z score is more than 2-3 standard deviations away.
2024-01-25 21:05:30 +00:00
## Sampling Distributions
2024-01-25 21:00:30 +00:00
(Ch 23, stat 1040)
2024-01-25 21:10:30 +00:00
Take a sample, find the average, plot it and repeat. After many many samples, the *observed* probability histogram for sample averages looks like the *predicted* probability histogram.
2024-01-25 21:00:30 +00:00
2024-01-25 21:10:30 +00:00
As with $SE_\%$, as the sample size increase, the standard error decreases.
The central limit theorem still applies here, so the probability histogram for the average of the draws *follows the normal curve* with a large number of draws, even if the contents of the box do not.
2024-01-25 21:15:30 +00:00
To calculate the $SE_{ave}$, use the below equation:
$$ \frac{SD_{box}}{\sqrt{num_{draws}}} $$
2024-01-25 21:10:30 +00:00
| Term | Definition |
| ---- | ---- |
2024-01-25 21:15:30 +00:00
| $EV_{ave}$ | The expected value for the average of the population |
2024-01-25 21:10:30 +00:00
| $SE_{ave}$ | The standard error |
2024-01-29 21:02:28 +00:00
## Confidence Interval
2024-01-29 21:12:28 +00:00
Remember that the *parameter* is the *number* that actually describes the population.
95% confidence means that 95% of the time the interval constructed will capture the parameter, and 95% of the time it will not.
For any unknown average, the probability histogram of the sample averages will be shaped like the normal curve and centered at the true average with a standard deviation equal to $SE_{ave}$.
$$ sample_{ave} \pm 2 * se_{ave} $$
The above equation will give you an interval that you can be 95% confident that the true random will be within that point.
95% does *not* mean that 95% of the data is in the interval, it just means we are 95% confident that the actual point is going to lie within the range specified.