notes/education/statistics/Correlation and Regression.md

44 lines
2.6 KiB
Markdown
Raw Normal View History

2023-12-12 21:38:14 +00:00
(Chapter 8, STAT 1040)
# Correlation
## Scatter Diagrams
A scatter diagram or scatter plot shows the relationship between two variables. One variable is on the X axis, the other on the Y axis.
2023-12-13 21:08:19 +00:00
If a scatter diagram is football shaped, it can be summarized using the 5-number summary:
| Variable | Description |
| -- | -- |
2023-12-13 21:43:19 +00:00
| $\bar{x}$ | The average of the set graphed along the X axis |
2023-12-13 21:18:19 +00:00
| $\sigma_x$| The standard deviation of set graphed along the X axis |
2023-12-13 21:43:19 +00:00
| $\bar{y}$ | The average of the set graphed along the Y axis |
2023-12-13 21:18:19 +00:00
| $\sigma_y$ | The standard deviation of the set graphed along the Y axis |
| $r$ | The correlation coefficient, or how closely clustered the datapoints are in a line |
2023-12-13 21:08:19 +00:00
2023-12-13 21:18:19 +00:00
The intersection of the averages of x and y will be the center of an oval shaped scatter diagram. Draw lines $2\sigma$ (will contain ~95% of all data) from the center along each axis to generalize the shape of a scatter plot.
2023-12-13 21:08:19 +00:00
2023-12-13 21:43:19 +00:00
You can approximate the mean by trying to find the upper bound and the lower bound of $2\sigma$ deviation to either side of the mean, then finding the middle of those two points to find $\bar{x}$. You can divide the range between the two points by 4 to find $\sigma$.
2023-12-12 21:43:22 +00:00
### Association
2023-12-13 21:08:19 +00:00
- Positive association is demonstrated when the dots are trend upward as $x$ increases ($r$ is positive).
- Negative association is demonstrated when the the dots trend downward as $x$ increases ($r$ is negative).
- Strong association is demonstrated when dots are clustered tightly together along a line ($|r|$ is closer to 1).
- Weak association is demonstrated when dots are not clustered tightly. ($|r|$ is closer to 0)
2023-12-12 21:43:22 +00:00
## Correlation
2023-12-13 21:03:19 +00:00
Correlation is between `-1` and `1`. Correlation near 1 means tight clustering, and correlation near 0 means loose clustering. $r$ is -1 if the points are on a line with negative slope, $r$ is positive 1 if the points are on a line with a positive slope. As $|r|$ gets closer to 1, the line points cluster more tightly around a line.
2023-12-12 21:38:14 +00:00
2023-12-13 21:28:19 +00:00
## Calculating $r$ by hand
Put the $x$ values into $L1$, put the $y$ values into $L2$.
2023-12-13 21:43:19 +00:00
1. Convert the $x$ each x value in the list to standard units($z$). Convert each $y$ value to standard units.
2023-12-13 21:38:19 +00:00
$$ z = \frac{x-\bar{x}}{\sigma_x} $$
2023-12-13 21:43:19 +00:00
2. Multiply the standard units for each ($x$, $y$) pair in the sets, giving you a third list, named $p$ in this example.
$$ x * y = p$$
2023-12-13 21:33:19 +00:00
3. Find the average of the values from step 3, this is $r$.
2023-12-13 21:43:19 +00:00
$$ \bar{x}(p) $$
2023-12-13 21:33:19 +00:00
https://www.thoughtco.com/how-to-calculate-the-correlation-coefficient-3126228
2023-12-13 21:28:19 +00:00
2023-12-12 21:38:14 +00:00
# Terminology
| Term | Definition |
2023-12-13 21:03:19 +00:00
| -- | -- |
| $r$ | Correlation Coefficient |
| Linear Correlation | Measures the strength of a line |