notes/education/statistics/Correlation and Regression.md

105 lines
6.3 KiB
Markdown
Raw Normal View History

2023-12-12 21:38:14 +00:00
(Chapter 8, STAT 1040)
# Correlation
## Scatter Diagrams
A scatter diagram or scatter plot shows the relationship between two variables. One variable is on the X axis, the other on the Y axis.
2023-12-13 21:08:19 +00:00
If a scatter diagram is football shaped, it can be summarized using the 5-number summary:
| Variable | Description |
| -- | -- |
2023-12-13 21:43:19 +00:00
| $\bar{x}$ | The average of the set graphed along the X axis |
2023-12-13 21:18:19 +00:00
| $\sigma_x$| The standard deviation of set graphed along the X axis |
2023-12-13 21:43:19 +00:00
| $\bar{y}$ | The average of the set graphed along the Y axis |
2023-12-13 21:18:19 +00:00
| $\sigma_y$ | The standard deviation of the set graphed along the Y axis |
| $r$ | The correlation coefficient, or how closely clustered the datapoints are in a line |
2023-12-13 21:08:19 +00:00
2023-12-13 21:18:19 +00:00
The intersection of the averages of x and y will be the center of an oval shaped scatter diagram. Draw lines $2\sigma$ (will contain ~95% of all data) from the center along each axis to generalize the shape of a scatter plot.
2023-12-13 21:08:19 +00:00
2023-12-13 21:43:19 +00:00
You can approximate the mean by trying to find the upper bound and the lower bound of $2\sigma$ deviation to either side of the mean, then finding the middle of those two points to find $\bar{x}$. You can divide the range between the two points by 4 to find $\sigma$.
2023-12-12 21:43:22 +00:00
### Association
2023-12-13 21:08:19 +00:00
- Positive association is demonstrated when the dots are trend upward as $x$ increases ($r$ is positive).
- Negative association is demonstrated when the the dots trend downward as $x$ increases ($r$ is negative).
- Strong association is demonstrated when dots are clustered tightly together along a line ($|r|$ is closer to 1).
- Weak association is demonstrated when dots are not clustered tightly. ($|r|$ is closer to 0)
2023-12-12 21:43:22 +00:00
## Correlation
2023-12-13 21:03:19 +00:00
Correlation is between `-1` and `1`. Correlation near 1 means tight clustering, and correlation near 0 means loose clustering. $r$ is -1 if the points are on a line with negative slope, $r$ is positive 1 if the points are on a line with a positive slope. As $|r|$ gets closer to 1, the line points cluster more tightly around a line.
2023-12-12 21:38:14 +00:00
2023-12-14 20:57:50 +00:00
If $x$ is above average, we expect the $y$ to be above average if there's a strong correlation coefficient
2023-12-13 21:28:19 +00:00
## Calculating $r$ by hand
Put the $x$ values into $L1$, put the $y$ values into $L2$.
2023-12-13 21:43:19 +00:00
1. Convert the $x$ each x value in the list to standard units($z$). Convert each $y$ value to standard units.
2023-12-13 21:38:19 +00:00
$$ z = \frac{x-\bar{x}}{\sigma_x} $$
2023-12-13 21:43:19 +00:00
2. Multiply the standard units for each ($x$, $y$) pair in the sets, giving you a third list, named $p$ in this example.
$$ x * y = p$$
2023-12-13 21:33:19 +00:00
3. Find the average of the values from step 3, this is $r$.
2023-12-13 21:43:19 +00:00
$$ \bar{x}(p) $$
2023-12-13 21:33:19 +00:00
https://www.thoughtco.com/how-to-calculate-the-correlation-coefficient-3126228
2023-12-13 21:28:19 +00:00
2023-12-12 21:38:14 +00:00
# Terminology
| Term | Definition |
2023-12-13 21:03:19 +00:00
| -- | -- |
| $r$ | Correlation Coefficient |
| Linear Correlation | Measures the strength of a line |
2023-12-14 20:57:50 +00:00
2023-12-14 21:02:52 +00:00
(Chapter 9, STAT 1040)
# Correlation contd.
## Notes
- $r$ is a pure number, and does not have units
- $r$ does not change if you:
- switch the $x$ and $y$ values
- add the same number to every x or y value
2023-12-14 21:07:52 +00:00
- multiply each $x$ or $y$ value by a positive number
- The correlation coefficient can be misleading in the presence of outliers or nonlinear association.
2023-12-14 21:12:53 +00:00
- $r$ only shows the strength of a linear relationship.
- High $r$ only indicates high correlation, not causation.
2023-12-14 21:17:52 +00:00
- $r$ is not a linear, perfect scale. An $r$ of 0.8 does not mean twice as much linearity as an $r$ of 0.4
### Ecological Correlations
- Sometimes each point on the plot represents the average or rate for a whole group of individuals.
2023-12-14 21:22:54 +00:00
- Ecological correlations are artificially strong
2023-12-15 19:48:31 +00:00
- The size of points may indicate the size of the datapoints
# Regression
(Chapter 10, STAT 1040)
## Notes
2023-12-15 19:53:31 +00:00
### Explanatory and Response Variables
2023-12-15 19:48:31 +00:00
- Regression uses values of one variable to predict values for a related value.
2023-12-15 19:53:31 +00:00
- The variable you are trying to predict is called the *response variable*. It is graphed along the *y-axis*. This is the thing being predicted/measured.
- The variable you have information about that you are using to make the prediction is called the *explanatory variable*. It is graphed along the *x-axis*. This is the treatment.
- Just because a relationship exists between $x$ and $y$ *does not* mean that changes in $x$ *cause* changes in $y$.
- If the graph is given to you already set up, you already know the response and explanatory variables.
- The $\sigma$ line will always always have a slope of:
2023-12-15 19:58:31 +00:00
$$\pm \frac{\sigma_y}{\sigma_x}$$
- The SD line always passes through the averages for each axis.
2023-12-15 20:04:23 +00:00
- It'll go through the middle of the "football"
- $(ave_x, ave_y)$ is on the line
- Visually looks like a line of best fit
2023-12-15 20:14:23 +00:00
- The SD line is not used for prediction because it over-predicts
2023-12-15 20:04:23 +00:00
- Someone who is *exactly on* the SD line is the same number of SDs above or below the average in the y axis as they are in the x axis.
2023-12-15 20:09:23 +00:00
Given a scatter diagram where the average of each set lies on the point $(75, 70)$, with a $\sigma_x$ of 10 and a $\sigma_y$ of 12, you can graph the SD line by going up $\sigma_y$ and right $\sigma_x$, then connecting that point (in this example, $(85, 82)$) with the mean points.
### The Regression Line/Least Squared Regression Line (LSRL)
- This line has a more moderate slope than the SD line. it does not go through the peaks of the "football"
- The regression line is *used to predict* the y variable when the x variable is given
2023-12-18 20:59:40 +00:00
- In regression, the $x$ variable is the known variable, and $y$ is the value being solved for.
- The regression line goes through the point of averages, and can be positive or negative
2023-12-15 20:14:23 +00:00
$$ slope = r(\frac{\sigma_y}{\sigma_x}) $$
2023-12-15 20:19:23 +00:00
- You can find the regression line by multiplying $\sigma_y$ by $r$, for the rise, then using $\sigma_x$ for the run from the point of averages.
2023-12-15 20:24:23 +00:00
2023-12-18 16:31:06 +00:00
The below formula can be used to predict a y value given a 5 number summary of a set.
$$ \hat{y} = \frac{x-\bar{x}}{\sigma_x} * r * \sigma_y + \bar{y} $$
2023-12-18 20:54:40 +00:00
1. Find $z_x$
2. Multiply $z_x$ by $r$
3. Multiply that by $\sigma_y$
4. Add the average of $y$
2023-12-18 21:14:40 +00:00
- For a positive association, for every $\sigma_x$ above average we are in $x$, the line predicts $y$ to be $\sigma_y$ standard deviations above y.x
2023-12-18 21:19:40 +00:00
### The Regression Effect
In a test-retest situation, people with low scores tend to improve, and people with high scores tend to do worse. This means that individuals score closer to the average as they retest.
2023-12-18 16:31:06 +00:00
# Terminology
| Term | Definition |
| -- | -- |
| $\hat{y}$ | The predicted value |