notes/education/statistics/Correlation and Regression.md

(Chapter 8, STAT 1040)

# Correlation
## Scatter Diagrams
A scatter diagram or scatter plot shows the relationship between two variables. One variable is on the X axis, the other on the Y axis. 

If a scatter diagram is football shaped, it can be summarized using the 5-number summary:

| Variable | Description |
| -- | -- |
| $\bar{x}$ | The average of the set graphed along the X axis |
| $\sigma_x$| The standard deviation of set graphed along the X axis |
| $\bar{y}$ | The average of the set graphed along the Y axis | 
| $\sigma_y$ | The standard deviation of the set graphed along the Y axis |
| $r$ | The correlation coefficient, or how closely clustered the datapoints are in a line | 

The intersection of the averages of x and y will be the center of an oval shaped scatter diagram. Draw lines $2\sigma$ (will contain ~95% of all data) from the center along each axis to generalize the shape of a scatter plot.

You can approximate the mean by trying to find the upper bound and the lower bound of $2\sigma$ deviation to either side of the mean, then finding the middle of those two points to find $\bar{x}$. You can divide the range between the two points by 4 to find $\sigma$.
### Association
- Positive association is demonstrated when the dots are trend upward as $x$ increases ($r$ is positive). 
- Negative association is demonstrated when the the dots trend downward as $x$ increases ($r$ is negative).
- Strong association is demonstrated when dots are clustered tightly together along a line ($|r|$ is closer to 1).
- Weak association is demonstrated when dots are not clustered tightly.  ($|r|$ is closer to 0)
## Correlation
Correlation is between `-1` and `1`. Correlation near 1 means tight clustering, and correlation near 0 means loose clustering.  $r$ is -1 if the points are on a line with negative slope, $r$ is positive 1 if the points are on a line with a positive slope. As $|r|$ gets closer to 1, the line points cluster more tightly around a line.

If $x$ is above average, we expect the $y$ to be above average if there's a strong correlation coefficient
## Calculating $r$ by hand
Put the $x$ values into $L1$, put the $y$ values into $L2$. 

1. Convert the $x$ each x value in the list to standard units($z$). Convert each $y$ value to standard units. This will create two new tables containing $z_x$ and $z_y$.
$$ z = \frac{x-\bar{x}}{\sigma_x} $$
2. Multiply the standard units for each ($z_x$, $z_y$) pair in the sets, giving you a fifth list, named $p$ in this example.
$$ x * y = p$$
3. Find the average of the values from step 3, this is $r$.
$$ \bar{x}(p) $$
https://www.thoughtco.com/how-to-calculate-the-correlation-coefficient-3126228

# Terminology
| Term | Definition |
| -- | -- |
| $r$ | Correlation Coefficient |
| Linear Correlation | Measures the strength of a line |

(Chapter 9, STAT 1040)

# Correlation contd.
## Notes
- $r$ is a pure number, and does not have units
- $r$ does not change if you:
	- switch the $x$ and $y$ values
	- add the same number to every x or y value
	- multiply each $x$ or $y$ value by a positive number
- The correlation coefficient can be misleading in the presence of outliers or nonlinear association. 
- $r$ only shows the strength of a linear relationship.
- High $r$ only indicates high correlation, not causation.
- $r$ is not a linear, perfect scale. An $r$ of 0.8 does not mean twice as much linearity as an $r$ of 0.4
### Ecological Correlations
- Sometimes each point on the plot represents the average or rate for a whole group of individuals. 
- Ecological correlations are artificially strong
- The size of points may indicate the size of the datapoints

# Regression
(Chapter 10, STAT 1040)
## Notes
### Explanatory and Response Variables 
- Regression uses values of one variable to predict values for a related value.
- The variable you are trying to predict is called the *response variable*. It is graphed along the *y-axis*. This is the thing being predicted/measured.
- The variable you have information about that you are using to make the prediction is called the *explanatory variable*. It is graphed along the *x-axis*. This is the treatment.
- Just because a relationship exists between $x$ and $y$ *does not* mean that changes in $x$ *cause* changes in $y$.
- If the graph is given to you already set up, you already know the response and explanatory variables.
- The $\sigma$ line will always always have a slope of:
$$\pm \frac{\sigma_y}{\sigma_x}$$
- The SD line always passes through the averages for each axis.
- It'll go through the middle of the "football"
- $(ave_x, ave_y)$ is on the line
- Visually looks like a line of best fit
- The SD line is not used for prediction because it over-predicts
- Someone who is *exactly on* the SD line is the same number of SDs above or below the average in the y axis as they are in the x axis.
Given a scatter diagram where the average of each set lies on the point $(75, 70)$, with a $\sigma_x$ of 10 and a $\sigma_y$ of 12, you can graph the SD line by going up $\sigma_y$ and right $\sigma_x$, then connecting that point (in this example, $(85, 82)$) with the mean points.

### The Regression Line/Least Squared Regression Line (LSRL)
- This line has a more moderate slope than the SD line. it does not go through the peaks of the "football"
- Predictions can only be made if the data displays a linear association (is a football shape).
- The regression line is *used to predict* the y variable when the x variable is given
- In regression, the $x$ variable is the known variable, and $y$ is the value being solved for.
- The regression line goes through the point of averages, and can be positive or negative
$$ slope = r(\frac{\sigma_y}{\sigma_x}) $$
- You can find the regression line by multiplying $\sigma_y$ by $r$, for the rise, then using $\sigma_x$ for the run from the point of averages.

The below formula can be used to predict a y value given a 5 number summary of a set. 
$$ \hat{y} = \frac{x-\bar{x}}{\sigma_x} * r * \sigma_y + \bar{y} $$
1. Find $z_x$
2. Multiply $z_x$ by $r$
3. Multiply that by $\sigma_y$
4. Add the average of $y$

- For a positive association, for every $\sigma_x$ above average we are in $x$, the line predicts $y$ to be $\sigma_y$ standard deviations above y.x
- There are two separate regression lines, one for predicting $y$ from $x$, and one for predicting $x$ from $y$
- Do not extrapolate outside of the graph

(Ch 12, stat 1040)
Predicting a y value for a given x value can be calculated when given the regression equation.
$$ y = mx + b $$
Where $y$ is the predicted value, $m$ is the slope, $x$ is the given value and the $b$ is the intercept. 
$$ slope = \frac{r * \sigma_y}{\sigma_x} $$

### The Regression Effect
- In a test-retest situation, people with low scores tend to improve, and people with high scores tend to do worse. This means that individuals score closer to the average as they retest. 
- The regression *fallacy* is contributing this to something other than chance error.
### R.M.S Error for Regression
The distance of an individual point from the regression line. This only applies for a football shaped scatter diagram.
- If a point is below the line, the error is negative.
- If a point is above the line, the error is positive.
- "Give or take"
- `residual = observed - predicted` for a given $x$ value 
- The r.m.s error is the r.m.s size of the errors
$$ \sqrt{1-r^2}(\sigma_y) $$
- On a least squared regression line, the 1 r.m.s error away will contain $2\sigma$ of the data, and it should loosely mirror a normal curve. 
- To approximate the R.M.S error for a scatter diagram, take a high value and a low value for a given $x$ coordinate, and divide by 4, because r.m.s error is within $2\sigma$ of either side of the line. 
- 68% = $2\sigma$, 95% = $4\sigma$
- RMS can help determine which observations are outliers. Typically if a value is more than *2 r.m.s* away from the prediction estimate, it is considered to be an outlier
- The RMS error is only appropriate for homoscedastic scatter diagrams (football shape)

- Heteroscedastic scatter diagrams should not be used to make a prediction, because they do not follow a football shape
- Homoscedastic scatter diagrams can be used to make predictions, because they follow a football shape


---
# Terminology
| Term | Definition |  |
| ---- | ---- | ---- |
| $\hat{y}$ | The predicted value |  |
|  Homoscedastic |  The scatter diagram will look the same above and below the LSRL |  |
| Heteroscedastic | Will have more variability on one side of the regression line |  |
vault backup: 2023-12-12 14:38:14 2023-12-12 21:38:14 +00:00			`(Chapter 8, STAT 1040)`

			`# Correlation`
			`## Scatter Diagrams`
			`A scatter diagram or scatter plot shows the relationship between two variables. One variable is on the X axis, the other on the Y axis.`
vault backup: 2023-12-13 14:08:19 2023-12-13 21:08:19 +00:00
			`If a scatter diagram is football shaped, it can be summarized using the 5-number summary:`

			`\| Variable \| Description \|`
			`\| -- \| -- \|`
vault backup: 2023-12-13 14:43:19 2023-12-13 21:43:19 +00:00			`\| $\bar{x}$ \| The average of the set graphed along the X axis \|`
vault backup: 2023-12-13 14:18:19 2023-12-13 21:18:19 +00:00			`\| $\sigma_x$\| The standard deviation of set graphed along the X axis \|`
vault backup: 2023-12-13 14:43:19 2023-12-13 21:43:19 +00:00			`\| $\bar{y}$ \| The average of the set graphed along the Y axis \|`
vault backup: 2023-12-13 14:18:19 2023-12-13 21:18:19 +00:00			`\| $\sigma_y$ \| The standard deviation of the set graphed along the Y axis \|`
			`\| $r$ \| The correlation coefficient, or how closely clustered the datapoints are in a line \|`
vault backup: 2023-12-13 14:08:19 2023-12-13 21:08:19 +00:00
vault backup: 2023-12-13 14:18:19 2023-12-13 21:18:19 +00:00			`The intersection of the averages of x and y will be the center of an oval shaped scatter diagram. Draw lines $2\sigma$ (will contain ~95% of all data) from the center along each axis to generalize the shape of a scatter plot.`
vault backup: 2023-12-13 14:08:19 2023-12-13 21:08:19 +00:00
vault backup: 2023-12-13 14:43:19 2023-12-13 21:43:19 +00:00			`You can approximate the mean by trying to find the upper bound and the lower bound of $2\sigma$ deviation to either side of the mean, then finding the middle of those two points to find $\bar{x}$. You can divide the range between the two points by 4 to find $\sigma$.`
vault backup: 2023-12-12 14:43:22 2023-12-12 21:43:22 +00:00			`### Association`
vault backup: 2023-12-13 14:08:19 2023-12-13 21:08:19 +00:00			`- Positive association is demonstrated when the dots are trend upward as $x$ increases ($r$ is positive).`
			`- Negative association is demonstrated when the the dots trend downward as $x$ increases ($r$ is negative).`
			`- Strong association is demonstrated when dots are clustered tightly together along a line ($\|r\|$ is closer to 1).`
			`- Weak association is demonstrated when dots are not clustered tightly. ($\|r\|$ is closer to 0)`
vault backup: 2023-12-12 14:43:22 2023-12-12 21:43:22 +00:00			`## Correlation`
vault backup: 2023-12-13 14:03:19 2023-12-13 21:03:19 +00:00			Correlation is between `-1` and `1`. Correlation near 1 means tight clustering, and correlation near 0 means loose clustering. $r$ is -1 if the points are on a line with negative slope, $r$ is positive 1 if the points are on a line with a positive slope. As $\|r\|$ gets closer to 1, the line points cluster more tightly around a line.
vault backup: 2023-12-12 14:38:14 2023-12-12 21:38:14 +00:00
vault backup: 2023-12-14 13:57:50 2023-12-14 20:57:50 +00:00			`If $x$ is above average, we expect the $y$ to be above average if there's a strong correlation coefficient`
vault backup: 2023-12-13 14:28:19 2023-12-13 21:28:19 +00:00			`## Calculating $r$ by hand`
			`Put the $x$ values into $L1$, put the $y$ values into $L2$.`

vault backup: 2023-12-19 14:05:30 2023-12-19 21:05:30 +00:00			`1. Convert the $x$ each x value in the list to standard units($z$). Convert each $y$ value to standard units. This will create two new tables containing $z_x$ and $z_y$.`
vault backup: 2023-12-13 14:38:19 2023-12-13 21:38:19 +00:00			`$$ z = \frac{x-\bar{x}}{\sigma_x} $$`
vault backup: 2023-12-19 14:05:30 2023-12-19 21:05:30 +00:00			`2. Multiply the standard units for each ($z_x$, $z_y$) pair in the sets, giving you a fifth list, named $p$ in this example.`
vault backup: 2023-12-13 14:43:19 2023-12-13 21:43:19 +00:00			`$$ x * y = p$$`
vault backup: 2023-12-13 14:33:19 2023-12-13 21:33:19 +00:00			`3. Find the average of the values from step 3, this is $r$.`
vault backup: 2023-12-13 14:43:19 2023-12-13 21:43:19 +00:00			`$$ \bar{x}(p) $$`
vault backup: 2023-12-13 14:33:19 2023-12-13 21:33:19 +00:00			`https://www.thoughtco.com/how-to-calculate-the-correlation-coefficient-3126228`
vault backup: 2023-12-13 14:28:19 2023-12-13 21:28:19 +00:00
vault backup: 2023-12-12 14:38:14 2023-12-12 21:38:14 +00:00			`# Terminology`
			`\| Term \| Definition \|`
vault backup: 2023-12-13 14:03:19 2023-12-13 21:03:19 +00:00			`\| -- \| -- \|`
			`\| $r$ \| Correlation Coefficient \|`
			`\| Linear Correlation \| Measures the strength of a line \|`
vault backup: 2023-12-14 13:57:50 2023-12-14 20:57:50 +00:00
vault backup: 2023-12-14 14:02:52 2023-12-14 21:02:52 +00:00			`(Chapter 9, STAT 1040)`

			`# Correlation contd.`
			`## Notes`
			`- $r$ is a pure number, and does not have units`
			`- $r$ does not change if you:`
			`- switch the $x$ and $y$ values`
			`- add the same number to every x or y value`
vault backup: 2023-12-14 14:07:52 2023-12-14 21:07:52 +00:00			`- multiply each $x$ or $y$ value by a positive number`
			`- The correlation coefficient can be misleading in the presence of outliers or nonlinear association.`
vault backup: 2023-12-14 14:12:53 2023-12-14 21:12:53 +00:00			`- $r$ only shows the strength of a linear relationship.`
			`- High $r$ only indicates high correlation, not causation.`
vault backup: 2023-12-14 14:17:52 2023-12-14 21:17:52 +00:00			`- $r$ is not a linear, perfect scale. An $r$ of 0.8 does not mean twice as much linearity as an $r$ of 0.4`
			`### Ecological Correlations`
			`- Sometimes each point on the plot represents the average or rate for a whole group of individuals.`
vault backup: 2023-12-14 14:22:54 2023-12-14 21:22:54 +00:00			`- Ecological correlations are artificially strong`
vault backup: 2023-12-15 12:48:31 2023-12-15 19:48:31 +00:00			`- The size of points may indicate the size of the datapoints`

			`# Regression`
			`(Chapter 10, STAT 1040)`
			`## Notes`
vault backup: 2023-12-15 12:53:31 2023-12-15 19:53:31 +00:00			`### Explanatory and Response Variables`
vault backup: 2023-12-15 12:48:31 2023-12-15 19:48:31 +00:00			`- Regression uses values of one variable to predict values for a related value.`
vault backup: 2023-12-15 12:53:31 2023-12-15 19:53:31 +00:00			`- The variable you are trying to predict is called the response variable. It is graphed along the y-axis. This is the thing being predicted/measured.`
			`- The variable you have information about that you are using to make the prediction is called the explanatory variable. It is graphed along the x-axis. This is the treatment.`
			`- Just because a relationship exists between $x$ and $y$ does not mean that changes in $x$ cause changes in $y$.`
			`- If the graph is given to you already set up, you already know the response and explanatory variables.`
			`- The $\sigma$ line will always always have a slope of:`
vault backup: 2023-12-15 12:58:31 2023-12-15 19:58:31 +00:00			`$$\pm \frac{\sigma_y}{\sigma_x}$$`
			`- The SD line always passes through the averages for each axis.`
vault backup: 2023-12-15 13:04:23 2023-12-15 20:04:23 +00:00			`- It'll go through the middle of the "football"`
			`- $(ave_x, ave_y)$ is on the line`
			`- Visually looks like a line of best fit`
vault backup: 2023-12-15 13:14:23 2023-12-15 20:14:23 +00:00			`- The SD line is not used for prediction because it over-predicts`
vault backup: 2023-12-15 13:04:23 2023-12-15 20:04:23 +00:00			`- Someone who is exactly on the SD line is the same number of SDs above or below the average in the y axis as they are in the x axis.`
vault backup: 2023-12-15 13:09:23 2023-12-15 20:09:23 +00:00			`Given a scatter diagram where the average of each set lies on the point $(75, 70)$, with a $\sigma_x$ of 10 and a $\sigma_y$ of 12, you can graph the SD line by going up $\sigma_y$ and right $\sigma_x$, then connecting that point (in this example, $(85, 82)$) with the mean points.`

			`### The Regression Line/Least Squared Regression Line (LSRL)`
			`- This line has a more moderate slope than the SD line. it does not go through the peaks of the "football"`
vault backup: 2023-12-19 13:55:29 2023-12-19 20:55:29 +00:00			`- Predictions can only be made if the data displays a linear association (is a football shape).`
vault backup: 2023-12-15 13:09:23 2023-12-15 20:09:23 +00:00			`- The regression line is used to predict the y variable when the x variable is given`
vault backup: 2023-12-18 13:59:40 2023-12-18 20:59:40 +00:00			`- In regression, the $x$ variable is the known variable, and $y$ is the value being solved for.`
			`- The regression line goes through the point of averages, and can be positive or negative`
vault backup: 2023-12-15 13:14:23 2023-12-15 20:14:23 +00:00			`$$ slope = r(\frac{\sigma_y}{\sigma_x}) $$`
vault backup: 2023-12-15 13:19:23 2023-12-15 20:19:23 +00:00			`- You can find the regression line by multiplying $\sigma_y$ by $r$, for the rise, then using $\sigma_x$ for the run from the point of averages.`
vault backup: 2023-12-15 13:24:23 2023-12-15 20:24:23 +00:00
vault backup: 2023-12-18 09:31:06 2023-12-18 16:31:06 +00:00			`The below formula can be used to predict a y value given a 5 number summary of a set.`
			`$$ \hat{y} = \frac{x-\bar{x}}{\sigma_x} * r * \sigma_y + \bar{y} $$`
vault backup: 2023-12-18 13:54:40 2023-12-18 20:54:40 +00:00			`1. Find $z_x$`
			`2. Multiply $z_x$ by $r$`
			`3. Multiply that by $\sigma_y$`
			`4. Add the average of $y$`

vault backup: 2023-12-18 14:14:40 2023-12-18 21:14:40 +00:00			`- For a positive association, for every $\sigma_x$ above average we are in $x$, the line predicts $y$ to be $\sigma_y$ standard deviations above y.x`
vault backup: 2023-12-19 14:20:30 2023-12-19 21:20:30 +00:00			`- There are two separate regression lines, one for predicting $y$ from $x$, and one for predicting $x$ from $y$`
			`- Do not extrapolate outside of the graph`
vault backup: 2024-01-02 14:28:59 2024-01-02 21:28:59 +00:00
			`(Ch 12, stat 1040)`
			`Predicting a y value for a given x value can be calculated when given the regression equation.`
			`$$ y = mx + b $$`
vault backup: 2024-01-02 14:34:22 2024-01-02 21:34:22 +00:00			`Where $y$ is the predicted value, $m$ is the slope, $x$ is the given value and the $b$ is the intercept.`
vault backup: 2024-01-02 14:28:59 2024-01-02 21:28:59 +00:00			`$$ slope = \frac{r * \sigma_y}{\sigma_x} $$`

vault backup: 2023-12-18 14:19:40 2023-12-18 21:19:40 +00:00			`### The Regression Effect`
vault backup: 2023-12-18 14:24:40 2023-12-18 21:24:40 +00:00			`- In a test-retest situation, people with low scores tend to improve, and people with high scores tend to do worse. This means that individuals score closer to the average as they retest.`
			`- The regression fallacy is contributing this to something other than chance error.`
vault backup: 2023-12-20 13:57:32 2023-12-20 20:57:32 +00:00			`### R.M.S Error for Regression`
vault backup: 2023-12-20 14:07:33 2023-12-20 21:07:33 +00:00			`The distance of an individual point from the regression line. This only applies for a football shaped scatter diagram.`
vault backup: 2023-12-20 13:57:32 2023-12-20 20:57:32 +00:00			`- If a point is below the line, the error is negative.`
			`- If a point is above the line, the error is positive.`
vault backup: 2023-12-20 14:07:33 2023-12-20 21:07:33 +00:00			`- "Give or take"`
vault backup: 2023-12-20 14:02:31 2023-12-20 21:02:31 +00:00			- `residual = observed - predicted` for a given $x$ value
			`- The r.m.s error is the r.m.s size of the errors`
			`$$ \sqrt{1-r^2}(\sigma_y) $$`
vault backup: 2024-01-02 14:09:01 2024-01-02 21:09:01 +00:00			`- On a least squared regression line, the 1 r.m.s error away will contain $2\sigma$ of the data, and it should loosely mirror a normal curve.`
			`- To approximate the R.M.S error for a scatter diagram, take a high value and a low value for a given $x$ coordinate, and divide by 4, because r.m.s error is within $2\sigma$ of either side of the line.`
vault backup: 2024-01-02 14:13:59 2024-01-02 21:13:59 +00:00			`- 68% = $2\sigma$, 95% = $4\sigma$`
vault backup: 2024-01-02 14:23:59 2024-01-02 21:23:59 +00:00			`- RMS can help determine which observations are outliers. Typically if a value is more than 2 r.m.s away from the prediction estimate, it is considered to be an outlier`
vault backup: 2024-01-02 14:28:59 2024-01-02 21:28:59 +00:00			`- The RMS error is only appropriate for homoscedastic scatter diagrams (football shape)`

			`- Heteroscedastic scatter diagrams should not be used to make a prediction, because they do not follow a football shape`
			`- Homoscedastic scatter diagrams can be used to make predictions, because they follow a football shape`


vault backup: 2024-01-02 14:23:59 2024-01-02 21:23:59 +00:00
vault backup: 2023-12-18 14:24:40 2023-12-18 21:24:40 +00:00			`---`
vault backup: 2023-12-18 09:31:06 2023-12-18 16:31:06 +00:00			`# Terminology`
vault backup: 2024-01-02 14:23:59 2024-01-02 21:23:59 +00:00			`\| Term \| Definition \| \|`
			`\| ---- \| ---- \| ---- \|`
			`\| $\hat{y}$ \| The predicted value \| \|`
			`\| Homoscedastic \| The scatter diagram will look the same above and below the LSRL \| \|`
			`\| Heteroscedastic \| Will have more variability on one side of the regression line \| \|`