146 lines
9.0 KiB
Markdown
146 lines
9.0 KiB
Markdown
(Chapter 8, STAT 1040)
|
|
|
|
# Correlation
|
|
## Scatter Diagrams
|
|
A scatter diagram or scatter plot shows the relationship between two variables. One variable is on the X axis, the other on the Y axis.
|
|
|
|
If a scatter diagram is football shaped, it can be summarized using the 5-number summary:
|
|
|
|
| Variable | Description |
|
|
| -- | -- |
|
|
| $\bar{x}$ | The average of the set graphed along the X axis |
|
|
| $\sigma_x$| The standard deviation of set graphed along the X axis |
|
|
| $\bar{y}$ | The average of the set graphed along the Y axis |
|
|
| $\sigma_y$ | The standard deviation of the set graphed along the Y axis |
|
|
| $r$ | The correlation coefficient, or how closely clustered the datapoints are in a line |
|
|
|
|
The intersection of the averages of x and y will be the center of an oval shaped scatter diagram. Draw lines $2\sigma$ (will contain ~95% of all data) from the center along each axis to generalize the shape of a scatter plot.
|
|
|
|
You can approximate the mean by trying to find the upper bound and the lower bound of $2\sigma$ deviation to either side of the mean, then finding the middle of those two points to find $\bar{x}$. You can divide the range between the two points by 4 to find $\sigma$.
|
|
### Association
|
|
- Positive association is demonstrated when the dots are trend upward as $x$ increases ($r$ is positive).
|
|
- Negative association is demonstrated when the the dots trend downward as $x$ increases ($r$ is negative).
|
|
- Strong association is demonstrated when dots are clustered tightly together along a line ($|r|$ is closer to 1).
|
|
- Weak association is demonstrated when dots are not clustered tightly. ($|r|$ is closer to 0)
|
|
## Correlation
|
|
Correlation is between `-1` and `1`. Correlation near 1 means tight clustering, and correlation near 0 means loose clustering. $r$ is -1 if the points are on a line with negative slope, $r$ is positive 1 if the points are on a line with a positive slope. As $|r|$ gets closer to 1, the line points cluster more tightly around a line.
|
|
|
|
If $x$ is above average, we expect the $y$ to be above average if there's a strong correlation coefficient
|
|
## Calculating $r$ by hand
|
|
Put the $x$ values into $L1$, put the $y$ values into $L2$.
|
|
|
|
1. Convert the $x$ each x value in the list to standard units($z$). Convert each $y$ value to standard units. This will create two new tables containing $z_x$ and $z_y$.
|
|
$$ z = \frac{x-\bar{x}}{\sigma_x} $$
|
|
2. Multiply the standard units for each ($z_x$, $z_y$) pair in the sets, giving you a fifth list, named $p$ in this example.
|
|
$$ x * y = p$$
|
|
3. Find the average of the values from step 3, this is $r$.
|
|
$$ \bar{x}(p) $$
|
|
https://www.thoughtco.com/how-to-calculate-the-correlation-coefficient-3126228
|
|
|
|
# Terminology
|
|
| Term | Definition |
|
|
| -- | -- |
|
|
| $r$ | Correlation Coefficient |
|
|
| Linear Correlation | Measures the strength of a line |
|
|
|
|
(Chapter 9, STAT 1040)
|
|
|
|
# Correlation contd.
|
|
## Notes
|
|
- $r$ is a pure number, and does not have units
|
|
- $r$ does not change if you:
|
|
- switch the $x$ and $y$ values
|
|
- add the same number to every x or y value
|
|
- multiply each $x$ or $y$ value by a positive number
|
|
- The correlation coefficient can be misleading in the presence of outliers or nonlinear association.
|
|
- $r$ only shows the strength of a linear relationship.
|
|
- High $r$ only indicates high correlation, not causation.
|
|
- $r$ is not a linear, perfect scale. An $r$ of 0.8 does not mean twice as much linearity as an $r$ of 0.4
|
|
### Ecological Correlations
|
|
- Sometimes each point on the plot represents the average or rate for a whole group of individuals.
|
|
- Ecological correlations are artificially strong
|
|
- The size of points may indicate the size of the datapoints
|
|
|
|
# Regression
|
|
(Chapter 10, STAT 1040)
|
|
## Notes
|
|
### Explanatory and Response Variables
|
|
- Regression uses values of one variable to predict values for a related value.
|
|
- The variable you are trying to predict is called the *response variable*. It is graphed along the *y-axis*. This is the thing being predicted/measured.
|
|
- The variable you have information about that you are using to make the prediction is called the *explanatory variable*. It is graphed along the *x-axis*. This is the treatment.
|
|
- Just because a relationship exists between $x$ and $y$ *does not* mean that changes in $x$ *cause* changes in $y$.
|
|
- If the graph is given to you already set up, you already know the response and explanatory variables.
|
|
- The $\sigma$ line will always always have a slope of:
|
|
$$\pm \frac{\sigma_y}{\sigma_x}$$
|
|
- The SD line always passes through the averages for each axis.
|
|
- It'll go through the middle of the "football"
|
|
- $(ave_x, ave_y)$ is on the line
|
|
- Visually looks like a line of best fit
|
|
- The SD line is not used for prediction because it over-predicts
|
|
- Someone who is *exactly on* the SD line is the same number of SDs above or below the average in the y axis as they are in the x axis.
|
|
Given a scatter diagram where the average of each set lies on the point $(75, 70)$, with a $\sigma_x$ of 10 and a $\sigma_y$ of 12, you can graph the SD line by going up $\sigma_y$ and right $\sigma_x$, then connecting that point (in this example, $(85, 82)$) with the mean points.
|
|
|
|
### The Regression Line/Least Squared Regression Line (LSRL)
|
|
- This line has a more moderate slope than the SD line. it does not go through the peaks of the "football"
|
|
- Predictions can only be made if the data displays a linear association (is a football shape).
|
|
- The regression line is *used to predict* the y variable when the x variable is given. It should only be relied on if it is a controlled experiment, observational studies have too many confounding factors.
|
|
- In regression, the $x$ variable is the known variable, and $y$ is the value being solved for.
|
|
- The regression line goes through the point of averages, and can be positive or negative
|
|
$$ slope = r(\frac{\sigma_y}{\sigma_x}) $$
|
|
- You can find the regression line by multiplying $\sigma_y$ by $r$, for the rise, then using $\sigma_x$ for the run from the point of averages.
|
|
|
|
The below formula can be used to predict a y value given a 5 number summary of a set.
|
|
$$ \hat{y} = \frac{x-\bar{x}}{\sigma_x} * r * \sigma_y + \bar{y} $$
|
|
1. Find $z_x$
|
|
2. Multiply $z_x$ by $r$
|
|
3. Multiply that by $\sigma_y$
|
|
4. Add the average of $y$
|
|
|
|
- For a positive association, for every $\sigma_x$ above average we are in $x$, the line predicts $y$ to be $\sigma_y$ standard deviations above y.
|
|
- There are two separate regression lines, one for predicting $y$ from $x$, and one for predicting $x$ from $y$
|
|
- Do not extrapolate outside of the graph
|
|
|
|
(Ch 12, stat 1040)
|
|
Predicting a y value for a given x value can be calculated when given the regression equation.
|
|
$$ y = mx + b $$
|
|
Where $y$ is the predicted value, $m$ is the slope, $x$ is the given value and the $b$ is the intercept.
|
|
$$ intercept = \bar{y} - slope*\bar{x} $$
|
|
|
|
$$ slope = \frac{r * \sigma_y}{\sigma_x} $$
|
|
|
|
- Interpreting slope: *slope(num)* is the approximate amount the *y context* will *increase/decrease* with each increase in *x label*.
|
|
- Interpret the intercept: When the *x context = 0*, the *y context* will be approximately *intercept*.
|
|
#### Residual plots
|
|
A plot of the differences from the line. Makes it easier to see if the data is football shaped.
|
|
If a residual plot has a strong pattern, it may not be suitable for making predictions.
|
|
|
|
### The Regression Effect
|
|
- In a test-retest situation, people with low scores tend to improve, and people with high scores tend to do worse. This means that individuals score closer to the average as they retest.
|
|
- The regression *fallacy* is contributing this to something other than chance error.
|
|
### R.M.S Error for Regression
|
|
The distance of an individual point from the regression line. This only applies for a football shaped scatter diagram.
|
|
- If a point is below the line, the error is negative.
|
|
- If a point is above the line, the error is positive.
|
|
- "Give or take"
|
|
- `residual = observed - predicted` for a given $x$ value
|
|
- The r.m.s error is the r.m.s size of the errors
|
|
$$ \sqrt{1-r^2}(\sigma_y) $$
|
|
- On a least squared regression line, the 1 r.m.s error away will contain $2\sigma$ of the data, and it should loosely mirror a normal curve.
|
|
- To approximate the R.M.S error for a scatter diagram, take a high value and a low value for a given $x$ coordinate, and divide by 4, because r.m.s error is within $2\sigma$ of either side of the line.
|
|
- 68% = $2\sigma$, 95% = $4\sigma$
|
|
- RMS can help determine which observations are outliers. Typically if a value is more than *2 r.m.s* away from the prediction estimate, it is considered to be an outlier
|
|
- The RMS error is only appropriate for homoscedastic scatter diagrams (football shape)
|
|
|
|
- Heteroscedastic scatter diagrams should not be used to make a prediction, because they do not follow a football shape
|
|
- Homoscedastic scatter diagrams can be used to make predictions, because they follow a football shape
|
|
|
|
|
|
|
|
---
|
|
# Terminology
|
|
| Term | Definition | |
|
|
| ---- | ---- | ---- |
|
|
| $\hat{y}$ | The predicted value | |
|
|
| Homoscedastic | The scatter diagram will look the same above and below the LSRL | |
|
|
| Heteroscedastic | Will have more variability on one side of the regression line | |
|