(Chapter 8, STAT 1040) # Correlation ## Scatter Diagrams A scatter diagram or scatter plot shows the relationship between two variables. One variable is on the X axis, the other on the Y axis. If a scatter diagram is football shaped, it can be summarized using the 5-number summary: | Variable | Description | | -- | -- | | $\bar{x}$ | The average of the set graphed along the X axis | | $\sigma_x$| The standard deviation of set graphed along the X axis | | $\bar{y}$ | The average of the set graphed along the Y axis | | $\sigma_y$ | The standard deviation of the set graphed along the Y axis | | $r$ | The correlation coefficient, or how closely clustered the datapoints are in a line | The intersection of the averages of x and y will be the center of an oval shaped scatter diagram. Draw lines $2\sigma$ (will contain ~95% of all data) from the center along each axis to generalize the shape of a scatter plot. You can approximate the mean by trying to find the upper bound and the lower bound of $2\sigma$ deviation to either side of the mean, then finding the middle of those two points to find $\bar{x}$. You can divide the range between the two points by 4 to find $\sigma$. ### Association - Positive association is demonstrated when the dots are trend upward as $x$ increases ($r$ is positive). - Negative association is demonstrated when the the dots trend downward as $x$ increases ($r$ is negative). - Strong association is demonstrated when dots are clustered tightly together along a line ($|r|$ is closer to 1). - Weak association is demonstrated when dots are not clustered tightly. ($|r|$ is closer to 0) ## Correlation Correlation is between `-1` and `1`. Correlation near 1 means tight clustering, and correlation near 0 means loose clustering. $r$ is -1 if the points are on a line with negative slope, $r$ is positive 1 if the points are on a line with a positive slope. As $|r|$ gets closer to 1, the line points cluster more tightly around a line. If $x$ is above average, we expect the $y$ to be above average if there's a strong correlation coefficient ## Calculating $r$ by hand Put the $x$ values into $L1$, put the $y$ values into $L2$. 1. Convert the $x$ each x value in the list to standard units($z$). Convert each $y$ value to standard units. $$ z = \frac{x-\bar{x}}{\sigma_x} $$ 2. Multiply the standard units for each ($x$, $y$) pair in the sets, giving you a third list, named $p$ in this example. $$ x * y = p$$ 3. Find the average of the values from step 3, this is $r$. $$ \bar{x}(p) $$ https://www.thoughtco.com/how-to-calculate-the-correlation-coefficient-3126228 # Terminology | Term | Definition | | -- | -- | | $r$ | Correlation Coefficient | | Linear Correlation | Measures the strength of a line | (Chapter 9, STAT 1040) # Correlation contd. ## Notes - $r$ is a pure number, and does not have units - $r$ does not change if you: - switch the $x$ and $y$ values - add the same number to every x or y value - multiply each $x$ or $y$ value by a positive number - The correlation coefficient can be misleading in the presence of outliers or nonlinear association. - $r$ only shows the strength of a linear relationship. - High $r$ only indicates high correlation, not causation. - $r$ is not a linear, perfect scale. An $r$ of 0.8 does not mean twice as much linearity as an $r$ of 0.4 ### Ecological Correlations - Sometimes each point on the plot represents the average or rate for a whole group of individuals. - Ecological correlations are artificially strong - The size of points may indicate the size of the datapoints # Regression (Chapter 10, STAT 1040) ## Notes ### Explanatory and Response Variables - Regression uses values of one variable to predict values for a related value. - The variable you are trying to predict is called the *response variable*. It is graphed along the *y-axis*. This is the thing being predicted/measured. - The variable you have information about that you are using to make the prediction is called the *explanatory variable*. It is graphed along the *x-axis*. This is the treatment. - Just because a relationship exists between $x$ and $y$ *does not* mean that changes in $x$ *cause* changes in $y$. - If the graph is given to you already set up, you already know the response and explanatory variables. - The $\sigma$ line will always always have a slope of: $$\pm \frac{\sigma_y}{\sigma_x}$$ - The SD line always passes through the averages for each axis. - It'll go through the middle of the "football" - $(ave_x, ave_y)$ is on the line - Visually looks like a line of best fit - The SD line is not used for prediction because it over-predicts - Someone who is *exactly on* the SD line is the same number of SDs above or below the average in the y axis as they are in the x axis. Given a scatter diagram where the average of each set lies on the point $(75, 70)$, with a $\sigma_x$ of 10 and a $\sigma_y$ of 12, you can graph the SD line by going up $\sigma_y$ and right $\sigma_x$, then connecting that point (in this example, $(85, 82)$) with the mean points. ### The Regression Line/Least Squared Regression Line (LSRL) - This line has a more moderate slope than the SD line. it does not go through the peaks of the "football" - The regression line is *used to predict* the y variable when the x variable is given - The regression line also goes through the point of averages $$ slope = r(\frac{\sigma_y}{\sigma_x}) $$ - You can find the regression line by multiplying $\sigma_y$ by $r$, for the rise, then using $\sigma_x$ for the run from the point of averages. The below formula can be used to predict a y value given a 5 number summary of a set. $$ \hat{y} = \frac{x-\bar{x}}{\sigma_x} * r * \sigma_y + \bar{y} $$ # Terminology | Term | Definition | | -- | -- | | $\hat{y}$ | The predicted value |