3.5 KiB
(Chapter 8, STAT 1040)
Correlation
Scatter Diagrams
A scatter diagram or scatter plot shows the relationship between two variables. One variable is on the X axis, the other on the Y axis.
If a scatter diagram is football shaped, it can be summarized using the 5-number summary:
| Variable | Description |
|---|---|
\bar{x} |
The average of the set graphed along the X axis |
\sigma_x |
The standard deviation of set graphed along the X axis |
\bar{y} |
The average of the set graphed along the Y axis |
\sigma_y |
The standard deviation of the set graphed along the Y axis |
r |
The correlation coefficient, or how closely clustered the datapoints are in a line |
The intersection of the averages of x and y will be the center of an oval shaped scatter diagram. Draw lines 2\sigma (will contain ~95% of all data) from the center along each axis to generalize the shape of a scatter plot.
You can approximate the mean by trying to find the upper bound and the lower bound of 2\sigma deviation to either side of the mean, then finding the middle of those two points to find \bar{x}. You can divide the range between the two points by 4 to find \sigma.
Association
- Positive association is demonstrated when the dots are trend upward as
xincreases (ris positive). - Negative association is demonstrated when the the dots trend downward as
xincreases (ris negative). - Strong association is demonstrated when dots are clustered tightly together along a line (
|r|is closer to 1). - Weak association is demonstrated when dots are not clustered tightly. (
|r|is closer to 0)
Correlation
Correlation is between -1 and 1. Correlation near 1 means tight clustering, and correlation near 0 means loose clustering. r is -1 if the points are on a line with negative slope, r is positive 1 if the points are on a line with a positive slope. As |r| gets closer to 1, the line points cluster more tightly around a line.
If x is above average, we expect the y to be above average if there's a strong correlation coefficient
Calculating r by hand
Put the x values into L1, put the y values into L2.
- Convert the
xeach x value in the list to standard units(z). Convert eachyvalue to standard units.
z = \frac{x-\bar{x}}{\sigma_x}
- Multiply the standard units for each (
x,y) pair in the sets, giving you a third list, namedpin this example.
x * y = p
- Find the average of the values from step 3, this is
r.
\bar{x}(p)
https://www.thoughtco.com/how-to-calculate-the-correlation-coefficient-3126228
Terminology
| Term | Definition |
|---|---|
r |
Correlation Coefficient |
| Linear Correlation | Measures the strength of a line |
(Chapter 9, STAT 1040)
Correlation contd.
Notes
ris a pure number, and does not have unitsrdoes not change if you:- switch the
xandyvalues - add the same number to every x or y value
- multiply each
xoryvalue by a positive number
- switch the
- The correlation coefficient can be misleading in the presence of outliers or nonlinear association.
ronly shows the strength of a linear relationship.- High
ronly indicates high correlation, not causation. ris not a linear, perfect scale. Anrof 0.8 does not mean twice as much linearity as anrof 0.4
Ecological Correlations
- Sometimes each point on the plot represents the average or rate for a whole group of individuals.
- Ecological correlations are artificially strong
- The size of points may indicate the size of the datapoints