Examples: one variable might be the number of hunters in a region and the other variable could be the deer population. Perhaps as the number of hunters increases, the deer population decreases. This is an example of a negative correlation: as one variable increases, the other decreases. A positive correlation is where the two variables react in the same way, increasing or decreasing together. Temperature in Celsius and Fahrenheit have a positive correlation.
How can you tell if there is a correlation?
By observing the graphs, a person can tell if there is a correlation by how
closely the data resemble a line. If the points are scattered about then
there is may be no correlation. If the points would closely fit a
quadratic or exponential equation, etc.,
then they have a nonlinear correlation.
In this lesson we will restrict ourselves to linear correlation.
How can you tell by inspection the type of correlation?
If the graph of the variables represent a line with positive slope, then
there is a positive correlation (x increases as y increases).
If the slope of the line is negative, then there is a negative correlation
(as x increases y decreases).
An important aspects of correlation is how strong it is. The strength of a correlation is measured by the correlation coefficient r. Another name for r is the Pearson product moment correlation coefficient in honor of Karl Pearson who developed it about 1900.
r =
n xy -
( x)( y)
sqrt[n( x2) -
( x)2] · sqrt[n( y2) -
( y)2]
|
For samples, the correlation coefficient is represented by r while the correlation coefficient for populations is denoted by the Greek letter rho (which can look like a p).
The closer r is to +1, the stronger the positive correlation is. The closer r is to -1, the stronger the negative correlation is. If |r| = 1 exactly, the two variables are perfectly correlated! Temperature in Celsius and Fahrenheit are perfectly correlated.
Formal hypothesis testing can be applied to r to determine how significant a result is. The Student t distribution with n-2 degrees of freedom, t=(r-0)/sr (where 0 represents the expected correlation or rho), and sr2 = (1 - r2)/(n - 2). For n = 8, alpha=0.05 and a two-tailed test, critical values of +/-0.707 are obtained.
| Remember, correlation does not imply causation. |
A value of zero for r does not mean that there is no correlation, there could be a nonlinear correlation. Confounding variables might also be involved. Suppose you discover that miners have a higher than average rate of lung cancer. You might be tempted to immediate conclude that their occupation is the cause, whereas perhaps the region has an abundance of radioactive radon gas leaking from the subterranian regions and all people in that area are affected. Or, perhaps, they are heavy smokers....
r2 is frequently used and is called the coefficient of determination. It is the fraction of the variation in the values of y that is explained by least-squares regression of y on x. This will be discussed further below after least squares is introduced.
If you go outside the original domain (x values) you are extrapolating. An equation of a line is expressed as y = mx + b or y = ax + b or even y = a + bx. As we see, the regression line has a similar equation.
|
y = ß0 + ß1x where y, ß0, and ß1 represents population statistics. But if a cap appears above the variable, then they represent sample statistics. Remember x is our independent variable for both the line and the data. |
The y-intercept of the regression line is ß0 and the slope is ß1. The following formulas give the y-intercept and the slope of the equation.
ß0 =
( y)( x2) -
( x)( xy) n( x2) - ( x)2
|
ß1 =
n( xy) -
( x)( y) n( x2) - ( x)2
|
Notice that the denominators are the same, so that saves calculations. Also, the calculator will have values for certain portions. Another way to write the equation is in point-slope form where the centroid is the point that is always on the line. The centroid is the ordered pair: (mean of x, mean of y).
|
To keep the y-intercept and slope accurate,
all intermediate steps should be kept to twice as many significant digits (six to ten?) as you want in your final answer (three to five?)! |
There are certain guidelines for regression lines:
Example: Write the regression line for the following points:
| x | y |
|---|---|
| 1 | 4 |
| 3 | 2 |
| 4 | 1 |
| 5 | 0 |
| 8 | 0 |
Solution 1:
x = 21;
y = 7;
x2 = 115;
y2 = 21;
xy = 14
Thus ß0 = [7·115 - 21·14] ÷ [5 · 115 - 212] = 511 ÷ 134 = 3.81
and ß1 = [5·14 - 21·7] ÷ [5 · 115 - 212] = -77 ÷ 134 = -0.575.
Thus the regression line for this example is y = -0.575x + 3.81.
Solution 2:
On your TI-83+ graphing calculator, enter the data into L1 and
L2 and do a LinReg(ax+b) L1, L2 (STAT, CALC, 4)
or LinReg(a+bx) L1, L2 (STAT, CALC, 8).
You should get a screen with
y=ax+b
a=-.5746...
b=3.8134...
r2=.790...
r=.88888...
If the r information is absent, do CATALOG (2nd 0)
DiagnosticOn. ENTER will bring the command back to the home screen
where another ENTER will execute it.
We thus see that about 79% of the variation in y
is explained by least-squares regression of y on x.
There is no mathematical difference between the two linear regression forms LinReg(ax+b) and LinReg(a+bx), only different professional groups prefer different notations.
Note the presence on your TI-83+ graphing calculator of several other regression functions as well. Specifically, quadratic (y = ax2 + bx + c), cubic (y = ax3 + bx2 + cx + d), quartic (y = ax4 + bx3 +cx2 + dx + e), exponential (y = abx), and power or variation (y = axb). Thus an easy way to find a quadratic through three points would be to enter the data in a pair of lists then do a quadratic regression on the lists.
What is the Least Squares Property?
Form the distance y - y' between each data point (x, y)
and a potential regression line y' = mx + b.
Each of these differences is known as a residual.
Square these residuals and sum them.
The resulting sum is called the residual sum of squares or SSres.
The line that best fits the data has the least possible value of SSres.
This link has a nice colorful example of these residuals, residual squares, and residual sum of squares.
Example:
Find the Linear Regression line through (3,1), (5,6), (7,8) by brute force.
Solution:
| x | y | y' | y - y' |
|---|---|---|---|
| 3 | 1 | 3m + b | 1 - 3m - b |
| 5 | 6 | 5m + b | 6 - 5m - b |
| 7 | 8 | 7m + b | 8 - 7m - b |
Using the fact that (A + B + C)2 = A2 + B2 + C2 + 2AB + 2AC + 2BC, we can quickly find SSres = 101 + 83m2 + 3b2 - 178m - 30b + 30mb. This expression is quadratic in both m and b. We can rewrite it both ways and then find the vertex for each (which is the minimum since we are summing squares). Remember the vertex of y = ax2 + bx + c is -b/2a.
SSres = 3b2 + (30m - 30)b + (101 + 83m2 - 178m).
SSres = 83m2 + (30b - 178)m + (101 + 3b2 - 30b).
From the first expression we find b = (-30m + 30)/6.
From the second expression we find m = (-30b + 178)/166.
These expressions give us two equations in two unknowns:
5m + b = 5 and
83m + 15b = 89.
These can be solved to obtain m = 7/4 = 1.75 and b = -15/4 = -3.75.
This is how the equations above for ß0
and ß1 were derived, from the general solution
to two general equations for SSres.
This link brings up a Java applet which allows you to add a point to a graph and see what influence it has on a regression line.
This link brings up a Java applet which encourages you to guess the regression line and correlation coefficient for a data set.
| T. OF CONTENTS | HOMEWORK | SOLUTIONS | ACTIVITY |
|---|