Back to the Table of Contents

Statistical Probabilities and Distributions: Lesson 14

Correlation and Regression

Lesson Overview

Correlation

The common usage of the word correlation refers to a relationship between two or more objects (ideas, variables...). In statistics, the word correlation refers to the relationship between two variables.

Examples: one variable might be the number of hunters in a region and the other variable could be the deer population. Perhaps as the number of hunters increases, the deer population decreases. This is an example of a negative correlation: as one variable increases, the other decreases. A positive correlation is where the two variables react in the same way, increasing or decreasing together. Temperature in Celsius and Fahrenheit have a positive correlation.

How can you tell if there is a correlation?
By observing the graphs, a person can tell if there is a correlation by how closely the data resemble a line. If the points are scattered about then there is may be no correlation. If the points would closely fit a quadratic or exponential equation, etc., then they have a nonlinear correlation. In this lesson we will restrict ourselves to linear correlation.

How can you tell by inspection the type of correlation?
If the graph of the variables represent a line with positive slope, then there is a positive correlation (x increases as y increases). If the slope of the line is negative, then there is a negative correlation (as x increases y decreases).

An important aspects of correlation is how strong it is. The strength of a correlation is measured by the correlation coefficient r. Another name for r is the Pearson product moment correlation coefficient in honor of Karl Pearson who developed it about 1900.

r =                        nxy - (x)(y)                        
       sqrt[n(x2) - (x)2] · sqrt[n(y2) - (y)2]

For samples, the correlation coefficient is represented by r while the correlation coefficient for populations is denoted by the Greek letter rho (which can look like a p).

The closer r is to +1, the stronger the positive correlation is. The closer r is to -1, the stronger the negative correlation is. If |r| = 1 exactly, the two variables are perfectly correlated! Temperature in Celsius and Fahrenheit are perfectly correlated.

Formal hypothesis testing can be applied to r to determine how significant a result is. The Student t distribution with n-2 degrees of freedom, t=(r-0)/sr (where 0 represents the expected correlation or rho), and sr2 = (1 - r2)/(n - 2). For n = 8, alpha=0.05 and a two-tailed test, critical values of +/-0.707 are obtained.

Remember, correlation does not imply causation.

A value of zero for r does not mean that there is no correlation, there could be a nonlinear correlation. Confounding variables might also be involved. Suppose you discover that miners have a higher than average rate of lung cancer. You might be tempted to immediate conclude that their occupation is the cause, whereas perhaps the region has an abundance of radioactive radon gas leaking from the subterranian regions and all people in that area are affected. Or, perhaps, they are heavy smokers....

r2 is frequently used and is called the coefficient of determination. It is the fraction of the variation in the values of y that is explained by least-squares regression of y on x. This will be discussed further below after least squares is introduced.

Regression

Regression goes one step beyond correlation in identifying the relationship between two variables. It creates an equation so that values can be predicted within the range framed by the data. Since the discussion is on linear correlations and the predicted values need to be as close as possible to the data, the equation is called the best-fitting line or regression line. The regression line was named after the work Galton did in gene characteristics that reverted (regressed) back to a mean value.

If you go outside the original domain (x values) you are extrapolating. An equation of a line is expressed as y = mx + b or y = ax + b or even y = a + bx. As we see, the regression line has a similar equation.

y = ß0 + ß1x
where y, ß0, and ß1 represents population statistics. But if a cap appears above the variable, then they represent sample statistics. Remember x is our independent variable for both the line and the data.

The y-intercept of the regression line is ß0 and the slope is ß1. The following formulas give the y-intercept and the slope of the equation.

ß0 =                  (y)(x2) - (x)(xy)          
n(x2) - (x)2
ß1 =                  n(xy) - (x)(y)          
n(x2) - (x)2

Notice that the denominators are the same, so that saves calculations. Also, the calculator will have values for certain portions. Another way to write the equation is in point-slope form where the centroid is the point that is always on the line. The centroid is the ordered pair: (mean of x, mean of y).

To keep the y-intercept and slope accurate, all intermediate steps should be kept to twice
as many significant digits (six to ten?) as you want in your final answer (three to five?)!

There are certain guidelines for regression lines:

  1. Use regression lines when there is a significant correlation to predict values.
  2. Do not use if there is not a significant correlation.
  3. Stay within the range of the data. Do not extrapolate!! For example, if the data is from 10 to 60, do not predict a value for 400.
  4. Do not make predictions for a population based on another population's regression line.

Example: Write the regression line for the following points:

xy
14
32
41
50
80

Solution 1: x = 21; y = 7; x2 = 115; y2 = 21; xy = 14
Thus ß0 = [7·115 - 21·14] ÷ [5 · 115 - 212] = 511 ÷ 134 = 3.81 and ß1 = [5·14 - 21·7] ÷ [5 · 115 - 212] = -77 ÷ 134 = -0.575. Thus the regression line for this example is y = -0.575x + 3.81.

Solution 2: On your TI-83+ graphing calculator, enter the data into L1 and L2 and do a LinReg(ax+b) L1, L2 (STAT, CALC, 4) or LinReg(a+bx) L1, L2 (STAT, CALC, 8). You should get a screen with
y=ax+b
a=-.5746...
b=3.8134...
r2=.790...
r=.88888...
If the r information is absent, do CATALOG (2nd 0) DiagnosticOn. ENTER will bring the command back to the home screen where another ENTER will execute it. We thus see that about 79% of the variation in y is explained by least-squares regression of y on x.

There is no mathematical difference between the two linear regression forms LinReg(ax+b) and LinReg(a+bx), only different professional groups prefer different notations.

Note the presence on your TI-83+ graphing calculator of several other regression functions as well. Specifically, quadratic (y = ax2 + bx + c), cubic (y = ax3 + bx2 + cx + d), quartic (y = ax4 + bx3 +cx2 + dx + e), exponential (y = abx), and power or variation (y = axb). Thus an easy way to find a quadratic through three points would be to enter the data in a pair of lists then do a quadratic regression on the lists.

Least Squares Procedure

The method of least squares was first published in 1806 by Legendre. However, Gauss "communicated the whole matter to Olbers in 1802."

What is the Least Squares Property?
Form the distance y - y' between each data point (x, y) and a potential regression line y' = mx + b. Each of these differences is known as a residual. Square these residuals and sum them. The resulting sum is called the residual sum of squares or SSres. The line that best fits the data has the least possible value of SSres.

This link has a nice colorful example of these residuals, residual squares, and residual sum of squares.

Example: Find the Linear Regression line through (3,1), (5,6), (7,8) by brute force.
Solution:
xyy'y - y'
313m + b1 - 3m - b
565m + b6 - 5m - b
787m + b8 - 7m - b

Using the fact that (A + B + C)2 = A2 + B2 + C2 + 2AB + 2AC + 2BC, we can quickly find SSres = 101 + 83m2 + 3b2 - 178m - 30b + 30mb. This expression is quadratic in both m and b. We can rewrite it both ways and then find the vertex for each (which is the minimum since we are summing squares). Remember the vertex of y = ax2 + bx + c is -b/2a.

SSres = 3b2 + (30m - 30)b + (101 + 83m2 - 178m).
SSres = 83m2 + (30b - 178)m + (101 + 3b2 - 30b).
From the first expression we find b = (-30m + 30)/6. From the second expression we find m = (-30b + 178)/166. These expressions give us two equations in two unknowns:
5m + b = 5 and
83m + 15b = 89.
These can be solved to obtain m = 7/4 = 1.75 and b = -15/4 = -3.75. This is how the equations above for ß0 and ß1 were derived, from the general solution to two general equations for SSres.

This link brings up a Java applet which allows you to add a point to a graph and see what influence it has on a regression line.

This link brings up a Java applet which encourages you to guess the regression line and correlation coefficient for a data set.

T. OF CONTENTS HOMEWORK SOLUTIONS ACTIVITY