A typical example might be ACT and SAT scores. ACT scores range from 1 to 36 with a national mean of about 21.0 and standard deviation of about 4.7. (The four sections each range from 1 to 36 but are averaged.) SAT scores range from 200 to 800 (for each subtest) with a national mean of about 508 and standard deviation of about 111. Both ACTs and SATs appear to be approximately normally distributed. Math and Science Center students often take both, perhaps several times and would represent a sample. This sample would have its own mean and standard deviation, but of course, these would be statistics, not parameters. (Specifically, our Math and Science Center students average about 1050 (two section total) when they take the SAT their eighth grade year and average over 1300 (two section total) when they take it their junior year. Our average ACT score (junior) is about 29. Information about the three section (math, English, and essay) is pending. Note that historically the Math and Science Center used SAT scores as 40% of our admission criterion. The math section now contains "higher math" (precalculus) and an alternative is being formulated. Also of note is the recent proposal here in Michigan to replace the high school MEAP with the ACT starting with the class of 2008.)
The formulae used for z-score appear in two virtually identical forms, recognizing the fact that we may be dealing with sample statistics or population parameters. These formulae are as follows.
| z-score formulae |
|
|
Negative z-scores indicate a data element's position below the mean. |
| Positive z-scores indicate a data element's position above the mean. |
| z-scores should [almost] always be rounded to two decimal places. |
For the IQs of 0 and 210 referred to in lesson 6, z-scores of -6.67 and 7.33 should be obtained respectively, based on a population mean of 100 and a standard deviation of 15.
The population does not have to be normally distributed to calculate z-scores, but that is one of its primary applications.
|
In summary, z-scores provide a useful measurement for comparing data elements from different [heap-shaped] data sets. |
| Data elements more than 2 standard deviations away from the mean are termed unusual. |
| Data elements less than 2 standard deviations away from the mean are termed ordinary. |
As you will recall, in a normally distributed population, 95% of the data
will then be ordinary, so only 5% can be unusual. Chebyshev's theorem guarantees
at least 75% of the data to be ordinary, so no more than 25% can be unusual.
| Note first how the median divides a population into two halves: a top half and a bottom half. |
The top half consists of those data elements above the median, whereas the bottom half consists of those data elements below the median. If we subdivide each of these halves yet again, we have quartered the population and each of these division points is termed quartiles. Although one might occasionally speak of the bottom quartile, top quartile, etc., the term quartile technically refers to the three division points and not to the four divisions of the data.
| Q1 is the term used for the median of the bottom half. |
| Q3 is the term used for the median of the top half. |
| Q2 is another term used for the median. |
The precise definition specifies that at least 25% of the data will be less than or equal to Q1 and at least 75% of the data will be less than or equal to Q3. For this introduction, we will follow the conventions for calculating Q1 and Q3 of the TI-84+ graphing calculator, but note a similar term below under hinges. All these measures of position assume the data is quantitative and can be put in numeric order.
| Data are ranked when arranged in [numeric] order. |
Since range is sensitive to outliers (defined below), sometimes the interquartile range is calculated. This range is the difference between the third and first quartiles: Q3-Q1. It is another measure of dispersion. Other common terms include: semi-interquartile range, (Q3-Q1)/2, another measure of dispersion, and midquartile or (Q1+Q3)/2, which is a measure of central tendancy (an average).
| The upper hinge is the median of the upper half of all scores, including the median. |
| The lower hinge is the median of the lower half of all scores, including the median. |
Outliers are extreme values in a data set. Sometimes the term outlier is applied to unusual values as defined above (Triola, 5th edition). More recently, outliers are defined in terms of the hinges or quartiles. Outliers are often differentiated as mild or extreme as defined below. The interquartile range or perhaps D = upper hinge - lower hinge is used. Generally, an outlier should be obvious and not borderlineright next to another element, but lying just outside some arbitrary line of demarcation.
| Mild outlier are 1.5D to 3D beyond the corresponding hinge. |
| Extreme outlier are beyond the corresponding hinge by more than 3D. |
Example: Find any outliers in the data set: {0, 2, 4, 5, 6, 3, 6, 1, 1, 50}.
Solution: Obviously, 50 is a much larger number than any of the other elements.
This outlier will cause the mean and variance to be much higher.
Specifically, without 50, the mean is 3.1 and standard deviation 2.3,
whereas with 50, the mean is 7.8 and standard deviation 15.0.
Note that the quartiles are 1 and 6, whereas the hinges are 1.5 and 5.5
for the unmodified data set.
For any of these definitions, 50 is way away from the other data and is an outlier.
Outliers might be legitimate data values or errors.
This 50 might really have been 5.0 and was miscoded
(historically, punch card input was column sensitive) or poorly
recorded in a lab book, with the decimal point extremely light or missing.
50 may also represent extreme extra credit on a 5 point quiz!
It is not unusual to be tempted to omit such data values.
It is not considered a good practice, but if such are omitted,
be sure to clearly record that fact.
You will have just crossed the line between objective and subjective science.
It doesn't take many such changes to skew results to fit a preconceived notion!
| D5 is another name for the median. |
| P50 is yet another term for median. |
Other equivalents, such as P25=Q1,
P75=Q3,
P10=D1, etc.,
should also be obvious.
Once again, the term percentile technically refers to the
99 division points, but is not uncommonly used to refer to the 100 divisions.
For large data sets, one can calculate the locator L to
help find a requested percentile. It is computed as follows.
| Percentile Locator Formula |
k is the percentile being sought and n, of course,
is the number of elements in our data set.
Usual conventions dictate that once L is obtained,
it must be checked to see if it is a whole number.
If it is a whole number, the value of Pk
is the mean of the Lth data element and the next higher data element.
If it is not a whole number, L must be
rounded up to the next larger whole number.
The value of Pk is then the
Lth data element, counting from the lowest.
There is an essential difference between rounding up and rounding off.
If we round off
we get 3.
Whereas, if we round up
we get 4.
One last measure of dispersion is the 10-90 percentile range
which is defined to be P90 - P10.
| There is no such thing as P100. |
| BACK | HOMEWORK | ACTIVITY | CONTINUE |