- The goal of measures of spread is to assess the “variation” in the data in some way.
- Are the values spread out and far from each other, or are they all real close?
- Are the values of one variable less or more spread out than those of another variable?
- Different measures achieve this in different ways.
-
The two main measures of spread are:
Std. Dev.
~ Standard Deviation. Measures distance of data from the mean.
Often denoted by $s$.
Not resistant.
IQR
~ Interquartile range. Measures range occupied by the middle 50% of the data.
- NOTE: The standard deviation and the IQR are in general not comparable to each other, they measure spread in very different ways.
-
Let us review how the standard deviation is computed. Its formula is a bit complicated, and it looks something like this:
Basically:
- Compute the mean of all the values (the $\bar x$).
- Subtract the mean from each value (the result of this step is the “deviations”, $x-\bar x$).
- Square all the deviations.
- Ensures values are positive before computing average.
- Average these squared deviations: Add them all up (the big “sigma”), then divide by $n-1$. The result at this stage is called the variance.
- Why $n-1$: Technical reason, and won’t really matter for large $n$.
- One way to think about it: The deviations always add up to $0$, so once you know $n-1$ of them the last one is determined.
- In this context, $n-1$ is called the “degrees of freedom”.
- Take a square root at the end.
- Fixes the units of measurement.
- Outliers have a considerable effect on this formula.
- They have a very large deviation, and because they pull the mean towards themselves they cause larger deviations in the other values as well.
- Because we square them before adding, those large deviations will dominate the equation even more.
-
What to use depends on the distribution:
Symmetric
~ Mean for center, Standard Deviation for spread
Skewed/Outliers
~ Median for center, IQR for spread
- Chebyshev’s Rule:
- At least 75% of the data is within 2 standard deviations from the mean
- At least 89% of the data is within 3 standard deviations from the mean
- For Bell Shaped data (Empirical Rule):
- Approximately 68% of the data is within one standard deviation of the mean.
- Approximately 95% of the data is within two standard deviations of the mean.
- More than 99% of the data is within three standard deviations of the mean.