As you have already seen, it is difficult to measure price changes when they so often vary from shop to shop and region to region. Taking some average value, such as the median or the mean, helps to simplify the problem. However, it would be a mistake to ignore the notion of spread, as averages on their own can be misleading.
Information about spread can be very important in statistical analysis, where you are often interested in comparing two or more batches. In this section we shall look first at measures of spread, and then at some methods of summarising the shape of a batch of data.
But how can spread be measured? Just as there are several ways of measuring location (mean, median, etc.), there are also several ways of measuring spread. Here, we shall examine two such measures: the range and the interquartile range. (A further, even more important, measure of spread is the standard deviation. It is, however, beyond the scope of this course.)
The range is defined below.
The range is the distance between the lower and the upper extremes. It can be calculated from the formula:
where is the upper extreme and
is the lower extreme.
Given an ordered batch of data, for example in a stemplot, the range can easily be calculated. However, the range tells us very little about how the values in the main body of the data are spread. It is also very sensitive to changes in the extreme values, like those considered in Subsection 1.4. It would be better to have a measure of spread that conveys more information about the spread of values in the main body of the data. One such measure is based upon the difference between two particular values in the batch, known as the quartiles. As the name suggests, the two quartiles lie one quarter of the way into the batch from either end. The major part of the next subsection describes how to find them.
Finding the quartiles of a batch is very similar to finding the median.
In Subsection 1.2, we represented a batch as a V-shaped formation, with the median at the ‘hinge’ where the two arms of the V meet. The median
splits the batch into two equal parts. Similarly, we can put another hinge in each side of the V and get four roughly equal
parts, shaped like this: . For a batch of size 15, it looks like Figure 12.
Figure 12 Median and quartiles
The points at the side hinges, in this case and
, are the quartiles. There are two quartiles which, as with the extremes, we call the lower quartile and the upper quartile. The lower quartile separates off the bottom quarter, or lowest 25%. The upper quartile separates off the top quarter, or
highest 25%. They are denoted
and
respectively. (Sometimes they are referred to as the first quartile and the third quartile.)
You might be wondering, if these are and
, what happened to
? Well, have a think about that for a moment.
separates the bottom quarter of the data (from the top three quarters), and
separates the bottom three quarters (from the top quarter). So it would make sense to say that
separates the bottom two quarters (from the top two quarters). But two quarters make a half, so
would denote the median, and since there is already a separate word for that, it’s not usual to call it the second quartile.
Usually we cannot divide the batch exactly into quarters. Indeed, this is illustrated in Figure 12 where the two central parts
of the are larger than the outer ones. As with calculating the median for an even-sized batch, some rule is needed to tell us how
many places we need to count along from the smallest value to find the quartiles. However, there are several alternatives
that we could adopt and the particular rule described below is somewhat arbitrary. Different authors and different software
may use slightly different rules. If your calculator can find quartiles, note that it may use a different rule.
As you might have expected, the rule involves dividing by 4, where
is the batch size (as opposed to dividing by 2 to find the median). However, the rule is slightly more complicated for the
quartiles and it depends on whether
is exactly divisible by 4.
The lower quartile is at position
in the ordered batch.
The upper quartile is at position
in the ordered batch.
If is exactly divisible by 4, these positions correspond to a single value in the batch.
If is not exactly divisible by 4, then the positions are to be interpreted as follows.
A position which is a whole number followed by means ‘halfway between the two positions either side’ (as was the case for finding the median).
A position which is a whole number followed by means ‘one quarter of the way from the position below to the position above’. So for instance if a position is
, the quartile is the number one quarter of the way from
to
.
A position which is a whole number followed by means ‘three quarters of the way from the position below to the position above’. So for instance if a position is
, the quartile is the number three quarters of the way from
to
.
Before we actually use these rules to find quartiles, let us look at some more examples of -shaped diagrams for different batch sizes
. The case where
is exactly divisible by 4, so that
is a whole number, was shown in Figure 12. The following three figures show the three other possible scenarios, where
is not exactly divisible by 4.
For ,
and
. So
is halfway between
and
, and
is halfway between
and
.
Figure 13 Quartiles for sample size
For ,
and
. So
is three quarters of the way from
to
, and
is one quarter of the way from
to
.
Figure 14 Quartiles for sample size
For ,
and
. So
is one quarter of the way from
to
, and
is three quarters of the way from
to
.
Figure 15 Quartiles for sample size
Figure 15 showed you where the quartiles are for a batch of size 20. Let us now use the stemplot of the 20 television prices
in Figure 16, which you first met in Figure 5 (Subsection 1.2), to find the lower and upper quartiles, and
, of this batch.
Figure 16 Prices of flat-screen televisions with a screen size of 24 inches or less
To calculate the lower quartile you need to find the number that is one quarter of the way from
to
. These values are both 130, so
is 130. To calculate the upper quartile
you need to find the number three quarters of the way from
to
. These values are both 180, so
is 180.
That example was easier than it might have been, because for each quartile the two numbers we had to consider turned out to be equal!
Table 2 (Subsection 1.2) gave ten prices for a particular model of digital camera (in pounds). In order, the prices are as follows.
To find the lower and upper quartiles, and
, of this batch, first find
and
.
The lower quartile is the number three quarters of the way from
to
. These values are 60 and 65. The difference between them is
, and three quarters of that difference is
. Therefore
is 3.75 larger than 60, so it is 63.75. As with the median, in this course we will generally round the quartiles to the accuracy
of the original data, so in this case we round to the nearest whole number, 64. In symbols,
.
The upper quartile is the number one quarter of the way from
to
. These values are 81 and 85. The difference between them is
, and one quarter of that difference is
. Therefore
is 1 larger than 81, so it is 82. (No rounding necessary this time.) In symbols,
.
Example 15 is the subject of the following screencast. [Note that references to ‘the unit’ should be interpreted as ‘this course’. The original wording refers to the Open University course from which this material is adapted.]
Video content is not available in this format.
Screencast 3 Calculating quartiles
(a) Find the lower and upper quartiles of the batch of 15 coffee prices in Figure 17. (This batch of coffee prices was first introduced in Table 1 of Subsection 1.1.)
Figure 17 Stemplot of 15 coffee prices
(b) Find the lower and upper quartiles of the batch of 14 gas prices in Figure 18. (This batch of gas prices was first introduced in Table 3 of Subsection 1.2.)
Figure 18 Stemplot of 14 gas prices
Now we can define a new measure of spread based entirely on the lower and upper quartiles.
The interquartile range (sometimes abbreviated to IQR) is the distance between the lower and upper quartiles:
Note that this value is independent of the sizes of and
.
For the batch of 20 television prices in Example 14 (Subsection 3.2),
So the interquartile range is £50.
Calculate both the range and the interquartile range of the batch of 15 coffee prices, last seen in Figure 17 (Subsection 3.2).
In Activity 10(b) (Subsection 3.2) you found the quartiles of the 14 gas prices from Activity 2 (Subsection 1.2). Find the interquartile range.
You may be wondering why you are being asked to learn a new measure of spread when you already know the range. As a measure
of spread, the range is not very satisfactory because it is not resistant to the effects of unrepresentative extreme values. (Resistant measures
were explained in Subsection 1.4.) The interquartile range, by contrast, is a highly resistant measure of spread (because it is not sensitive to the effects
of values lying outside the middle 50% of the batch) and it is generally the preferred choice.
Suppose the price of the most expensive jar of coffee is reduced from 369p to 325p. How does this affect the range and the interquartile range of the batch of coffee prices in Figure 17 (Subsection 3.2)?
The new range is
a lot less than the original value of 101p (found in Activity 11). The interquartile range is unchanged.
As well as giving us a new measure of spread – the interquartile range – the quartiles are important figures in themselves.
Our -shaped diagram, Figure 19, gives five important points which help to summarise the shape of a distribution: the median, the two quartiles and the two extremes.
Figure 19 Values in a five-figure summary
These are conveniently displayed in the following form, called the five-figure summary of the batch.
Figure 20
For the television price data, we have ,
,
,
,
and
. (You last saw these data in Figure 16, Subsection 3.2.)
Therefore, the five-figure summary of this batch is
Figure 21
This diagram contains the following information about the batch of prices.
The general level of prices, as measured by the median, is £150.
The individual prices vary from £90 to £270.
About 25% of the prices were less than £130.
About 25% of the prices were more than £180.
About 50% of the prices were between £130 and £180.
We hope you agree that the five-figure summary is quite an efficient way of presenting a summary of a batch of data.
The five values in a five-figure summary can be very effectively presented in a special diagram called a boxplot. For the 14 gas prices (Figure 15, Subsection 3.2) the diagram looks like Figure 22.
Figure 22 Boxplot of batch of 14 gas prices
The central feature of this diagram is a box – hence the name boxplot. The box extends from the lower quartile (at the left-hand edge of the box) to the upper quartile (the right-hand edge). This part of the diagram contains 50% of the values in the batch. The length of this box is thus the interquartile range.
Outside the box are two whiskers. (Boxplots are sometimes called box-and-whisker diagrams.) In many cases, such as in Figure 22, the whiskers extend all the way out to the extremes. Each whisker then covers the end 25% of the batch and the distance between the two whisker-ends is then the range. (You will see examples later where the whiskers do not go right out to the extremes.)
So far we have dealt with four figures from the five-figure summary: the two quartiles and the two extremes. The remaining figure is perhaps the most important: it is the median, whose position is shown by putting a vertical line through the box.
Thus a boxplot shows clearly the division of the data into four parts: the two whiskers and the two sections of the box; these
are the four parts of the -shaped diagram and each contains (approximately) 25% of values in the batch (see Figure 21).
John Tukey was a prominent and prolific US statistician, based at Princeton University and Bell Laboratories. As well as working in some very technical areas, he was a great promoter of simple ways of picturing and summarising data, and invented both the five-figure summary and the boxplot (except that he called them the ‘five-number summary’ and the ‘box-and-whisker plot’).
He had what has been described as an ‘unusual’ lecturing style. The statistician Peter McCullagh describes a lecture he gave at Imperial College, London in 1977:
Tukey ambled to the podium, a great bear of a man dressed in baggy pants and a black knitted shirt. These might once have been a matching pair, but the vintage was such that it was hard to tell. …The words came …, not many, like overweight parcels, delivered at a slow unfaltering pace. …Tukey turned to face the audience …. ‘Comments, queries, suggestions?’ he asked …. As he waited for a response, he clambered onto the podium and manoeuvred until he was sitting cross-legged facing the audience. …We in the audience sat like spectators at the zoo waiting for the great bear to move or say something. But the great bear appeared to be doing the same thing, and the feeling was not comfortable. …After a long while, …he extracted from his pocket a bag of dried prunes and proceeded to eat them in silence, one by one. The war of nerves continued …four prunes, five prunes. …How many prunes would it take to end the silence?
(Source: McCullagh, P. (2003) ‘John Wilder Tukey’, Biographical Memoirs of Fellows of the Royal Society, vol. 49, pp. 537–55.)
Figure 23 A standard boxplot with annotation
A typical boxplot looks something like Figure 23 because in most batches of data the values are more densely packed in the middle of the batch and are less densely packed in the extremes. This means that each whisker is usually longer than half the length of the box. This is illustrated again in the next example.
The boxplot for the batch of 20 television prices (last worked with in Example 18) is shown in Figure 24.
Figure 24 Boxplot of batch of 20 television prices
You can see that each whisker is longer than half the length of the box.
However, this boxplot has a new feature. The whisker on the left goes right down to the lower extreme. But the whisker on the right does not go right to the upper extreme. The highest extreme data value, 270, which might potentially be regarded as an outlier, is marked separately with a star. Then the whisker extends only to cover the data values that are not extreme enough to be regarded as potential outliers. The highest of these values is 250.
(This course does not describe the rule to decide which data values (if any) can be regarded as potential outliers that are plotted separately on the diagram. This is another issue that may be dealt with differently by different authors and different software.)
Example 19 is the subject of the following screencast. [Note that the reference to ‘Unit 2’ should be ‘this course’ and ‘Figure 18’ should be ‘Figure 23’. Unit 2 and Figure 18 are references to the Open University course from which this material is adapted.]
Video content is not available in this format.
Screencast 4 Interpreting a boxplot
One important use of boxplots is to picture and describe the overall shape of a batch of data.
The stemplot of small television prices, last seen in Figure 16 (Subsection 3.2), shows a lack of symmetry. Since the higher values are more spread out than the lower values, the data are right-skew.
The boxplot of these data, given in Figure 22, also shows this right-skew fairly clearly. In the box, the right-hand part (corresponding to higher prices) is rather longer than the left-hand part, and the right-hand whisker is longer than the left-hand whisker.
A stemplot of the gas price data from Activity 2 (Subsection 1.2) is shown, yet again, in Figure 25.
Figure 25 Stemplot of 14 gas prices
(a) Prepare a five-figure summary of the batch.
(b) Figure 27 shows the boxplot of these data that you have already seen in Figure 22. What do the stemplot and boxplot tell us about the symmetry and/or skewness of the batch?
Figure 27 Boxplot of batch of 14 gas prices
In Example 20 and Activity 13 you saw how boxplots look for batches of data that are right-skew or left-skew. What happens in a batch that is more symmetrical?
For the small batch of camera prices from Table 2 (Subsection 1.2), a (stretched) stemplot is shown in Figure 28.
Figure 28 Stemplot of ten camera prices
The stemplot looks reasonably symmetric.
A boxplot of the data, Figure 29, confirms the impression of symmetry. The two parts of the box are roughly equal in length, and the two whiskers are also roughly equal in length.
Figure 29 Boxplot of batch of ten camera prices
You have now spent quite a lot of time looking at various ways of investigating prices and, in particular, at methods of measuring the location and spread of the prices of particular commodities.
In order to begin to answer our question, Are people getting better or worse off?, we need to know not just location (and spread) of prices but also how these prices are changing from year to year. That is the subject of the rest of this course.
The following exercises provide extra practice on the topics covered in Section 3.
(a) For the arithmetic scores in Exercise 1 (Section 1), find the quartiles and calculate the interquartile range. The stemplot of the scores is given below.
Figure 30 Stemplot of arithmetic stores
(b) For the television prices in Exercise 1, find the quartiles and calculate the interquartile range. The table of prices is given below.
170 |
180 |
190 |
200 |
220 |
229 |
230 |
230 |
230 |
230 |
250 |
269 |
269 |
270 |
279 |
299 |
300 |
300 |
315 |
320 |
349 |
350 |
400 |
429 |
649 |
699 |
Prepare a five-figure summary for each of the two batches from Exercise 1.
(a) For the arithmetic scores, the median is 79% (found in Exercise 1), and you found the quartiles and interquartile range in Exercise 6.
(b) For the television prices, the median is £270 (found in Exercise 1), and you found the quartiles and interquartile range in Exercise 6.
Boxplots of the two batches used in Exercises 1, 6 and 7 are shown in Figures 33 and 34. On the basis of these diagrams, comment on the symmetry and/or skewness of these data.
Figure 33 Boxplot of batch of 33 arithmetic scores
Figure 34 Boxplot of batch of 26 television prices