3 Measuring spread

As you have already seen, it is difficult to measure price changes when they so often vary from shop to shop and region to region. Taking some average value, such as the median or the mean, helps to simplify the problem. However, it would be a mistake to ignore the notion of spread, as averages on their own can be misleading.

Information about spread can be very important in statistical analysis, where you are often interested in comparing two or more batches. In this section we shall look first at measures of spread, and then at some methods of summarising the shape of a batch of data.

But how can spread be measured? Just as there are several ways of measuring location (mean, median, etc.), there are also several ways of measuring spread. Here, we shall examine two such measures: the range and the interquartile range. (A further, even more important, measure of spread is the standard deviation. It is, however, beyond the scope of this course.)

3.1 The range

The range is defined below.

The range

The range is the distance between the lower and the upper extremes. It can be calculated from the formula:

where uppercase E subscript uppercase U end is the upper extreme and uppercase E subscript uppercase L end is the lower extreme.

Given an ordered batch of data, for example in a stemplot, the range can easily be calculated. However, the range tells us very little about how the values in the main body of the data are spread. It is also very sensitive to changes in the extreme values, like those considered in Subsection 1.4. It would be better to have a measure of spread that conveys more information about the spread of values in the main body of the data. One such measure is based upon the difference between two particular values in the batch, known as the quartiles. As the name suggests, the two quartiles lie one quarter of the way into the batch from either end. The major part of the next subsection describes how to find them.

3.2 Quartiles and the interquartile range

Finding the quartiles of a batch is very similar to finding the median.

In Subsection 1.2, we represented a batch as a V-shaped formation, with the median at the ‘hinge’ where the two arms of the V meet. The median splits the batch into two equal parts. Similarly, we can put another hinge in each side of the V and get four roughly equal parts, shaped like this: wedge wedge. For a batch of size 15, it looks like Figure 12.

Figure 12 Median and quartiles

The points at the side hinges, in this case x subscript open bracket 4 close bracket end and x subscript open bracket 12 close bracket end, are the quartiles. There are two quartiles which, as with the extremes, we call the lower quartile and the upper quartile. The lower quartile separates off the bottom quarter, or lowest 25%. The upper quartile separates off the top quarter, or highest 25%. They are denoted uppercase Q sub 1 and uppercase Q sub 3 respectively. (Sometimes they are referred to as the first quartile and the third quartile.)

You might be wondering, if these are uppercase Q sub 1 and uppercase Q sub 3, what happened to uppercase Q sub 2? Well, have a think about that for a moment.

uppercase Q sub 1 separates the bottom quarter of the data (from the top three quarters), and uppercase Q sub 3 separates the bottom three quarters (from the top quarter). So it would make sense to say that uppercase Q sub 2 separates the bottom two quarters (from the top two quarters). But two quarters make a half, so uppercase Q sub 2 would denote the median, and since there is already a separate word for that, it’s not usual to call it the second quartile.

Usually we cannot divide the batch exactly into quarters. Indeed, this is illustrated in Figure 12 where the two central parts of the wedge wedge are larger than the outer ones. As with calculating the median for an even-sized batch, some rule is needed to tell us how many places we need to count along from the smallest value to find the quartiles. However, there are several alternatives that we could adopt and the particular rule described below is somewhat arbitrary. Different authors and different software may use slightly different rules. If your calculator can find quartiles, note that it may use a different rule.

As you might have expected, the rule involves dividing open bracket n+1 close bracket by 4, where n is the batch size (as opposed to dividing by 2 to find the median). However, the rule is slightly more complicated for the quartiles and it depends on whether n+1 is exactly divisible by 4.

The quartiles

The lower quartile uppercase Q sub 1 is at position fraction open bracket n +1 close bracket over 4 end in the ordered batch.

The upper quartile uppercase Q sub 3 is at position fraction 3 open bracket n +1 close bracket over 4 end in the ordered batch.

If open bracket n+1 close bracket is exactly divisible by 4, these positions correspond to a single value in the batch.

If open bracket n+1 close bracket is not exactly divisible by 4, then the positions are to be interpreted as follows.

  • A position which is a whole number followed by fraction 1 over 2 end means ‘halfway between the two positions either side’ (as was the case for finding the median).

  • A position which is a whole number followed by fraction 1 over 4 end means ‘one quarter of the way from the position below to the position above’. So for instance if a position is 5 fraction 1 over 4 end, the quartile is the number one quarter of the way from x subscript open bracket 5 close bracket end to x subscript open bracket 6 close bracket end.

  • A position which is a whole number followed by fraction 3 over 4 end means ‘three quarters of the way from the position below to the position above’. So for instance if a position is 4 fraction 3 over 4 end, the quartile is the number three quarters of the way from x subscript open bracket 4 close bracket end to x subscript open bracket 5 close bracket end.

Before we actually use these rules to find quartiles, let us look at some more examples of wedge wedge-shaped diagrams for different batch sizes n. The case where open bracket n+1 close bracket is exactly divisible by 4, so that fraction 1 over 4 end open bracket n+1 close bracket is a whole number, was shown in Figure 12. The following three figures show the three other possible scenarios, where open bracket n+1 close bracket is not exactly divisible by 4.

For n = 17, fraction 1 over 4 end open bracket n+1 close bracket = 4 fraction 1 over 2 end and fraction 3 over 4 end open bracket n+1 close bracket = 13 fraction 1 over 2 end. So uppercase Q sub 1 is halfway between x subscript open bracket 4 close bracket end and x subscript open bracket 5 close bracket end, and uppercase Q sub 3 is halfway between x subscript open bracket 13 close bracket end and x subscript open bracket 14 close bracket end.

For n = 18, fraction 1 over 4 end open bracket n+1 close bracket = 4 fraction 3 over 4 end and fraction 3 over 4 end open bracket n+1 close bracket = 14 fraction 1 over 4 end. So uppercase Q sub 1 is three quarters of the way from x subscript open bracket 4 close bracket end to x subscript open bracket 5 close bracket end, and uppercase Q sub 3 is one quarter of the way from x subscript open bracket 14 close bracket end to x subscript open bracket 15 close bracket end.

For n = 20, fraction 1 over 4 end open bracket n+1 close bracket = 5 fraction 1 over 4 end and fraction 3 over 4 end open bracket n+1 close bracket = 15 fraction 3 over 4 end. So uppercase Q sub 1 is one quarter of the way from x subscript open bracket 5 close bracket end to x subscript open bracket 6 close bracket end, and uppercase Q sub 3 is three quarters of the way from x subscript open bracket 15 close bracket end to x subscript open bracket 16 close bracket end.

Example 14 Quartiles for the prices of small televisions

Figure 15 showed you where the quartiles are for a batch of size 20. Let us now use the stemplot of the 20 television prices in Figure 16, which you first met in Figure 5 (Subsection 1.2), to find the lower and upper quartiles, uppercase Q sub 1 and uppercase Q sub 3, of this batch.

Figure 16 Prices of flat-screen televisions with a screen size of 24 inches or less

To calculate the lower quartile uppercase Q sub 1 you need to find the number that is one quarter of the way from x subscript open bracket 5 close bracket end to x subscript open bracket 6 close bracket end. These values are both 130, so uppercase Q sub 1 is 130. To calculate the upper quartile uppercase Q sub 3 you need to find the number three quarters of the way from x subscript open bracket 15 close bracket end to x subscript open bracket 16 close bracket end. These values are both 180, so uppercase Q sub 3 is 180.

That example was easier than it might have been, because for each quartile the two numbers we had to consider turned out to be equal!

Example 15 Quartiles for the camera prices

Table 2 (Subsection 1.2) gave ten prices for a particular model of digital camera (in pounds). In order, the prices are as follows.

To find the lower and upper quartiles, uppercase Q sub 1 and uppercase Q sub 3, of this batch, first find fraction 1 over 4 end open bracket n+1 close bracket = 2 fraction 3 over 4 end and fraction 3 over 4 end open bracket n+1 close bracket = 8 fraction 1 over 4 end.

The lower quartile uppercase Q sub 1 is the number three quarters of the way from x subscript open bracket 2 close bracket end to x subscript open bracket 3 close bracket end. These values are 60 and 65. The difference between them is 65 minus 60=5, and three quarters of that difference is fraction 3 over 4 end times 5 = 3.75. Therefore uppercase Q sub 1 is 3.75 larger than 60, so it is 63.75. As with the median, in this course we will generally round the quartiles to the accuracy of the original data, so in this case we round to the nearest whole number, 64. In symbols, uppercase Q sub 1 = 60 + fraction 3 over 4 end open bracket 65 minus 60 close bracket = 63.75 simeq 64.

The upper quartile uppercase Q sub 3 is the number one quarter of the way from x subscript open bracket 8 close bracket end to x subscript open bracket 9 close bracket end. These values are 81 and 85. The difference between them is 85 minus 81=4, and one quarter of that difference is fraction 1 over 4 end times 4 = 1. Therefore uppercase Q sub 3 is 1 larger than 81, so it is 82. (No rounding necessary this time.) In symbols, uppercase Q sub 3 = 81 + fraction 1 over 4 end open bracket 85 minus 81 close bracket = 82.

Example 15 is the subject of the following screencast. [Note that references to ‘the unit’ should be interpreted as ‘this course’. The original wording refers to the Open University course from which this material is adapted.]

Video content is not available in this format.

Screencast 3 Calculating quartiles

Activity 10 Finding more quartiles

(a) Find the lower and upper quartiles of the batch of 15 coffee prices in Figure 17. (This batch of coffee prices was first introduced in Table 1 of Subsection 1.1.)

Figure 17 Stemplot of 15 coffee prices

(b) Find the lower and upper quartiles of the batch of 14 gas prices in Figure 18. (This batch of gas prices was first introduced in Table 3 of Subsection 1.2.)

Figure 18 Stemplot of 14 gas prices

A measure of spread

Now we can define a new measure of spread based entirely on the lower and upper quartiles.

The interquartile range

The interquartile range (sometimes abbreviated to IQR) is the distance between the lower and upper quartiles:

Note that this value is independent of the sizes of uppercase E subscript uppercase U end and uppercase E subscript uppercase L end.

Example 16 The prices of small televisions, yet again!

For the batch of 20 television prices in Example 14 (Subsection 3.2),

So the interquartile range is £50.

Activity 11 Coffee prices again

Calculate both the range and the interquartile range of the batch of 15 coffee prices, last seen in Figure 17 (Subsection 3.2).

Activity 12 Interquartile range of gas prices

In Activity 10(b) (Subsection 3.2) you found the quartiles of the 14 gas prices from Activity 2 (Subsection 1.2). Find the interquartile range.

You may be wondering why you are being asked to learn a new measure of spread when you already know the range. As a measure of spread, the range open bracket uppercase E subscript uppercase U end minus uppercase E subscript uppercase L end close bracket is not very satisfactory because it is not resistant to the effects of unrepresentative extreme values. (Resistant measures were explained in Subsection 1.4.) The interquartile range, by contrast, is a highly resistant measure of spread (because it is not sensitive to the effects of values lying outside the middle 50% of the batch) and it is generally the preferred choice.

Example 17 Comparing the resistance of the range and the IQR

Suppose the price of the most expensive jar of coffee is reduced from 369p to 325p. How does this affect the range and the interquartile range of the batch of coffee prices in Figure 17 (Subsection 3.2)?

The new range is

a lot less than the original value of 101p (found in Activity 11). The interquartile range is unchanged.

3.3 The five-figure summary and boxplots

As well as giving us a new measure of spread – the interquartile range – the quartiles are important figures in themselves. Our wedge wedge-shaped diagram, Figure 19, gives five important points which help to summarise the shape of a distribution: the median, the two quartiles and the two extremes.

Figure 19 Values in a five-figure summary

These are conveniently displayed in the following form, called the five-figure summary of the batch.

Five-figure summary

Example 18 Five-figure summary for television price data

For the television price data, we have n = 20, uppercase M=150, uppercase Q sub 1 =130, uppercase Q sub 3 =180, uppercase E subscript uppercase L end =90 and uppercase E subscript uppercase U end =270. (You last saw these data in Figure 16, Subsection 3.2.)

Therefore, the five-figure summary of this batch is

This diagram contains the following information about the batch of prices.

  • The general level of prices, as measured by the median, is £150.

  • The individual prices vary from £90 to £270.

  • About 25% of the prices were less than £130.

  • About 25% of the prices were more than £180.

  • About 50% of the prices were between £130 and £180.

We hope you agree that the five-figure summary is quite an efficient way of presenting a summary of a batch of data.

The five values in a five-figure summary can be very effectively presented in a special diagram called a boxplot. For the 14 gas prices (Figure 15, Subsection 3.2) the diagram looks like Figure 22.

Figure 22 Boxplot of batch of 14 gas prices

The central feature of this diagram is a box – hence the name boxplot. The box extends from the lower quartile (at the left-hand edge of the box) to the upper quartile (the right-hand edge). This part of the diagram contains 50% of the values in the batch. The length of this box is thus the interquartile range.

Outside the box are two whiskers. (Boxplots are sometimes called box-and-whisker diagrams.) In many cases, such as in Figure 22, the whiskers extend all the way out to the extremes. Each whisker then covers the end 25% of the batch and the distance between the two whisker-ends is then the range. (You will see examples later where the whiskers do not go right out to the extremes.)

So far we have dealt with four figures from the five-figure summary: the two quartiles and the two extremes. The remaining figure is perhaps the most important: it is the median, whose position is shown by putting a vertical line through the box.

Thus a boxplot shows clearly the division of the data into four parts: the two whiskers and the two sections of the box; these are the four parts of the wedge wedge-shaped diagram and each contains (approximately) 25% of values in the batch (see Figure 21).

John W. Tukey (1915–2000), inventor of the five-figure summary and boxplot

John Tukey was a prominent and prolific US statistician, based at Princeton University and Bell Laboratories. As well as working in some very technical areas, he was a great promoter of simple ways of picturing and summarising data, and invented both the five-figure summary and the boxplot (except that he called them the ‘five-number summary’ and the ‘box-and-whisker plot’).

He had what has been described as an ‘unusual’ lecturing style. The statistician Peter McCullagh describes a lecture he gave at Imperial College, London in 1977:

Tukey ambled to the podium, a great bear of a man dressed in baggy pants and a black knitted shirt. These might once have been a matching pair, but the vintage was such that it was hard to tell. …The words came …, not many, like overweight parcels, delivered at a slow unfaltering pace. …Tukey turned to face the audience …. ‘Comments, queries, suggestions?’ he asked …. As he waited for a response, he clambered onto the podium and manoeuvred until he was sitting cross-legged facing the audience. …We in the audience sat like spectators at the zoo waiting for the great bear to move or say something. But the great bear appeared to be doing the same thing, and the feeling was not comfortable. …After a long while, …he extracted from his pocket a bag of dried prunes and proceeded to eat them in silence, one by one. The war of nerves continued …four prunes, five prunes. …How many prunes would it take to end the silence?

(Source: McCullagh, P. (2003) ‘John Wilder Tukey’, Biographical Memoirs of Fellows of the Royal Society, vol. 49, pp. 537–55.)

Figure 23 A standard boxplot with annotation

A typical boxplot looks something like Figure 23 because in most batches of data the values are more densely packed in the middle of the batch and are less densely packed in the extremes. This means that each whisker is usually longer than half the length of the box. This is illustrated again in the next example.

Example 19 Boxplot for the prices of small televisions

The boxplot for the batch of 20 television prices (last worked with in Example 18) is shown in Figure 24.

Figure 24 Boxplot of batch of 20 television prices

You can see that each whisker is longer than half the length of the box.

However, this boxplot has a new feature. The whisker on the left goes right down to the lower extreme. But the whisker on the right does not go right to the upper extreme. The highest extreme data value, 270, which might potentially be regarded as an outlier, is marked separately with a star. Then the whisker extends only to cover the data values that are not extreme enough to be regarded as potential outliers. The highest of these values is 250.

(This course does not describe the rule to decide which data values (if any) can be regarded as potential outliers that are plotted separately on the diagram. This is another issue that may be dealt with differently by different authors and different software.)

Example 19 is the subject of the following screencast. [Note that the reference to ‘Unit 2’ should be ‘this course’ and ‘Figure 18’ should be ‘Figure 23’. Unit 2 and Figure 18 are references to the Open University course from which this material is adapted.]

Video content is not available in this format.

Screencast 4 Interpreting a boxplot

One important use of boxplots is to picture and describe the overall shape of a batch of data.

Example 20 Skew televisions

The stemplot of small television prices, last seen in Figure 16 (Subsection 3.2), shows a lack of symmetry. Since the higher values are more spread out than the lower values, the data are right-skew.

The boxplot of these data, given in Figure 22, also shows this right-skew fairly clearly. In the box, the right-hand part (corresponding to higher prices) is rather longer than the left-hand part, and the right-hand whisker is longer than the left-hand whisker.

Activity 13 Skew gas prices?

A stemplot of the gas price data from Activity 2 (Subsection 1.2) is shown, yet again, in Figure 25.

Figure 25 Stemplot of 14 gas prices

(a) Prepare a five-figure summary of the batch.

(b) Figure 27 shows the boxplot of these data that you have already seen in Figure 22. What do the stemplot and boxplot tell us about the symmetry and/or skewness of the batch?

Figure 27 Boxplot of batch of 14 gas prices

Example 21 Camera prices: skew or not?

In Example 20 and Activity 13 you saw how boxplots look for batches of data that are right-skew or left-skew. What happens in a batch that is more symmetrical?

For the small batch of camera prices from Table 2 (Subsection 1.2), a (stretched) stemplot is shown in Figure 28.

Figure 28 Stemplot of ten camera prices

The stemplot looks reasonably symmetric.

A boxplot of the data, Figure 29, confirms the impression of symmetry. The two parts of the box are roughly equal in length, and the two whiskers are also roughly equal in length.

Figure 29 Boxplot of batch of ten camera prices

You have now spent quite a lot of time looking at various ways of investigating prices and, in particular, at methods of measuring the location and spread of the prices of particular commodities.

In order to begin to answer our question, Are people getting better or worse off?, we need to know not just location (and spread) of prices but also how these prices are changing from year to year. That is the subject of the rest of this course.

Exercises on Section 3

The following exercises provide extra practice on the topics covered in Section 3.

Exercise 6 Finding quartiles and the interquartile range

(a) For the arithmetic scores in Exercise 1 (Section 1), find the quartiles and calculate the interquartile range. The stemplot of the scores is given below.

Figure 30 Stemplot of arithmetic stores

(b) For the television prices in Exercise 1, find the quartiles and calculate the interquartile range. The table of prices is given below.

170

180

190

200

220

229

230

230

230

230

250

269

269

270

279

299

300

300

315

320

349

350

400

429

649

699

Exercise 7 Some five-figure summaries

Prepare a five-figure summary for each of the two batches from Exercise 1.

(a) For the arithmetic scores, the median is 79% (found in Exercise 1), and you found the quartiles and interquartile range in Exercise 6.

(b) For the television prices, the median is £270 (found in Exercise 1), and you found the quartiles and interquartile range in Exercise 6.

Exercise 8 Boxplots and the shape of distributions

Boxplots of the two batches used in Exercises 1, 6 and 7 are shown in Figures 33 and 34. On the basis of these diagrams, comment on the symmetry and/or skewness of these data.

Figure 33 Boxplot of batch of 33 arithmetic scores

Figure 34 Boxplot of batch of 26 television prices