Interpreting data: Boxplots and tables

This free course is available to start right now. Review the full course description and key learning outcomes and create an account and enrol if you want a free statement of participation.

Free course

# 2.4 Including the results of useful calculation

Can Table 2.4 be simplified further by pooling more rows or columns? Perhaps it might be, but there may well be a risk of losing some important or relevant information. So, before considering any further simplification, we shall look at adding information to the table, in the form of the results of some helpful calculations (guideline 4).

On their own, some of the numbers in the table still do not mean a great deal. There were 61 new cases among males in the 55–59 age group. But how does this compare with males in other age groups, and with females? There were 60 new cases for males aged 70–74. On the face of it this looks very close to the figure for the 55–59 group. But there were far more males in the South Australian population aged 55–59 than there were aged 70–74 (35192 compared to 16613). It seems likely that the main interest in these data is in the varying chances of developing lung cancer or dying from it, at different ages and for the two genders. To find out something about this, it is useful to calculate the proportions of the different age groups that became new cases of lung cancer. For males aged 55–59, the proportion is 61/35192=0.0017333, or 0.17333% as a percentage. For males aged 70–74 the corresponding proportion is 60/16613=0.0036116, or 0.36116%. It is very common, and often very useful, to calculate such quantities, which are often known as rates.

For the time being, we shall just look at the new cases and omit the information on deaths. The rate for new cases in each age group has been calculated for males and for females; these rates are included in Table 2.5. As you can see, these numbers do not look particularly user-friendly!

## Table 2.5 South Australia: incidence for lung cancer, 1981

Age group Population size New cases New cases as % of population size
Male Female Male Female Male Female
0–39 427725 414937 1 2 0.0023380 0.0048200
40–44 35648 35547 2 5 0.056104 0.014066
45–49 32911 31799 8 2 0.024308 0.062895
50–54 36485 35333 38 8 0.10415 0.022642
55–59 35192 35555 61 18 0.17333 0.050626
60–64 28131 30868 67 16 0.23817 0.051834
65–69 24419 27390 88 15 0.36038 0.054765
70–74 16613 21402 60 21 0.36116 0.098122
75–79 9958 14546 46 10 0.46194 0.068747
80–84 4852 9749 24 6 0.49464 0.061545
85+ 2790 7477 7 2 0.25090 0.026749

The table still looks pretty horrible and the information it contains is difficult to assimilate, largely because there is too much clutter from information of dubious relevance, and also because far too many decimal places are included in the last two columns. The latter problem is easily solved, in accord with guideline 3. First, note that (for example) the figure of 0.098122% for females aged 70–74 means that, for every 100 women in this age group (in South Australia in 1981), there were 0.098122 new cases of lung cancer. In this context there is nothing special about calculating the rate per 100 women in the population. Instead, the number of cases per 100 000 women in the population will be calculated. This has the effect of multiplying all the rates by 1000, which gets rid of most of the occurrences of ‘0.0…’ at the start of the numbers, and hence makes the table easier to read. Also, simply to get across the main message of these data does not require five significant figures. Instead, in Table 2.6, the figures are given to one decimal place.

## Table 2.6 South Australia: incidence for lung cancer, 1981

Age group Population size New cases Newcases per 100 000 population
Male Female Male Female Male Female
0–39 427725 414937 1 2 0.2 0.5
40–44 35648 35547 2 5 5.6 14.1
45–49 32911 31799 8 2 24.3 6.3
50–54 36485 35333 38 8 104.2 22.6
55–59 35192 35555 61 18 173.3 50.6
60–64 28131 30868 67 16 238.2 51.8
65–69 24419 27390 88 15 360.4 54.8
70–74 16613 21402 60 21 361.2 98.1
75–79 9958 14546 46 10 461.9 68.7
80–84 4852 9749 24 6 494.6 61.5
85+ 2790 7477 7 2 250.9 26.7

Now does it make sense to simplify the table any further? If we want to use it to communicate information about the relative chances of being diagnosed as a new case of lung cancer at different ages and for the two genders, the ‘Population size’ and ‘New cases’ columns do not actually give very relevant information. It might therefore be reasonable to omit them. Furthermore, the general pattern of the new case rates at different ages can be communicated with rather fewer age groups than were used in Table 2.6. Table 2.7 uses fewer and coarser age groupings, and the only figures given are the calculated values of the new cases per 100 000 and deaths per 100 000; these have been rounded to one decimal place. (Note that the figures for new cases in Table 2.7 cannot be calculated simply from the rates given in the last two columns of Table 2.6. The appropriate population sizes and counts of cases must be aggregated and the aggregates used to calculate the rates.)

## Table 2.7 South Australia: incidence and mortality for lung cancer, 1981 (rates per 100,000 population)

Age group New cases Deaths
Male Female Male Female
0–49 2.2 1.9 3.0 1.0
50–59 138.1 36.7 96.3 22.6
60–69 295.0 53.2 239.8 54.9
70–79 398.9 86.2 402.7 83.5
80+ 405.7 46.4 405.7 40.6

(Whole numbers in the deaths column would arguably have been quite adequate to get across the message of these data. Using one decimal place has the advantage of making it clear that these are rates, and not counts of individual cases.)

This is a quickly assimilated table that communicates the pattern of incidence and death from lung cancer, in relation to population size. It is easy to compare the figures for males and females, and it is equally easy to compare incidence with mortality in any of the age groups.

## Activity 4 Describing data in a table

• (a) Describe the main patterns in the data on lung cancer in South Australia, on the basis of Table 2.7.

• (b) Table 2.7 is certainly much simpler than the earlier tables in this section, and you would probably agree that the patterns in the data are easier to see. But can you think of any disadvantages of the presentation in Table 2.7 compared to the other tables?

### Solution

(a) The pattern of incidence of lung cancer for males in South Australia may be described as follows. There are very few new cases in men aged under 50 years, but the rate rises rapidly for men in their 50s and 60s. The increase levels off above age 70. The pattern of mortality for males is very similar to that for incidence. For females, both incidence and mortality are again very low below 50 years of age and increase after that, but the incidence and mortality rates remain much lower than for men (about one quarter or one fifth of the level for men). Also the incidence and mortality rates for women reduce quite considerably in the oldest age groups.

(b) One problem is that the information on how many people were involved has been entirely removed. One pattern that was noted in part (a) is the fall in incidence and mortality rates for women aged over 80. However, we cannot tell from Table 2.7 that there were actually only 8 new cases and 7 deaths in women in these age groups. With numbers of cases this small, a few extra cases in one year, such as we might expect just on the basis of random variability, would show up as a large rise in the incidence rate. Without knowing something about the numbers from which the rates in Table 2.7 were calculated, it is not possible to take this into account. Thus, for example, in writing report about these matters, it would be good statistical practice to include the counts of cases and deaths somewhere, even if not in the same table as that including the rates.

Do you agree that Table 2.7 conforms to all of the four guidelines given at the beginning of this section? After you have produced a table for yourself, it is always a good idea to check it carefully against each of the four guidelines.

M248_2