# 4 The law of small numbers

This concept is again from the book by Kahneman (2001, p. 109). It can be illustrated by the following example.

Suppose there is a medical syndrome X, which is relatively rare and affects 1 person in every 1000 on average (that is, the rate is 0.1%). In the imaginary country of Ruritania, scientists have done a survey to see if the rate of this condition varies from one part of the country to another.

The map in Figure 18 shows the regions of Ruritania. For each region, the map gives the population of the region, and the percentage of the population with the condition X.

In some regions, the rate of condition X is unusually low, bearing in mind the average rate of 0.1%. These regions are shaded in Figure 17.

Looking more closely, you will see a clear pattern. All of the relatively X-free regions have small populations.

- What is the connection?
- Do people in the low-population regions live in smaller communities, leading to more personalised medical care and a reduced risk of the condition?
- Could the condition be caused by the stress of urban living, which is reduced in smaller communities where people lead more relaxed lives?

At this point you might try to confirm your theories, by looking at the regions where the rate of X is abnormally high. These are shaded in Figure 20.

But these regions are *also* ones with low populations! What is going on? How does a low population lead to both a lower *and* a higher rate of the condition?

The answer is, it doesn’t! The numbers have been generated by a computer simulation. In every region the average rate used for the simulation was exactly the same, 0.1%. However, in small populations random fluctuations are bound to make a much bigger difference to the rate. Imagine a region has only 100 people in it. If a single person had the condition X, the rate would be a massive 1%, which is 10 times the average. If no one has the condition, the rate will plummet to 0%.

So, the smaller regions are certain to include many with below average rates, and many with above average rates, and there is no cause and effect involved.

You might feel that this is an artificial example but it is not. Studies in the USA found that small schools were more common than might be expected among schools with high average test results. As a result, many charities provided small schools with financial support, for example US$1.7 billion from the Bill & Melinda Gates Foundation.

However, in an analysis of the scores of Pennsylvania schools, Howard Wainer and Harris Zwerling (2006) showed that schools with the *worst* average scores were even more likely to be among the smaller ones. So the apparent superiority of small schools is almost surely an illusion. In the same way, our simulation of syndrome X made it seem at first that people in small regions were less inclined to have the condition.

## Activity 5 Try it yourself

You can do similar simulations for yourself in the Python shell.

You need to start by loading a part of the Python system that is only available on request. You do that by entering

>>> from random import sample

Next, you need to say what options the samples will be taken from, and what frequency each option should occur with.

>>> choices = ['no']*999 + ['yes']

This says you want to choose from a list having 999 occurrences of 'no' and 1 occurrence of 'yes'. On *average*, 'yes' will get picked 1 time in 1000, but it is a random process, so 'yes' might get picked more or less often than that.

Now you are ready to take a random sample. The following expression will sample one item from the choices list 100 times in a row.

>>> samples = [sample(choices,1) for s in range(100)]

This will give a list containing 100 samples each containing one item but one sample containing 100 items is what is actually needed. We can get that as follows

>>> results = [item for s in samples for item in s]

If you now enter results you will get something like this

>>> results

['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']

This is what you would expect, given the frequencies you have assigned to each option: 999 for 'no' versus 1 for 'yes'. The actual rate at which 'yes' should appear is 1 in 1000, which is 0.1%.

But if you repeat the experiment a few times, a 'yes' pops up now and again. This is like the example you saw earlier in which there was only a 0.1% chance of having the imaginary syndrome X, but among 100 people, cases occurred occasionally.

You can get Python to count how many times ‘yes' came up, as follows

>>> results.count('yes')

Usually this will be 0, which corresponds to no one having syndrome X, which is a rate of 0%. But if you keep repeating the steps of the simulation

>>> samples = [sample(choices,1) for s in range(100)]

>>> results = [item for s in samples for item in s]

>>> results.count(‘yes')

then from time to time you will get 1, which corresponds to a rate of 1%, 10 times the average. So in these small samples of 100 rates are either 0%, or else 10 (or more) times the average rate, purely from random effects, exactly as discussed earlier.

If you now try a sample of a million, things will be very different. Try executing the following steps (warning: do *not* try to look at the actual content of results; displaying a million items is too much for the console!)

>>> samples = [sample(choices,1) for s in range(1000000)]

>>> results = [item for s in samples for item in s]

>>> results.count('yes')

Running the calculations will take a few seconds. If you see a message saying a web page is slowing your browser down, simply ignore it.

To get the rate, divide the output from the count by 1000000.

>>> _/1000000

0.000969

(This is the result we got; yours will be slightly different, of course.)

0.000969 is about 1 in a thousand, close to the known average. While the rate fluctuates a lot for a small number of samples (100), with a large number of samples (1000000), the rate is much more predictable.

The next section looks at how the conclusion people draw from a positive test result can often be wrong.