RSS feed for Exploring data: graphs and numerical summaries
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection0
This RSS feed contains all the sections in Exploring data: graphs and numerical summaries
Moodle
Copyright © 2016 The Open University
http://www.open.edu/openlearn/ocw/theme/image.php/_s/openlearn/core/1450346605/i/rsssitelogo
moodle
http://www.open.edu/openlearn/ocw
140
35
engbMon, 22 Feb 2016 12:32:13 +0000Mon, 22 Feb 2016 12:32:13 +000020160222T12:32:13+00:00The Open UniversityengbCopyright © 2016 The Open UniversityCopyright © 2016 The Open University
Introduction
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection0
Tue, 26 Jul 2011 23:00:00 GMT
<p>This course will introduce you to a number of ways of representing data graphically and of summarising data numerically. You will learn the uses for pie charts, bar charts, histograms and scatterplots. You will also be introduced to various ways of summarising data and methods for assessing location and dispersion.</p><p>This OpenLearn course is an adapted extract from the Open University course <span class="oucontentlinkwithtip"><a class="oucontenthyperlink" href="http://www3.open.ac.uk/study/undergraduate/course/m248.htm?LKCAMPAIGN=ebook_&MEDIA=ou">M248 <i>Analysing data</i>.</a></span></p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection0
IntroductionM248_1<p>This course will introduce you to a number of ways of representing data graphically and of summarising data numerically. You will learn the uses for pie charts, bar charts, histograms and scatterplots. You will also be introduced to various ways of summarising data and methods for assessing location and dispersion.</p><p>This OpenLearn course is an adapted extract from the Open University course <span class="oucontentlinkwithtip"><a class="oucontenthyperlink" href="http://www3.open.ac.uk/study/undergraduate/course/m248.htm?LKCAMPAIGN=ebook_&MEDIA=ou">M248 <i>Analysing data</i>.</a></span></p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

Learning outcomes
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsectionlearningoutcomes
Tue, 26 Jul 2011 23:00:00 GMT
<p>After studying this course, you should be able to:</p><ul><li><p>understand and use standard symbols and notation: for the pth value in a data set when the values are written in order, the sample lower and upper quartiles and the sample median, the sample mean and the standard deviation</p></li><li><p>understand that data can have a pattern which may be represented graphically</p></li><li><p>understand that the standard deviation and the interquartile range are measures of the dispersion in a data set</p></li><li><p>understand that the median and the interquartile range are more resistant measures than are the mean and the standard deviation</p></li><li><p>identify an overall 'feel' for data and the way it is distributed by constructing appropriate graphical displays.</p></li></ul>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsectionlearningoutcomes
Learning outcomesM248_1<p>After studying this course, you should be able to:</p><ul><li><p>understand and use standard symbols and notation: for the pth value in a data set when the values are written in order, the sample lower and upper quartiles and the sample median, the sample mean and the standard deviation</p></li><li><p>understand that data can have a pattern which may be represented graphically</p></li><li><p>understand that the standard deviation and the interquartile range are measures of the dispersion in a data set</p></li><li><p>understand that the median and the interquartile range are more resistant measures than are the mean and the standard deviation</p></li><li><p>identify an overall 'feel' for data and the way it is distributed by constructing appropriate graphical displays.</p></li></ul>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1 0 Introducing data
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection1
Tue, 26 Jul 2011 23:00:00 GMT
<p><i>Chambers English Dictionary</i> defines the word data as follows.</p><p><b>data</b>, <i>dātä,, n.pl</i>. facts given, from which others may be inferred:—<i>sing</i>. <b>da'tum(q.v.)</b> …. [L. <i>data</i>, things given, pa.p. neut. pl. of <i>dare</i>, to give.]</p><p>You might prefer the definition given in the <i>Shorter Oxford English Dictionary</i>.</p><p><b>data</b>, things given or granted; something known or assumed as fact, and made the basis of reasoning or calculation.</p><p>Data arise in many spheres of human activity and in all sorts of different contexts in the natural world about us. Statistics may be described as exploring, analysing and summarising data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions. The data themselves may arise in the natural course of things (for example, as meteorological records) or, commonly, they may be collected by survey or experiment.</p><p>In this course we begin by examining several different data sets and describing some of their features.</p><p>Data are frequently expressed as nothing more than a list of numbers or a complicated table. As a result, very large data sets can be difficult to appreciate and interpret without some form of consolidation. This can, perhaps, be achieved via a series of simpler tables or an easily assimilated diagram. The same applies to smaller data sets, whose main message may become evident only after some procedure of sorting and summarising.</p><p>Before computers were widely available, it was often necessary to make quite detailed theoretical assumptions before beginning to investigate the data. But nowadays it is relatively easy to use a statistical computer package to explore data and acquire some intuitive ‘feel’ for them, without making such assumptions. This is helpful in that the most important and informative place to start is the logical one, namely with the data themselves. The computer will make your task both possible and relatively quick.</p><p>However, you must take care not to be misled into thinking that computers have made statistical theory redundant: this is far from the truth. You will find the computer can only lead you to see where theory is needed to underpin a commonsense approach or, perhaps, to reach an informed decision. It cannot replace such theory and it is, of course, incapable of informed reasoning: as always, that is up to you. Even so, if you are to gain real understanding and expertise, your first steps are best directed towards learning to use your computer to explore data, and to obtain some tentative inferences from such exploration.</p><p>The technology explosion of recent years has made relatively cheap and powerful computers available to all of us. Furthermore, it has brought about an information explosion which has revolutionised our whole environment. Information pours in from the media, advertisements, government agencies and a host of other sources and, in order to survive, we must learn to make rational choices based on some kind of summary and analysis of it. We need to learn to select the relevant and discard the irrelevant, to sift out what is interesting, to have some kind of appreciation of the accuracy and reliability of both our information and our conclusions, and to produce succinct summaries which can be interpreted clearly and quickly.</p><p>Our methods for summarising data will involve producing graphical displays as well as numerical calculations. You will see how a preliminary pictorial analysis of your data can, and indeed should, influence your entire approach to choosing a valid, reliable method.</p><p>But we shall begin, in Section 1 of this course, with the data themselves. In this course, except where it is necessary to make a particular theoretical point, all of the data sets used are genuine; none are artificial, contrived or ‘adjusted’ in any way. In Section 1 you will encounter several sets of real data, and begin to look at some questions on which they can throw light.</p><p>Statistics exists as an academic and intellectual discipline precisely because real investigations need to be carried out. Simple questions, and difficult ones, about matters which affect our lives need to be answered, information needs to be processed and decisions need to be made. ‘Finding things out’ is fun: this is the challenge of real data.</p><p>Some basic graphical methods that can be used to present data and make clearer the patterns in sets of numbers are introduced in Sections 2 and 3: pie charts and bar charts in Section 2, histograms and scatterplots in Section 4.</p><p>Finally, in Section 4 we discuss ways of producing numerical summaries of certain aspects of data sets, including measures of location (which are, in a sense, ‘averages’), measures of the dispersion or variability of a set of data, and measures of symmetry (and lack of symmetry).</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection1
1 0 Introducing dataM248_1<p><i>Chambers English Dictionary</i> defines the word data as follows.</p><p><b>data</b>, <i>dātä,, n.pl</i>. facts given, from which others may be inferred:—<i>sing</i>. <b>da'tum(q.v.)</b> …. [L. <i>data</i>, things given, pa.p. neut. pl. of <i>dare</i>, to give.]</p><p>You might prefer the definition given in the <i>Shorter Oxford English Dictionary</i>.</p><p><b>data</b>, things given or granted; something known or assumed as fact, and made the basis of reasoning or calculation.</p><p>Data arise in many spheres of human activity and in all sorts of different contexts in the natural world about us. Statistics may be described as exploring, analysing and summarising data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions. The data themselves may arise in the natural course of things (for example, as meteorological records) or, commonly, they may be collected by survey or experiment.</p><p>In this course we begin by examining several different data sets and describing some of their features.</p><p>Data are frequently expressed as nothing more than a list of numbers or a complicated table. As a result, very large data sets can be difficult to appreciate and interpret without some form of consolidation. This can, perhaps, be achieved via a series of simpler tables or an easily assimilated diagram. The same applies to smaller data sets, whose main message may become evident only after some procedure of sorting and summarising.</p><p>Before computers were widely available, it was often necessary to make quite detailed theoretical assumptions before beginning to investigate the data. But nowadays it is relatively easy to use a statistical computer package to explore data and acquire some intuitive ‘feel’ for them, without making such assumptions. This is helpful in that the most important and informative place to start is the logical one, namely with the data themselves. The computer will make your task both possible and relatively quick.</p><p>However, you must take care not to be misled into thinking that computers have made statistical theory redundant: this is far from the truth. You will find the computer can only lead you to see where theory is needed to underpin a commonsense approach or, perhaps, to reach an informed decision. It cannot replace such theory and it is, of course, incapable of informed reasoning: as always, that is up to you. Even so, if you are to gain real understanding and expertise, your first steps are best directed towards learning to use your computer to explore data, and to obtain some tentative inferences from such exploration.</p><p>The technology explosion of recent years has made relatively cheap and powerful computers available to all of us. Furthermore, it has brought about an information explosion which has revolutionised our whole environment. Information pours in from the media, advertisements, government agencies and a host of other sources and, in order to survive, we must learn to make rational choices based on some kind of summary and analysis of it. We need to learn to select the relevant and discard the irrelevant, to sift out what is interesting, to have some kind of appreciation of the accuracy and reliability of both our information and our conclusions, and to produce succinct summaries which can be interpreted clearly and quickly.</p><p>Our methods for summarising data will involve producing graphical displays as well as numerical calculations. You will see how a preliminary pictorial analysis of your data can, and indeed should, influence your entire approach to choosing a valid, reliable method.</p><p>But we shall begin, in Section 1 of this course, with the data themselves. In this course, except where it is necessary to make a particular theoretical point, all of the data sets used are genuine; none are artificial, contrived or ‘adjusted’ in any way. In Section 1 you will encounter several sets of real data, and begin to look at some questions on which they can throw light.</p><p>Statistics exists as an academic and intellectual discipline precisely because real investigations need to be carried out. Simple questions, and difficult ones, about matters which affect our lives need to be answered, information needs to be processed and decisions need to be made. ‘Finding things out’ is fun: this is the challenge of real data.</p><p>Some basic graphical methods that can be used to present data and make clearer the patterns in sets of numbers are introduced in Sections 2 and 3: pie charts and bar charts in Section 2, histograms and scatterplots in Section 4.</p><p>Finally, in Section 4 we discuss ways of producing numerical summaries of certain aspects of data sets, including measures of location (which are, in a sense, ‘averages’), measures of the dispersion or variability of a set of data, and measures of symmetry (and lack of symmetry).</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.1.1 Introduction
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.1
Tue, 26 Jul 2011 23:00:00 GMT
<p>The data sets you will meet in this section are very different from each other, both in structure and character. By the time you reach the end of the course, you will have carried out a preliminary investigation of each, identified important questions about them and made a good deal of progress with some of the answers. As you work through the course developing statistical expertise, several of these data sets will be revisited and different questions addressed.</p><p>There are seven data sets here. You do not need to study them in great detail at this early stage. You should spend just long enough to see how they are presented and to think about the questions that arise. We shall be looking at all of them in greater detail as this course proceeds. However, if you think you have identified something interesting or unusual about any one of them, make a note of your idea for later in the course.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.1
1.1.1 IntroductionM248_1<p>The data sets you will meet in this section are very different from each other, both in structure and character. By the time you reach the end of the course, you will have carried out a preliminary investigation of each, identified important questions about them and made a good deal of progress with some of the answers. As you work through the course developing statistical expertise, several of these data sets will be revisited and different questions addressed.</p><p>There are seven data sets here. You do not need to study them in great detail at this early stage. You should spend just long enough to see how they are presented and to think about the questions that arise. We shall be looking at all of them in greater detail as this course proceeds. However, if you think you have identified something interesting or unusual about any one of them, make a note of your idea for later in the course.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.2: Nuclear power stations
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.2
Tue, 26 Jul 2011 23:00:00 GMT
<p>The first data set is a very simple one. <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.2#tbl001001">Table 1</a> shows the number of nuclear power stations in various countries throughout the world before the end of the cold war (that is, prior to 1989). The names of the countries listed are those that pertained at the time the data were collected.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_001"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Table 1 Nuclear power stations</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Country</th><th scope="col">Number</th></tr><tr><td>Canada</td><td>22</td></tr><tr><td>Czechoslovakia</td><td>13</td></tr><tr><td>East Germany</td><td>10</td></tr><tr><td>France</td><td>52</td></tr><tr><td>Japan</td><td>43</td></tr><tr><td>South Korea</td><td>9</td></tr><tr><td>Soviet Union</td><td>73</td></tr><tr><td>Spain</td><td>10</td></tr><tr><td>Sweden</td><td>12</td></tr><tr><td>UK</td><td>41</td></tr><tr><td>USA</td><td>119</td></tr><tr><td>West Germany</td><td>23</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>You could scarcely have a more straightforward table than this, and yet it is by no means clear what is the most meaningful and appealing way to show the information.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.2
1.2: Nuclear power stationsM248_1<p>The first data set is a very simple one. <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.2#tbl001001">Table 1</a> shows the number of nuclear power stations in various countries throughout the world before the end of the cold war (that is, prior to 1989). The names of the countries listed are those that pertained at the time the data were collected.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_001"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Table 1 Nuclear power stations</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Country</th><th scope="col">Number</th></tr><tr><td>Canada</td><td>22</td></tr><tr><td>Czechoslovakia</td><td>13</td></tr><tr><td>East Germany</td><td>10</td></tr><tr><td>France</td><td>52</td></tr><tr><td>Japan</td><td>43</td></tr><tr><td>South Korea</td><td>9</td></tr><tr><td>Soviet Union</td><td>73</td></tr><tr><td>Spain</td><td>10</td></tr><tr><td>Sweden</td><td>12</td></tr><tr><td>UK</td><td>41</td></tr><tr><td>USA</td><td>119</td></tr><tr><td>West Germany</td><td>23</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>You could scarcely have a more straightforward table than this, and yet it is by no means clear what is the most meaningful and appealing way to show the information.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.1.3: USA workforce
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.3
Tue, 26 Jul 2011 23:00:00 GMT
<p>The data set in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.3#tbl001002">Table 2</a> comprises the figures published by the US Labor Department for the composition of its workforce in 1986. It shows the average numbers over the year of male and female workers in the various different employment categories and is typical of the kind of data published by government departments.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_002"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Table 2 Average composition of the USA workforce during 1986</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Type of employment</th><th scope="col">Male (millions)</th><th scope="col">Female (millions)</th></tr><tr><td>Professional</td><td>15.00</td><td>11.60</td></tr><tr><td>Industrial</td><td>12.90</td><td>4.45</td></tr><tr><td>Craftsmen</td><td>12.30</td><td>1.25</td></tr><tr><td>Sales</td><td>6.90</td><td>6.45</td></tr><tr><td>Service</td><td>5.80</td><td>9.60</td></tr><tr><td>Clerical</td><td>3.50</td><td>14.30</td></tr><tr><td>Agricultural</td><td>2.90</td><td>0.65</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>In spite of this being a small and fairly straightforward data set, it is not easy to develop an intuitive ‘feel’ for the numbers and their relationships with each other when they are displayed as a table.</p><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="act001_001"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 1: USA workforce</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Given the USA workforce data, what questions might you ask?</p></div>
<div class="oucontentsaqdiscussion"><h3 class="oucontenth4">Discussion</h3><p>One question is: what is the most meaningful and appealing way to show the information? You might want to decide how best you can compare the male and female workforces in each category. It is possible that the most important question involves comparisons between the total number of employees in each of the seven categories.</p></div></div></div></div>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.3
1.1.3: USA workforceM248_1<p>The data set in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.3#tbl001002">Table 2</a> comprises the figures published by the US Labor Department for the composition of its workforce in 1986. It shows the average numbers over the year of male and female workers in the various different employment categories and is typical of the kind of data published by government departments.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_002"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Table 2 Average composition of the USA workforce during 1986</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Type of employment</th><th scope="col">Male (millions)</th><th scope="col">Female (millions)</th></tr><tr><td>Professional</td><td>15.00</td><td>11.60</td></tr><tr><td>Industrial</td><td>12.90</td><td>4.45</td></tr><tr><td>Craftsmen</td><td>12.30</td><td>1.25</td></tr><tr><td>Sales</td><td>6.90</td><td>6.45</td></tr><tr><td>Service</td><td>5.80</td><td>9.60</td></tr><tr><td>Clerical</td><td>3.50</td><td>14.30</td></tr><tr><td>Agricultural</td><td>2.90</td><td>0.65</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>In spite of this being a small and fairly straightforward data set, it is not easy to develop an intuitive ‘feel’ for the numbers and their relationships with each other when they are displayed as a table.</p><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="act001_001"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 1: USA workforce</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Given the USA workforce data, what questions might you ask?</p></div>
<div class="oucontentsaqdiscussion"><h3 class="oucontenth4">Discussion</h3><p>One question is: what is the most meaningful and appealing way to show the information? You might want to decide how best you can compare the male and female workforces in each category. It is possible that the most important question involves comparisons between the total number of employees in each of the seven categories.</p></div></div></div></div>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.1.4: Infants with SIRDS
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.4
Tue, 26 Jul 2011 23:00:00 GMT
<p>The data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.4#tbl001003">Table 3</a> are the recorded birth weights of 50 infants who displayed severe idiopathic respiratory distress syndrome (SIRDS). This is a serious condition which can result in death.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_003"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Table 3 Birth weights (in kg) of infants with severe idiopathic respiratory distress syndrome</h2><div class="oucontenttablewrapper"><table><tr><td>1.050*</td><td>2.500*</td><td>1.890*</td><td>1.760</td><td>2.830</td></tr><tr><td>1.175*</td><td>1.030*</td><td>1.940*</td><td>1.930</td><td>1.410</td></tr><tr><td>1.230*</td><td>1.100*</td><td>2.200*</td><td>2.015</td><td>1.715</td></tr><tr><td>1.310*</td><td>1.185*</td><td>2.270*</td><td>2.090</td><td>1.720</td></tr><tr><td>1.500*</td><td>1.225*</td><td>2.440*</td><td>2.600</td><td>2.040</td></tr><tr><td>1.600*</td><td>1.262*</td><td>2.560*</td><td>2.700</td><td>2.200</td></tr><tr><td>1.720*</td><td>1.295*</td><td>2.730*</td><td>2.950</td><td>2.400</td></tr><tr><td>1.750*</td><td>1.300*</td><td>1.130</td><td>3.160</td><td>2.550</td></tr><tr><td>1.770*</td><td>1.550*</td><td>1.575</td><td>3.400</td><td>2.570</td></tr><tr><td>2.275*</td><td>1.820*</td><td>1.680</td><td>3.640</td><td>3.005</td></tr><tr><td colspan="2">*child died</td><td/><td/><td/></tr></table></div><div class="oucontentsourcereference"></div></div><p>(van Vliet, P.K. and Gupta, J.M. (1973) Sodium bicarbonate in idiopathic respiratory distress syndrome. <i>Arch. Disease in Childhood</i>, <b>48</b>, 249–255.)</p><p>At first glance, there seems little that one can deduce from these data. The babies vary in weight between 1.03 kg and 3.64 kg. Notice, however, that some of the children died. Surely the important question concerns early identification of children displaying SIRDS who are at risk of dying. Do the children split into two identifiable groups? Is it possible to relate the chances of survival to birth weight?</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.4
1.1.4: Infants with SIRDSM248_1<p>The data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.4#tbl001003">Table 3</a> are the recorded birth weights of 50 infants who displayed severe idiopathic respiratory distress syndrome (SIRDS). This is a serious condition which can result in death.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_003"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Table 3 Birth weights (in kg) of infants with severe idiopathic respiratory distress syndrome</h2><div class="oucontenttablewrapper"><table><tr><td>1.050*</td><td>2.500*</td><td>1.890*</td><td>1.760</td><td>2.830</td></tr><tr><td>1.175*</td><td>1.030*</td><td>1.940*</td><td>1.930</td><td>1.410</td></tr><tr><td>1.230*</td><td>1.100*</td><td>2.200*</td><td>2.015</td><td>1.715</td></tr><tr><td>1.310*</td><td>1.185*</td><td>2.270*</td><td>2.090</td><td>1.720</td></tr><tr><td>1.500*</td><td>1.225*</td><td>2.440*</td><td>2.600</td><td>2.040</td></tr><tr><td>1.600*</td><td>1.262*</td><td>2.560*</td><td>2.700</td><td>2.200</td></tr><tr><td>1.720*</td><td>1.295*</td><td>2.730*</td><td>2.950</td><td>2.400</td></tr><tr><td>1.750*</td><td>1.300*</td><td>1.130</td><td>3.160</td><td>2.550</td></tr><tr><td>1.770*</td><td>1.550*</td><td>1.575</td><td>3.400</td><td>2.570</td></tr><tr><td>2.275*</td><td>1.820*</td><td>1.680</td><td>3.640</td><td>3.005</td></tr><tr><td colspan="2">*child died</td><td/><td/><td/></tr></table></div><div class="oucontentsourcereference"></div></div><p>(van Vliet, P.K. and Gupta, J.M. (1973) Sodium bicarbonate in idiopathic respiratory distress syndrome. <i>Arch. Disease in Childhood</i>, <b>48</b>, 249–255.)</p><p>At first glance, there seems little that one can deduce from these data. The babies vary in weight between 1.03 kg and 3.64 kg. Notice, however, that some of the children died. Surely the important question concerns early identification of children displaying SIRDS who are at risk of dying. Do the children split into two identifiable groups? Is it possible to relate the chances of survival to birth weight?</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.1.5 Runners
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5
Tue, 26 Jul 2011 23:00:00 GMT
<p>The next data set relates to 22 of the competitors in an annual championship run, the Tyneside Great North Run. Blood samples were taken from eleven runners before and after the run, and also from another eleven runners who collapsed near the end of the race. The measurements are plasma β endorphin concentrations in pmol/litre. The letter β is the Greek lowercase letter beta, pronounced ‘beeta’. Unless you have had medical training you are unlikely to know precisely what constitutes a plasma β endorphin concentration, much less what the units of measurement mean. This is a common experience even among expert statisticians working with data from specialist experiments, and can usually be dealt with. What matters is that some physical attribute can be measured, and the measurement value is important to the experimenter. The statistician is prepared to accept that running may have an effect upon the blood, and will ask for clarification of medical questions as and when the need arises. The data are given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5#tbl001004">Table 4</a>.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_004"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Table 4 Blood plasma β endorphin concentration (pmol/l)</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Normal runner before race</th><th scope="col">Same runner after race</th><th scope="col">Collapsed runner after race</th></tr><tr><td>4.3</td><td>29.6</td><td>66</td></tr><tr><td>4.6</td><td>25.1</td><td>72</td></tr><tr><td>5.2</td><td>15.5</td><td>79</td></tr><tr><td>5.2</td><td>29.6</td><td>84</td></tr><tr><td>6.6</td><td>24.1</td><td>102</td></tr><tr><td>7.2</td><td>37.8</td><td>110</td></tr><tr><td>8.4</td><td>20.2</td><td>123</td></tr><tr><td>9.0</td><td>21.9</td><td>144</td></tr><tr><td>10.4</td><td>14.2</td><td>162</td></tr><tr><td>14.0</td><td>34.6</td><td>169</td></tr><tr><td>17.8</td><td>46.2</td><td>414</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>(Dale, G., Fleetwood, J.A., Weddell, A., Ellis, R.D. and Sainsbury, J.R.C. (1987) Betaendorphin: a factor in ‘fun run’ collapse? <i>British Medical Journal</i>
<b>294</b>, 1004.)</p><p>You can see immediately that there is a difference in β endorphin concentration before and after the race, and you do not need to be a statistician to see that collapsed runners have very high β endorphin concentrations compared with those who finished the race. But what is the relationship between initial and final β endorphin concentrations? What is a typical finishing concentration? What is a typical concentration for a collapsed runner? How do the sets of data values compare in terms of how widely they are dispersed around a typical value?</p><p>The table raises other questions. The eleven normal runners (in the first two columns) have been sorted according to increasing prerace endorphin levels. This may or may not help make any differences in the postrace levels more immediately evident. Is this kind of initial sorting necessary, or even common, in statistical practice? The data on the collapsed runners have also been sorted. The neat table design relies in part on the fact that there were eleven collapsed runners measured, just as there were eleven finishers, but the two groups are independent of each other. There does not seem to be any particularly obvious reason why the numbers in the two groups should not have been different. Is it necessary to the statistical design of this experiment that the numbers should have been the same?</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5
1.1.5 RunnersM248_1<p>The next data set relates to 22 of the competitors in an annual championship run, the Tyneside Great North Run. Blood samples were taken from eleven runners before and after the run, and also from another eleven runners who collapsed near the end of the race. The measurements are plasma β endorphin concentrations in pmol/litre. The letter β is the Greek lowercase letter beta, pronounced ‘beeta’. Unless you have had medical training you are unlikely to know precisely what constitutes a plasma β endorphin concentration, much less what the units of measurement mean. This is a common experience even among expert statisticians working with data from specialist experiments, and can usually be dealt with. What matters is that some physical attribute can be measured, and the measurement value is important to the experimenter. The statistician is prepared to accept that running may have an effect upon the blood, and will ask for clarification of medical questions as and when the need arises. The data are given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5#tbl001004">Table 4</a>.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_004"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Table 4 Blood plasma β endorphin concentration (pmol/l)</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Normal runner before race</th><th scope="col">Same runner after race</th><th scope="col">Collapsed runner after race</th></tr><tr><td>4.3</td><td>29.6</td><td>66</td></tr><tr><td>4.6</td><td>25.1</td><td>72</td></tr><tr><td>5.2</td><td>15.5</td><td>79</td></tr><tr><td>5.2</td><td>29.6</td><td>84</td></tr><tr><td>6.6</td><td>24.1</td><td>102</td></tr><tr><td>7.2</td><td>37.8</td><td>110</td></tr><tr><td>8.4</td><td>20.2</td><td>123</td></tr><tr><td>9.0</td><td>21.9</td><td>144</td></tr><tr><td>10.4</td><td>14.2</td><td>162</td></tr><tr><td>14.0</td><td>34.6</td><td>169</td></tr><tr><td>17.8</td><td>46.2</td><td>414</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>(Dale, G., Fleetwood, J.A., Weddell, A., Ellis, R.D. and Sainsbury, J.R.C. (1987) Betaendorphin: a factor in ‘fun run’ collapse? <i>British Medical Journal</i>
<b>294</b>, 1004.)</p><p>You can see immediately that there is a difference in β endorphin concentration before and after the race, and you do not need to be a statistician to see that collapsed runners have very high β endorphin concentrations compared with those who finished the race. But what is the relationship between initial and final β endorphin concentrations? What is a typical finishing concentration? What is a typical concentration for a collapsed runner? How do the sets of data values compare in terms of how widely they are dispersed around a typical value?</p><p>The table raises other questions. The eleven normal runners (in the first two columns) have been sorted according to increasing prerace endorphin levels. This may or may not help make any differences in the postrace levels more immediately evident. Is this kind of initial sorting necessary, or even common, in statistical practice? The data on the collapsed runners have also been sorted. The neat table design relies in part on the fact that there were eleven collapsed runners measured, just as there were eleven finishers, but the two groups are independent of each other. There does not seem to be any particularly obvious reason why the numbers in the two groups should not have been different. Is it necessary to the statistical design of this experiment that the numbers should have been the same?</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.1.6 Cirrhosis and alcoholism
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.6
Tue, 26 Jul 2011 23:00:00 GMT
<p>The data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.6#tbl001005">Table 5</a>, which are given for several countries in Europe and elsewhere, are the average annual alcohol consumption in litres per person and the death rate per 100 000 of the population from cirrhosis and alcoholism. It would seem obvious that the two are related to each other, but what is the relationship and is it a strong one? How can the strength of such a relationship be measured? Is it possible to assess the effect on alcoholrelated deaths of taxes on alcohol, or of laws that aim to reduce the national alcohol consumption?</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_005"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Table 5 Average alcohol consumption and death rate</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Country</th><th scope="col">Annual alcohol consumption (1/person)</th><th scope="col">Cirrhosis & alcoholism (death rate/100 000)</th></tr><tr><td>France</td><td>24.7</td><td>46.1</td></tr><tr><td>Italy</td><td>15.2</td><td>23.6</td></tr><tr><td>W. Germany</td><td>12.3</td><td>23.7</td></tr><tr><td>Austria</td><td>10.9</td><td>7.0</td></tr><tr><td>Belgium</td><td>10.8</td><td>12.3</td></tr><tr><td>USA</td><td>9.9</td><td>14.2</td></tr><tr><td>Canada</td><td>8.3</td><td>7.4</td></tr><tr><td>England & Wales</td><td>7.2</td><td>3.0</td></tr><tr><td>Sweden</td><td>6.6</td><td>7.2</td></tr><tr><td>Japan</td><td>5.8</td><td>10.6</td></tr><tr><td>Netherlands</td><td>5.7</td><td>3.7</td></tr><tr><td>Ireland</td><td>5.6</td><td>3.4</td></tr><tr><td>Norway</td><td>4.2</td><td>4.3</td></tr><tr><td>Finland</td><td>3.9</td><td>3.6</td></tr><tr><td>Israel</td><td>3.1</td><td>5.4</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>(Osborn, J.F. (1979) <i>Statistical exercises in medical research</i>. Blackwell Scientific Publications, Oxford, p.44.)</p><p>France has a noticeably higher average annual individual alcohol consumption than the others; the figure is more than double that of thirdplaced West Germany. The French alcoholrelated death rate is just under double that of the next highest.</p><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="act001_002"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 2: Alcohol consumption and death rate</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Bearing in mind the comments above, summarise the information you might wish to glean from these data. Have you any suggestions for displaying the data?</p></div>
<div class="oucontentsaqdiscussion"><h3 class="oucontenth4">Discussion</h3><p>You would wish to know whether the death rate is directly related to alcohol consumption and, if so, how. You would also need to know if the figures for France should be regarded as atypical. If so, how should they be handled when the data are analysed?</p><p>One suggestion for displaying the data would be to plot a graph of <i>death rate</i> against <i>alcohol consumption</i>.</p></div></div></div></div>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.6
1.1.6 Cirrhosis and alcoholismM248_1<p>The data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.6#tbl001005">Table 5</a>, which are given for several countries in Europe and elsewhere, are the average annual alcohol consumption in litres per person and the death rate per 100 000 of the population from cirrhosis and alcoholism. It would seem obvious that the two are related to each other, but what is the relationship and is it a strong one? How can the strength of such a relationship be measured? Is it possible to assess the effect on alcoholrelated deaths of taxes on alcohol, or of laws that aim to reduce the national alcohol consumption?</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_005"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Table 5 Average alcohol consumption and death rate</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Country</th><th scope="col">Annual alcohol consumption (1/person)</th><th scope="col">Cirrhosis & alcoholism (death rate/100 000)</th></tr><tr><td>France</td><td>24.7</td><td>46.1</td></tr><tr><td>Italy</td><td>15.2</td><td>23.6</td></tr><tr><td>W. Germany</td><td>12.3</td><td>23.7</td></tr><tr><td>Austria</td><td>10.9</td><td>7.0</td></tr><tr><td>Belgium</td><td>10.8</td><td>12.3</td></tr><tr><td>USA</td><td>9.9</td><td>14.2</td></tr><tr><td>Canada</td><td>8.3</td><td>7.4</td></tr><tr><td>England & Wales</td><td>7.2</td><td>3.0</td></tr><tr><td>Sweden</td><td>6.6</td><td>7.2</td></tr><tr><td>Japan</td><td>5.8</td><td>10.6</td></tr><tr><td>Netherlands</td><td>5.7</td><td>3.7</td></tr><tr><td>Ireland</td><td>5.6</td><td>3.4</td></tr><tr><td>Norway</td><td>4.2</td><td>4.3</td></tr><tr><td>Finland</td><td>3.9</td><td>3.6</td></tr><tr><td>Israel</td><td>3.1</td><td>5.4</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>(Osborn, J.F. (1979) <i>Statistical exercises in medical research</i>. Blackwell Scientific Publications, Oxford, p.44.)</p><p>France has a noticeably higher average annual individual alcohol consumption than the others; the figure is more than double that of thirdplaced West Germany. The French alcoholrelated death rate is just under double that of the next highest.</p><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="act001_002"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 2: Alcohol consumption and death rate</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Bearing in mind the comments above, summarise the information you might wish to glean from these data. Have you any suggestions for displaying the data?</p></div>
<div class="oucontentsaqdiscussion"><h3 class="oucontenth4">Discussion</h3><p>You would wish to know whether the death rate is directly related to alcohol consumption and, if so, how. You would also need to know if the figures for France should be regarded as atypical. If so, how should they be handled when the data are analysed?</p><p>One suggestion for displaying the data would be to plot a graph of <i>death rate</i> against <i>alcohol consumption</i>.</p></div></div></div></div>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.1.7 Body weights and brain weights for animals
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.7
Tue, 26 Jul 2011 23:00:00 GMT
<p>The next data set comprises average body and brain weights for 28 kinds of animal, some of them extinct. The data are given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.7#tbl001006">Table 6</a>.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_006"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Table 6 Average body and brain weights for animals</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Species</th><th scope="col">Body weight (kg)</th><th scope="col">Brain weight (g)</th></tr><tr><td>Mountain Beaver</td><td>1.350</td><td>8.100</td></tr><tr><td>Cow</td><td>465.000</td><td>423.000</td></tr><tr><td>Grey Wolf</td><td>36.330</td><td>119.500</td></tr><tr><td>Goat</td><td>27.660</td><td>115.000</td></tr><tr><td>Guinea Pig</td><td>1.040</td><td>5.500</td></tr><tr><td>
<i>Diplodocus</i>
</td><td>11700.000</td><td>50.000</td></tr><tr><td>Asian Elephant</td><td>2547.000</td><td>4603.000</td></tr><tr><td>Donkey</td><td>187.100</td><td>419.000</td></tr><tr><td>Horse</td><td>521.000</td><td>655.000</td></tr><tr><td>Potar Monkey</td><td>10.000</td><td>115.000</td></tr><tr><td>Cat</td><td>3.300</td><td>25.600</td></tr><tr><td>Giraffe</td><td>529.000</td><td>680.000</td></tr><tr><td>Gorilla</td><td>207.000</td><td>406.000</td></tr><tr><td>Human</td><td>62.000</td><td>1320.000</td></tr><tr><td>African Elephant</td><td>6654.000</td><td>5712.000</td></tr><tr><td>
<i>Triceratops</i>
</td><td>9400.000</td><td>70.000</td></tr><tr><td>Rhesus Monkey</td><td>6.800</td><td>179.000</td></tr><tr><td>Kangaroo</td><td>35.000</td><td>56.000</td></tr><tr><td>Hamster</td><td>0.120</td><td>1.000</td></tr><tr><td>Mouse</td><td>0.023</td><td>0.400</td></tr><tr><td>Rabbit</td><td>2.500</td><td>12.100</td></tr><tr><td>Sheep</td><td>55.500</td><td>175.000</td></tr><tr><td>Jaguar</td><td>100.000</td><td>157.000</td></tr><tr><td>Chimpanzee</td><td>52.160</td><td>440.000</td></tr><tr><td>
<i>Brachiosaurus</i>
</td><td>87000.000</td><td>154.500</td></tr><tr><td>Rat</td><td>0.280</td><td>1.900</td></tr><tr><td>Mole</td><td>0.122</td><td>3.000</td></tr><tr><td>Pig</td><td>192.000</td><td>180.000</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>(Jerison, H.J. (1973) <i>Evolution the brain and intelligence</i>. Academic Press, New York.)</p><p>These data raise interesting questions about their collection and the use of the word ‘average’. Presumably some estimates may be based on very small samples, while others may be more precise. On what sampling experiment are the figures for <i>Diplodocus, Triceratops</i> and other extinct animals based? The threedecimalplace ‘accuracy’ given throughout the table here is extraordinary (and certainly needs justification).</p><p>Putting these concerns to one side for the moment, it would seem obvious that the two variables, body weight and brain weight, are linked. But what is the relationship between them and how strong is it? Can the strength of the relationship be measured? Is a larger brain really required to govern a larger body? These data give rise to a common problem in data analysis which experienced practical analysts would notice as soon as they look at such data. Can you identify the difficulty? Later, when we plot these data, you will see it immediately.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.7
1.1.7 Body weights and brain weights for animalsM248_1<p>The next data set comprises average body and brain weights for 28 kinds of animal, some of them extinct. The data are given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.7#tbl001006">Table 6</a>.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_006"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Table 6 Average body and brain weights for animals</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Species</th><th scope="col">Body weight (kg)</th><th scope="col">Brain weight (g)</th></tr><tr><td>Mountain Beaver</td><td>1.350</td><td>8.100</td></tr><tr><td>Cow</td><td>465.000</td><td>423.000</td></tr><tr><td>Grey Wolf</td><td>36.330</td><td>119.500</td></tr><tr><td>Goat</td><td>27.660</td><td>115.000</td></tr><tr><td>Guinea Pig</td><td>1.040</td><td>5.500</td></tr><tr><td>
<i>Diplodocus</i>
</td><td>11700.000</td><td>50.000</td></tr><tr><td>Asian Elephant</td><td>2547.000</td><td>4603.000</td></tr><tr><td>Donkey</td><td>187.100</td><td>419.000</td></tr><tr><td>Horse</td><td>521.000</td><td>655.000</td></tr><tr><td>Potar Monkey</td><td>10.000</td><td>115.000</td></tr><tr><td>Cat</td><td>3.300</td><td>25.600</td></tr><tr><td>Giraffe</td><td>529.000</td><td>680.000</td></tr><tr><td>Gorilla</td><td>207.000</td><td>406.000</td></tr><tr><td>Human</td><td>62.000</td><td>1320.000</td></tr><tr><td>African Elephant</td><td>6654.000</td><td>5712.000</td></tr><tr><td>
<i>Triceratops</i>
</td><td>9400.000</td><td>70.000</td></tr><tr><td>Rhesus Monkey</td><td>6.800</td><td>179.000</td></tr><tr><td>Kangaroo</td><td>35.000</td><td>56.000</td></tr><tr><td>Hamster</td><td>0.120</td><td>1.000</td></tr><tr><td>Mouse</td><td>0.023</td><td>0.400</td></tr><tr><td>Rabbit</td><td>2.500</td><td>12.100</td></tr><tr><td>Sheep</td><td>55.500</td><td>175.000</td></tr><tr><td>Jaguar</td><td>100.000</td><td>157.000</td></tr><tr><td>Chimpanzee</td><td>52.160</td><td>440.000</td></tr><tr><td>
<i>Brachiosaurus</i>
</td><td>87000.000</td><td>154.500</td></tr><tr><td>Rat</td><td>0.280</td><td>1.900</td></tr><tr><td>Mole</td><td>0.122</td><td>3.000</td></tr><tr><td>Pig</td><td>192.000</td><td>180.000</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>(Jerison, H.J. (1973) <i>Evolution the brain and intelligence</i>. Academic Press, New York.)</p><p>These data raise interesting questions about their collection and the use of the word ‘average’. Presumably some estimates may be based on very small samples, while others may be more precise. On what sampling experiment are the figures for <i>Diplodocus, Triceratops</i> and other extinct animals based? The threedecimalplace ‘accuracy’ given throughout the table here is extraordinary (and certainly needs justification).</p><p>Putting these concerns to one side for the moment, it would seem obvious that the two variables, body weight and brain weight, are linked. But what is the relationship between them and how strong is it? Can the strength of the relationship be measured? Is a larger brain really required to govern a larger body? These data give rise to a common problem in data analysis which experienced practical analysts would notice as soon as they look at such data. Can you identify the difficulty? Later, when we plot these data, you will see it immediately.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.1.8 Surgical removal of tattoos
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.8
Tue, 26 Jul 2011 23:00:00 GMT
<p>The final data set in this section is different from the others in that the data are not numerical. So far you have only seen numerical data in the form of measurements or counts. However, there is no reason why data should not be verbal or textual. <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.8#tbl001007">Table 7</a> contains clinical data from 55 patients who have had forearm tattoos removed. Two different surgical methods were used; these are denoted by A and B in the table. The tattoos were of large, medium or small size, either deep or at moderate depth. The final result is scored from 1 to 4, where 1 represents a poor removal and 4 represents an excellent result. The gender of the patient is also shown.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_007"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Table 7 Surgical removal of tattoos</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Method</th><th scope="col">Gender</th><th scope="col">Size</th><th scope="col">Depth</th><th scope="col">Score</th><th scope="col">  </th><th scope="col">Method</th><th scope="col">Gender</th><th scope="col">Size</th><th scope="col">Depth</th><th scope="col">Score</th></tr><tr><td>A</td><td>M</td><td>Large</td><td>deep</td><td>1</td><td/><td>B</td><td>M</td><td>medium</td><td>moderate</td><td>2</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>moderate</td><td>1</td><td/><td>B</td><td>M</td><td>large</td><td>moderate</td><td>1</td></tr><tr><td>B</td><td>F</td><td>Small</td><td>deep</td><td>1</td><td/><td>A</td><td>M</td><td>medium</td><td>deep</td><td>2</td></tr><tr><td>B</td><td>M</td><td>Small</td><td>moderate</td><td>4</td><td/><td>B</td><td>M</td><td>large</td><td>deep</td><td>3</td></tr><tr><td>B</td><td>F</td><td>Large</td><td>deep</td><td>3</td><td/><td>A</td><td>F</td><td>large</td><td>moderate</td><td>1</td></tr><tr><td>B</td><td>M</td><td>Medium</td><td>moderate</td><td>4</td><td/><td>B</td><td>F</td><td>medium</td><td>deep</td><td>2</td></tr><tr><td>B</td><td>M</td><td>Medium</td><td>deep</td><td>4</td><td/><td>A</td><td>F</td><td>medium</td><td>deep</td><td>1</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>deep</td><td>1</td><td/><td>A</td><td>M</td><td>medium</td><td>moderate</td><td>3</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>moderate</td><td>4</td><td/><td>B</td><td>M</td><td>large</td><td>moderate</td><td>3</td></tr><tr><td>A</td><td>M</td><td>Small</td><td>moderate</td><td>4</td><td/><td>A</td><td>M</td><td>medium</td><td>deep</td><td>1</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>deep</td><td>1</td><td/><td>A</td><td>F</td><td>small</td><td>deep</td><td>2</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>moderate</td><td>4</td><td/><td>A</td><td>M</td><td>large</td><td>moderate</td><td>2</td></tr><tr><td>A</td><td>F</td><td>Small</td><td>moderate</td><td>3</td><td/><td>B</td><td>M</td><td>large</td><td>deep</td><td>2</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>deep</td><td>3</td><td/><td>B</td><td>M</td><td>medium</td><td>moderate</td><td>4</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>deep</td><td>2</td><td/><td>B</td><td>M</td><td>medium</td><td>deep</td><td>1</td></tr><tr><td>B</td><td>F</td><td>Medium</td><td>moderate</td><td>2</td><td/><td>B</td><td>F</td><td>medium</td><td>moderate</td><td>3</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>deep</td><td>1</td><td/><td>B</td><td>M</td><td>large</td><td>moderate</td><td>2</td></tr><tr><td>B</td><td>F</td><td>Medium</td><td>deep</td><td>1</td><td/><td>B</td><td>M</td><td>large</td><td>moderate</td><td>2</td></tr><tr><td>B</td><td>F</td><td>Small</td><td>moderate</td><td>3</td><td/><td>B</td><td>M</td><td>large</td><td>moderate</td><td>4</td></tr><tr><td>A</td><td>F</td><td>Small</td><td>moderate</td><td>4</td><td/><td>B</td><td>M</td><td>small</td><td>deep</td><td>4</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>deep</td><td>2</td><td/><td>B</td><td>M</td><td>large</td><td>moderate</td><td>3</td></tr><tr><td>A</td><td>M</td><td>Medium</td><td>moderate</td><td>4</td><td/><td>B</td><td>M</td><td>large</td><td>deep</td><td>2</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>deep</td><td>4</td><td/><td>B</td><td>M</td><td>large</td><td>deep</td><td>3</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>moderate</td><td>4</td><td/><td>A</td><td>M</td><td>large</td><td>moderate</td><td>4</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>deep</td><td>4</td><td/><td>A</td><td>M</td><td>large</td><td>deep</td><td>2</td></tr><tr><td>B</td><td>M</td><td>Medium</td><td>moderate</td><td>3</td><td/><td>B</td><td>M</td><td>medium</td><td>deep</td><td>1</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>deep</td><td>1</td><td/><td>A</td><td>M</td><td>small</td><td>deep</td><td>2</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>moderate</td><td>4</td><td/><td/><td/><td/><td/><td/></tr></table></div><div class="oucontentsourcereference"></div></div><p>(Lunn, A.D. and McNeil, D.R. (1988) <i>The SPIDA manual</i>. Statistical Computing Laboratory, Sydney.)</p><p>What are the relative merits of the two methods of tattoo removal? Is one method simply better, or does the quality of the result depend upon the size or depth of the tattoo?</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.8
1.1.8 Surgical removal of tattoosM248_1<p>The final data set in this section is different from the others in that the data are not numerical. So far you have only seen numerical data in the form of measurements or counts. However, there is no reason why data should not be verbal or textual. <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.8#tbl001007">Table 7</a> contains clinical data from 55 patients who have had forearm tattoos removed. Two different surgical methods were used; these are denoted by A and B in the table. The tattoos were of large, medium or small size, either deep or at moderate depth. The final result is scored from 1 to 4, where 1 represents a poor removal and 4 represents an excellent result. The gender of the patient is also shown.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_007"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Table 7 Surgical removal of tattoos</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Method</th><th scope="col">Gender</th><th scope="col">Size</th><th scope="col">Depth</th><th scope="col">Score</th><th scope="col"> </th><th scope="col">Method</th><th scope="col">Gender</th><th scope="col">Size</th><th scope="col">Depth</th><th scope="col">Score</th></tr><tr><td>A</td><td>M</td><td>Large</td><td>deep</td><td>1</td><td/><td>B</td><td>M</td><td>medium</td><td>moderate</td><td>2</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>moderate</td><td>1</td><td/><td>B</td><td>M</td><td>large</td><td>moderate</td><td>1</td></tr><tr><td>B</td><td>F</td><td>Small</td><td>deep</td><td>1</td><td/><td>A</td><td>M</td><td>medium</td><td>deep</td><td>2</td></tr><tr><td>B</td><td>M</td><td>Small</td><td>moderate</td><td>4</td><td/><td>B</td><td>M</td><td>large</td><td>deep</td><td>3</td></tr><tr><td>B</td><td>F</td><td>Large</td><td>deep</td><td>3</td><td/><td>A</td><td>F</td><td>large</td><td>moderate</td><td>1</td></tr><tr><td>B</td><td>M</td><td>Medium</td><td>moderate</td><td>4</td><td/><td>B</td><td>F</td><td>medium</td><td>deep</td><td>2</td></tr><tr><td>B</td><td>M</td><td>Medium</td><td>deep</td><td>4</td><td/><td>A</td><td>F</td><td>medium</td><td>deep</td><td>1</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>deep</td><td>1</td><td/><td>A</td><td>M</td><td>medium</td><td>moderate</td><td>3</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>moderate</td><td>4</td><td/><td>B</td><td>M</td><td>large</td><td>moderate</td><td>3</td></tr><tr><td>A</td><td>M</td><td>Small</td><td>moderate</td><td>4</td><td/><td>A</td><td>M</td><td>medium</td><td>deep</td><td>1</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>deep</td><td>1</td><td/><td>A</td><td>F</td><td>small</td><td>deep</td><td>2</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>moderate</td><td>4</td><td/><td>A</td><td>M</td><td>large</td><td>moderate</td><td>2</td></tr><tr><td>A</td><td>F</td><td>Small</td><td>moderate</td><td>3</td><td/><td>B</td><td>M</td><td>large</td><td>deep</td><td>2</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>deep</td><td>3</td><td/><td>B</td><td>M</td><td>medium</td><td>moderate</td><td>4</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>deep</td><td>2</td><td/><td>B</td><td>M</td><td>medium</td><td>deep</td><td>1</td></tr><tr><td>B</td><td>F</td><td>Medium</td><td>moderate</td><td>2</td><td/><td>B</td><td>F</td><td>medium</td><td>moderate</td><td>3</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>deep</td><td>1</td><td/><td>B</td><td>M</td><td>large</td><td>moderate</td><td>2</td></tr><tr><td>B</td><td>F</td><td>Medium</td><td>deep</td><td>1</td><td/><td>B</td><td>M</td><td>large</td><td>moderate</td><td>2</td></tr><tr><td>B</td><td>F</td><td>Small</td><td>moderate</td><td>3</td><td/><td>B</td><td>M</td><td>large</td><td>moderate</td><td>4</td></tr><tr><td>A</td><td>F</td><td>Small</td><td>moderate</td><td>4</td><td/><td>B</td><td>M</td><td>small</td><td>deep</td><td>4</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>deep</td><td>2</td><td/><td>B</td><td>M</td><td>large</td><td>moderate</td><td>3</td></tr><tr><td>A</td><td>M</td><td>Medium</td><td>moderate</td><td>4</td><td/><td>B</td><td>M</td><td>large</td><td>deep</td><td>2</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>deep</td><td>4</td><td/><td>B</td><td>M</td><td>large</td><td>deep</td><td>3</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>moderate</td><td>4</td><td/><td>A</td><td>M</td><td>large</td><td>moderate</td><td>4</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>deep</td><td>4</td><td/><td>A</td><td>M</td><td>large</td><td>deep</td><td>2</td></tr><tr><td>B</td><td>M</td><td>Medium</td><td>moderate</td><td>3</td><td/><td>B</td><td>M</td><td>medium</td><td>deep</td><td>1</td></tr><tr><td>A</td><td>M</td><td>Large</td><td>deep</td><td>1</td><td/><td>A</td><td>M</td><td>small</td><td>deep</td><td>2</td></tr><tr><td>B</td><td>M</td><td>Large</td><td>moderate</td><td>4</td><td/><td/><td/><td/><td/><td/></tr></table></div><div class="oucontentsourcereference"></div></div><p>(Lunn, A.D. and McNeil, D.R. (1988) <i>The SPIDA manual</i>. Statistical Computing Laboratory, Sydney.)</p><p>What are the relative merits of the two methods of tattoo removal? Is one method simply better, or does the quality of the result depend upon the size or depth of the tattoo?</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.1.9 Data and questions: summary
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.9
Tue, 26 Jul 2011 23:00:00 GMT
<p>In this section you have met some real data sets and briefly considered some of the questions you might ask of them. They will be referred to and investigated in the remaining sections of this course. Some general principles that govern the efficacy and quality of data summaries and displays will be formulated. As you will discover, the main requirements of any good statistical summary/display are that it is informative, easy to construct, visually appealing and readily assimilated by a nonexpert.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.9
1.1.9 Data and questions: summaryM248_1<p>In this section you have met some real data sets and briefly considered some of the questions you might ask of them. They will be referred to and investigated in the remaining sections of this course. Some general principles that govern the efficacy and quality of data summaries and displays will be formulated. As you will discover, the main requirements of any good statistical summary/display are that it is informative, easy to construct, visually appealing and readily assimilated by a nonexpert.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

2.3.1 Introduction
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.1
Tue, 26 Jul 2011 23:00:00 GMT
<p>The data set in Table 7 (section 1.8) comprised nonnumerical or categorical data. Such data often appear in newspaper reports and are usually represented as one or other of two types of graphical display, one type is called a <i>pie chart</i> and the other a <i>bar chart;</i> these are arguably the graphical displays most familiar to the general public, and are certainly ones that you will have seen before. Pie charts are discussed in section 2.2 and bar charts in section 2.4. Some problems that can arise when using graphics of these types are discussed briefly in section 2.6.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.1
2.3.1 IntroductionM248_1<p>The data set in Table 7 (section 1.8) comprised nonnumerical or categorical data. Such data often appear in newspaper reports and are usually represented as one or other of two types of graphical display, one type is called a <i>pie chart</i> and the other a <i>bar chart;</i> these are arguably the graphical displays most familiar to the general public, and are certainly ones that you will have seen before. Pie charts are discussed in section 2.2 and bar charts in section 2.4. Some problems that can arise when using graphics of these types are discussed briefly in section 2.6.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.3.2: Pie charts: surgical removal of tattoos
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.2
Tue, 26 Jul 2011 23:00:00 GMT
<p>Suppose we count the numbers of large, medium and small tattoos from the data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.8#tbl001007">Table 7</a>: there were 30 large tattoos, 16 of medium size and 9 small tattoos. These data are represented in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.2#fig002001">Figure 1</a>. This display is called a <b>pie chart</b>.</p><p>This is an easy display to construct because the size of each ‘slice’ is proportional to the angle it subtends at the centre, which in turn is proportional to the count in each category. So, to construct <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.2#fig002001">Figure 1</a>, you simply draw a circle and draw in radii making angles that represent the counts of large, medium and small tattoos respectively. For example, you can calculate the angle that represents the number of large tattoos as follows.</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_001"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/f67b7b70/m248_1_ue001.jpg" alt=""/></div><div class="oucontentfigure" style="width:511px;" id="fig002_001"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/0a4336e2/m248_1_001i.jpg" alt="Figure 1" width="511" height="440"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 1 Tattoo sizes</span></div></div></div><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="act002_001"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 3: Tattoo sizes</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Calculate the angles that represent the numbers of medium and small tattoos.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>The angle for medium tattoos is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0033"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/b9eba436/m248_1_ue033i.jpg" alt=""/></div><p>The angle for small tattoos is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0034"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/fd9ad963/m248_1_ue034i.jpg" alt=""/></div></div></div></div></div><p>Once you have calculated the angles, you draw in the three radii that subtend them, and then shade the three sectors in order to distinguish them from each other.</p><p>At first sight the pie chart seems to fulfil the basic requirements of a good statistical display: it appears to be informative, easy to construct, visually appealing and readily assimilated by a nonexpert.</p><p>Pie charts can be useful when all you want the reader to notice is that there were more large than medium size tattoos, and more medium than small tattoos. But in conveying a good impression of the relative magnitudes of the differences, they have some limitations. Also pie charts are useful only for displaying a limited number of categories, as the next section illustrates.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.2
1.3.2: Pie charts: surgical removal of tattoosM248_1<p>Suppose we count the numbers of large, medium and small tattoos from the data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.8#tbl001007">Table 7</a>: there were 30 large tattoos, 16 of medium size and 9 small tattoos. These data are represented in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.2#fig002001">Figure 1</a>. This display is called a <b>pie chart</b>.</p><p>This is an easy display to construct because the size of each ‘slice’ is proportional to the angle it subtends at the centre, which in turn is proportional to the count in each category. So, to construct <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.2#fig002001">Figure 1</a>, you simply draw a circle and draw in radii making angles that represent the counts of large, medium and small tattoos respectively. For example, you can calculate the angle that represents the number of large tattoos as follows.</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_001"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/f67b7b70/m248_1_ue001.jpg" alt=""/></div><div class="oucontentfigure" style="width:511px;" id="fig002_001"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/0a4336e2/m248_1_001i.jpg" alt="Figure 1" width="511" height="440"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 1 Tattoo sizes</span></div></div></div><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="act002_001"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 3: Tattoo sizes</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Calculate the angles that represent the numbers of medium and small tattoos.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>The angle for medium tattoos is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0033"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/b9eba436/m248_1_ue033i.jpg" alt=""/></div><p>The angle for small tattoos is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0034"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/fd9ad963/m248_1_ue034i.jpg" alt=""/></div></div></div></div></div><p>Once you have calculated the angles, you draw in the three radii that subtend them, and then shade the three sectors in order to distinguish them from each other.</p><p>At first sight the pie chart seems to fulfil the basic requirements of a good statistical display: it appears to be informative, easy to construct, visually appealing and readily assimilated by a nonexpert.</p><p>Pie charts can be useful when all you want the reader to notice is that there were more large than medium size tattoos, and more medium than small tattoos. But in conveying a good impression of the relative magnitudes of the differences, they have some limitations. Also pie charts are useful only for displaying a limited number of categories, as the next section illustrates.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.2.3 Pie charts: Nuclear power stations
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.3
Tue, 26 Jul 2011 23:00:00 GMT
<p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.3#fig002002">Figure 2</a> shows a pie chart of the number of nuclear power stations in countries where nuclear power is used, based on the data from <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.2#tbl001001">Table 1</a>.</p><div class="oucontentfigure" style="width:511px;" id="fig002_002"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/7fb3b0a4/m248_1_002i.small.jpg" alt="" width="511" height="278"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">Figure 2 Nuclear power stations (a pie chart)</span></div></div></div><p>It is not so easy to extract meaningful information from this more detailed diagram. You can pick out the main users of nuclear power, and that is about all.</p><p>When trying to construct pie charts for data with many categories, a common ploy of the graphic designer is to produce a pie chart which displays the main contributors and lumps together the smaller ones. However, the process inevitably involves loss of information.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.3
1.2.3 Pie charts: Nuclear power stationsM248_1<p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.3#fig002002">Figure 2</a> shows a pie chart of the number of nuclear power stations in countries where nuclear power is used, based on the data from <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.2#tbl001001">Table 1</a>.</p><div class="oucontentfigure" style="width:511px;" id="fig002_002"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/7fb3b0a4/m248_1_002i.small.jpg" alt="" width="511" height="278"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">Figure 2 Nuclear power stations (a pie chart)</span></div></div></div><p>It is not so easy to extract meaningful information from this more detailed diagram. You can pick out the main users of nuclear power, and that is about all.</p><p>When trying to construct pie charts for data with many categories, a common ploy of the graphic designer is to produce a pie chart which displays the main contributors and lumps together the smaller ones. However, the process inevitably involves loss of information.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.2.4 Bar charts: nuclear power stations
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.4
Tue, 26 Jul 2011 23:00:00 GMT
<p>A better way of displaying the data on nuclear power stations is by constructing a rectangular bar for each country, the length of which is proportional to the count. Bars are drawn separated from each other. In this context, the order of the categories (countries) in the original data table does not matter, so the bars in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.4#fig002003">Figure 3</a> have been drawn in order of decreasing size from top to bottom. This makes the categories easier to compare with one another.</p><div class="oucontentfigure" style="width:511px;" id="fig002_003"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/cbf35a4c/m248_1_003i.small.jpg" alt="" width="511" height="256"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">Figure 3 Nuclear power stations (a bar chart)</span></div></div></div><p>The display in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.4#fig002003">Figure 3</a> is called a <b>bar chart.</b> The bars may be drawn vertically or horizontally according to preference and convenience. Those in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.4#fig002003">Figure 3</a> have been drawn horizontally because of the lengths of the names of some of the countries. If the bars had been drawn vertically, the names of the countries would not have fitted along the horizontal axis unless the bars had been drawn far apart or the names had been printed vertically. The former would have made comparison difficult, while the latter would have made the names difficult to read. However, it is conventional to draw the bars vertically whenever possible.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.4
1.2.4 Bar charts: nuclear power stationsM248_1<p>A better way of displaying the data on nuclear power stations is by constructing a rectangular bar for each country, the length of which is proportional to the count. Bars are drawn separated from each other. In this context, the order of the categories (countries) in the original data table does not matter, so the bars in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.4#fig002003">Figure 3</a> have been drawn in order of decreasing size from top to bottom. This makes the categories easier to compare with one another.</p><div class="oucontentfigure" style="width:511px;" id="fig002_003"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/cbf35a4c/m248_1_003i.small.jpg" alt="" width="511" height="256"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">Figure 3 Nuclear power stations (a bar chart)</span></div></div></div><p>The display in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.4#fig002003">Figure 3</a> is called a <b>bar chart.</b> The bars may be drawn vertically or horizontally according to preference and convenience. Those in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.4#fig002003">Figure 3</a> have been drawn horizontally because of the lengths of the names of some of the countries. If the bars had been drawn vertically, the names of the countries would not have fitted along the horizontal axis unless the bars had been drawn far apart or the names had been printed vertically. The former would have made comparison difficult, while the latter would have made the names difficult to read. However, it is conventional to draw the bars vertically whenever possible.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.2.5 Bar charts: Surgical removal of tattoos
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.5
Tue, 26 Jul 2011 23:00:00 GMT
<p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.5#fig002004">Figure 4</a> shows a bar chart for the data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.8#tbl001007">Table 7</a> on the effectiveness of tattoo removal.</p><p>For the data on nuclear power stations, the order of the categories did not matter. However, sometimes order is important. The quality of tattoo removal was given a score from 1 to 4, and this ordering has been preserved along the quality (horizontal) axis. The vertical axis shows the reported frequency for each assessment.</p><div class="oucontentfigure" style="width:511px;" id="fig002_004"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/6eb36083/m248_1_004i.jpg" alt="Figure 4" width="511" height="439"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 4 Quality assessment of surgical removal of 55 tattoos</span></div></div></div><p>The eye is good at assessing lengths, whereas comparison of areas or angles does not come so naturally. Thus an advantage of bar charts over pie charts is that it is much easier to be accurate when comparing frequencies from a bar chart than from a pie chart.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.5
1.2.5 Bar charts: Surgical removal of tattoosM248_1<p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.5#fig002004">Figure 4</a> shows a bar chart for the data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.8#tbl001007">Table 7</a> on the effectiveness of tattoo removal.</p><p>For the data on nuclear power stations, the order of the categories did not matter. However, sometimes order is important. The quality of tattoo removal was given a score from 1 to 4, and this ordering has been preserved along the quality (horizontal) axis. The vertical axis shows the reported frequency for each assessment.</p><div class="oucontentfigure" style="width:511px;" id="fig002_004"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/6eb36083/m248_1_004i.jpg" alt="Figure 4" width="511" height="439"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 4 Quality assessment of surgical removal of 55 tattoos</span></div></div></div><p>The eye is good at assessing lengths, whereas comparison of areas or angles does not come so naturally. Thus an advantage of bar charts over pie charts is that it is much easier to be accurate when comparing frequencies from a bar chart than from a pie chart.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.2.6 Problems with graphics
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.6
Tue, 26 Jul 2011 23:00:00 GMT
<p>In this subsection we consider, briefly, some problems that can arise with certain ways of drawing bar charts and pie charts.</p><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.6#fig002005">Figure 5</a> shows what is essentially the same bar chart as <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.5#fig002004">Figure 4</a>, for the data on quality of tattoo removal. This time, though, the bar chart has been drawn in such a way as to suggest that the bars are ‘really’ threedimensional. You can see that, compared to <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.5#fig002004">Figure 4</a>, it is quite difficult to discern the corresponding frequency value for each bar.</p><div class="oucontentfigure" style="width:511px;" id="fig002_005"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/5cffbca0/m248_1_005i.jpg" alt="Figure 5" width="511" height="398"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 5 Quality of tattoo removal: a threedimensional bar chart</span></div></div></div><p>This kind of threedimensional bar chart is commonly used as a graphic, in television reports or in the press, for showing data such as the results from an opinion poll on the popularity of the main political parties. Viewers or readers do not necessarily realise exactly how they are supposed to use the vertical scale to determine the heights of the bars. To interpret this kind of graphic properly, you need to be aware of how misleading it can be.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.6
1.2.6 Problems with graphicsM248_1<p>In this subsection we consider, briefly, some problems that can arise with certain ways of drawing bar charts and pie charts.</p><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.6#fig002005">Figure 5</a> shows what is essentially the same bar chart as <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.5#fig002004">Figure 4</a>, for the data on quality of tattoo removal. This time, though, the bar chart has been drawn in such a way as to suggest that the bars are ‘really’ threedimensional. You can see that, compared to <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.5#fig002004">Figure 4</a>, it is quite difficult to discern the corresponding frequency value for each bar.</p><div class="oucontentfigure" style="width:511px;" id="fig002_005"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/5cffbca0/m248_1_005i.jpg" alt="Figure 5" width="511" height="398"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 5 Quality of tattoo removal: a threedimensional bar chart</span></div></div></div><p>This kind of threedimensional bar chart is commonly used as a graphic, in television reports or in the press, for showing data such as the results from an opinion poll on the popularity of the main political parties. Viewers or readers do not necessarily realise exactly how they are supposed to use the vertical scale to determine the heights of the bars. To interpret this kind of graphic properly, you need to be aware of how misleading it can be.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.2.7 Problems with graphics: USA workforce
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7
Tue, 26 Jul 2011 23:00:00 GMT
<p>The danger of using threedimensional effects is really brought home when two data sets are displayed on the same bar chart. <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.3#tbl001002">Table 2</a> may be thought of as consisting of two data sets, one for male workers and one for female workers. On its own, each of these data sets could be portrayed in a bar chart like those you have seen earlier. However, one of the questions raised about these data in Activity 1 was how the data for men and for women could best be compared. Presenting them as two separate bar charts, one for males and one for females, is not the ideal way to support this comparison. We can produce a single bar chart that makes the comparison straightforward by plotting the corresponding bars for the two genders next to one another, and distinguishing the genders by shading, as in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7#fig002006">Figure 6</a>.</p><div class="oucontentfigure" style="width:511px;" id="fig002_006"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/155c17c0/m248_1_006i.jpg" alt="Figure 6" width="511" height="465"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 6 USA workforce: 1986 averages</span></div></div></div><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="act002_002"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 4: USA workforce</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>On the basis of <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7#fig002006">Figure 6</a>, describe how the balance between the genders differs from one ‘employment type’ to another.</p></div>
<div class="oucontentsaqdiscussion"><h3 class="oucontenth4">Discussion</h3><p>This display clearly shows the predominance of men in the <i>Professional, Industrial, Craftsmen</i> and <i>Agricultural</i> categories. In <i>Service</i> and <i>Clerical</i> women outnumber men and, in <i>Clerical</i> in particular, there is a huge imbalance. In <i>Sales</i> the numbers of men and women are very similar.</p></div></div></div></div><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7#fig002007">Figure 7</a> is an attempt to display the same information using a threedimensional effect.</p><div class="oucontentfigure" style="width:511px;" id="fig002_007"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/9aaf1bfd/m248_1_007i.jpg" alt="Figure 7" width="511" height="422"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 7 USA workforce data: a threedimensional bar chart</span></div></div></div><p>It is now much more difficult to identify values. Some blocks are hidden, which makes judgement difficult. The display for <i>Sales</i> is particularly misleading; in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7#fig002006">Figure 6</a> you can see that the bars are almost the same height, but in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7#fig002007">Figure 7</a> this is much less obvious.</p><p>Similar, and in some cases even more severe, problems arise with ‘threedimensional’ pie charts.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7
1.2.7 Problems with graphics: USA workforceM248_1<p>The danger of using threedimensional effects is really brought home when two data sets are displayed on the same bar chart. <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.3#tbl001002">Table 2</a> may be thought of as consisting of two data sets, one for male workers and one for female workers. On its own, each of these data sets could be portrayed in a bar chart like those you have seen earlier. However, one of the questions raised about these data in Activity 1 was how the data for men and for women could best be compared. Presenting them as two separate bar charts, one for males and one for females, is not the ideal way to support this comparison. We can produce a single bar chart that makes the comparison straightforward by plotting the corresponding bars for the two genders next to one another, and distinguishing the genders by shading, as in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7#fig002006">Figure 6</a>.</p><div class="oucontentfigure" style="width:511px;" id="fig002_006"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/155c17c0/m248_1_006i.jpg" alt="Figure 6" width="511" height="465"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 6 USA workforce: 1986 averages</span></div></div></div><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="act002_002"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 4: USA workforce</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>On the basis of <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7#fig002006">Figure 6</a>, describe how the balance between the genders differs from one ‘employment type’ to another.</p></div>
<div class="oucontentsaqdiscussion"><h3 class="oucontenth4">Discussion</h3><p>This display clearly shows the predominance of men in the <i>Professional, Industrial, Craftsmen</i> and <i>Agricultural</i> categories. In <i>Service</i> and <i>Clerical</i> women outnumber men and, in <i>Clerical</i> in particular, there is a huge imbalance. In <i>Sales</i> the numbers of men and women are very similar.</p></div></div></div></div><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7#fig002007">Figure 7</a> is an attempt to display the same information using a threedimensional effect.</p><div class="oucontentfigure" style="width:511px;" id="fig002_007"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/9aaf1bfd/m248_1_007i.jpg" alt="Figure 7" width="511" height="422"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 7 USA workforce data: a threedimensional bar chart</span></div></div></div><p>It is now much more difficult to identify values. Some blocks are hidden, which makes judgement difficult. The display for <i>Sales</i> is particularly misleading; in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7#fig002006">Figure 6</a> you can see that the bars are almost the same height, but in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7#fig002007">Figure 7</a> this is much less obvious.</p><p>Similar, and in some cases even more severe, problems arise with ‘threedimensional’ pie charts.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.2.8 Problems with graphics: nuclear power stations
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8
Tue, 26 Jul 2011 23:00:00 GMT
<p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002008">Figure 8</a> shows a pie chart of the data on nuclear power stations from <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.2#tbl001001">Table 1</a>. This diagram is similar to <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.3#fig002002">Figure 2</a>, except that the data for all countries apart from the five with the largest numbers of power stations have been amalgamated into a single ‘Others’ category.</p><div class="oucontentfigure" style="width:511px;" id="fig002_008"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/b7196c6b/m248_1_008i.jpg" alt="Figure 8" width="511" height="284"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 8 Nuclear power stations, smaller groups consolidated</span></div></div></div><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a> shows two attempts to display the same information in a ‘threedimensional’ form. Can you see how they differ from one another?</p><div class="oucontentfigure" style="width:511px;" id="fig002_009"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/4fe1ad26/m248_1_009i.small.jpg" alt="" width="511" height="171"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">Figure 9 Nuclear power stations: threedimensional pie charts</span></div></div></div><p>The only difference is the location of the ‘slices’. In <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(b) they have been turned round through an angle of 90 degrees, compared to <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(a). Yet this changes the whole appearance of the diagram. For instance, at first glance the UK ‘slice’ looks rather bigger in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(b) than that in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(a), and the Soviet Union ‘slice’ looks bigger in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(a) than that in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(b). In both cases, the differences are apparent rather than real, and are due to the angles at which the ‘slices’ are being viewed. To make comparisons between the sizes of different categories is not as straightforward as it might be even in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002008">Figure 8</a>, because the human perception system finds it harder, on the whole, to compare angles than to compare lengths (as in a bar chart). But when you look at <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(a), or <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(b), you are being asked to make comparisons in a representation, on twodimensional paper, of angles at an oblique direction to the direction in which you are looking. It is a tribute to the robustness of the human perception system that we can do this at all; but it is far from easy to do it accurately. ‘Threedimensional’ charts might superficially look nicer, but they can be seriously misleading.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8
1.2.8 Problems with graphics: nuclear power stationsM248_1<p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002008">Figure 8</a> shows a pie chart of the data on nuclear power stations from <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.2#tbl001001">Table 1</a>. This diagram is similar to <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.3#fig002002">Figure 2</a>, except that the data for all countries apart from the five with the largest numbers of power stations have been amalgamated into a single ‘Others’ category.</p><div class="oucontentfigure" style="width:511px;" id="fig002_008"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/b7196c6b/m248_1_008i.jpg" alt="Figure 8" width="511" height="284"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 8 Nuclear power stations, smaller groups consolidated</span></div></div></div><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a> shows two attempts to display the same information in a ‘threedimensional’ form. Can you see how they differ from one another?</p><div class="oucontentfigure" style="width:511px;" id="fig002_009"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/4fe1ad26/m248_1_009i.small.jpg" alt="" width="511" height="171"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">Figure 9 Nuclear power stations: threedimensional pie charts</span></div></div></div><p>The only difference is the location of the ‘slices’. In <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(b) they have been turned round through an angle of 90 degrees, compared to <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(a). Yet this changes the whole appearance of the diagram. For instance, at first glance the UK ‘slice’ looks rather bigger in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(b) than that in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(a), and the Soviet Union ‘slice’ looks bigger in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(a) than that in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(b). In both cases, the differences are apparent rather than real, and are due to the angles at which the ‘slices’ are being viewed. To make comparisons between the sizes of different categories is not as straightforward as it might be even in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002008">Figure 8</a>, because the human perception system finds it harder, on the whole, to compare angles than to compare lengths (as in a bar chart). But when you look at <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(a), or <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.8#fig002009">Figure 9</a>(b), you are being asked to make comparisons in a representation, on twodimensional paper, of angles at an oblique direction to the direction in which you are looking. It is a tribute to the robustness of the human perception system that we can do this at all; but it is far from easy to do it accurately. ‘Threedimensional’ charts might superficially look nicer, but they can be seriously misleading.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.2.9 Pie charts and bar charts: summary
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.9
Tue, 26 Jul 2011 23:00:00 GMT
<p>Two common display methods for data relating to a set of categories have been introduced in this section. In a pie chart, the number in each category is proportional to the angle subtended at the centre of the circular chart by the corresponding ‘slice’. In a bar chart, the number in each category is proportional to the length of the corresponding bar. The bars may be arranged vertically or horizontally, though it is conventional to draw them vertically where the labelling of the chart makes this practicable. You have seen that attempts to make pie charts and bar charts look ‘threedimensional’ can make them considerably harder to interpret.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.9
1.2.9 Pie charts and bar charts: summaryM248_1<p>Two common display methods for data relating to a set of categories have been introduced in this section. In a pie chart, the number in each category is proportional to the angle subtended at the centre of the circular chart by the corresponding ‘slice’. In a bar chart, the number in each category is proportional to the length of the corresponding bar. The bars may be arranged vertically or horizontally, though it is conventional to draw them vertically where the labelling of the chart makes this practicable. You have seen that attempts to make pie charts and bar charts look ‘threedimensional’ can make them considerably harder to interpret.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

3.1 Introduction
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.1
Tue, 26 Jul 2011 23:00:00 GMT
<p>In this section, two more kinds of graphical display are introduced – <i>histograms</i> in section 3.2 and <i>scatterplots</i> in section 3.3. Both are most commonly used with data that do not relate to separate categories, unlike pie charts and bar charts. However, as you will see, histograms do have something in common with bar charts. Scatterplots are a very common way of picturing the way in which two different quantities are related to each other.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.1
3.1 IntroductionM248_1<p>In this section, two more kinds of graphical display are introduced – <i>histograms</i> in section 3.2 and <i>scatterplots</i> in section 3.3. Both are most commonly used with data that do not relate to separate categories, unlike pie charts and bar charts. However, as you will see, histograms do have something in common with bar charts. Scatterplots are a very common way of picturing the way in which two different quantities are related to each other.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

3.2: Histograms
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2
Tue, 26 Jul 2011 23:00:00 GMT
<p>It is a fundamental principle in modern practical data analysis that all investigations should begin, wherever possible, with one or more suitable diagrams of the data. Such displays should certainly show overall patterns or trends, and should also be capable of isolating unexpected features that might otherwise be missed. The histogram is a commonlyused display, which is useful for identifying characteristics of a data set. To illustrate its use, we return to the data set on infants with SIRDS that we looked at briefly in Section 1.4.</p><p>The birth weights of 50 infants with severe idiopathic respiratory distress syndrome were given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.4#tbl001003">Table 3</a>. The list of weights is in itself not very informative, partly because there are so many weights listed. Suppose, however, that the weights are grouped as shown in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#tbl003001">Table 8</a>.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl003_001"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Table 8 Birth weights (kg)</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Group</th><th scope="col"> Birth weight (kg)</th><th scope="col"> Frequency</th></tr><tr><td>1</td><td>1.0–1.2</td><td>6</td></tr><tr><td>2</td><td>1.2–1.4</td><td>6</td></tr><tr><td>3</td><td>1.4–1.6</td><td>4</td></tr><tr><td>4</td><td>1.6–1.8</td><td>8</td></tr><tr><td>5</td><td>1.8–2.0</td><td>4</td></tr><tr><td>6</td><td>2.0–2.2</td><td>3</td></tr><tr><td>7</td><td>2.2–2.4</td><td>4</td></tr><tr><td>8</td><td>2.4–2.6</td><td>6</td></tr><tr><td/><td>2.6–2.8</td><td>3</td></tr><tr><td>10</td><td>2.8–3.0</td><td>2</td></tr><tr><td>11</td><td>3.0–3.2</td><td>2</td></tr><tr><td>12</td><td>3.2–3.4</td><td>0</td></tr><tr><td>13</td><td>3.4–3.6</td><td>1</td></tr><tr><td>14</td><td>3.6–3.8</td><td>1</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>Such a table is called a <b>grouped frequency table.</b> Each listed frequency gives the number of individuals falling into a particular group: for instance, there were six children with birth weights between 1.0 and 1.2 kilograms. It may occur to you that there is an ambiguity over borderlines, or <b>cutpoints</b>, between the groups. Into which group, for example, should a value of 2.2 go? Should it be included in Group 6 or Group 7? Providing you are consistent with your rule over such borderlines, it really does not matter.</p><p>In fact, among the 50 infants there were two with a recorded birth weight of 2.2 kg and both have been allocated to Group 7. The infant weighing 2.4 kg has been allocated to Group 8. The rule followed here was that borderline cases were allocated to the higher of the two possible groups.</p><p>With the data structured like this, certain characteristics can be seen even though some information has been lost. There seems to be an indication that there are two groupings divided somewhere around 2 kg or, perhaps, three groupings divided somewhere around 1.5 kg and 2 kg. But the pattern is far from clear and needs a helpful picture, such as a bar chart. The categories are ordered, and notice also that the groups are contiguous (1.0–1.2, 1.2–1.4, and so on). This reflects the fact that here the variable of interest (birth weight) is not a count but a measurement.</p><p>The distinction between ‘counting’ and ‘measuring’ is quite an important one. In later units we shall be concerned with formulating different models to express the sort of variation that occurs in different sampling contexts, and it matters that the model should be appropriate to the type of data. Data arising from measurements (height, weight, temperature, and so on) are called <b>continuous</b> data. Those arising from counts (family size, hospital admissions, nuclear power stations) are called <b>discrete.</b>
</p><p>In this situation, where we have a grouped frequency table of continuous data, the bars of the bar chart are drawn without gaps between them, as in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#fig003001">Figure 10</a>.</p><p>This kind of bar chart, of continuous data which have been put into a limited number of distinct groups or classes, is called a <b>histogram.</b> In this example, the 50 data items were allocated to groups of width 0.2 kg: there were 14 groups. The classification was quite arbitrary. If the group classifications had been narrower, there would have been more groups each containing fewer observations; if the classifications had been wider, there would have been fewer groups with more observations in each group. The question of an optimal classification is an interesting one, and surprisingly complex.</p><p>How many groups should you choose for a histogram? If you choose too many, the display will be too fragmented to show an overall shape. But if you choose too few, you will not have a picture of the shape: too much of the information in the data will be lost.</p><div class="oucontentfigure" style="width:511px;" id="fig003_001"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/54a68f71/m248_1_010i.jpg" alt="Figure 10" width="511" height="403"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 10 Birth weights (kg) of infants with SIRDS</span></div></div></div><p>When these data were introduced in section 1.4, the questions posed were as follows. Do the children split into two identifiable groups? And is it possible to relate the chances of survival to birth weight? We are not, as yet, in a position to answer these questions, but we can see that the birth weights might split into two or even three ‘clumps’. On the other hand, can we be sure that this is no more than a consequence of the way in which the borderlines for the groups were chosen? Suppose, for example, we had decided to make the intervals of width 0.3 kg instead of 0.2 kg. We would have had fewer groups, with Group 1 containing birth weights from 1.0 to 1.3 kg, Group 2 containing birth weights from 1.3 to 1.6 kg, and so on, producing the histogram in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#fig003002">Figure 11</a>.</p><div class="oucontentfigure" style="width:511px;" id="fig003_002"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/348e5a7a/m248_1_011i.jpg" alt="Figure 11" width="511" height="385"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 11 Birth weights, 0.3 kg interval widths</span></div></div></div><p>The histogram in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#fig003002">Figure 11</a> looks quite different to that in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#fig003001">Figure 10</a>, but then this is not surprising as the whole display has been compressed into fewer bars. The basic shape remains similar, so you might be tempted to conclude that the choice of grouping does not really matter. But suppose we retain groupings of width 0.3 kg and choose a different starting point. Suppose we make Group 1 go from 0.8 to 1.1kg, Group 2 from 1.1 to 1.4kg, and so on. The resulting histogram is shown in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#fig003003">Figure 12</a>(a). In <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#fig003003">Figure 12</a>(b), the groups again have width 0.3kg, but this time the first group starts at 0.9 kg.</p><div class="oucontentfigure" style="width:511px;" id="fig003_003"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/788d6da5/m248_1_012i.small.jpg" alt="" width="511" height="197"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">Figure 12 Birth weights, 0.3 kg interval widths</span></div></div></div><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="act003_001"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 5: Comparing histograms</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>What information do the histograms in Figures 10, 11 and 12 give about the possibility that the children are split into two (or more) identifiable groups on the basis of birth weight?</p></div>
<div class="oucontentsaqdiscussion"><h3 class="oucontenth4">Discussion</h3><p>You might have felt that only Figures 10 and 12(b) give a really clear indication that the data are split into two ‘clumps’. Figures 10, 11 and 12(a) all give, to varying degrees, the impression that there is perhaps an identifiable group of babies with particularly low birth weights.</p></div></div></div></div><p>What you have seen in Figures 10 to 12 is a series of visual displays of a data set which warn you against trying to reach firm conclusions from histograms. It is important to realise that histograms often produce only a vague impression of the data – nothing more. One of the problems here is that we have only 50 data values, which is not really enough for a clear pattern to be evident. However, the histograms all convey one very important message: the data do not appear in a single, concentrated clump. Clearly it is a good idea to look at the way frequencies of data, such as the birth weights, are distributed and, given that a statistical computer package will quickly produce a histogram for you, comparatively little effort is required. This makes the histogram a valuable analytic tool and, in spite of some disadvantages, you will find that you use it a great deal.</p><p>It is, of course, quite feasible to produce grouped frequency tables and draw histograms by hand. However, the process can be very longwinded, and in practice statisticians almost always use a computer to produce them.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2
3.2: HistogramsM248_1<p>It is a fundamental principle in modern practical data analysis that all investigations should begin, wherever possible, with one or more suitable diagrams of the data. Such displays should certainly show overall patterns or trends, and should also be capable of isolating unexpected features that might otherwise be missed. The histogram is a commonlyused display, which is useful for identifying characteristics of a data set. To illustrate its use, we return to the data set on infants with SIRDS that we looked at briefly in Section 1.4.</p><p>The birth weights of 50 infants with severe idiopathic respiratory distress syndrome were given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.4#tbl001003">Table 3</a>. The list of weights is in itself not very informative, partly because there are so many weights listed. Suppose, however, that the weights are grouped as shown in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#tbl003001">Table 8</a>.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl003_001"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Table 8 Birth weights (kg)</h2><div class="oucontenttablewrapper"><table><tr><th scope="col">Group</th><th scope="col"> Birth weight (kg)</th><th scope="col"> Frequency</th></tr><tr><td>1</td><td>1.0–1.2</td><td>6</td></tr><tr><td>2</td><td>1.2–1.4</td><td>6</td></tr><tr><td>3</td><td>1.4–1.6</td><td>4</td></tr><tr><td>4</td><td>1.6–1.8</td><td>8</td></tr><tr><td>5</td><td>1.8–2.0</td><td>4</td></tr><tr><td>6</td><td>2.0–2.2</td><td>3</td></tr><tr><td>7</td><td>2.2–2.4</td><td>4</td></tr><tr><td>8</td><td>2.4–2.6</td><td>6</td></tr><tr><td/><td>2.6–2.8</td><td>3</td></tr><tr><td>10</td><td>2.8–3.0</td><td>2</td></tr><tr><td>11</td><td>3.0–3.2</td><td>2</td></tr><tr><td>12</td><td>3.2–3.4</td><td>0</td></tr><tr><td>13</td><td>3.4–3.6</td><td>1</td></tr><tr><td>14</td><td>3.6–3.8</td><td>1</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>Such a table is called a <b>grouped frequency table.</b> Each listed frequency gives the number of individuals falling into a particular group: for instance, there were six children with birth weights between 1.0 and 1.2 kilograms. It may occur to you that there is an ambiguity over borderlines, or <b>cutpoints</b>, between the groups. Into which group, for example, should a value of 2.2 go? Should it be included in Group 6 or Group 7? Providing you are consistent with your rule over such borderlines, it really does not matter.</p><p>In fact, among the 50 infants there were two with a recorded birth weight of 2.2 kg and both have been allocated to Group 7. The infant weighing 2.4 kg has been allocated to Group 8. The rule followed here was that borderline cases were allocated to the higher of the two possible groups.</p><p>With the data structured like this, certain characteristics can be seen even though some information has been lost. There seems to be an indication that there are two groupings divided somewhere around 2 kg or, perhaps, three groupings divided somewhere around 1.5 kg and 2 kg. But the pattern is far from clear and needs a helpful picture, such as a bar chart. The categories are ordered, and notice also that the groups are contiguous (1.0–1.2, 1.2–1.4, and so on). This reflects the fact that here the variable of interest (birth weight) is not a count but a measurement.</p><p>The distinction between ‘counting’ and ‘measuring’ is quite an important one. In later units we shall be concerned with formulating different models to express the sort of variation that occurs in different sampling contexts, and it matters that the model should be appropriate to the type of data. Data arising from measurements (height, weight, temperature, and so on) are called <b>continuous</b> data. Those arising from counts (family size, hospital admissions, nuclear power stations) are called <b>discrete.</b>
</p><p>In this situation, where we have a grouped frequency table of continuous data, the bars of the bar chart are drawn without gaps between them, as in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#fig003001">Figure 10</a>.</p><p>This kind of bar chart, of continuous data which have been put into a limited number of distinct groups or classes, is called a <b>histogram.</b> In this example, the 50 data items were allocated to groups of width 0.2 kg: there were 14 groups. The classification was quite arbitrary. If the group classifications had been narrower, there would have been more groups each containing fewer observations; if the classifications had been wider, there would have been fewer groups with more observations in each group. The question of an optimal classification is an interesting one, and surprisingly complex.</p><p>How many groups should you choose for a histogram? If you choose too many, the display will be too fragmented to show an overall shape. But if you choose too few, you will not have a picture of the shape: too much of the information in the data will be lost.</p><div class="oucontentfigure" style="width:511px;" id="fig003_001"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/54a68f71/m248_1_010i.jpg" alt="Figure 10" width="511" height="403"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 10 Birth weights (kg) of infants with SIRDS</span></div></div></div><p>When these data were introduced in section 1.4, the questions posed were as follows. Do the children split into two identifiable groups? And is it possible to relate the chances of survival to birth weight? We are not, as yet, in a position to answer these questions, but we can see that the birth weights might split into two or even three ‘clumps’. On the other hand, can we be sure that this is no more than a consequence of the way in which the borderlines for the groups were chosen? Suppose, for example, we had decided to make the intervals of width 0.3 kg instead of 0.2 kg. We would have had fewer groups, with Group 1 containing birth weights from 1.0 to 1.3 kg, Group 2 containing birth weights from 1.3 to 1.6 kg, and so on, producing the histogram in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#fig003002">Figure 11</a>.</p><div class="oucontentfigure" style="width:511px;" id="fig003_002"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/348e5a7a/m248_1_011i.jpg" alt="Figure 11" width="511" height="385"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 11 Birth weights, 0.3 kg interval widths</span></div></div></div><p>The histogram in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#fig003002">Figure 11</a> looks quite different to that in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#fig003001">Figure 10</a>, but then this is not surprising as the whole display has been compressed into fewer bars. The basic shape remains similar, so you might be tempted to conclude that the choice of grouping does not really matter. But suppose we retain groupings of width 0.3 kg and choose a different starting point. Suppose we make Group 1 go from 0.8 to 1.1kg, Group 2 from 1.1 to 1.4kg, and so on. The resulting histogram is shown in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#fig003003">Figure 12</a>(a). In <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.2#fig003003">Figure 12</a>(b), the groups again have width 0.3kg, but this time the first group starts at 0.9 kg.</p><div class="oucontentfigure" style="width:511px;" id="fig003_003"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/788d6da5/m248_1_012i.small.jpg" alt="" width="511" height="197"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">Figure 12 Birth weights, 0.3 kg interval widths</span></div></div></div><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="act003_001"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 5: Comparing histograms</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>What information do the histograms in Figures 10, 11 and 12 give about the possibility that the children are split into two (or more) identifiable groups on the basis of birth weight?</p></div>
<div class="oucontentsaqdiscussion"><h3 class="oucontenth4">Discussion</h3><p>You might have felt that only Figures 10 and 12(b) give a really clear indication that the data are split into two ‘clumps’. Figures 10, 11 and 12(a) all give, to varying degrees, the impression that there is perhaps an identifiable group of babies with particularly low birth weights.</p></div></div></div></div><p>What you have seen in Figures 10 to 12 is a series of visual displays of a data set which warn you against trying to reach firm conclusions from histograms. It is important to realise that histograms often produce only a vague impression of the data – nothing more. One of the problems here is that we have only 50 data values, which is not really enough for a clear pattern to be evident. However, the histograms all convey one very important message: the data do not appear in a single, concentrated clump. Clearly it is a good idea to look at the way frequencies of data, such as the birth weights, are distributed and, given that a statistical computer package will quickly produce a histogram for you, comparatively little effort is required. This makes the histogram a valuable analytic tool and, in spite of some disadvantages, you will find that you use it a great deal.</p><p>It is, of course, quite feasible to produce grouped frequency tables and draw histograms by hand. However, the process can be very longwinded, and in practice statisticians almost always use a computer to produce them.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

3.3: Scatterplots
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.3
Tue, 26 Jul 2011 23:00:00 GMT
<p>In recent years, graphical displays have come into prominence because computers have made them quick and easy to produce. Techniques of data exploration have been developed which have revolutionised the subject of statistics, and today no serious data analyst would carry out a formal numerical procedure without first inspecting the data by eye. Nowhere is this demonstrated more forcibly than in the way a scatterplot reveals a relationship between two variables.</p><p>Look at <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.3#fig003004">Figure 13</a>, which displays the data on cirrhosis and alcoholism from <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.6#tbl001005">Table 5</a>. This display is a <b>scatterplot.</b>
</p><div class="oucontentfigure" style="width:511px;" id="fig003_004"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/78f06457/m248_1_013i.jpg" alt="Figure 13" width="511" height="363"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 13 Alcoholrelated deaths and consumption</span></div></div></div><p>In a scatterplot, one variable is plotted on the horizontal axis and the other on the vertical axis. Each data item corresponds to a point in twodimensional space. For example, the average annual consumption of alcohol in France for the time over which the data were collected was 24.7 litres per person, and the death rate per hundred thousand of the population through cirrhosis and alcoholism was 46.1. In this diagram consumption is plotted along the horizontal axis and death rate is plotted up the vertical axis. The data point at the coordinate (24.7,46.1) corresponds to France.</p><p>Is there a strong relationship between the two variables? In other words, do the points appear to fit fairly ‘tightly’ about a straight line or a curve? It is fairly obvious that there is a relationship, although the overall pattern is not easy to see since most of the points are concentrated in the bottom lefthand corner. There is one point that is a long way from the others and the size of the diagram relative to the page is dictated by the available space into which it must fit. We remarked upon this point, corresponding to France, when we first looked at the data, but seeing it here really does put into perspective the magnitude of the difference between France and the other countries. The best way to look for a general relationship between death rate and consumption of alcohol is to spread out the points representing the more conventional drinking habits of other countries by leaving France, an extreme case, out of the plot. The picture, given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.3#fig003005">Figure 14</a>, is now much clearer. It shows up a general (and hardly surprising) rule that the incidence of death through alcoholrelated disease is strongly linked to average alcohol consumption, the relationship being plausibly linear. A ‘linear’ relationship means that we could draw a straight line through the points that would fit them quite well, and this has been done in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.3#fig003005">Figure 14</a>.</p><div class="oucontentfigure" style="width:511px;" id="fig003_005"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/67c1452a/m248_1_014i.jpg" alt="Figure 14" width="511" height="316"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 14 Alcoholrelated deaths and consumption, excluding France</span></div></div></div><p>Of course, we would not expect the points to sit precisely on the line but to be scattered about it tightly enough for the relationship to show. In this case you could conclude that, given the average annual alcohol consumption in any country not included among those on the scatterplot, we would be fairly confident of being able to use our straight line for providing a reasonable estimate of the national death rate due to cirrhosis and alcoholism.</p><p>It is worth mentioning at this stage that demonstrating the existence of some sort of association is not the same thing as demonstrating causation; that, in this case, alcohol use ‘causes’ (or makes more likely) cirrhosis or an early death. For example, if cirrhosis were stressrelated, so might be alcohol consumption, and hence the apparent relationship. It should also be noted that these data were averaged over large populations and (whatever may be inferred from them) they say nothing about the consequences of alcohol use for an individual.</p><p>France was left out because that data point was treated as an extreme case. It corresponded to data values so atypical, and so far removed from the others, that we were wary of using them to draw general conclusions.</p><p>‘Extreme’, ‘unrepresentative’, ‘atypical’ or possibly ‘rogue’ observations in sets of data are sometimes called <b>outliers.</b> It is important to recognise that, while we would wish to eliminate from a statistical analysis data points which were erroneous (wrongly recorded, perhaps, or observed when background circumstances had profoundly altered), data points that appear ‘surprising’ are not necessarily ‘wrong’. The identification of outliers, and what to do with them, is a research question of great interest to the statistician. Once a possible outlier has been identified, it should be closely inspected and its apparently aberrant behaviour accounted for. If it is to be excluded from the analysis there must be sound reasons for its exclusion. Only then can the data analyst be happy about discarding it. An example will illustrate the point.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.3
3.3: ScatterplotsM248_1<p>In recent years, graphical displays have come into prominence because computers have made them quick and easy to produce. Techniques of data exploration have been developed which have revolutionised the subject of statistics, and today no serious data analyst would carry out a formal numerical procedure without first inspecting the data by eye. Nowhere is this demonstrated more forcibly than in the way a scatterplot reveals a relationship between two variables.</p><p>Look at <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.3#fig003004">Figure 13</a>, which displays the data on cirrhosis and alcoholism from <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.6#tbl001005">Table 5</a>. This display is a <b>scatterplot.</b>
</p><div class="oucontentfigure" style="width:511px;" id="fig003_004"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/78f06457/m248_1_013i.jpg" alt="Figure 13" width="511" height="363"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 13 Alcoholrelated deaths and consumption</span></div></div></div><p>In a scatterplot, one variable is plotted on the horizontal axis and the other on the vertical axis. Each data item corresponds to a point in twodimensional space. For example, the average annual consumption of alcohol in France for the time over which the data were collected was 24.7 litres per person, and the death rate per hundred thousand of the population through cirrhosis and alcoholism was 46.1. In this diagram consumption is plotted along the horizontal axis and death rate is plotted up the vertical axis. The data point at the coordinate (24.7,46.1) corresponds to France.</p><p>Is there a strong relationship between the two variables? In other words, do the points appear to fit fairly ‘tightly’ about a straight line or a curve? It is fairly obvious that there is a relationship, although the overall pattern is not easy to see since most of the points are concentrated in the bottom lefthand corner. There is one point that is a long way from the others and the size of the diagram relative to the page is dictated by the available space into which it must fit. We remarked upon this point, corresponding to France, when we first looked at the data, but seeing it here really does put into perspective the magnitude of the difference between France and the other countries. The best way to look for a general relationship between death rate and consumption of alcohol is to spread out the points representing the more conventional drinking habits of other countries by leaving France, an extreme case, out of the plot. The picture, given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.3#fig003005">Figure 14</a>, is now much clearer. It shows up a general (and hardly surprising) rule that the incidence of death through alcoholrelated disease is strongly linked to average alcohol consumption, the relationship being plausibly linear. A ‘linear’ relationship means that we could draw a straight line through the points that would fit them quite well, and this has been done in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.3#fig003005">Figure 14</a>.</p><div class="oucontentfigure" style="width:511px;" id="fig003_005"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/67c1452a/m248_1_014i.jpg" alt="Figure 14" width="511" height="316"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 14 Alcoholrelated deaths and consumption, excluding France</span></div></div></div><p>Of course, we would not expect the points to sit precisely on the line but to be scattered about it tightly enough for the relationship to show. In this case you could conclude that, given the average annual alcohol consumption in any country not included among those on the scatterplot, we would be fairly confident of being able to use our straight line for providing a reasonable estimate of the national death rate due to cirrhosis and alcoholism.</p><p>It is worth mentioning at this stage that demonstrating the existence of some sort of association is not the same thing as demonstrating causation; that, in this case, alcohol use ‘causes’ (or makes more likely) cirrhosis or an early death. For example, if cirrhosis were stressrelated, so might be alcohol consumption, and hence the apparent relationship. It should also be noted that these data were averaged over large populations and (whatever may be inferred from them) they say nothing about the consequences of alcohol use for an individual.</p><p>France was left out because that data point was treated as an extreme case. It corresponded to data values so atypical, and so far removed from the others, that we were wary of using them to draw general conclusions.</p><p>‘Extreme’, ‘unrepresentative’, ‘atypical’ or possibly ‘rogue’ observations in sets of data are sometimes called <b>outliers.</b> It is important to recognise that, while we would wish to eliminate from a statistical analysis data points which were erroneous (wrongly recorded, perhaps, or observed when background circumstances had profoundly altered), data points that appear ‘surprising’ are not necessarily ‘wrong’. The identification of outliers, and what to do with them, is a research question of great interest to the statistician. Once a possible outlier has been identified, it should be closely inspected and its apparently aberrant behaviour accounted for. If it is to be excluded from the analysis there must be sound reasons for its exclusion. Only then can the data analyst be happy about discarding it. An example will illustrate the point.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

3.4: Scatterplots: body weights and brain weights for animals
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.4
Tue, 26 Jul 2011 23:00:00 GMT
<p>In our discussion of the data on body weights and brain weights for animals in section 1.7, we conjectured a strong relationship between these weights on the grounds that a large body might well need a large brain to run it properly. At that stage a ‘difficulty’ with the data was also suggested, but we did not say exactly what it was. It would, you might reasonably have thought, be useful to look at a scatterplot, but you will see the difficulty if you actually try to produce one. Did you spot the problem when it was first mentioned in section 1.7? There are many very small weights such as those for the hamster and the mouse which simply do not show up properly if displayed on the same plot as, say, those for animals like the elephant! <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.4#fig003006">Figure 15</a> shows the difficulty very clearly.</p><div class="oucontentfigure" style="width:511px;" id="fig003_006"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/3a4a7858/m248_1_015i.jpg" alt="Figure 15" width="511" height="323"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 15 Body weight and brain weight</span></div></div></div><p>You cannot see anything from this scatterplot. The many very small weights are all lumped together in order to allow a sufficient spread on the scale to include the heavy ones on the plot. As it stands, the plot is pretty well useless.</p><p>Now, this sort of thing often happens and the usual way of getting round the problem is to <i>transform</i> the data in such a way as to spread out the points with very small values of either variable, and to pull closer together the points with very large values for either variable. The objective is to reduce the spread in the large values relative to the spread in the small values. In this case it can be done by plotting the logarithm of brain weight against the logarithm of body weight. The log transformation compresses the large values but stretches the small ones. (Notice that simply treating the large values as outliers and removing them would not solve the problem because the tight clumping of points close to the origin would still remain to some extent. Also, there are in this case several possible outliers, and in general it is not good practice simply to throw data out of an analysis without at least considering potential reasons why these points should not be considered along with all the rest.)</p><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.4#fig003007">Figure 16</a> shows the scatterplot that is obtained after applying a log transformation to both variables.</p><div class="oucontentfigure" style="width:511px;" id="fig003_007"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/6c624a60/m248_1_016i.jpg" alt="Figure 16" width="511" height="348"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 16 Body weights and brain weights after a log transformation</span></div></div></div><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="act003_002"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 6: Interpreting a scatterplot</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>What information does <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.4#fig003007">Figure 16</a> give about the relationship between body weight and brain weight? Are there any points that you might consider as outliers?</p></div>
<div class="oucontentsaqdiscussion"><h3 class="oucontenth4">Discussion</h3><p>The plot immediately reveals three apparent outliers to the right of the main band of points. Excluding these three species, there is a convincing linear relationship, although there are two or three points that are slightly above the general pattern of the others and hence appear to have high brain weight to body weight ratios.</p><p>When you discover the animals to which the three ‘obvious’ outlying points correspond you will not be surprised. One way of identifying them is by labelling all the animals with the first letters of the names of their species and plotting the letters in place of the points. The resulting scatterplot is shown in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.4#fig003008">Figure 17</a>.</p><div class="oucontentfigure" style="width:511px;" id="fig003_008"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/4c9b184a/m248_1_017i.jpg" alt="Figure 17" width="511" height="350"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 17 Scatterplot with points labelled</span></div></div></div><p>A comparison of the letters with the values in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.7#tbl001006">Table 6</a> shows that the three outliers, labelled B, D and T, correspond to the dinosaurs <i>Brachiosaurus, Diplodocus</i> and <i>Triceratops</i>. The human, mole and Rhesus monkey all appear to have rather high brain weight in relation to body weight, but they are by no means as extreme compared to the general pattern as are the three dinosaur species.</p></div></div></div></div>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.4
3.4: Scatterplots: body weights and brain weights for animalsM248_1<p>In our discussion of the data on body weights and brain weights for animals in section 1.7, we conjectured a strong relationship between these weights on the grounds that a large body might well need a large brain to run it properly. At that stage a ‘difficulty’ with the data was also suggested, but we did not say exactly what it was. It would, you might reasonably have thought, be useful to look at a scatterplot, but you will see the difficulty if you actually try to produce one. Did you spot the problem when it was first mentioned in section 1.7? There are many very small weights such as those for the hamster and the mouse which simply do not show up properly if displayed on the same plot as, say, those for animals like the elephant! <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.4#fig003006">Figure 15</a> shows the difficulty very clearly.</p><div class="oucontentfigure" style="width:511px;" id="fig003_006"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/3a4a7858/m248_1_015i.jpg" alt="Figure 15" width="511" height="323"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 15 Body weight and brain weight</span></div></div></div><p>You cannot see anything from this scatterplot. The many very small weights are all lumped together in order to allow a sufficient spread on the scale to include the heavy ones on the plot. As it stands, the plot is pretty well useless.</p><p>Now, this sort of thing often happens and the usual way of getting round the problem is to <i>transform</i> the data in such a way as to spread out the points with very small values of either variable, and to pull closer together the points with very large values for either variable. The objective is to reduce the spread in the large values relative to the spread in the small values. In this case it can be done by plotting the logarithm of brain weight against the logarithm of body weight. The log transformation compresses the large values but stretches the small ones. (Notice that simply treating the large values as outliers and removing them would not solve the problem because the tight clumping of points close to the origin would still remain to some extent. Also, there are in this case several possible outliers, and in general it is not good practice simply to throw data out of an analysis without at least considering potential reasons why these points should not be considered along with all the rest.)</p><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.4#fig003007">Figure 16</a> shows the scatterplot that is obtained after applying a log transformation to both variables.</p><div class="oucontentfigure" style="width:511px;" id="fig003_007"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/6c624a60/m248_1_016i.jpg" alt="Figure 16" width="511" height="348"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 16 Body weights and brain weights after a log transformation</span></div></div></div><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="act003_002"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 6: Interpreting a scatterplot</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>What information does <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.4#fig003007">Figure 16</a> give about the relationship between body weight and brain weight? Are there any points that you might consider as outliers?</p></div>
<div class="oucontentsaqdiscussion"><h3 class="oucontenth4">Discussion</h3><p>The plot immediately reveals three apparent outliers to the right of the main band of points. Excluding these three species, there is a convincing linear relationship, although there are two or three points that are slightly above the general pattern of the others and hence appear to have high brain weight to body weight ratios.</p><p>When you discover the animals to which the three ‘obvious’ outlying points correspond you will not be surprised. One way of identifying them is by labelling all the animals with the first letters of the names of their species and plotting the letters in place of the points. The resulting scatterplot is shown in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.4#fig003008">Figure 17</a>.</p><div class="oucontentfigure" style="width:511px;" id="fig003_008"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/4c9b184a/m248_1_017i.jpg" alt="Figure 17" width="511" height="350"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 17 Scatterplot with points labelled</span></div></div></div><p>A comparison of the letters with the values in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.7#tbl001006">Table 6</a> shows that the three outliers, labelled B, D and T, correspond to the dinosaurs <i>Brachiosaurus, Diplodocus</i> and <i>Triceratops</i>. The human, mole and Rhesus monkey all appear to have rather high brain weight in relation to body weight, but they are by no means as extreme compared to the general pattern as are the three dinosaur species.</p></div></div></div></div>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.3.5 Histograms and scatterplots: summary
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.5
Tue, 26 Jul 2011 23:00:00 GMT
<p>Two common graphical displays, most frequently used for continuous data (arising from measurements), have been introduced in this section. A histogram is in a sense a development of the idea of a bar chart. A set of continuous data is divided up into groups, the frequencies in the groups are found, and a histogram is produced by drawing vertical bars, without gaps between them, whose heights are proportional to the frequencies in the groups. You have seen that the shape of a histogram drawn from a particular data set can depend on the choices made for the grouping of the data.</p><p>Scatterplots represent the relationship between two variables. The variables generally have to be numerical, and are commonly continuous, though they may also be discrete (counted). One variable is plotted on the horizontal axis and the other on the vertical axis. One point is plotted, in the appropriate position, for each individual entity (person, animal, country) in the data set. As well as making it easy to identify any general pattern, such as a straight line, in the relationship between the variables, a scatterplot can help in the identification of outliers. These are data points that lie a long way from the general pattern in the data. In some cases, the patterns shown in a scatterplot can be made clearer by omitting an outlier, though this is very often not an advisable thing to do. In other cases, it may help to transform the data by applying some appropriate function to one or both of the variables involved.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection4.5
1.3.5 Histograms and scatterplots: summaryM248_1<p>Two common graphical displays, most frequently used for continuous data (arising from measurements), have been introduced in this section. A histogram is in a sense a development of the idea of a bar chart. A set of continuous data is divided up into groups, the frequencies in the groups are found, and a histogram is produced by drawing vertical bars, without gaps between them, whose heights are proportional to the frequencies in the groups. You have seen that the shape of a histogram drawn from a particular data set can depend on the choices made for the grouping of the data.</p><p>Scatterplots represent the relationship between two variables. The variables generally have to be numerical, and are commonly continuous, though they may also be discrete (counted). One variable is plotted on the horizontal axis and the other on the vertical axis. One point is plotted, in the appropriate position, for each individual entity (person, animal, country) in the data set. As well as making it easy to identify any general pattern, such as a straight line, in the relationship between the variables, a scatterplot can help in the identification of outliers. These are data points that lie a long way from the general pattern in the data. In some cases, the patterns shown in a scatterplot can be made clearer by omitting an outlier, though this is very often not an advisable thing to do. In other cases, it may help to transform the data by applying some appropriate function to one or both of the variables involved.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.4.1 Introduction
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.1
Tue, 26 Jul 2011 23:00:00 GMT
<p>Histograms provide a quick way of looking at data sets, but they lose sight of individual observations and they tend to play down ‘intuitive feel’ for the magnitude of the numbers themselves. We may often want to summarize the data in numerical terms; for example, we could use a number to summarize the general level (or <i>location)</i> of the values and, perhaps, another number to indicate how spread out or dispersed they are. In this section you will learn about some numerical summaries that are used for both of these purposes: measures of location are discussed in sections 4.2 to 4.5 and measures of dispersion in sections 4.6 to 4.9. In section 4.11 you will be introduced to the important concept of <i>skewness</i> (lack of symmetry) in a data set.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.1
1.4.1 IntroductionM248_1<p>Histograms provide a quick way of looking at data sets, but they lose sight of individual observations and they tend to play down ‘intuitive feel’ for the magnitude of the numbers themselves. We may often want to summarize the data in numerical terms; for example, we could use a number to summarize the general level (or <i>location)</i> of the values and, perhaps, another number to indicate how spread out or dispersed they are. In this section you will learn about some numerical summaries that are used for both of these purposes: measures of location are discussed in sections 4.2 to 4.5 and measures of dispersion in sections 4.6 to 4.9. In section 4.11 you will be introduced to the important concept of <i>skewness</i> (lack of symmetry) in a data set.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.4.2 Measures of location
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.2
Tue, 26 Jul 2011 23:00:00 GMT
<p>Everyone professes to understand what is meant by the term ‘average’, in that it should be representative of a group of objects. The objects may well be numbers from, say, a batch or sample of measurements, in which case the average should be a number which in some way characterises the batch as a whole. For example, the statement ‘a typical adult female in Britain is 160 cm tall’ would be understood by most people who heard it. Obviously not all adult females in Britain are the same height: there is considerable variation. To state that a ‘typical’ height is 160 cm is to ignore the variation and summarise the distribution of heights with a single number. Even so, it may be all that is needed to answer certain questions. (For example, is a typical adult female shorter than a typical adult male?)</p><p>But how should this representative value be chosen? Should it be a typical member of the group or should it be some representative measure which can be calculated from the collection of individual data values? Believe it or not, there are no straightforward answers to these questions. In fact, two different ways of expressing a representative value are commonly used in statistics, namely the <i>median</i> and the <i>mean</i>. The choice of which of these provides the better representative numerical summary is fairly arbitrary and is based entirely upon the nature of the data themselves, or the particular preference of the data analyst, or the use to which the summary statement is to be put. The median and the mean are both examples of <b>measures of location</b> of a data set; here the word ‘location’ is essentially being used in the sense of the position of a typical data value along some sort of coordinate axis.</p><p>We deal with the median and the mean in turn, as well as considering the concept of the mode of a data set. (In a sense the mode is another measure of location.)</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.2
1.4.2 Measures of locationM248_1<p>Everyone professes to understand what is meant by the term ‘average’, in that it should be representative of a group of objects. The objects may well be numbers from, say, a batch or sample of measurements, in which case the average should be a number which in some way characterises the batch as a whole. For example, the statement ‘a typical adult female in Britain is 160 cm tall’ would be understood by most people who heard it. Obviously not all adult females in Britain are the same height: there is considerable variation. To state that a ‘typical’ height is 160 cm is to ignore the variation and summarise the distribution of heights with a single number. Even so, it may be all that is needed to answer certain questions. (For example, is a typical adult female shorter than a typical adult male?)</p><p>But how should this representative value be chosen? Should it be a typical member of the group or should it be some representative measure which can be calculated from the collection of individual data values? Believe it or not, there are no straightforward answers to these questions. In fact, two different ways of expressing a representative value are commonly used in statistics, namely the <i>median</i> and the <i>mean</i>. The choice of which of these provides the better representative numerical summary is fairly arbitrary and is based entirely upon the nature of the data themselves, or the particular preference of the data analyst, or the use to which the summary statement is to be put. The median and the mean are both examples of <b>measures of location</b> of a data set; here the word ‘location’ is essentially being used in the sense of the position of a typical data value along some sort of coordinate axis.</p><p>We deal with the median and the mean in turn, as well as considering the concept of the mode of a data set. (In a sense the mode is another measure of location.)</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

4.3: The median
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3
Tue, 26 Jul 2011 23:00:00 GMT
<p>The median describes the central value of a set of data. Here, to be precise, we are discussing the <i>sample</i> median, in contrast to the <i>population</i> median.</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_001"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">The sample median</h2><div class="oucontentinnerbox"><p>The <b>median</b> of a sample of data with an odd number of data values is defined to be the middle value of the data set when the values are placed in order of increasing size. If the sample size is even, then the median is defined to be halfway between the two middle values. In this course, the median is denoted by m</p></div></div></div>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3
4.3: The medianM248_1<p>The median describes the central value of a set of data. Here, to be precise, we are discussing the <i>sample</i> median, in contrast to the <i>population</i> median.</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_001"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">The sample median</h2><div class="oucontentinnerbox"><p>The <b>median</b> of a sample of data with an odd number of data values is defined to be the middle value of the data set when the values are placed in order of increasing size. If the sample size is even, then the median is defined to be halfway between the two middle values. In this course, the median is denoted by m</p></div></div></div>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.4.3.1 Beta endorphin concentration (collapsed runners)
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.1
Tue, 26 Jul 2011 23:00:00 GMT
<p>The final column of <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5#tbl001004">Table 4</a> contains the blood plasma β endorphin concentrations for eleven runners who collapsed towards the end of the Great North Run. The observations are already sorted. They are as follows.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl004_000"><div class="oucontenttablewrapper"><table><tr><td>66</td><td>72</td><td>79</td><td>84</td><td>102</td><td>110</td><td>123</td><td>144</td><td>162</td><td>169</td><td>414</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>Eleven is an odd number, so the middle value of the data set is the sixth value (five either side). So, in this case, the sample median is 110 pmol/l.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.1
1.4.3.1 Beta endorphin concentration (collapsed runners)M248_1<p>The final column of <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5#tbl001004">Table 4</a> contains the blood plasma β endorphin concentrations for eleven runners who collapsed towards the end of the Great North Run. The observations are already sorted. They are as follows.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl004_000"><div class="oucontenttablewrapper"><table><tr><td>66</td><td>72</td><td>79</td><td>84</td><td>102</td><td>110</td><td>123</td><td>144</td><td>162</td><td>169</td><td>414</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>Eleven is an odd number, so the middle value of the data set is the sixth value (five either side). So, in this case, the sample median is 110 pmol/l.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.4.3.2 Birth weights of infants with SIRDS
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2
Tue, 26 Jul 2011 23:00:00 GMT
<p>The data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.4#tbl001003">Table 3</a> are the birth weights (in kg) of 50 infants suffering from severe idiopathic respiratory distress syndrome. There are two groups of infants: those who survived the condition (there were 23 of these) and those who, unfortunately, did not. The data have not been sorted, and it is not an entirely trivial exercise to do this by hand (though it is a task that a computer can handle very easily). The sorted data are given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2#tbl004001">Table 9</a>.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl004_001"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Table 9 Birth weights (in kg) of infants with severe idiopathic respiratory distress syndrome</h2><div class="oucontenttablewrapper"><table><tr><td>1.030*</td><td>1.300*</td><td>1.720</td><td>2.090</td><td>2.570</td></tr><tr><td>1.050*</td><td>1.310*</td><td>1.750*</td><td>2.200*</td><td>2.600</td></tr><tr><td>1.100*</td><td>1.410</td><td>1.760</td><td>2.200</td><td>2.700</td></tr><tr><td>1.130</td><td>1.500*</td><td>1.770*</td><td>2.270*</td><td>2.730*</td></tr><tr><td>1.175*</td><td>1.550*</td><td>1.820*</td><td>2.275*</td><td>2.830</td></tr><tr><td>1.185*</td><td>1.575</td><td>1.890*</td><td>2.400</td><td>2.950</td></tr><tr><td>1.225*</td><td>1.600*</td><td>1.930</td><td>2.440*</td><td>3.005</td></tr><tr><td>1.230*</td><td>1.680</td><td>1.940*</td><td>2.500*</td><td>3.160</td></tr><tr><td>1.262*</td><td>1.715</td><td>2.015</td><td>2.550</td><td>3.400</td></tr><tr><td>1.295*</td><td>1.720*</td><td>2.040</td><td>2.560*</td><td>3.640</td></tr><tr><td colspan="2">* child died</td><td/><td/><td/></tr></table></div><div class="oucontentsourcereference"></div></div><p>The sample size is even: the sample median is defined to be the value halfway between the 25th and 26th observations. That is to say, it is obtained by splitting the difference between 1.820 (the 25th value) and 1.890 (the 26th value). This gives</p><p>½(1.820 + 1.890) = 1.855kg</p><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="act004_001"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 7: Birth weights of infants with SIRDS</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Find the median birth weight for the infants who survived, and for those who did not.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><p>There were 23 children who survived the condition. Their birth weights are 1.130, 1.410, 1.575, 1.680, 1.715, 1.720, 1.760, 1.930, 2.015, 2.040, 2.090, 2.200, 2.400, 2.550, 2.570, 2.600, 2.700, 2.830, 2.950, 3.005, 3.160, 3.400, 3.640. The median birth weight for these children is 2.200 kg (the 12th value in the sorted list).</p><p>There were 27 children who died. The sorted birth weights are 1.030, 1.050, 1.100, 1.175, 1.185, 1.225, 1.230, 1.262, 1.295, 1.300, 1.310, 1.500, 1.550, 1.600, 1.720, 1.750, 1.770, 1.820, 1.890, 1.940, 2.200, 2.270, 2.275, 2.440, 2.500, 2.560, 2.730. The middle value is the 14th (thirteen either side) so the median birth weight for the children who died is 1.600 kg.</p></div></div></div></div><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="act004_002"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 8: Beta endorphin concentration (successful runners)</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>The first two columns of <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5#tbl001004">Table 4</a> give the blood plasma β endorphin concentrations of eleven runners before and after completing the Great North Run successfully. There is a marked difference between these concentrations. The data are reproduced in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2#tbl004002">Table 10</a> below with the ‘After – Before’ differences also shown.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl004_002"><h3 class="oucontenth3 oucontentheading oucontentnonumber">
Table 10 Differences in pre and postrace βendorphin concentrations</h3><div class="oucontenttablewrapper"><table><tr><td>Before</td><td>4.3</td><td>4.6</td><td>5.2</td><td>5.2</td><td>6.6</td><td>7.2</td><td>8.4</td><td>9.0</td><td>10.4</td><td>14.0</td><td>17.8</td></tr><tr><td>After</td><td>29.6</td><td>25.1</td><td>15.5</td><td>29.6</td><td>24.1</td><td>37.8</td><td>20.2</td><td>21.9</td><td>14.2</td><td>34.6</td><td>46.2</td></tr><tr><td>Difference</td><td>25.3</td><td>20.5</td><td>10.3</td><td>24.4</td><td>17.5</td><td>30.6</td><td>11.8</td><td>12.9</td><td>3.8</td><td>20.6</td><td>28.4</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>Find the median of the ‘After – Before’ differences given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2#tbl004002">Table 10</a>.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><p>The ordered differences are 3.8, 10.3, 11.8, 12.9, 17.5, 20.5, 20.6, 24.4, 25.3, 28.4, 30.6. The median difference is 20.5 pmol/l.</p></div></div></div></div>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2
1.4.3.2 Birth weights of infants with SIRDSM248_1<p>The data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.4#tbl001003">Table 3</a> are the birth weights (in kg) of 50 infants suffering from severe idiopathic respiratory distress syndrome. There are two groups of infants: those who survived the condition (there were 23 of these) and those who, unfortunately, did not. The data have not been sorted, and it is not an entirely trivial exercise to do this by hand (though it is a task that a computer can handle very easily). The sorted data are given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2#tbl004001">Table 9</a>.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl004_001"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Table 9 Birth weights (in kg) of infants with severe idiopathic respiratory distress syndrome</h2><div class="oucontenttablewrapper"><table><tr><td>1.030*</td><td>1.300*</td><td>1.720</td><td>2.090</td><td>2.570</td></tr><tr><td>1.050*</td><td>1.310*</td><td>1.750*</td><td>2.200*</td><td>2.600</td></tr><tr><td>1.100*</td><td>1.410</td><td>1.760</td><td>2.200</td><td>2.700</td></tr><tr><td>1.130</td><td>1.500*</td><td>1.770*</td><td>2.270*</td><td>2.730*</td></tr><tr><td>1.175*</td><td>1.550*</td><td>1.820*</td><td>2.275*</td><td>2.830</td></tr><tr><td>1.185*</td><td>1.575</td><td>1.890*</td><td>2.400</td><td>2.950</td></tr><tr><td>1.225*</td><td>1.600*</td><td>1.930</td><td>2.440*</td><td>3.005</td></tr><tr><td>1.230*</td><td>1.680</td><td>1.940*</td><td>2.500*</td><td>3.160</td></tr><tr><td>1.262*</td><td>1.715</td><td>2.015</td><td>2.550</td><td>3.400</td></tr><tr><td>1.295*</td><td>1.720*</td><td>2.040</td><td>2.560*</td><td>3.640</td></tr><tr><td colspan="2">* child died</td><td/><td/><td/></tr></table></div><div class="oucontentsourcereference"></div></div><p>The sample size is even: the sample median is defined to be the value halfway between the 25th and 26th observations. That is to say, it is obtained by splitting the difference between 1.820 (the 25th value) and 1.890 (the 26th value). This gives</p><p>½(1.820 + 1.890) = 1.855kg</p><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="act004_001"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 7: Birth weights of infants with SIRDS</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Find the median birth weight for the infants who survived, and for those who did not.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><p>There were 23 children who survived the condition. Their birth weights are 1.130, 1.410, 1.575, 1.680, 1.715, 1.720, 1.760, 1.930, 2.015, 2.040, 2.090, 2.200, 2.400, 2.550, 2.570, 2.600, 2.700, 2.830, 2.950, 3.005, 3.160, 3.400, 3.640. The median birth weight for these children is 2.200 kg (the 12th value in the sorted list).</p><p>There were 27 children who died. The sorted birth weights are 1.030, 1.050, 1.100, 1.175, 1.185, 1.225, 1.230, 1.262, 1.295, 1.300, 1.310, 1.500, 1.550, 1.600, 1.720, 1.750, 1.770, 1.820, 1.890, 1.940, 2.200, 2.270, 2.275, 2.440, 2.500, 2.560, 2.730. The middle value is the 14th (thirteen either side) so the median birth weight for the children who died is 1.600 kg.</p></div></div></div></div><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="act004_002"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 8: Beta endorphin concentration (successful runners)</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>The first two columns of <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5#tbl001004">Table 4</a> give the blood plasma β endorphin concentrations of eleven runners before and after completing the Great North Run successfully. There is a marked difference between these concentrations. The data are reproduced in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2#tbl004002">Table 10</a> below with the ‘After – Before’ differences also shown.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl004_002"><h3 class="oucontenth3 oucontentheading oucontentnonumber">
Table 10 Differences in pre and postrace βendorphin concentrations</h3><div class="oucontenttablewrapper"><table><tr><td>Before</td><td>4.3</td><td>4.6</td><td>5.2</td><td>5.2</td><td>6.6</td><td>7.2</td><td>8.4</td><td>9.0</td><td>10.4</td><td>14.0</td><td>17.8</td></tr><tr><td>After</td><td>29.6</td><td>25.1</td><td>15.5</td><td>29.6</td><td>24.1</td><td>37.8</td><td>20.2</td><td>21.9</td><td>14.2</td><td>34.6</td><td>46.2</td></tr><tr><td>Difference</td><td>25.3</td><td>20.5</td><td>10.3</td><td>24.4</td><td>17.5</td><td>30.6</td><td>11.8</td><td>12.9</td><td>3.8</td><td>20.6</td><td>28.4</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>Find the median of the ‘After – Before’ differences given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2#tbl004002">Table 10</a>.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><p>The ordered differences are 3.8, 10.3, 11.8, 12.9, 17.5, 20.5, 20.6, 24.4, 25.3, 28.4, 30.6. The median difference is 20.5 pmol/l.</p></div></div></div></div>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.4.4: The mean
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.3
Tue, 26 Jul 2011 23:00:00 GMT
<p>The second measure of location defined in this course for a collection of data is the <i>mean</i>. Again, to be precise, we are discussing the <i>sample</i> mean, as opposed to the <i>population</i> mean. This is what most individuals would understand by the word ‘average’. All the items in the data set are added together, giving the <i>sample total</i>. This total is divided by the number of items (the sample size).</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_002"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">The sample mean</h2><div class="oucontentinnerbox"><p>The <b>mean</b> of a sample is the arithmetic average of the data values. It is obtained by adding together all of the data values and dividing this total by the number of items in the sample.</p><p>If the <i>n</i> items in a data set are denoted <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>,…, <i>x<sub>
<i>n</i>
</sub>
</i>, then the sample size is <i>n</i>, and the sample mean, which is denoted <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/dc7280af/m248_1_ie001i.jpg" alt="" width="15" height="14"/></span>, is given by</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_003"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/9ce0f5f5/m248_1_ue003i.jpg" alt=""/></div><p>The symbol <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/dc7280af/m248_1_ie001i.jpg" alt="" width="15" height="14"/></span> denoting the sample mean is read ‘<i>x</i>bar’.</p></div></div></div><p>Recall that the symbol for the Greek uppercase letter sigma Σ is used to mean ‘the sum of’. The expression</p><div class="oucontentfigure oucontentmediamini" id="eqn003"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/42811896/m248_1_ie003i.jpg" alt="" width="31" height="62"/></div><p>which reads ‘sigma <i>i</i> equals 1 to <i>n</i>’, means the sum of the terms <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>,…, <i>x</i>
<sub>
<i>n</i>
</sub>.</p><p>From the data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5#tbl001004">Table 4</a> (repeated at the start of section 4.2), the mean β endorphin concentration (in pmol/l) of collapsed runners is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_004"><a href="http://www.open.edu/openlearn/ocw/mod/oucontent/view.php?id=4084&extra=thumbnail_idp3388816" title="View larger image"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/9c304567/m248_1_ue004i.small.jpg" alt=""/></a><div class="oucontentfiguretext"><div class="oucontentthumbnaillink"><a href="http://www.open.edu/openlearn/ocw/mod/oucontent/view.php?id=4084&extra=thumbnail_idp3388816">View larger image</a></div><div class="oucontentcaption oucontentnonumber oucontentcaptionplaceholder"> </div></div><a id="back_thumbnail_idp3388816"></a></div><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="act004_003"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 9: Beta endorphin concentration (successful runners)</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Find the mean of the ‘After – Before’ differences given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2#tbl004002">Table 10</a>.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>The mean ‘After – Before’ difference (in pmol/l) is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0035"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/0e4460f7/m248_1_ue035i.jpg" alt=""/></div></div></div></div></div><p>Two plausible measures of location have been defined for describing a typical or representative value for a sample of data. Which measure should be chosen in a statement of that typical value? In the examples we have looked at in this section, there has been little to choose between the two. Are there principles that should be followed? As you might expect there are no hard and fast rules: it all depends on the data that we are trying to summarise, and our aim in summarising them.</p><p>To a large extent deciding between using the sample mean and the sample median depends on how the data are distributed. If their distribution appears to be regular and concentrated in the middle of their range, the mean is usually used. When a computer is not available, the mean is easier to calculate than the median because no sorting is involved and, as you will see later in the course, it is easier to use for drawing inferences about the population from which the sample has been taken.</p><p>If, however, the data are irregularly distributed with apparent outliers present, then the sample median is often preferred in quoting a typical value, since it is less sensitive to such irregularities. You can see this by looking again at the data on collapsed runners in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5#tbl001004">Table 4</a>. The mean endorphin concentration is 138.6 pmol/l, whereas the median concentration is 110. The large discrepancy is due to the outlier with an endorphin concentration of 414. Excluding this outlier brings the mean down to 111.1 while the median decreases to 106. From this we see that the median is more stable than the mean in the sense that outliers exert less influence upon it. The word <b>resistant</b> is sometimes used to describe measures which are insensitive to outliers. The median is said to be a resistant measure, whereas the mean is not resistant.</p><p>A general comment on the use of certain familiar words in statistics is appropriate here. Notice the use of the word ‘range’ in the second paragraph after Activity 9. The statement made there is a statement of the extent of the values observed in a sample, as in ‘the observed weights ranged from a minimum of 1.03kg to a maximum of 3.64kg’. It need not be an exact statement: ‘the range of observed weights was from 1kg to about 4 kg’. However, in Subsection 4.6 you will see the word ‘range’ used in a technical sense, as a measure of dispersion in data. This often happens in statistics: a familiar word is given a technical meaning. Terms you will come across later in the course include expectation, likelihood, confidence, estimator, significant. But we would not wish this to preclude normal English usage of such words. It will usually be clear from the context when the technical sense is intended.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.3
1.4.4: The meanM248_1<p>The second measure of location defined in this course for a collection of data is the <i>mean</i>. Again, to be precise, we are discussing the <i>sample</i> mean, as opposed to the <i>population</i> mean. This is what most individuals would understand by the word ‘average’. All the items in the data set are added together, giving the <i>sample total</i>. This total is divided by the number of items (the sample size).</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_002"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">The sample mean</h2><div class="oucontentinnerbox"><p>The <b>mean</b> of a sample is the arithmetic average of the data values. It is obtained by adding together all of the data values and dividing this total by the number of items in the sample.</p><p>If the <i>n</i> items in a data set are denoted <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>,…, <i>x<sub>
<i>n</i>
</sub>
</i>, then the sample size is <i>n</i>, and the sample mean, which is denoted <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/dc7280af/m248_1_ie001i.jpg" alt="" width="15" height="14"/></span>, is given by</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_003"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/9ce0f5f5/m248_1_ue003i.jpg" alt=""/></div><p>The symbol <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/dc7280af/m248_1_ie001i.jpg" alt="" width="15" height="14"/></span> denoting the sample mean is read ‘<i>x</i>bar’.</p></div></div></div><p>Recall that the symbol for the Greek uppercase letter sigma Σ is used to mean ‘the sum of’. The expression</p><div class="oucontentfigure oucontentmediamini" id="eqn003"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/42811896/m248_1_ie003i.jpg" alt="" width="31" height="62"/></div><p>which reads ‘sigma <i>i</i> equals 1 to <i>n</i>’, means the sum of the terms <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>,…, <i>x</i>
<sub>
<i>n</i>
</sub>.</p><p>From the data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5#tbl001004">Table 4</a> (repeated at the start of section 4.2), the mean β endorphin concentration (in pmol/l) of collapsed runners is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_004"><a href="http://www.open.edu/openlearn/ocw/mod/oucontent/view.php?id=4084&extra=thumbnail_idp3388816" title="View larger image"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/9c304567/m248_1_ue004i.small.jpg" alt=""/></a><div class="oucontentfiguretext"><div class="oucontentthumbnaillink"><a href="http://www.open.edu/openlearn/ocw/mod/oucontent/view.php?id=4084&extra=thumbnail_idp3388816">View larger image</a></div><div class="oucontentcaption oucontentnonumber oucontentcaptionplaceholder"> </div></div><a id="back_thumbnail_idp3388816"></a></div><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="act004_003"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 9: Beta endorphin concentration (successful runners)</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Find the mean of the ‘After – Before’ differences given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2#tbl004002">Table 10</a>.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>The mean ‘After – Before’ difference (in pmol/l) is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0035"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/0e4460f7/m248_1_ue035i.jpg" alt=""/></div></div></div></div></div><p>Two plausible measures of location have been defined for describing a typical or representative value for a sample of data. Which measure should be chosen in a statement of that typical value? In the examples we have looked at in this section, there has been little to choose between the two. Are there principles that should be followed? As you might expect there are no hard and fast rules: it all depends on the data that we are trying to summarise, and our aim in summarising them.</p><p>To a large extent deciding between using the sample mean and the sample median depends on how the data are distributed. If their distribution appears to be regular and concentrated in the middle of their range, the mean is usually used. When a computer is not available, the mean is easier to calculate than the median because no sorting is involved and, as you will see later in the course, it is easier to use for drawing inferences about the population from which the sample has been taken.</p><p>If, however, the data are irregularly distributed with apparent outliers present, then the sample median is often preferred in quoting a typical value, since it is less sensitive to such irregularities. You can see this by looking again at the data on collapsed runners in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5#tbl001004">Table 4</a>. The mean endorphin concentration is 138.6 pmol/l, whereas the median concentration is 110. The large discrepancy is due to the outlier with an endorphin concentration of 414. Excluding this outlier brings the mean down to 111.1 while the median decreases to 106. From this we see that the median is more stable than the mean in the sense that outliers exert less influence upon it. The word <b>resistant</b> is sometimes used to describe measures which are insensitive to outliers. The median is said to be a resistant measure, whereas the mean is not resistant.</p><p>A general comment on the use of certain familiar words in statistics is appropriate here. Notice the use of the word ‘range’ in the second paragraph after Activity 9. The statement made there is a statement of the extent of the values observed in a sample, as in ‘the observed weights ranged from a minimum of 1.03kg to a maximum of 3.64kg’. It need not be an exact statement: ‘the range of observed weights was from 1kg to about 4 kg’. However, in Subsection 4.6 you will see the word ‘range’ used in a technical sense, as a measure of dispersion in data. This often happens in statistics: a familiar word is given a technical meaning. Terms you will come across later in the course include expectation, likelihood, confidence, estimator, significant. But we would not wish this to preclude normal English usage of such words. It will usually be clear from the context when the technical sense is intended.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

4.5: The mode
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.4
Tue, 26 Jul 2011 23:00:00 GMT
<p>The USA workforce data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.3#tbl001002">Table 2</a> were usefully summarised in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7#fig002006">Figure 6</a>, which is reproduced below as <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.4#fig004001">Figure 18</a>.</p><div class="oucontentfigure" style="width:511px;" id="fig004_001"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/9ea9c9d4/m248_1_018i.jpg" alt="Figure 18" width="511" height="466"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 18 Employment in USA</span></div></div></div><p>The variable recorded here is ‘type of employment’ (professional, industrial, clerical, and so on) so the data are categorical and not amenable to ordering. In this context the notion of ‘mean type of employment’ or ‘median type of employment’ is not a sensible one. For any data set, a third representative measure which is sometimes used is the <b>mode</b>. It describes the most frequently occurring observation. Thus, for males in employment in the USA during 1986, the <i>modal</i> type of employment was ‘professional’ while, for females, the modal type of employment was ‘clerical’.</p><p>The word mode can also reasonably be applied to numerical data, referring again to the most frequently occurring observation. But there is a problem of definition. For the birth weight data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.4#tbl001003">Table 3</a>, there were two duplicates: two of the infants weighed 1.72 kg, and another two weighed 2.20 kg. So there would appear to be two modes, and yet to report either one of them as a representative weight is to make a great deal of an arithmetic accident. If the data are classified into groups, then you can see from Figures 10 to 12 in section 3.1 that even the definition of a ‘modal group’ will depend on the definition of borderlines (and on what to do with borderline cases). The number of histogram peaks as well as their locations can alter.</p><p>Yet it often happens that a collection of data presents a very clear picture of an underlying pattern, and one which would be robust against changes in group definition. In such a case it is common to identify as modes not just the most frequently occurring observation (the highest peak) but every peak. Here are two examples.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.4
4.5: The modeM248_1<p>The USA workforce data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.3#tbl001002">Table 2</a> were usefully summarised in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection3.7#fig002006">Figure 6</a>, which is reproduced below as <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.4#fig004001">Figure 18</a>.</p><div class="oucontentfigure" style="width:511px;" id="fig004_001"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/9ea9c9d4/m248_1_018i.jpg" alt="Figure 18" width="511" height="466"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 18 Employment in USA</span></div></div></div><p>The variable recorded here is ‘type of employment’ (professional, industrial, clerical, and so on) so the data are categorical and not amenable to ordering. In this context the notion of ‘mean type of employment’ or ‘median type of employment’ is not a sensible one. For any data set, a third representative measure which is sometimes used is the <b>mode</b>. It describes the most frequently occurring observation. Thus, for males in employment in the USA during 1986, the <i>modal</i> type of employment was ‘professional’ while, for females, the modal type of employment was ‘clerical’.</p><p>The word mode can also reasonably be applied to numerical data, referring again to the most frequently occurring observation. But there is a problem of definition. For the birth weight data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.4#tbl001003">Table 3</a>, there were two duplicates: two of the infants weighed 1.72 kg, and another two weighed 2.20 kg. So there would appear to be two modes, and yet to report either one of them as a representative weight is to make a great deal of an arithmetic accident. If the data are classified into groups, then you can see from Figures 10 to 12 in section 3.1 that even the definition of a ‘modal group’ will depend on the definition of borderlines (and on what to do with borderline cases). The number of histogram peaks as well as their locations can alter.</p><p>Yet it often happens that a collection of data presents a very clear picture of an underlying pattern, and one which would be robust against changes in group definition. In such a case it is common to identify as modes not just the most frequently occurring observation (the highest peak) but every peak. Here are two examples.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.4.5.1 Chest measurements of Scottish soldiers
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.5
Tue, 26 Jul 2011 23:00:00 GMT
<p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.5#fig004002">Figure 19</a> shows a histogram of chest measurements (in inches) of a sample of 5732 Scottish soldiers.</p><div class="oucontentfigure" style="width:511px;" id="fig004_002"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/6823ad53/m248_1_019i.jpg" alt="Figure 19" width="511" height="369"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 19 Chest measurements (inches), source: Stigler, S.M. (1986) <i>The History of Statistics</i><i>The Measurement of Uncertainty before 1900</i>. Belknap Press of Harvard University Press, p. 208.</span></div></div></div><p>This data set is discussed further later in the course; for the moment, simply observe that there is an evident single mode at around 40 inches. The data are said to be <b>unimodal</b>.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.5
1.4.5.1 Chest measurements of Scottish soldiersM248_1<p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.5#fig004002">Figure 19</a> shows a histogram of chest measurements (in inches) of a sample of 5732 Scottish soldiers.</p><div class="oucontentfigure" style="width:511px;" id="fig004_002"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/6823ad53/m248_1_019i.jpg" alt="Figure 19" width="511" height="369"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 19 Chest measurements (inches), source: Stigler, S.M. (1986) <i>The History of Statistics</i><i>The Measurement of Uncertainty before 1900</i>. Belknap Press of Harvard University Press, p. 208.</span></div></div></div><p>This data set is discussed further later in the course; for the moment, simply observe that there is an evident single mode at around 40 inches. The data are said to be <b>unimodal</b>.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

Waiting times between geyser eruptions
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.6
Tue, 26 Jul 2011 23:00:00 GMT
<p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.6#fig004003">Figure 20</a> shows a histogram of waiting times, varying from about 40 minutes to about 110 minutes.</p><div class="oucontentfigure" style="width:511px;" id="fig004_003"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/163aebe6/m248_1_020i.jpg" alt="Figure 4.3" width="511" height="392"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 4.3 Waiting times (minutes), source: Azzalini, A. and Bowman, A.W. (1990) A look at some data on the Old Faithful geyser. <i>Applied Statistics</i>, <b>39</b>, 357–366.</span></div></div></div><p>In fact, these are waiting times between the starts of successive eruptions of the Old Faithful geyser in the Yellowstone National Park, Wyoming, USA, during August, 1985. Observe the two modes. These data are said to be <b>bimodal</b>.
</p><p>Sometimes data sets may exhibit three modes (<b>trimodal</b>) or many modes (<b>multimodal</b>). You should be wary of too precise a description. Both the data sets in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.5#fig004002">Figures 19</a> and <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.6#fig004003">20</a> were based on large samples, and their message is unambiguous. As you will see later in the course, smaller data sets can give rise to very jagged histograms indeed, and any message about one or more preferred observations is consequently very unclear.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.6
Waiting times between geyser eruptionsM248_1<p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.6#fig004003">Figure 20</a> shows a histogram of waiting times, varying from about 40 minutes to about 110 minutes.</p><div class="oucontentfigure" style="width:511px;" id="fig004_003"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/163aebe6/m248_1_020i.jpg" alt="Figure 4.3" width="511" height="392"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 4.3 Waiting times (minutes), source: Azzalini, A. and Bowman, A.W. (1990) A look at some data on the Old Faithful geyser. <i>Applied Statistics</i>, <b>39</b>, 357–366.</span></div></div></div><p>In fact, these are waiting times between the starts of successive eruptions of the Old Faithful geyser in the Yellowstone National Park, Wyoming, USA, during August, 1985. Observe the two modes. These data are said to be <b>bimodal</b>.
</p><p>Sometimes data sets may exhibit three modes (<b>trimodal</b>) or many modes (<b>multimodal</b>). You should be wary of too precise a description. Both the data sets in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.5#fig004002">Figures 19</a> and <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.6#fig004003">20</a> were based on large samples, and their message is unambiguous. As you will see later in the course, smaller data sets can give rise to very jagged histograms indeed, and any message about one or more preferred observations is consequently very unclear.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.4.6: Measures of dispersion
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.4
Tue, 26 Jul 2011 23:00:00 GMT
<p>During the above discussion of suitable numerical summaries for a typical value (measures of location), you may have noticed that it was not possible to make any kind of decision about the relative merits of the sample mean and median without introducing the notion of the extent of variation of the data. In practice, this means that the amount of information contained in these measures, when taken in isolation, is not sufficient to describe the appearance of the data. A more informative numerical summary is needed. In other words, if we are to be happy about replacing a full data set by a few summary numbers, we need some measure of the <i>dispersion</i>, sometimes called the <i>spread</i>, of observations.</p><p>The <b>range</b> is the difference between the smallest and largest data values. It is certainly the simplest measure of dispersion, but it can be misleading. The range of β endorphin concentrations for collapsed runners is 414−66=348, suggesting a fairly wide spread. However, omitting the value 414 reduces the range to 169−66=103. This sensitivity to a single data value suggests that the range is not a very reliable measure; a much more modest assessment of dispersion may be more appropriate. By its very nature, the range is always going to give prominence to outliers and therefore cannot sensibly be used in this way.</p><p>This example indicates the need for an alternative to the range as a measure of dispersion, and one which is not overinfluenced by the presence of a few extreme values. In fact, we shall discuss in turn two different measures of dispersion: the <i>interquartile range</i> and the <i>standard deviation</i>.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.4
1.4.6: Measures of dispersionM248_1<p>During the above discussion of suitable numerical summaries for a typical value (measures of location), you may have noticed that it was not possible to make any kind of decision about the relative merits of the sample mean and median without introducing the notion of the extent of variation of the data. In practice, this means that the amount of information contained in these measures, when taken in isolation, is not sufficient to describe the appearance of the data. A more informative numerical summary is needed. In other words, if we are to be happy about replacing a full data set by a few summary numbers, we need some measure of the <i>dispersion</i>, sometimes called the <i>spread</i>, of observations.</p><p>The <b>range</b> is the difference between the smallest and largest data values. It is certainly the simplest measure of dispersion, but it can be misleading. The range of β endorphin concentrations for collapsed runners is 414−66=348, suggesting a fairly wide spread. However, omitting the value 414 reduces the range to 169−66=103. This sensitivity to a single data value suggests that the range is not a very reliable measure; a much more modest assessment of dispersion may be more appropriate. By its very nature, the range is always going to give prominence to outliers and therefore cannot sensibly be used in this way.</p><p>This example indicates the need for an alternative to the range as a measure of dispersion, and one which is not overinfluenced by the presence of a few extreme values. In fact, we shall discuss in turn two different measures of dispersion: the <i>interquartile range</i> and the <i>standard deviation</i>.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

4.7: Quartiles and the interquartile range
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.5
Tue, 26 Jul 2011 23:00:00 GMT
<p>The first alternative measure of dispersion we shall discuss is the interquartile range: this is the difference between summary measures known as the lower and upper quartiles. The quartiles are simple in concept: if the median is regarded as the middle data point, so that it splits the data in half, the quartiles similarly split the data into quarters. This is, of course, an oversimplification. With an even number of data points, the median is defined to be the average of the middle two: defining quartiles is a little more complicated.</p><p>It would be convenient to express our wordy definition of the median in a concise symbolic form, and this is easy to do. Any data sample of size <i>n</i> may be written as a list of numbers</p><p>
<i>x</i>
<sub>1</sub>,<i>x</i>
<sub>2</sub>,<i>x</i>
<sub>3</sub>, ... <i>x</i>
<sub>n</sub>
</p><p>In order to calculate the sample median it is necessary to sort the data so that they are written in order of increasing size. The sorted list can then be written as</p><p>
<i>x</i>
<sub>(1)</sub>,<i>x</i>
<sub>(2)</sub>,<i>x</i>
<sub>(3)</sub>, ... <i>x</i>
<sub>(n)</sub>
</p><p>where <i>x</i>
<sub>(1)</sub> is the smallest value in the original list (the minimum) and <i>x</i>
<sub>(<i>n</i>)</sub>
is the largest (the maximum). In general, the notation<i>x</i>
<sub>(<i>p</i>)</sub>
is used to mean the pth value when the data are arranged in order of increasing size. Each successive item in the ordered list is greater than or equal to the previous item. For instance, the list of six data items</p><p>7, 1, 3, 6, 3, 7</p><p>may be ordered as</p><p>1, 3, 3, 6, 7, 7</p><p>So, for these data, <i>x</i>
<sub>(1)</sub>=1, <i>x</i>
<sub>(2)</sub>=<i>x</i>
<sub>(3)</sub>=3, <i>x</i>
<sub>(4)</sub>=6, <i>x</i>
<sub>(5)</sub>=<i>x</i>
<sub>(6)</sub>=7.</p><p>In any such ordered list, the sample median m may be defined to be the number</p><p>
<i>m</i> = <i>x</i>
<sub>(½(<i>n</i> + 1))</sub>
</p><p>as long as the subscript on the righthand side is appropriately interpreted.</p><p>If the sample size <i>n</i> is odd, then the number ½(<i>n</i>+1) is an integer, and there is no problem of definition. For instance, if <i>n</i>=27 then ½(<i>n</i>+1)=14, and the sample median is <i>m</i> = <i>x</i>
<sub>(14)</sub> side of it.</p><p>If the sample size <i>n</i> is even then the number ½(<i>n</i>+1) is not an integer but has a fractional part equal to ½. For instance, if <i>n</i> = 6 (as in the example above) then the sample median is</p><p>
<i>m</i> = <i>x</i>
<sub>(½(<i>n</i>+1))</sub> = <i>x</i>
<sub>(3½)</sub>.</p><p>Such numbers are sometimes called ‘halfinteger’</p><p>If the number <i>x</i>
<sub>(3½)</sub> is interpreted as ‘the number halfway between <i>x</i>
<sub>(3)</sub> and <i>x</i>
<sub>(4)</sub>’ then you can see that the wordy definition survives intact. This obvious interpretation of numbers such as <i>x</i>
<sub>3½</sub> can be extended to numbers such as <i>x</i>
<sub>(2¼)</sub> and <i>x</i>
<sub>(4¾)</sub>: <i>x</i>
<sub>(2¼)</sub> is the number onequarter of the way from <i>x</i>
<sub>(2)</sub> to <i>x</i>
<sub>(3)</sub>, and <i>x</i>
<sub>(4¾)</sub> is the number threequarters of the way from <i>x</i>
<sub>(4)</sub> to <i>x</i>
<sub>(5)</sub>. Interpreting fractional subscripts in this way when they occur, the lower quartile (roughly, onequarter of the way into the data set) and the upper quartile (approximately threequarters of the way through the data set) may be defined as follows.</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_003"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Sample quartiles</h2><div class="oucontentinnerbox"><p>If a data set <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>,… , <i>x</i>
<sub>
<i>n</i>
</sub> is reordered as <i>x</i>
<sub>(1)</sub>, <i>x</i>
<sub>(2)</sub>, …, <i>x</i>
<sub>(n)</sub>
, where</p><p>
<i>x</i>
<sub>(1)</sub> ≤ <i>x</i>
<sub>(2)</sub> ≤ ... ≤ <i>x</i>(<i>n</i>)</p><p>then the <b>lower sample quartile</b>
<i>q<sub>L</sub>
</i> is defined by</p><p>
<i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(¼(<i>n</i>+1))</sub>
</p><p>and the <b>upper sample quartile</b>
<i>q<sub>U</sub>
</i> is defined by</p><p>
<i>q<sub>U</sub>
</i> = <i>x</i>
<sub>(¾(<i>n</i>+1))</sub>
</p></div></div></div><p>Unfortunately, there is no universally accepted definition for sample quartiles, nor, indeed, a universally accepted nomenclature. The lower and upper sample quartiles are sometimes called the first and third sample quartiles. The median is the second sample quartile. Other definitions are possible, and you may even be familiar with some of them. For instance, some practitioners use</p><p>
<i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(¼<i>n</i>+½)</sub>, <i>q<sub>U</sub>
</i> = <i>x</i>
<sub>(¾<i>n</i>+½)</sub>
</p><p>Others use</p><p>
<i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(¼<i>n</i>+¾)</sub>, <i>q<sub>U</sub>
</i> = <i>x</i>
<sub>(¾<i>n</i>+¼)</sub>
</p><p>Still others insist that the lower and upper quartiles be defined in such a way that they are identified uniquely with actual sample items. However, almost all definitions of the sample median reduce to the same thing.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.5
4.7: Quartiles and the interquartile rangeM248_1<p>The first alternative measure of dispersion we shall discuss is the interquartile range: this is the difference between summary measures known as the lower and upper quartiles. The quartiles are simple in concept: if the median is regarded as the middle data point, so that it splits the data in half, the quartiles similarly split the data into quarters. This is, of course, an oversimplification. With an even number of data points, the median is defined to be the average of the middle two: defining quartiles is a little more complicated.</p><p>It would be convenient to express our wordy definition of the median in a concise symbolic form, and this is easy to do. Any data sample of size <i>n</i> may be written as a list of numbers</p><p>
<i>x</i>
<sub>1</sub>,<i>x</i>
<sub>2</sub>,<i>x</i>
<sub>3</sub>, ... <i>x</i>
<sub>n</sub>
</p><p>In order to calculate the sample median it is necessary to sort the data so that they are written in order of increasing size. The sorted list can then be written as</p><p>
<i>x</i>
<sub>(1)</sub>,<i>x</i>
<sub>(2)</sub>,<i>x</i>
<sub>(3)</sub>, ... <i>x</i>
<sub>(n)</sub>
</p><p>where <i>x</i>
<sub>(1)</sub> is the smallest value in the original list (the minimum) and <i>x</i>
<sub>(<i>n</i>)</sub>
is the largest (the maximum). In general, the notation<i>x</i>
<sub>(<i>p</i>)</sub>
is used to mean the pth value when the data are arranged in order of increasing size. Each successive item in the ordered list is greater than or equal to the previous item. For instance, the list of six data items</p><p>7, 1, 3, 6, 3, 7</p><p>may be ordered as</p><p>1, 3, 3, 6, 7, 7</p><p>So, for these data, <i>x</i>
<sub>(1)</sub>=1, <i>x</i>
<sub>(2)</sub>=<i>x</i>
<sub>(3)</sub>=3, <i>x</i>
<sub>(4)</sub>=6, <i>x</i>
<sub>(5)</sub>=<i>x</i>
<sub>(6)</sub>=7.</p><p>In any such ordered list, the sample median m may be defined to be the number</p><p>
<i>m</i> = <i>x</i>
<sub>(½(<i>n</i> + 1))</sub>
</p><p>as long as the subscript on the righthand side is appropriately interpreted.</p><p>If the sample size <i>n</i> is odd, then the number ½(<i>n</i>+1) is an integer, and there is no problem of definition. For instance, if <i>n</i>=27 then ½(<i>n</i>+1)=14, and the sample median is <i>m</i> = <i>x</i>
<sub>(14)</sub> side of it.</p><p>If the sample size <i>n</i> is even then the number ½(<i>n</i>+1) is not an integer but has a fractional part equal to ½. For instance, if <i>n</i> = 6 (as in the example above) then the sample median is</p><p>
<i>m</i> = <i>x</i>
<sub>(½(<i>n</i>+1))</sub> = <i>x</i>
<sub>(3½)</sub>.</p><p>Such numbers are sometimes called ‘halfinteger’</p><p>If the number <i>x</i>
<sub>(3½)</sub> is interpreted as ‘the number halfway between <i>x</i>
<sub>(3)</sub> and <i>x</i>
<sub>(4)</sub>’ then you can see that the wordy definition survives intact. This obvious interpretation of numbers such as <i>x</i>
<sub>3½</sub> can be extended to numbers such as <i>x</i>
<sub>(2¼)</sub> and <i>x</i>
<sub>(4¾)</sub>: <i>x</i>
<sub>(2¼)</sub> is the number onequarter of the way from <i>x</i>
<sub>(2)</sub> to <i>x</i>
<sub>(3)</sub>, and <i>x</i>
<sub>(4¾)</sub> is the number threequarters of the way from <i>x</i>
<sub>(4)</sub> to <i>x</i>
<sub>(5)</sub>. Interpreting fractional subscripts in this way when they occur, the lower quartile (roughly, onequarter of the way into the data set) and the upper quartile (approximately threequarters of the way through the data set) may be defined as follows.</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_003"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Sample quartiles</h2><div class="oucontentinnerbox"><p>If a data set <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>,… , <i>x</i>
<sub>
<i>n</i>
</sub> is reordered as <i>x</i>
<sub>(1)</sub>, <i>x</i>
<sub>(2)</sub>, …, <i>x</i>
<sub>(n)</sub>
, where</p><p>
<i>x</i>
<sub>(1)</sub> ≤ <i>x</i>
<sub>(2)</sub> ≤ ... ≤ <i>x</i>(<i>n</i>)</p><p>then the <b>lower sample quartile</b>
<i>q<sub>L</sub>
</i> is defined by</p><p>
<i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(¼(<i>n</i>+1))</sub>
</p><p>and the <b>upper sample quartile</b>
<i>q<sub>U</sub>
</i> is defined by</p><p>
<i>q<sub>U</sub>
</i> = <i>x</i>
<sub>(¾(<i>n</i>+1))</sub>
</p></div></div></div><p>Unfortunately, there is no universally accepted definition for sample quartiles, nor, indeed, a universally accepted nomenclature. The lower and upper sample quartiles are sometimes called the first and third sample quartiles. The median is the second sample quartile. Other definitions are possible, and you may even be familiar with some of them. For instance, some practitioners use</p><p>
<i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(¼<i>n</i>+½)</sub>, <i>q<sub>U</sub>
</i> = <i>x</i>
<sub>(¾<i>n</i>+½)</sub>
</p><p>Others use</p><p>
<i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(¼<i>n</i>+¾)</sub>, <i>q<sub>U</sub>
</i> = <i>x</i>
<sub>(¾<i>n</i>+¼)</sub>
</p><p>Still others insist that the lower and upper quartiles be defined in such a way that they are identified uniquely with actual sample items. However, almost all definitions of the sample median reduce to the same thing.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.4.7.1 Quartiles for the SIRDS data
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.5.1
Tue, 26 Jul 2011 23:00:00 GMT
<p>For the 23 infants who survived SIRDS, the ordered birth weights are given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2#tbl004001">Table 9</a>. The first quartile is</p><p>
<i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(¼(23+1))</sub> = <i>x</i>
<sub>(6)</sub> = 1.720kg.</p><p>The third quartile is</p><p>
<i>q<sub>U</sub>
</i> = <i>x</i>
<sub>(¾(23+1))</sub> = <i>x</i>
<sub>(18)</sub> = 2.830kg.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.5.1
1.4.7.1 Quartiles for the SIRDS dataM248_1<p>For the 23 infants who survived SIRDS, the ordered birth weights are given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2#tbl004001">Table 9</a>. The first quartile is</p><p>
<i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(¼(23+1))</sub> = <i>x</i>
<sub>(6)</sub> = 1.720kg.</p><p>The third quartile is</p><p>
<i>q<sub>U</sub>
</i> = <i>x</i>
<sub>(¾(23+1))</sub> = <i>x</i>
<sub>(18)</sub> = 2.830kg.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.4.7.2 Quartiles when the sample size is awkward
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.5.2
Tue, 26 Jul 2011 23:00:00 GMT
<p>For the six ordered data items 1, 3, 3, 6, 7, 7, the lower quartile is given by</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0017"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/a93bc015/m248_1_ue017i.jpg" alt=""/></div><p>In other words, the lower quartile <i>q<sub>L</sub>
</i> is given by the number threequarters of the way between <i>x</i>
<sub>(1)</sub>=1 and <i>x</i>
<sub>(2)</sub>=3. The difference between <i>x</i>
<sub>(1)</sub> and <i>x</i>
<sub>(2)</sub> is 2, so</p><p>
<i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(1)</sub> + ¾(<i>x</i>
<sub>(2)</sub> – <i>x</i>
<sub>(1)</sub>) = 1 + ¾ × 2 = 2.5.</p><p>The upper quartile is given by</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0019"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/0c1c1068/m248_1_ue019i.jpg" alt=""/></div><p>So the upper quartile <i>q<sub>u</sub>
</i> is the number onequarter of the way between <i>x</i>
<sub>(5)</sub>=7 and <i>x</i>
<sub>(6)</sub>=7. This is just the number 7 itself.</p><p>Having defined the quartiles, it is straightforward to define the measure of dispersion based on them: the interquartile range is the difference between the quartiles.</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_004"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">The interquartile range</h2><div class="oucontentinnerbox"><p>The <b>interquartile range</b>, which is a measure of the dispersion in a data set, is the difference <i>q<sub>U</sub>−q<sub>L</sub>
</i> between the upper quartile <i>q<sub>U</sub>
</i> and the lower quartile <i>q<sub>L</sub>
</i>.</p></div></div></div>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.5.2
1.4.7.2 Quartiles when the sample size is awkwardM248_1<p>For the six ordered data items 1, 3, 3, 6, 7, 7, the lower quartile is given by</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0017"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/a93bc015/m248_1_ue017i.jpg" alt=""/></div><p>In other words, the lower quartile <i>q<sub>L</sub>
</i> is given by the number threequarters of the way between <i>x</i>
<sub>(1)</sub>=1 and <i>x</i>
<sub>(2)</sub>=3. The difference between <i>x</i>
<sub>(1)</sub> and <i>x</i>
<sub>(2)</sub> is 2, so</p><p>
<i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(1)</sub> + ¾(<i>x</i>
<sub>(2)</sub> – <i>x</i>
<sub>(1)</sub>) = 1 + ¾ × 2 = 2.5.</p><p>The upper quartile is given by</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0019"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/0c1c1068/m248_1_ue019i.jpg" alt=""/></div><p>So the upper quartile <i>q<sub>u</sub>
</i> is the number onequarter of the way between <i>x</i>
<sub>(5)</sub>=7 and <i>x</i>
<sub>(6)</sub>=7. This is just the number 7 itself.</p><p>Having defined the quartiles, it is straightforward to define the measure of dispersion based on them: the interquartile range is the difference between the quartiles.</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_004"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">The interquartile range</h2><div class="oucontentinnerbox"><p>The <b>interquartile range</b>, which is a measure of the dispersion in a data set, is the difference <i>q<sub>U</sub>−q<sub>L</sub>
</i> between the upper quartile <i>q<sub>U</sub>
</i> and the lower quartile <i>q<sub>L</sub>
</i>.</p></div></div></div>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.4.7.3 Interquartile range for the SIRDS data
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.5.3
Tue, 26 Jul 2011 23:00:00 GMT
<p>For the 23 infants who survived SIRDS, the lower quartile is <i>q<sub>L</sub>
</i>=1.720 kg, and the upper quartile is <i>q<sub>U</sub>
</i>=2.830 kg. Thus the interquartile range (in kg) is</p><p>
<i>q<sub>U</sub>
</i> – <i>q<sub>L</sub>
</i> = 2.830 – 1.720 = 1.110.</p><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="act004_004"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 10: More on the SIRDS data</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Find the lower and upper quartiles, and the interquartile range, for the birth weight data on those children with SIRDS who died. The ordered data are in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2#tbl004001">Table 9</a>.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>The lower quartile birth weight (in kg) for the 27 children who died is given by</p><p>
<i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(¼(<i>n</i>+1))</sub> = <i>x</i>
<sub>(7)</sub> = 1.230.</p><p>The upper quartile birth weight (in kg) is</p><p>
<i>q<sub>U</sub>
</i> = <i>x</i>
<sub>(¾(<i>n</i>+1))</sub> = <i>x</i>
<sub>(21)</sub> = 2.220.</p><p>The interquartile range (in kg) is</p><p>
<i>q<sub>U</sub>
</i> – <i>q<sub>L</sub>
</i> = 2.200 – 1.230 = 0.970.</p></div></div></div></div><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="act004_005"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 11: Chondrite meteors</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Find the median, the lower and upper quartiles, and the interquartile range for the data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.5.3#tbl004003">Table 11</a>, which give the percentage of silica found in each of 22 chondrite meteors. (The data are ordered.)</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl004_003"><h3 class="oucontenth3 oucontentheading oucontentnonumber">
<b>Table 11</b> Silica content of chondrite meteors</h3><div class="oucontenttablewrapper"><table><tr><td>20.77</td><td>22.56</td><td>22.71</td><td>22.99</td><td>26.39</td><td>27.08</td><td>27.3</td><td>27.33</td></tr><tr><td>27.57</td><td>27.81</td><td>28.69</td><td>29.36</td><td>30.25</td><td>31.89</td><td>32.88</td><td>33.23</td></tr><tr><td>33.28</td><td>33.40</td><td>33.52</td><td>33.83</td><td>33.95</td><td>34.82</td><td/><td/></tr></table></div><div class="oucontentsourcereference"></div></div><p>(Good, I.J. and Gaskins, R.A. (1980) Density estimation and bumphunting by the penalized likelihood method exemplified by scattering and meteorite data. <i>J. American Statistical Association</i>, <b>75</b>, 42–56.)</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>For the silica data, the sample size <i>n</i> is 22. The lower quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0039"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/c6acd458/m248_1_ue039i.jpg" alt=""/></div><p>So <i>q<sub>L</sub>
</i> is threequarters of the way between</p><p>
<i>x</i>
<sub>(5)</sub>=26.39 and <i>x</i>
<sub>(6)</sub>=27.08. That is</p><p>
<i>q<sub>L</sub>
</i> = 26.39 + ¾(27.08 – 26.39) = 26.9075,</p><p>or approximately 26.91. The sample median is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0041"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/bd5bbfeb/m248_1_ue041i.jpg" alt=""/></div><p>This is midway between <i>x</i>
<sub>(11)</sub>=28.69 and <i>x</i>
<sub>(12)</sub>=29.36. That is 29.025, or approximately 29.03.</p><p>The upper quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0042"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/99dcf297/m248_1_ue042i.jpg" alt=""/></div><p>So <i>q<sub>U</sub>
</i> is onequarter of the way between <i>x</i>
<sub>(17)</sub>=33.28 and <i>x</i>
<sub>(18)</sub>=33.40. That is</p><p>
<i>q<sub>U</sub>
</i> = 33.28 + ¼(33.40 – 33.28) = 33.31</p><p>The interquartile range is</p><p>
<i>q<sub>U</sub>
</i> – <i>q<sub>L</sub>
</i> = 33.31 – 26.9075 = 6.4025,</p><p>or approximately 6.40.</p></div></div></div></div>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.5.3
1.4.7.3 Interquartile range for the SIRDS dataM248_1<p>For the 23 infants who survived SIRDS, the lower quartile is <i>q<sub>L</sub>
</i>=1.720 kg, and the upper quartile is <i>q<sub>U</sub>
</i>=2.830 kg. Thus the interquartile range (in kg) is</p><p>
<i>q<sub>U</sub>
</i> – <i>q<sub>L</sub>
</i> = 2.830 – 1.720 = 1.110.</p><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="act004_004"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 10: More on the SIRDS data</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Find the lower and upper quartiles, and the interquartile range, for the birth weight data on those children with SIRDS who died. The ordered data are in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.3.2#tbl004001">Table 9</a>.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>The lower quartile birth weight (in kg) for the 27 children who died is given by</p><p>
<i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(¼(<i>n</i>+1))</sub> = <i>x</i>
<sub>(7)</sub> = 1.230.</p><p>The upper quartile birth weight (in kg) is</p><p>
<i>q<sub>U</sub>
</i> = <i>x</i>
<sub>(¾(<i>n</i>+1))</sub> = <i>x</i>
<sub>(21)</sub> = 2.220.</p><p>The interquartile range (in kg) is</p><p>
<i>q<sub>U</sub>
</i> – <i>q<sub>L</sub>
</i> = 2.200 – 1.230 = 0.970.</p></div></div></div></div><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="act004_005"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 11: Chondrite meteors</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Find the median, the lower and upper quartiles, and the interquartile range for the data in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.5.3#tbl004003">Table 11</a>, which give the percentage of silica found in each of 22 chondrite meteors. (The data are ordered.)</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl004_003"><h3 class="oucontenth3 oucontentheading oucontentnonumber">
<b>Table 11</b> Silica content of chondrite meteors</h3><div class="oucontenttablewrapper"><table><tr><td>20.77</td><td>22.56</td><td>22.71</td><td>22.99</td><td>26.39</td><td>27.08</td><td>27.3</td><td>27.33</td></tr><tr><td>27.57</td><td>27.81</td><td>28.69</td><td>29.36</td><td>30.25</td><td>31.89</td><td>32.88</td><td>33.23</td></tr><tr><td>33.28</td><td>33.40</td><td>33.52</td><td>33.83</td><td>33.95</td><td>34.82</td><td/><td/></tr></table></div><div class="oucontentsourcereference"></div></div><p>(Good, I.J. and Gaskins, R.A. (1980) Density estimation and bumphunting by the penalized likelihood method exemplified by scattering and meteorite data. <i>J. American Statistical Association</i>, <b>75</b>, 42–56.)</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>For the silica data, the sample size <i>n</i> is 22. The lower quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0039"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/c6acd458/m248_1_ue039i.jpg" alt=""/></div><p>So <i>q<sub>L</sub>
</i> is threequarters of the way between</p><p>
<i>x</i>
<sub>(5)</sub>=26.39 and <i>x</i>
<sub>(6)</sub>=27.08. That is</p><p>
<i>q<sub>L</sub>
</i> = 26.39 + ¾(27.08 – 26.39) = 26.9075,</p><p>or approximately 26.91. The sample median is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0041"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/bd5bbfeb/m248_1_ue041i.jpg" alt=""/></div><p>This is midway between <i>x</i>
<sub>(11)</sub>=28.69 and <i>x</i>
<sub>(12)</sub>=29.36. That is 29.025, or approximately 29.03.</p><p>The upper quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0042"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/99dcf297/m248_1_ue042i.jpg" alt=""/></div><p>So <i>q<sub>U</sub>
</i> is onequarter of the way between <i>x</i>
<sub>(17)</sub>=33.28 and <i>x</i>
<sub>(18)</sub>=33.40. That is</p><p>
<i>q<sub>U</sub>
</i> = 33.28 + ¼(33.40 – 33.28) = 33.31</p><p>The interquartile range is</p><p>
<i>q<sub>U</sub>
</i> – <i>q<sub>L</sub>
</i> = 33.31 – 26.9075 = 6.4025,</p><p>or approximately 6.40.</p></div></div></div></div>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.4.8 The standard deviation
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.6
Tue, 26 Jul 2011 23:00:00 GMT
<p>The interquartile range is a useful measure of dispersion in the data and it has the excellent property of not being too sensitive to outlying data values. (That is, it is a resistant measure.) However, like the median it does suffer from the disadvantage that its calculation involves sorting the data. This can be very timeconsuming for large samples when a computer is not available to do the calculations. A measure that does not require sorting of the data and, as you will find in later units, has good statistical properties is the <i>standard deviation</i>.</p><p>The standard deviation is defined in terms of the differences between the data values (<i>x<sub>i</sub>
</i>) and their mean <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/6b48477d/m248_1_ie009i.jpg" alt="" width="30" height="27"/></span>. These differences <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/49e92afa/m248_1_ie010i.jpg" alt="" width="74" height="27"/></span>, which may be positive or negative, are called <b>residuals</b>.</p><div class="oucontentexample oucontentsheavybox1 oucontentsbox " id="exm004_007"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Example 1 Calculating residuals</h2><div class="oucontentinnerbox"><p>The mean difference in β endorphin concentration for the eleven runners in section 1.4 who completed the Great North Run is 18.74 pmol/l (to two decimal places). The eleven residuals are given in the following table.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbli001"><div class="oucontenttablewrapper"><table><tr><td>Difference, <i>x<sub>i</sub>
</i>
</td><td>25.3</td><td>20.5</td><td>10.3</td><td>24.4</td><td>17.5</td><td>30.6</td><td>11.8</td><td>12.9</td><td>3.8</td><td>20.6</td><td>28.4</td></tr><tr><td>Mean, <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/dc6cbcab/m248_1_ie011i.jpg" alt="" width="18" height="18"/></span>
</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td></tr><tr><td>Residual, <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/3e9d5b7e/m248_1_ie012i.jpg" alt="" width="57" height="18"/></span>
</td><td>6.56</td><td>1.76</td><td>−8.44</td><td>5.66</td><td>−1.24</td><td>11.86</td><td>−6.94</td><td>−5.84</td><td>−14.94</td><td>1.86</td><td>9.66</td></tr></table></div><div class="oucontentsourcereference"></div></div></div></div></div><p>For a sample of size <i>n</i> consisting of the data values <i>x</i>
<sub>(1)</sub>, <i>x</i>
<sub>(2)</sub>, …, <i>x<sub>(n)</sub>
</i> and having mean <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/ec5bef14/m248_1_ie013i.jpg" alt="" width="14" height="18"/></span>, the ith residual may be written as</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0021"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/861f29ca/m248_1_ue021i.jpg" alt=""/></div><p>These residuals all contribute to an overall measure of dispersion in the data. Large negative and large positive values both indicate observations far removed from the sample mean. In some way they need to be combined into a single number.</p><p>There is not much point in averaging them: positive residuals will cancel out negative ones. In fact their sum is zero, since</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0022"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/1e03cb5e/m248_1_ue022i.jpg" alt=""/></div><p>Therefore their average is also zero. What is important is the magnitude of each residual, the absolute difference <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/c00fbc39/m248_1_ie014i.jpg" alt="" width="68" height="26"/></span>. The absolute residuals could be added together and averaged, but this measure (known as the <i>mean absolute deviation)</i> does not possess very convenient mathematical properties. Another way of eliminating minus signs is by squaring the residuals. If these squares are averaged and then the square root is taken, this will lead to a measure of dispersion known as the <i>sample standard deviation</i>. It is defined as follows.</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_006"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">The sample standard deviation</h2><div class="oucontentinnerbox"><p>The <b>sample standard deviation</b>, which is a measure of the dispersion in a sample <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>, …, <i>x<sub>n</sub>
</i> with sample mean <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/29fca506/m248_1_ie015i.jpg" alt="" width="17" height="18"/></span>, is denoted by <i>s</i> and is obtained by averaging the squared residuals, and taking the square root of that average. Thus, if <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/24382449/m248_1_ie016i.jpg" alt="" width="103" height="21"/></span>, then</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0023"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/fface446/m248_1_ue023i.jpg" alt=""/></div></div></div></div><p>There are two important points you should note about this definition. First and foremost, although there are <i>n</i> terms contributing to the sum in the numerator, the divisor used when averaging the residuals is not the sample size <i>n</i>, but <i>n</i>−1. The reason for this surprising amendment will become clear later in the course. Whether dividing by <i>n</i> or by <i>n</i>−1, the measure of dispersion obtained has useful statistical properties, but these properties are subtly different. The definition above, with divisor <i>n</i>−1, is used in this course.</p><p>Second, you should remember to take the square root of the average. The reason for taking the square root is so that the measure of dispersion obtained is measured in the same units as the data. Since the residuals are measured in the same units as the data, their squares and the average of their squares are measured in the squares of those units. So the standard deviation, which is the square root of this average, is measured in the same units as the data.</p><div class="oucontentexample oucontentsheavybox1 oucontentsbox " id="exm004_007b"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Example 2 Calculating the standard deviation</h2><div class="oucontentinnerbox"><p>The sum of the squared residuals for the eleven β endorphin concentration differences is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0024"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/ff922597/m248_1_ue024i.jpg" alt=""/></div><p>Notice that a negative residual contributes a positive value to the calculation of the standard deviation. This is because it is squared.</p><p>So the sample standard deviation of the differences is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0025"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/48fd2aa7/m248_1_ue025i.jpg" alt=""/></div></div></div></div><p>Even for relatively small samples the arithmetic is rather awkward if done by hand. Fortunately, it is now common for calculators to have a ‘standard deviation’ button, and all that is required is to key in the data. The exact details of how to do this differ between different models and makes of calculator. Several types of calculator give you the option of using either <i>n</i> or <i>n</i>−1 as the divisor. You should check that you understand exactly how your own calculator is used to calculate standard deviations. Try using it to calculate the sample standard deviation of the eleven β endorphin concentration differences for runners who completed the race, using <i>n</i>−1 as the divisor. Make sure that you get the same answer as given above (that is, 8.33).</p><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="act004_006"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 12: Calculating standard deviations</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Use your calculator to calculate the standard deviation for the β endorphin concentrations of the eleven collapsed runners. The data in pmol/l, originally given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5#tbl001004">Table 4</a>, are as follows.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl004_000a"><div class="oucontenttablewrapper"><table><tr><td>66</td><td>72</td><td>79</td><td>84</td><td>102</td><td>110</td><td>123</td><td>144</td><td>162</td><td>169</td><td>414</td></tr></table></div><div class="oucontentsourcereference"></div></div></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>Answering this question might involve delving around for the instruction manual that came with your calculator! The important thing is not to use the formula — let your calculator do all the arithmetic. All you should need to do is key in the original data and then press the correct button. (There might be a choice, one of which is when the divisor in the ‘standard deviation’ formula is <i>n</i>, the other is when the divisor is <i>n</i>−1. Remember, in this course we use the second formula.) For the collapsed runners’ β endorphin concentrations, <i>s =</i> 98.0.</p></div></div></div></div><p>You will find that the main use of the standard deviation lies in making inferences about the population from which the sample is drawn. Its most serious disadvantage, like the mean, results from its sensitivity to outliers.</p><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="act004_007"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 13:Calculating standard deviations</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>In Activity 12 you calculated a standard deviation of 98.0 for the data on the collapsed runners. Try doing the calculation again, but this time omit the outlier at 414. Calculate also the interquartile range of this data set, first including the outlier and then omitting it.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>When the outlier of 414 is omitted, you will find a drastic reduction in the standard deviation from 98.0 to 37.4, a reduction by a factor of almost three!</p><p>The data are given in order in Activity 12. For the full data set, the sample size <i>n</i> is 11. The lower quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0045"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/798820ed/m248_1_ue045i.jpg" alt=""/></div><p>The upper quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0046"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/0619bf93/m248_1_ue046i.jpg" alt=""/></div><p>Thus the interquartile range is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0047"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/7ff7df61/m248_1_ue047i.jpg" alt=""/></div><p>When the outlier 414 is omitted from the data set, the sample size <i>n</i> is 10. The lower quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0048"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/2dbd3968/m248_1_ue048i.jpg" alt=""/></div><p>which is threequarters of the way between <i>x</i>
<sub>(2)</sub>=72 and <i>x</i>
<sub>(3)</sub> = 79. Thus</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0049"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/a7efe0ae/m248_1_ue049i.jpg" alt=""/></div><p>The upper quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0050"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/02c2c967/m248_1_ue050i.jpg" alt=""/></div><p>which is onequarter of the way between <i>x</i>
<sub>(8)</sub>=144 and <i>x</i>
<sub>(9)</sub>=162. So</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0051"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/bec613d6/m248_1_ue051i.jpg" alt=""/></div><p>The interquartile range is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0052"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/1d13a55f/m248_1_ue052i.jpg" alt=""/></div><p>Naturally it has decreased with the removal of the outlier, but the decrease is relatively far less than the decrease in the standard deviation.</p></div></div></div></div><p>You should have found a considerable reduction in the standard deviation when the outlier is omitted, from 98.0 to a value of 37.4: omitting 414 reduces the standard deviation by a factor of almost three. However, the interquartile range, which is 83 for the whole data set, decreases relatively much less (to 71.25) when the outlier is omitted. This illustrates clearly that the interquartile range is a resistant measure of dispersion, while the standard deviation is not.</p><p>Which, then, should you prefer as a measure of dispersion: range, interquartile range or standard deviation? For exploring and summarising dispersion (spread) in data values, the interquartile range is safer, especially when outliers are present. For inferential calculations, which you will meet later in the course, the standard deviation is used, possibly with extreme values removed. The range should only be used as a check on calculations. Clearly the mean must lie between the smallest and largest data values, somewhere near the middle if the data are reasonably symmetric; and the standard deviation, which can never exceed the range, is usually close to about onequarter of it.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.6
1.4.8 The standard deviationM248_1<p>The interquartile range is a useful measure of dispersion in the data and it has the excellent property of not being too sensitive to outlying data values. (That is, it is a resistant measure.) However, like the median it does suffer from the disadvantage that its calculation involves sorting the data. This can be very timeconsuming for large samples when a computer is not available to do the calculations. A measure that does not require sorting of the data and, as you will find in later units, has good statistical properties is the <i>standard deviation</i>.</p><p>The standard deviation is defined in terms of the differences between the data values (<i>x<sub>i</sub>
</i>) and their mean <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/6b48477d/m248_1_ie009i.jpg" alt="" width="30" height="27"/></span>. These differences <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/49e92afa/m248_1_ie010i.jpg" alt="" width="74" height="27"/></span>, which may be positive or negative, are called <b>residuals</b>.</p><div class="oucontentexample oucontentsheavybox1 oucontentsbox " id="exm004_007"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Example 1 Calculating residuals</h2><div class="oucontentinnerbox"><p>The mean difference in β endorphin concentration for the eleven runners in section 1.4 who completed the Great North Run is 18.74 pmol/l (to two decimal places). The eleven residuals are given in the following table.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbli001"><div class="oucontenttablewrapper"><table><tr><td>Difference, <i>x<sub>i</sub>
</i>
</td><td>25.3</td><td>20.5</td><td>10.3</td><td>24.4</td><td>17.5</td><td>30.6</td><td>11.8</td><td>12.9</td><td>3.8</td><td>20.6</td><td>28.4</td></tr><tr><td>Mean, <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/dc6cbcab/m248_1_ie011i.jpg" alt="" width="18" height="18"/></span>
</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td><td>18.74</td></tr><tr><td>Residual, <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/3e9d5b7e/m248_1_ie012i.jpg" alt="" width="57" height="18"/></span>
</td><td>6.56</td><td>1.76</td><td>−8.44</td><td>5.66</td><td>−1.24</td><td>11.86</td><td>−6.94</td><td>−5.84</td><td>−14.94</td><td>1.86</td><td>9.66</td></tr></table></div><div class="oucontentsourcereference"></div></div></div></div></div><p>For a sample of size <i>n</i> consisting of the data values <i>x</i>
<sub>(1)</sub>, <i>x</i>
<sub>(2)</sub>, …, <i>x<sub>(n)</sub>
</i> and having mean <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/ec5bef14/m248_1_ie013i.jpg" alt="" width="14" height="18"/></span>, the ith residual may be written as</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0021"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/861f29ca/m248_1_ue021i.jpg" alt=""/></div><p>These residuals all contribute to an overall measure of dispersion in the data. Large negative and large positive values both indicate observations far removed from the sample mean. In some way they need to be combined into a single number.</p><p>There is not much point in averaging them: positive residuals will cancel out negative ones. In fact their sum is zero, since</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0022"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/1e03cb5e/m248_1_ue022i.jpg" alt=""/></div><p>Therefore their average is also zero. What is important is the magnitude of each residual, the absolute difference <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/c00fbc39/m248_1_ie014i.jpg" alt="" width="68" height="26"/></span>. The absolute residuals could be added together and averaged, but this measure (known as the <i>mean absolute deviation)</i> does not possess very convenient mathematical properties. Another way of eliminating minus signs is by squaring the residuals. If these squares are averaged and then the square root is taken, this will lead to a measure of dispersion known as the <i>sample standard deviation</i>. It is defined as follows.</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_006"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">The sample standard deviation</h2><div class="oucontentinnerbox"><p>The <b>sample standard deviation</b>, which is a measure of the dispersion in a sample <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>, …, <i>x<sub>n</sub>
</i> with sample mean <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/29fca506/m248_1_ie015i.jpg" alt="" width="17" height="18"/></span>, is denoted by <i>s</i> and is obtained by averaging the squared residuals, and taking the square root of that average. Thus, if <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/24382449/m248_1_ie016i.jpg" alt="" width="103" height="21"/></span>, then</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0023"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/fface446/m248_1_ue023i.jpg" alt=""/></div></div></div></div><p>There are two important points you should note about this definition. First and foremost, although there are <i>n</i> terms contributing to the sum in the numerator, the divisor used when averaging the residuals is not the sample size <i>n</i>, but <i>n</i>−1. The reason for this surprising amendment will become clear later in the course. Whether dividing by <i>n</i> or by <i>n</i>−1, the measure of dispersion obtained has useful statistical properties, but these properties are subtly different. The definition above, with divisor <i>n</i>−1, is used in this course.</p><p>Second, you should remember to take the square root of the average. The reason for taking the square root is so that the measure of dispersion obtained is measured in the same units as the data. Since the residuals are measured in the same units as the data, their squares and the average of their squares are measured in the squares of those units. So the standard deviation, which is the square root of this average, is measured in the same units as the data.</p><div class="oucontentexample oucontentsheavybox1 oucontentsbox " id="exm004_007b"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Example 2 Calculating the standard deviation</h2><div class="oucontentinnerbox"><p>The sum of the squared residuals for the eleven β endorphin concentration differences is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0024"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/ff922597/m248_1_ue024i.jpg" alt=""/></div><p>Notice that a negative residual contributes a positive value to the calculation of the standard deviation. This is because it is squared.</p><p>So the sample standard deviation of the differences is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0025"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/48fd2aa7/m248_1_ue025i.jpg" alt=""/></div></div></div></div><p>Even for relatively small samples the arithmetic is rather awkward if done by hand. Fortunately, it is now common for calculators to have a ‘standard deviation’ button, and all that is required is to key in the data. The exact details of how to do this differ between different models and makes of calculator. Several types of calculator give you the option of using either <i>n</i> or <i>n</i>−1 as the divisor. You should check that you understand exactly how your own calculator is used to calculate standard deviations. Try using it to calculate the sample standard deviation of the eleven β endorphin concentration differences for runners who completed the race, using <i>n</i>−1 as the divisor. Make sure that you get the same answer as given above (that is, 8.33).</p><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="act004_006"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 12: Calculating standard deviations</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>Use your calculator to calculate the standard deviation for the β endorphin concentrations of the eleven collapsed runners. The data in pmol/l, originally given in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.5#tbl001004">Table 4</a>, are as follows.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl004_000a"><div class="oucontenttablewrapper"><table><tr><td>66</td><td>72</td><td>79</td><td>84</td><td>102</td><td>110</td><td>123</td><td>144</td><td>162</td><td>169</td><td>414</td></tr></table></div><div class="oucontentsourcereference"></div></div></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>Answering this question might involve delving around for the instruction manual that came with your calculator! The important thing is not to use the formula — let your calculator do all the arithmetic. All you should need to do is key in the original data and then press the correct button. (There might be a choice, one of which is when the divisor in the ‘standard deviation’ formula is <i>n</i>, the other is when the divisor is <i>n</i>−1. Remember, in this course we use the second formula.) For the collapsed runners’ β endorphin concentrations, <i>s =</i> 98.0.</p></div></div></div></div><p>You will find that the main use of the standard deviation lies in making inferences about the population from which the sample is drawn. Its most serious disadvantage, like the mean, results from its sensitivity to outliers.</p><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="act004_007"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Activity 13:Calculating standard deviations</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>In Activity 12 you calculated a standard deviation of 98.0 for the data on the collapsed runners. Try doing the calculation again, but this time omit the outlier at 414. Calculate also the interquartile range of this data set, first including the outlier and then omitting it.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>When the outlier of 414 is omitted, you will find a drastic reduction in the standard deviation from 98.0 to 37.4, a reduction by a factor of almost three!</p><p>The data are given in order in Activity 12. For the full data set, the sample size <i>n</i> is 11. The lower quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0045"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/798820ed/m248_1_ue045i.jpg" alt=""/></div><p>The upper quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0046"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/0619bf93/m248_1_ue046i.jpg" alt=""/></div><p>Thus the interquartile range is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0047"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/7ff7df61/m248_1_ue047i.jpg" alt=""/></div><p>When the outlier 414 is omitted from the data set, the sample size <i>n</i> is 10. The lower quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0048"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/2dbd3968/m248_1_ue048i.jpg" alt=""/></div><p>which is threequarters of the way between <i>x</i>
<sub>(2)</sub>=72 and <i>x</i>
<sub>(3)</sub> = 79. Thus</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0049"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/a7efe0ae/m248_1_ue049i.jpg" alt=""/></div><p>The upper quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0050"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/02c2c967/m248_1_ue050i.jpg" alt=""/></div><p>which is onequarter of the way between <i>x</i>
<sub>(8)</sub>=144 and <i>x</i>
<sub>(9)</sub>=162. So</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0051"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/bec613d6/m248_1_ue051i.jpg" alt=""/></div><p>The interquartile range is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0052"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/1d13a55f/m248_1_ue052i.jpg" alt=""/></div><p>Naturally it has decreased with the removal of the outlier, but the decrease is relatively far less than the decrease in the standard deviation.</p></div></div></div></div><p>You should have found a considerable reduction in the standard deviation when the outlier is omitted, from 98.0 to a value of 37.4: omitting 414 reduces the standard deviation by a factor of almost three. However, the interquartile range, which is 83 for the whole data set, decreases relatively much less (to 71.25) when the outlier is omitted. This illustrates clearly that the interquartile range is a resistant measure of dispersion, while the standard deviation is not.</p><p>Which, then, should you prefer as a measure of dispersion: range, interquartile range or standard deviation? For exploring and summarising dispersion (spread) in data values, the interquartile range is safer, especially when outliers are present. For inferential calculations, which you will meet later in the course, the standard deviation is used, possibly with extreme values removed. The range should only be used as a check on calculations. Clearly the mean must lie between the smallest and largest data values, somewhere near the middle if the data are reasonably symmetric; and the standard deviation, which can never exceed the range, is usually close to about onequarter of it.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.4.9 Sample variance
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.7
Tue, 26 Jul 2011 23:00:00 GMT
<p>It is worth noting that a special term is reserved for the square of the sample standard deviation: it is known as the <i>sample variance</i>.</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_005"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">The sample variance</h2><div class="oucontentinnerbox"><p>The <b>sample variance</b> of a data sample <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>, …, <i>x<sub>n</sub>
</i> is given by</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0027"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/57ed22ba/m248_1_ue027i.jpg" alt=""/></div><p>where <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/4a5c2f97/m248_1_ie017i.jpg" alt="" width="17" height="18"/></span> is the sample mean.</p></div></div></div><div class="oucontentexample oucontentsheavybox1 oucontentsbox " id="exm004_007a"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Example 3: Calculating the variance</h2><div class="oucontentinnerbox"><p>The variance of the eleven β endorphin concentration differences is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0028"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/211aa9b0/m248_1_ue028i.jpg" alt=""/></div></div></div></div>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.7
1.4.9 Sample varianceM248_1<p>It is worth noting that a special term is reserved for the square of the sample standard deviation: it is known as the <i>sample variance</i>.</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_005"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">The sample variance</h2><div class="oucontentinnerbox"><p>The <b>sample variance</b> of a data sample <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>, …, <i>x<sub>n</sub>
</i> is given by</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0027"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/57ed22ba/m248_1_ue027i.jpg" alt=""/></div><p>where <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/4a5c2f97/m248_1_ie017i.jpg" alt="" width="17" height="18"/></span> is the sample mean.</p></div></div></div><div class="oucontentexample oucontentsheavybox1 oucontentsbox " id="exm004_007a"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Example 3: Calculating the variance</h2><div class="oucontentinnerbox"><p>The variance of the eleven β endorphin concentration differences is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0028"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/211aa9b0/m248_1_ue028i.jpg" alt=""/></div></div></div></div>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

1.4.10 A note on accuracy
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.8
Tue, 26 Jul 2011 23:00:00 GMT
<p>To what accuracy should you give the results of calculations? If you look through the examples in this section, you will find that, in general, results have been given either to the same accuracy as the data or rounded to one decimal place or one significant figure more than is given in the data. There is no hard and fast rule about what you should do: appropriate accuracy depends on a number of factors including the reliability of the data and the size of the data set. However, you should avoid either rounding the data too much, so that valuable information is lost, or too little, thus suggesting that your results are more accurate that can be justified from the available data.</p><p>As a rough guide, it is usually satisfactory to round a result to one significant figure more than is given in the data. But note that this rough guide applies only to results quoted at the ends of calculations: intermediate results should not be rounded. If you round a result and then use the rounded value in subsequent calculations – for instance, if you use a rounded value for the mean when calculating the standard deviation of a data set – then this sometimes leads to quite serious inaccuracies (known as <i>rounding errors)</i>.</p><p>In Example 1, the mean was rounded to two decimal places before calculating the residuals. The squared residuals were also rounded before finding the standard deviation. This was done simply for clarity of presentation.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.8
1.4.10 A note on accuracyM248_1<p>To what accuracy should you give the results of calculations? If you look through the examples in this section, you will find that, in general, results have been given either to the same accuracy as the data or rounded to one decimal place or one significant figure more than is given in the data. There is no hard and fast rule about what you should do: appropriate accuracy depends on a number of factors including the reliability of the data and the size of the data set. However, you should avoid either rounding the data too much, so that valuable information is lost, or too little, thus suggesting that your results are more accurate that can be justified from the available data.</p><p>As a rough guide, it is usually satisfactory to round a result to one significant figure more than is given in the data. But note that this rough guide applies only to results quoted at the ends of calculations: intermediate results should not be rounded. If you round a result and then use the rounded value in subsequent calculations – for instance, if you use a rounded value for the mean when calculating the standard deviation of a data set – then this sometimes leads to quite serious inaccuracies (known as <i>rounding errors)</i>.</p><p>In Example 1, the mean was rounded to two decimal places before calculating the residuals. The squared residuals were also rounded before finding the standard deviation. This was done simply for clarity of presentation.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

4.11: Symmetry and skewness
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9
Tue, 26 Jul 2011 23:00:00 GMT
<p>For many purposes the location and dispersion of a set of data are the main features of its distribution that we might wish to summarise, numerically or otherwise. But for some purposes it can be important to consider a slightly more subtle aspect: the symmetry, or lack of symmetry, in the data.</p><div class="oucontentexample oucontentsheavybox1 oucontentsbox " id="exm004_008"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Example 4: Family sizes of Protestant mothers in Ontario</h2><div class="oucontentinnerbox"><p>The following data are taken from the 1941 Canadian Census and comprise the sizes of completed families (numbers of children) born to a sample of Protestant mothers in Ontario aged 45–54 and married at age 15–19. The data are split into two groups according to how many years of formal education the mothers had received.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl004_004"><h3 class="oucontenth3 oucontentheading oucontentnonumber">
<b>Table 12</b> Family size: mothers married aged 15–19</h3><div class="oucontenttablewrapper"><table><tr><td>Mother educated for six years or less</td></tr><tr><td>14 13 4 14 10 2 13 5 0 0 13 3 9 2 10 11 13 5 14</td></tr><tr><td>Mother educated for seven years or more</td></tr><tr><td>0 4 0 2 3 3 0 4 7 1 9 4 3 2 32 16 6 0 13 6 6 5 9 10 5 4 3 3 5 2 3 5 15 5</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>(Keyfitz, N. (1953) A factorial arrangement of comparisons of family size. <i>American J. Sociology</i>, <b>53</b>, 470–480.)</p><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004004">Figure 21</a> shows a bar chart of some of the data from <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#tbl004004">Table 12</a>: it shows the numbers of children born to the 35 mothers who had at least seven years of education.</p><div class="oucontentfigure" style="width:511px;" id="fig004_004"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/2b1e3375/m248_1_021i.jpg" alt="Figure 21" width="511" height="358"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 21 Family size for mothers with at least seven years of education</span></div></div></div><p>As you can see, the bar chart shows a marked lack of symmetry.</p></div></div></div><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="exe004_001"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Exercise 1: Family size</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><ol class="oucontentnumbered"><li>
<p>For each of the two data sets in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#tbl004004">Table 12</a>, calculate the range, the median, the upper and lower quartiles and the interquartile range.</p>
</li><li>
<p>Use the statistical functions of your calculator to find the mean, the standard deviation and the variance for each of the two data sets.</p>
</li></ol></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>After sorting into order of increasing size, the two data sets are as follows.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbli002"><div class="oucontenttablewrapper"><table><tr><td>Mother educated for six years or less</td></tr><tr><td>0 0 2 2 3 4 5 5 9 10 10 11 13 13 13 13 14 14 14</td></tr><tr><td>Mother educated for seven years or more</td></tr><tr><td>0 0 0 0 1 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 7 9 9 10 13 15 16</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>For the mothers with at most six years of education, the sample size <i>n</i> is 19. The range is just the difference between the largest and the smallest values in the sample, so it is 14−0=14. The median is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_053"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/71b0952b/m248_1_ue053i.jpg" alt=""/></div><p>The lower quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_054"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/107b26c9/m248_1_ue054i.jpg" alt=""/></div><p>The upper quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0055"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/d4cdb402/m248_1_ue055i.jpg" alt=""/></div><p>Thus the interquartile range is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0056"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/70fc50aa/m248_1_ue056i.jpg" alt=""/></div><p>For the other data set, the sample size <i>n</i> is 35. The range is 16−0=16. The median is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0057"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/c6d25c5b/m248_1_ue057i.jpg" alt=""/></div><p>The lower quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0058"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/b15a2a88/m248_1_ue058i.jpg" alt=""/></div><p>The upper quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0059"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/91db58b3/m248_1_ue059i.jpg" alt=""/></div><p>The interquartile range is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0060"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/0abacabc/m248_1_ue060i.jpg" alt=""/></div><p>The mean and the standard deviation for the first sample are respectively 8.158 and 5.188, or approximately 8.2 and 5.2. The variance is the square of the standard deviation, 5.188<sup>2</sup> = 26.9 (approx.). For the second sample, the mean, standard deviation and variance are respectively 4.8, 3.954 = 4.0 (approx.) and 3.954<sup>2</sup> = 15.6 (approx.).</p></div></div></div></div><p>Detection of lack of symmetry is of considerable importance in data analysis and inference. One reason is that the most important summary measure of the data is the typical or central value in the context of which the sample median and the sample mean were introduced. When the data are roughly symmetrically distributed, all ambiguity is removed because the median and the mean will nearly coincide. However, when the data are very far from symmetrical, not only will these measures not coincide but we may even be pressed to decide whether <i>any</i> summary measure of this kind is appropriate. There are other reasons for the importance of symmetry in data analysis. For instance, most statistical methods involve producing a mathematical (probability) model for data, and the choice of an appropriate model may depend on whether the data are symmetrical.</p><p>Numerical data that are not symmetrical, in the sense that a bar chart or histogram shows clear lack of symmetry, are said to be <b>skew</b> or <b>skewed</b>. In <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004004">Figure 21</a>, the general pattern of lack of symmetry is that the main bulk of the data take relatively low values, towards the left of the bar chart, and to the right of the bar chart there is a relatively large ‘tail’ of relatively high values. Because of this ‘tail’ to the right, data showing this sort of pattern are said to be <b>rightskew</b> or <b>positively skewed</b>.</p><p>These data on family sizes arise from counts, so they are discrete, and a bar chart is an appropriate way to picture them. But the concept of skewness applies also to measured (continuous) data. <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004005">Figure 22</a> shows a histogram of the time intervals (in seconds) between pulses along a nerve fibre.</p><div class="oucontentfigure" style="width:511px;" id="fig004_005"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/aee2ea48/m248_1_022i.jpg" alt="Figure 22" width="511" height="348"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 22 Time intervals between nerve pulses (seconds)</span></div></div></div><p>Again, the general pattern is one of lack of symmetry. The data have a relatively large ‘tail’ to the right of the diagram for relatively long time intervals, so again they are described as rightskew or positively skewed.</p><p>The mean and the median are shown in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004005">Figure 22</a>. Notice that the mean is greater than the median; this is the case for rightskew data in general.</p><p>Clearly not all data sets that exhibit lack of symmetry are rightskew. Data sets whose bar charts or histograms look generally like the mirror images of Figures 21 and 22 are said to be <b>leftskew</b> or <b>negatively skewed</b>. In general, the mean is less than the median for leftskew data. (Note that the direction – left or right – used to describe the skewness is the direction in which the long ‘tail’ of the distribution points, not the end of the diagram where the main bulk of the data lie.) In practice, rightskew data are relatively common, and often arise (as in the data sets in Figures 21 and 22) where there is some natural lower limit on the values of a variable, so that it is impossible for there to be a long ‘tail’ to the left. In the nature of things, natural upper limits on the values of variables tend to be less common, so that leftskew data are encountered rather less frequently.</p><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004006">Figure 23</a> is a bar chart of the family sizes of the first group of mothers in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#tbl004004">Table 12</a>, who were educated for six years or less.</p><div class="oucontentfigure" style="width:511px;" id="fig004_006"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/728612bc/m248_1_023i.jpg" alt="Figure 23" width="511" height="356"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 23 Family size of mothers with at most six years of education</span></div></div></div><p>This bar chart does not exhibit such a clear lack of symmetry as does <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004004">Figure 21</a>; but it is not symmetrical. This time, however, the main concentration of the data is, if anything, towards the right of the diagram and the main ‘tail’ is to the left. These data are leftskew, or negatively skewed.</p><p>As well as a general impression of skewness obtained by looking at histograms or bar charts, a numerical measure of symmetry is both meaningful and useful.</p><p>The generally accepted measure is the <i>sample skewness</i>, defined as follows.</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_007"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">The sample skewness</h2><div class="oucontentinnerbox"><p>The <b>sample skewness</b> of a data sample <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>, …, <i>x<sub>n</sub>
</i> is given by</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0029"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/f0acec4e/m248_1_ue029i.jpg" alt=""/></div><p>where <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/8804fbac/m248_1_ie018i.jpg" alt="" width="19" height="18"/></span> is the sample mean and <i>s</i> is the sample standard deviation.</p></div></div></div><p>Notice the term <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/6f4ee29e/m248_1_ie019i.jpg" alt="" width="85" height="29"/></span> in this formula. Since <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/f641cc90/m248_1_ie020i.jpg" alt="" width="88" height="31"/></span> is positive when <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/5e4f308a/m248_1_ie021i.jpg" alt="" width="62" height="24"/></span> is positive and negative when <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/f1cb4857/m248_1_ie022i.jpg" alt="" width="61" height="22"/></span> is negative, observations greater than the sample mean contribute positive terms to the sum, while observations less than the sample mean contribute negative terms. Perfectly symmetric data have a skewness of 0, because the contributions from positive and negative terms cancel out. In skewed data, the sign of the sample skewness depends on the direction of the skew. For rightskew data, the bigger ‘tail’ is on the right, so that it consists (largely at any rate) of values greater than the sample mean. In other words, in rightskew data there are a lot of values much greater than the sample mean, and fewer values much less than the sample mean. The power of 3 applied to the terms in the sum, in the formula for sample skewness, means that values a long way from the mean contribute a disproportionately large amount to the sum. Thus, in rightskew data, the positive terms in the sum outweigh the negative terms, and the sample skewness comes out to be positive. In leftskew data, it is the other way round and the sample skewness is negative. (This, in a sense, is the reason why rightskew data are also said to be positively skewed, and leftskew data are negatively skewed.) The data of <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004004">Figure 21</a> have a sample skewness of 1.36, and those in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004006">Figure 23</a> have a sample skewness of −0.33. That is, the data for the group of mothers with seven or more years of education have positive skewness, while for the group of mothers with six or less years of education, the sample skewness is negative. The asymmetry is rather slight for the second group of mothers, certainly by comparison to the first group of mothers.</p><p>It is, of course, possible to calculate the sample skewness on a calculator, but the computations are rather tedious. In practice a statistician would use a computer — and therefore practice on calculating skewness is left to the computer book.</p><div class=" oucontentactivity oucontentsheavybox1 oucontentsbox " id="exe004_002"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Exercise 2 Alcohol consumption</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.6#tbl001005">Table 5</a> contains average annual alcohol consumption figures (in 1/person) for 15 countries. The figure for France was observed to be much higher than the other figures (an apparent outlier). In order of increasing size, the other values in the data set are as follows.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_005a"><div class="oucontenttablewrapper"><table><tr><td>3.1</td><td>3.9</td><td>4.2</td><td>5.6</td><td>5.7</td><td>5.8</td><td>6.6</td><td>7.2</td><td>8.3</td><td>9.9</td><td>10.8</td><td>10.9</td><td>12.3</td><td>15.2</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>Calculate the median, the upper and lower quartiles and the interquartile range for these alcohol consumption figures.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>For these data, the sample size <i>n</i> is 14. The lower quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0061"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/580ed95e/m248_1_ue061i.jpg" alt=""/></div><p>This is threequarters of the way between <i>x</i>
<sub>(3)</sub> =4.2 and <i>x</i>
<sub>(4)</sub>=5.6. So</p><p>
<i>q<sub>L</sub>
</i> = 4.2 + ¾(5.6 – 4.2) = 5.25, </p><p>or approximately 5.3. The sample median is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0063"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/789427f3/m248_1_ue063i.jpg" alt=""/></div><p>This is midway between <i>x</i>
<sub>(6)</sub>=6.6 and <i>x</i>
<sub>(8)</sub>=7.2, that is 6.9.</p><p>The upper quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0064"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/40048d61/m248_1_ue064i.jpg" alt=""/></div><p>which is onequarter of the way between <i>x</i>
<sub>(11)</sub>=10.8 and <i>x</i>
<sub>(12)</sub>=10.9. So</p><p>
<i>q<sub>U</sub>
</i> = 10.8 + ¼(10.9 – 10.8) = 10.825, </p><p>or approximately 10.8. The interquartile range is</p><p>
<i>q<sub>U</sub>
</i> – <i>q<sub>L</sub>
</i> = 10.825 – 5.25 = 5.575,</p><p>or approximately 5.6.</p></div></div></div></div>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9
4.11: Symmetry and skewnessM248_1<p>For many purposes the location and dispersion of a set of data are the main features of its distribution that we might wish to summarise, numerically or otherwise. But for some purposes it can be important to consider a slightly more subtle aspect: the symmetry, or lack of symmetry, in the data.</p><div class="oucontentexample oucontentsheavybox1 oucontentsbox " id="exm004_008"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Example 4: Family sizes of Protestant mothers in Ontario</h2><div class="oucontentinnerbox"><p>The following data are taken from the 1941 Canadian Census and comprise the sizes of completed families (numbers of children) born to a sample of Protestant mothers in Ontario aged 45–54 and married at age 15–19. The data are split into two groups according to how many years of formal education the mothers had received.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl004_004"><h3 class="oucontenth3 oucontentheading oucontentnonumber">
<b>Table 12</b> Family size: mothers married aged 15–19</h3><div class="oucontenttablewrapper"><table><tr><td>Mother educated for six years or less</td></tr><tr><td>14 13 4 14 10 2 13 5 0 0 13 3 9 2 10 11 13 5 14</td></tr><tr><td>Mother educated for seven years or more</td></tr><tr><td>0 4 0 2 3 3 0 4 7 1 9 4 3 2 32 16 6 0 13 6 6 5 9 10 5 4 3 3 5 2 3 5 15 5</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>(Keyfitz, N. (1953) A factorial arrangement of comparisons of family size. <i>American J. Sociology</i>, <b>53</b>, 470–480.)</p><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004004">Figure 21</a> shows a bar chart of some of the data from <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#tbl004004">Table 12</a>: it shows the numbers of children born to the 35 mothers who had at least seven years of education.</p><div class="oucontentfigure" style="width:511px;" id="fig004_004"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/2b1e3375/m248_1_021i.jpg" alt="Figure 21" width="511" height="358"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 21 Family size for mothers with at least seven years of education</span></div></div></div><p>As you can see, the bar chart shows a marked lack of symmetry.</p></div></div></div><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="exe004_001"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">Exercise 1: Family size</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><ol class="oucontentnumbered"><li>
<p>For each of the two data sets in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#tbl004004">Table 12</a>, calculate the range, the median, the upper and lower quartiles and the interquartile range.</p>
</li><li>
<p>Use the statistical functions of your calculator to find the mean, the standard deviation and the variance for each of the two data sets.</p>
</li></ol></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>After sorting into order of increasing size, the two data sets are as follows.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbli002"><div class="oucontenttablewrapper"><table><tr><td>Mother educated for six years or less</td></tr><tr><td>0 0 2 2 3 4 5 5 9 10 10 11 13 13 13 13 14 14 14</td></tr><tr><td>Mother educated for seven years or more</td></tr><tr><td>0 0 0 0 1 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 7 9 9 10 13 15 16</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>For the mothers with at most six years of education, the sample size <i>n</i> is 19. The range is just the difference between the largest and the smallest values in the sample, so it is 14−0=14. The median is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_053"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/71b0952b/m248_1_ue053i.jpg" alt=""/></div><p>The lower quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_054"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/107b26c9/m248_1_ue054i.jpg" alt=""/></div><p>The upper quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0055"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/d4cdb402/m248_1_ue055i.jpg" alt=""/></div><p>Thus the interquartile range is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0056"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/70fc50aa/m248_1_ue056i.jpg" alt=""/></div><p>For the other data set, the sample size <i>n</i> is 35. The range is 16−0=16. The median is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0057"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/c6d25c5b/m248_1_ue057i.jpg" alt=""/></div><p>The lower quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0058"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/b15a2a88/m248_1_ue058i.jpg" alt=""/></div><p>The upper quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0059"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/91db58b3/m248_1_ue059i.jpg" alt=""/></div><p>The interquartile range is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0060"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/0abacabc/m248_1_ue060i.jpg" alt=""/></div><p>The mean and the standard deviation for the first sample are respectively 8.158 and 5.188, or approximately 8.2 and 5.2. The variance is the square of the standard deviation, 5.188<sup>2</sup> = 26.9 (approx.). For the second sample, the mean, standard deviation and variance are respectively 4.8, 3.954 = 4.0 (approx.) and 3.954<sup>2</sup> = 15.6 (approx.).</p></div></div></div></div><p>Detection of lack of symmetry is of considerable importance in data analysis and inference. One reason is that the most important summary measure of the data is the typical or central value in the context of which the sample median and the sample mean were introduced. When the data are roughly symmetrically distributed, all ambiguity is removed because the median and the mean will nearly coincide. However, when the data are very far from symmetrical, not only will these measures not coincide but we may even be pressed to decide whether <i>any</i> summary measure of this kind is appropriate. There are other reasons for the importance of symmetry in data analysis. For instance, most statistical methods involve producing a mathematical (probability) model for data, and the choice of an appropriate model may depend on whether the data are symmetrical.</p><p>Numerical data that are not symmetrical, in the sense that a bar chart or histogram shows clear lack of symmetry, are said to be <b>skew</b> or <b>skewed</b>. In <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004004">Figure 21</a>, the general pattern of lack of symmetry is that the main bulk of the data take relatively low values, towards the left of the bar chart, and to the right of the bar chart there is a relatively large ‘tail’ of relatively high values. Because of this ‘tail’ to the right, data showing this sort of pattern are said to be <b>rightskew</b> or <b>positively skewed</b>.</p><p>These data on family sizes arise from counts, so they are discrete, and a bar chart is an appropriate way to picture them. But the concept of skewness applies also to measured (continuous) data. <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004005">Figure 22</a> shows a histogram of the time intervals (in seconds) between pulses along a nerve fibre.</p><div class="oucontentfigure" style="width:511px;" id="fig004_005"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/aee2ea48/m248_1_022i.jpg" alt="Figure 22" width="511" height="348"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 22 Time intervals between nerve pulses (seconds)</span></div></div></div><p>Again, the general pattern is one of lack of symmetry. The data have a relatively large ‘tail’ to the right of the diagram for relatively long time intervals, so again they are described as rightskew or positively skewed.</p><p>The mean and the median are shown in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004005">Figure 22</a>. Notice that the mean is greater than the median; this is the case for rightskew data in general.</p><p>Clearly not all data sets that exhibit lack of symmetry are rightskew. Data sets whose bar charts or histograms look generally like the mirror images of Figures 21 and 22 are said to be <b>leftskew</b> or <b>negatively skewed</b>. In general, the mean is less than the median for leftskew data. (Note that the direction – left or right – used to describe the skewness is the direction in which the long ‘tail’ of the distribution points, not the end of the diagram where the main bulk of the data lie.) In practice, rightskew data are relatively common, and often arise (as in the data sets in Figures 21 and 22) where there is some natural lower limit on the values of a variable, so that it is impossible for there to be a long ‘tail’ to the left. In the nature of things, natural upper limits on the values of variables tend to be less common, so that leftskew data are encountered rather less frequently.</p><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004006">Figure 23</a> is a bar chart of the family sizes of the first group of mothers in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#tbl004004">Table 12</a>, who were educated for six years or less.</p><div class="oucontentfigure" style="width:511px;" id="fig004_006"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/728612bc/m248_1_023i.jpg" alt="Figure 23" width="511" height="356"/><div class="oucontentfiguretext"><div class="oucontentcaption oucontentnonumber"><span class="oucontentfigurecaption">
Figure 23 Family size of mothers with at most six years of education</span></div></div></div><p>This bar chart does not exhibit such a clear lack of symmetry as does <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004004">Figure 21</a>; but it is not symmetrical. This time, however, the main concentration of the data is, if anything, towards the right of the diagram and the main ‘tail’ is to the left. These data are leftskew, or negatively skewed.</p><p>As well as a general impression of skewness obtained by looking at histograms or bar charts, a numerical measure of symmetry is both meaningful and useful.</p><p>The generally accepted measure is the <i>sample skewness</i>, defined as follows.</p><div class="oucontentbox oucontentsheavybox1 oucontentsbox " id="box001_007"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">The sample skewness</h2><div class="oucontentinnerbox"><p>The <b>sample skewness</b> of a data sample <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>, …, <i>x<sub>n</sub>
</i> is given by</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0029"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/f0acec4e/m248_1_ue029i.jpg" alt=""/></div><p>where <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/8804fbac/m248_1_ie018i.jpg" alt="" width="19" height="18"/></span> is the sample mean and <i>s</i> is the sample standard deviation.</p></div></div></div><p>Notice the term <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/6f4ee29e/m248_1_ie019i.jpg" alt="" width="85" height="29"/></span> in this formula. Since <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/f641cc90/m248_1_ie020i.jpg" alt="" width="88" height="31"/></span> is positive when <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/5e4f308a/m248_1_ie021i.jpg" alt="" width="62" height="24"/></span> is positive and negative when <span class="oucontentinlinefigure"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/f1cb4857/m248_1_ie022i.jpg" alt="" width="61" height="22"/></span> is negative, observations greater than the sample mean contribute positive terms to the sum, while observations less than the sample mean contribute negative terms. Perfectly symmetric data have a skewness of 0, because the contributions from positive and negative terms cancel out. In skewed data, the sign of the sample skewness depends on the direction of the skew. For rightskew data, the bigger ‘tail’ is on the right, so that it consists (largely at any rate) of values greater than the sample mean. In other words, in rightskew data there are a lot of values much greater than the sample mean, and fewer values much less than the sample mean. The power of 3 applied to the terms in the sum, in the formula for sample skewness, means that values a long way from the mean contribute a disproportionately large amount to the sum. Thus, in rightskew data, the positive terms in the sum outweigh the negative terms, and the sample skewness comes out to be positive. In leftskew data, it is the other way round and the sample skewness is negative. (This, in a sense, is the reason why rightskew data are also said to be positively skewed, and leftskew data are negatively skewed.) The data of <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004004">Figure 21</a> have a sample skewness of 1.36, and those in <a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.9#fig004006">Figure 23</a> have a sample skewness of −0.33. That is, the data for the group of mothers with seven or more years of education have positive skewness, while for the group of mothers with six or less years of education, the sample skewness is negative. The asymmetry is rather slight for the second group of mothers, certainly by comparison to the first group of mothers.</p><p>It is, of course, possible to calculate the sample skewness on a calculator, but the computations are rather tedious. In practice a statistician would use a computer — and therefore practice on calculating skewness is left to the computer book.</p><div class="
oucontentactivity
oucontentsheavybox1 oucontentsbox " id="exe004_002"><div class="oucontentouterbox"><h2 class="oucontenth3 oucontentheading oucontentnonumber">
Exercise 2 Alcohol consumption</h2><div class="oucontentinnerbox"><div class="oucontentsaqquestion"><p>
<a class="oucontentcrossref" href="http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection2.6#tbl001005">Table 5</a> contains average annual alcohol consumption figures (in 1/person) for 15 countries. The figure for France was observed to be much higher than the other figures (an apparent outlier). In order of increasing size, the other values in the data set are as follows.</p><div class="oucontenttable oucontentsnormal oucontentsbox" id="tbl001_005a"><div class="oucontenttablewrapper"><table><tr><td>3.1</td><td>3.9</td><td>4.2</td><td>5.6</td><td>5.7</td><td>5.8</td><td>6.6</td><td>7.2</td><td>8.3</td><td>9.9</td><td>10.8</td><td>10.9</td><td>12.3</td><td>15.2</td></tr></table></div><div class="oucontentsourcereference"></div></div><p>Calculate the median, the upper and lower quartiles and the interquartile range for these alcohol consumption figures.</p></div>
<div class="oucontentsaqanswer"><h3 class="oucontenth4">Answer</h3><h3 class="oucontenth4 oucontentbasic">Solution</h3><p>For these data, the sample size <i>n</i> is 14. The lower quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0061"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/580ed95e/m248_1_ue061i.jpg" alt=""/></div><p>This is threequarters of the way between <i>x</i>
<sub>(3)</sub> =4.2 and <i>x</i>
<sub>(4)</sub>=5.6. So</p><p>
<i>q<sub>L</sub>
</i> = 4.2 + ¾(5.6 – 4.2) = 5.25, </p><p>or approximately 5.3. The sample median is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0063"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/789427f3/m248_1_ue063i.jpg" alt=""/></div><p>This is midway between <i>x</i>
<sub>(6)</sub>=6.6 and <i>x</i>
<sub>(8)</sub>=7.2, that is 6.9.</p><p>The upper quartile is</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0064"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/40048d61/m248_1_ue064i.jpg" alt=""/></div><p>which is onequarter of the way between <i>x</i>
<sub>(11)</sub>=10.8 and <i>x</i>
<sub>(12)</sub>=10.9. So</p><p>
<i>q<sub>U</sub>
</i> = 10.8 + ¼(10.9 – 10.8) = 10.825, </p><p>or approximately 10.8. The interquartile range is</p><p>
<i>q<sub>U</sub>
</i> – <i>q<sub>L</sub>
</i> = 10.825 – 5.25 = 5.575,</p><p>or approximately 5.6.</p></div></div></div></div>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

4.12: Numerical summaries: summary
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.10
Tue, 26 Jul 2011 23:00:00 GMT
<p>In this section, various ways of summarising certain aspects of a data set by a single number have been discussed. You have been introduced to two pairs of statistics for assessing location and dispersion. The median and interquartile range provide one pair of statistics, and the mean and standard deviation the other, each pair doing a similar job. As for the choice of which pair to use, there are pros and cons for either. You have seen that the median is a more resistant measure of location than is the mean, in the sense that its value is less affected by the presence of one or two outliers in the data. In the same sense, the interquartile range is a more resistant measure of dispersion than is the standard deviation.</p><p>The (sample) median is the central value in a data set after the data values have been sorted into order of increasing size. The lower and upper (sample) quartiles are the values that divide the data set into quarters. Denoting by <i>x</i>
<sub>(<i>p</i>)</sub> the <i>p</i>th value in the ordered data set of <i>n</i> values, the median to, the lower quartile <i>q<sub>L</sub>
</i> and the upper quartile <i>q<sub>U</sub>
</i> are given by</p><p>
<i>m</i> = <i>x</i>
<sub>(½(<i>n</i>+1))</sub>, <i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(¼(<i>n</i>+1))</sub>, <i>q<sub>U</sub>
</i> = <i>x</i>
<sub>(¾(<i>n</i>+1))</sub>.</p><p>In each case, if the subscript is not a whole number, it is interpreted by interpolating between sample values. The interquartile range is <i>q</i>
<sub>
<i>U</i>
</sub>−<i>q</i>
<sub>
<i>L</i>
</sub>. A much less commonly used measure of dispersion is the range, which is simply the difference between the largest and smallest values in the sample.</p><p>No sorting of the data is required when calculating the (sample) mean and (sample) standard deviation. The mean <i>x</i> and the standard deviation <i>s</i> of a sample <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>,… <i>x</i>
<sub>1</sub> are given by</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0031"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/68be9759/m248_1_ue031i.jpg" alt=""/></div><p>The variance is the square of the standard deviation.</p><p>The term ‘mode’ can be used to describe a ‘representative value’ in a data set; it describes the most frequently occurring observation. For numerical data, this definition needs to be modified; a mode is taken to be a clear peak in a histogram of the data. Some data sets have only one such peak and are called unimodal, others have two peaks (bimodal) or more (trimodal, multimodal).</p><p>Finally, you have learned to distinguish between data sets that are symmetrical, rightskew (or positively skewed, with a long tail of high values) and leftskew (or negatively skewed, with a long tail of low values). The sample skewness is a numerical summary of the skewness of a data set.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection5.10
4.12: Numerical summaries: summaryM248_1<p>In this section, various ways of summarising certain aspects of a data set by a single number have been discussed. You have been introduced to two pairs of statistics for assessing location and dispersion. The median and interquartile range provide one pair of statistics, and the mean and standard deviation the other, each pair doing a similar job. As for the choice of which pair to use, there are pros and cons for either. You have seen that the median is a more resistant measure of location than is the mean, in the sense that its value is less affected by the presence of one or two outliers in the data. In the same sense, the interquartile range is a more resistant measure of dispersion than is the standard deviation.</p><p>The (sample) median is the central value in a data set after the data values have been sorted into order of increasing size. The lower and upper (sample) quartiles are the values that divide the data set into quarters. Denoting by <i>x</i>
<sub>(<i>p</i>)</sub> the <i>p</i>th value in the ordered data set of <i>n</i> values, the median to, the lower quartile <i>q<sub>L</sub>
</i> and the upper quartile <i>q<sub>U</sub>
</i> are given by</p><p>
<i>m</i> = <i>x</i>
<sub>(½(<i>n</i>+1))</sub>, <i>q<sub>L</sub>
</i> = <i>x</i>
<sub>(¼(<i>n</i>+1))</sub>, <i>q<sub>U</sub>
</i> = <i>x</i>
<sub>(¾(<i>n</i>+1))</sub>.</p><p>In each case, if the subscript is not a whole number, it is interpreted by interpolating between sample values. The interquartile range is <i>q</i>
<sub>
<i>U</i>
</sub>−<i>q</i>
<sub>
<i>L</i>
</sub>. A much less commonly used measure of dispersion is the range, which is simply the difference between the largest and smallest values in the sample.</p><p>No sorting of the data is required when calculating the (sample) mean and (sample) standard deviation. The mean <i>x</i> and the standard deviation <i>s</i> of a sample <i>x</i>
<sub>1</sub>, <i>x</i>
<sub>2</sub>,… <i>x</i>
<sub>1</sub> are given by</p><div class="oucontentequation oucontentequationequation oucontentnocaption" id="ueqn001_0031"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/959931d9/68be9759/m248_1_ue031i.jpg" alt=""/></div><p>The variance is the square of the standard deviation.</p><p>The term ‘mode’ can be used to describe a ‘representative value’ in a data set; it describes the most frequently occurring observation. For numerical data, this definition needs to be modified; a mode is taken to be a clear peak in a histogram of the data. Some data sets have only one such peak and are called unimodal, others have two peaks (bimodal) or more (trimodal, multimodal).</p><p>Finally, you have learned to distinguish between data sets that are symmetrical, rightskew (or positively skewed, with a long tail of high values) and leftskew (or negatively skewed, with a long tail of low values). The sample skewness is a numerical summary of the skewness of a data set.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

5: Conclusion
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection6
Tue, 26 Jul 2011 23:00:00 GMT
<p>In this course, you have been introduced to a number of ways of representing data graphically and of summarizing data numerically. We began by looking at some data sets and considering informally the kinds of questions they might be used to answer.</p><p>An important first stage in any assessment of a collection of data, preceding any numerical analysis, is to represent the data, if possible, in some informative diagrammatic way. Useful graphical representations that you have met in this course include pie charts, bar charts, histograms and scatterplots. Pie charts and bar charts are generally used with categorical data, or with numerical data that are discrete (counted rather than measured). Histograms are generally used with continuous (measured) data, and scatterplots are used to investigate the relationship between two numerical variables (which are often continuous but may be discrete). You have seen that a transformation may be useful to aid the representation of data.</p><p>However, most diagrammatic representations have some disadvantages. In particular, pie charts are hard to assess unless the data set is simple, with a restricted number of categories. Histograms need a reasonably large data set. They are also sensitive to the choice of cutpoints and the widths of the classes.</p><p>Numerical summaries of data are very important. You have been introduced to two main pairs of statistics for assessing location and dispersion. The principal measures of location that have been discussed are the mean and the median, and the principal measures of dispersion are the interquartile range and the standard deviation (together with a related measure, the variance). Because of the way they are calculated, these measures ‘go together’ in pairs – the median with the interquartile range, the mean with the standard deviation. The median and interquartile range are more resistant than are the mean and standard deviation; that is, they are less affected by one or two unusual values in a data set.</p><p>The mode has also been introduced. The term ‘mode’ is used for the most frequently occurring value in a set of categorical data, as well as to describe a clear peak in the histogram of a set of continuous data.</p><p>You have learned about the terms used to describe lack of symmetry in a data set. A data set is said to be rightskew or positively skewed if a histogram (or bar chart, for numerical discrete data) has a relatively large and long tail towards the higher values, on the right of the diagram. The terms leftskew and negatively skewed are used when there is a relatively long tail towards the lower values, on the left of the diagram. Note that the direction of the tail, and not the direction of the main concentration of the data values, is used to describe the skewness. The sample skewness, which is a numerical summary measure of skewness, has also been defined.</p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection6
5: ConclusionM248_1<p>In this course, you have been introduced to a number of ways of representing data graphically and of summarizing data numerically. We began by looking at some data sets and considering informally the kinds of questions they might be used to answer.</p><p>An important first stage in any assessment of a collection of data, preceding any numerical analysis, is to represent the data, if possible, in some informative diagrammatic way. Useful graphical representations that you have met in this course include pie charts, bar charts, histograms and scatterplots. Pie charts and bar charts are generally used with categorical data, or with numerical data that are discrete (counted rather than measured). Histograms are generally used with continuous (measured) data, and scatterplots are used to investigate the relationship between two numerical variables (which are often continuous but may be discrete). You have seen that a transformation may be useful to aid the representation of data.</p><p>However, most diagrammatic representations have some disadvantages. In particular, pie charts are hard to assess unless the data set is simple, with a restricted number of categories. Histograms need a reasonably large data set. They are also sensitive to the choice of cutpoints and the widths of the classes.</p><p>Numerical summaries of data are very important. You have been introduced to two main pairs of statistics for assessing location and dispersion. The principal measures of location that have been discussed are the mean and the median, and the principal measures of dispersion are the interquartile range and the standard deviation (together with a related measure, the variance). Because of the way they are calculated, these measures ‘go together’ in pairs – the median with the interquartile range, the mean with the standard deviation. The median and interquartile range are more resistant than are the mean and standard deviation; that is, they are less affected by one or two unusual values in a data set.</p><p>The mode has also been introduced. The term ‘mode’ is used for the most frequently occurring value in a set of categorical data, as well as to describe a clear peak in the histogram of a set of continuous data.</p><p>You have learned about the terms used to describe lack of symmetry in a data set. A data set is said to be rightskew or positively skewed if a histogram (or bar chart, for numerical discrete data) has a relatively large and long tail towards the higher values, on the right of the diagram. The terms leftskew and negatively skewed are used when there is a relatively long tail towards the lower values, on the left of the diagram. Note that the direction of the tail, and not the direction of the main concentration of the data values, is used to describe the skewness. The sample skewness, which is a numerical summary measure of skewness, has also been defined.</p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

Keep on learning
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection7
Tue, 26 Jul 2011 23:00:00 GMT
<div class="oucontentfigure oucontentmediamini"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/1b9129f0/d3c986e6/ol_skeleton_keeponlearning_image.jpg" alt="" width="300" height="200"/></div><p> </p><div class="oucontentinternalsection"><h2 class="oucontenth2 oucontentinternalsectionhead">Study another free course</h2><p>There are more than <b>800 courses on OpenLearn</b> for you to choose from on a range of subjects. </p><p>Find out more about all our <span class="oucontentlinkwithtip"><a class="oucontenthyperlink" href="http://www.open.edu/openlearn/freecourses?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OL">free courses</a></span>.</p><p> </p></div><div class="oucontentinternalsection"><h2 class="oucontenth2 oucontentinternalsectionhead">Take your studies further</h2><p>Find out more about studying with The Open University by <a class="oucontenthyperlink" href="http://www.open.ac.uk/courses?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OU">visiting our online prospectus</a>. </p><p>If you are new to university study, you may be interested in our <a class="oucontenthyperlink" href="http://www.open.ac.uk/courses/doit/access?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OU">Access Courses</a> or <a class="oucontenthyperlink" href="http://www.open.ac.uk/courses/certificateshe?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OU">Certificates</a>.</p><p> </p></div><div class="oucontentinternalsection"><h2 class="oucontenth2 oucontentinternalsectionhead">What’s new from OpenLearn?</h2><p>
<a class="oucontenthyperlink" href="http://www.open.edu/openlearn/aboutopenlearn/subscribetheopenlearnnewsletter?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OL">Sign up to our newsletter</a> or view a sample.</p><p> </p></div><div class="oucontentbox oucontentshollowbox2 oucontentsbox oucontentsnoheading "><div class="oucontentouterbox"><div class="oucontentinnerbox"><p>For reference, full URLs to pages listed above:</p><p>OpenLearn – <a class="oucontenthyperlink" href="http://www.open.edu/openlearn/freecourses?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OL">www.open.edu/<span class="oucontenthidespace"> </span>openlearn/<span class="oucontenthidespace"> </span>freecourses</a>
</p><p>Visiting our online prospectus – <a class="oucontenthyperlink" href="http://www.open.ac.uk/courses?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OU">www.open.ac.uk/<span class="oucontenthidespace"> </span>courses</a>
</p><p>Access Courses – <a class="oucontenthyperlink" href="http://www.open.ac.uk/courses/doit/access?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OU">www.open.ac.uk/<span class="oucontenthidespace"> </span>courses/<span class="oucontenthidespace"> </span>doit/<span class="oucontenthidespace"> </span>access</a>
</p><p>Certificates – <a class="oucontenthyperlink" href="http://www.open.ac.uk/courses/certificateshe?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OU">www.open.ac.uk/<span class="oucontenthidespace"> </span>courses/<span class="oucontenthidespace"> </span>certificateshe</a>
</p><p>Newsletter ­– <a class="oucontenthyperlink" href="http://www.open.edu/openlearn/aboutopenlearn/subscribetheopenlearnnewsletter?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OL">www.open.edu/<span class="oucontenthidespace"> </span>openlearn/<span class="oucontenthidespace"> </span>aboutopenlearn/<span class="oucontenthidespace"> </span>subscribetheopenlearnnewsletter</a>
</p></div></div></div>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsection7
Keep on learningM248_1<div class="oucontentfigure oucontentmediamini"><img src="http://www.open.edu/openlearn/ocw/pluginfile.php/89953/mod_oucontent/oucontent/747/1b9129f0/d3c986e6/ol_skeleton_keeponlearning_image.jpg" alt="" width="300" height="200"/></div><p> </p><div class="oucontentinternalsection"><h2 class="oucontenth2 oucontentinternalsectionhead">Study another free course</h2><p>There are more than <b>800 courses on OpenLearn</b> for you to choose from on a range of subjects. </p><p>Find out more about all our <span class="oucontentlinkwithtip"><a class="oucontenthyperlink" href="http://www.open.edu/openlearn/freecourses?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OL">free courses</a></span>.</p><p> </p></div><div class="oucontentinternalsection"><h2 class="oucontenth2 oucontentinternalsectionhead">Take your studies further</h2><p>Find out more about studying with The Open University by <a class="oucontenthyperlink" href="http://www.open.ac.uk/courses?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OU">visiting our online prospectus</a>. </p><p>If you are new to university study, you may be interested in our <a class="oucontenthyperlink" href="http://www.open.ac.uk/courses/doit/access?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OU">Access Courses</a> or <a class="oucontenthyperlink" href="http://www.open.ac.uk/courses/certificateshe?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OU">Certificates</a>.</p><p> </p></div><div class="oucontentinternalsection"><h2 class="oucontenth2 oucontentinternalsectionhead">What’s new from OpenLearn?</h2><p>
<a class="oucontenthyperlink" href="http://www.open.edu/openlearn/aboutopenlearn/subscribetheopenlearnnewsletter?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OL">Sign up to our newsletter</a> or view a sample.</p><p> </p></div><div class="oucontentbox oucontentshollowbox2 oucontentsbox
oucontentsnoheading
"><div class="oucontentouterbox"><div class="oucontentinnerbox"><p>For reference, full URLs to pages listed above:</p><p>OpenLearn – <a class="oucontenthyperlink" href="http://www.open.edu/openlearn/freecourses?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OL">www.open.edu/<span class="oucontenthidespace"> </span>openlearn/<span class="oucontenthidespace"> </span>freecourses</a>
</p><p>Visiting our online prospectus – <a class="oucontenthyperlink" href="http://www.open.ac.uk/courses?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OU">www.open.ac.uk/<span class="oucontenthidespace"> </span>courses</a>
</p><p>Access Courses – <a class="oucontenthyperlink" href="http://www.open.ac.uk/courses/doit/access?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OU">www.open.ac.uk/<span class="oucontenthidespace"> </span>courses/<span class="oucontenthidespace"> </span>doit/<span class="oucontenthidespace"> </span>access</a>
</p><p>Certificates – <a class="oucontenthyperlink" href="http://www.open.ac.uk/courses/certificateshe?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OU">www.open.ac.uk/<span class="oucontenthidespace"> </span>courses/<span class="oucontenthidespace"> </span>certificateshe</a>
</p><p>Newsletter – <a class="oucontenthyperlink" href="http://www.open.edu/openlearn/aboutopenlearn/subscribetheopenlearnnewsletter?LKCAMPAIGN=OLSU_KeepLearning&MEDIA=_OL">www.open.edu/<span class="oucontenthidespace"> </span>openlearn/<span class="oucontenthidespace"> </span>aboutopenlearn/<span class="oucontenthidespace"> </span>subscribetheopenlearnnewsletter</a>
</p></div></div></div>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University

Acknowledgements
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsectionacknowledgements
Tue, 26 Jul 2011 23:00:00 GMT
<p>All materials included in this course are derived from content originated at the Open University.</p><p>Course image: <span class="oucontentlinkwithtip"><a class="oucontenthyperlink" href="https://www.flickr.com/photos/kjetikor/">Kjetil Korslien</a></span> in Flickr made available under <a class="oucontenthyperlink" href="https://creativecommons.org/licenses/bync/2.0/legalcode">Creative Commons AttributionNonCommercial 2.0 Licence</a>.</p><p>Except for third party materials and otherwise stated (see <a class="oucontenthyperlink" href="http://www.open.ac.uk/conditions">terms and conditions</a>), this content is made available under a <a class="oucontenthyperlink" href="https://creativecommons.org/licenses/byncsa/4.0/">Creative Commons AttributionNonCommercialShareAlike 4.0 Licence</a></p><p><b>Don't miss out:</b></p><p>If reading this text has inspired you to learn more, you may be interested in joining the millions of people who discover our free learning resources and qualifications by visiting The Open University  <a class="oucontenthyperlink" href="http://www.open.edu/openlearn/freecourses?LKCAMPAIGN=ebook_&amp;MEDIA=ol">www.open.edu/<span class="oucontenthidespace"> </span>openlearn/<span class="oucontenthidespace"> </span>freecourses</a></p>
http://www.open.edu/openlearn/sciencemathstechnology/mathematicsandstatistics/mathematics/exploringdatagraphsandnumericalsummaries/contentsectionacknowledgements
AcknowledgementsM248_1<p>All materials included in this course are derived from content originated at the Open University.</p><p>Course image: <span class="oucontentlinkwithtip"><a class="oucontenthyperlink" href="https://www.flickr.com/photos/kjetikor/">Kjetil Korslien</a></span> in Flickr made available under <a class="oucontenthyperlink" href="https://creativecommons.org/licenses/bync/2.0/legalcode">Creative Commons AttributionNonCommercial 2.0 Licence</a>.</p><p>Except for third party materials and otherwise stated (see <a class="oucontenthyperlink" href="http://www.open.ac.uk/conditions">terms and conditions</a>), this content is made available under a <a class="oucontenthyperlink" href="https://creativecommons.org/licenses/byncsa/4.0/">Creative Commons AttributionNonCommercialShareAlike 4.0 Licence</a></p><p><b>Don't miss out:</b></p><p>If reading this text has inspired you to learn more, you may be interested in joining the millions of people who discover our free learning resources and qualifications by visiting The Open University  <a class="oucontenthyperlink" href="http://www.open.edu/openlearn/freecourses?LKCAMPAIGN=ebook_&MEDIA=ol">www.open.edu/<span class="oucontenthidespace"> </span>openlearn/<span class="oucontenthidespace"> </span>freecourses</a></p>The Open UniversityThe Open UniversityCoursetext/htmlenGBExploring data: Graphs and numerical summaries  M248_1Copyright © 2016 The Open University