1 Introducing data
Chambers English Dictionary defines the word data as follows.
data, dātä,, n.pl. facts given, from which others may be inferred:—sing. da'tum(q.v.) …. [L. data, things given, pa.p. neut. pl. of dare, to give.]
You might prefer the definition given in the Shorter Oxford English Dictionary.
data, things given or granted; something known or assumed as fact, and made the basis of reasoning or calculation.
Data arise in many spheres of human activity and in all sorts of different contexts in the natural world about us. Statistics may be described as exploring, analysing and summarising data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions. The data themselves may arise in the natural course of things (for example, as meteorological records) or, commonly, they may be collected by survey or experiment.
In this course we begin by examining several different data sets and describing some of their features.
Data are frequently expressed as nothing more than a list of numbers or a complicated table. As a result, very large data sets can be difficult to appreciate and interpret without some form of consolidation. This can, perhaps, be achieved via a series of simpler tables or an easily assimilated diagram. The same applies to smaller data sets, whose main message may become evident only after some procedure of sorting and summarising.
Before computers were widely available, it was often necessary to make quite detailed theoretical assumptions before beginning to investigate the data. But nowadays it is relatively easy to use a statistical computer package to explore data and acquire some intuitive ‘feel’ for them, without making such assumptions. This is helpful in that the most important and informative place to start is the logical one, namely with the data themselves. The computer will make your task both possible and relatively quick.
However, you must take care not to be misled into thinking that computers have made statistical theory redundant: this is far from the truth. You will find the computer can only lead you to see where theory is needed to underpin a commonsense approach or, perhaps, to reach an informed decision. It cannot replace such theory and it is, of course, incapable of informed reasoning: as always, that is up to you. Even so, if you are to gain real understanding and expertise, your first steps are best directed towards learning to use your computer to explore data, and to obtain some tentative inferences from such exploration.
The technology explosion of recent years has made relatively cheap and powerful computers available to all of us. Furthermore, it has brought about an information explosion which has revolutionised our whole environment. Information pours in from the media, advertisements, government agencies and a host of other sources and, in order to survive, we must learn to make rational choices based on some kind of summary and analysis of it. We need to learn to select the relevant and discard the irrelevant, to sift out what is interesting, to have some kind of appreciation of the accuracy and reliability of both our information and our conclusions, and to produce succinct summaries which can be interpreted clearly and quickly.
Our methods for summarising data will involve producing graphical displays as well as numerical calculations. You will see how a preliminary pictorial analysis of your data can, and indeed should, influence your entire approach to choosing a valid, reliable method.
But we shall begin, in Section 1 of this course, with the data themselves. In this course, except where it is necessary to make a particular theoretical point, all of the data sets used are genuine; none are artificial, contrived or ‘adjusted’ in any way. In Section 1 you will encounter several sets of real data, and begin to look at some questions on which they can throw light.
Statistics exists as an academic and intellectual discipline precisely because real investigations need to be carried out. Simple questions, and difficult ones, about matters which affect our lives need to be answered, information needs to be processed and decisions need to be made. ‘Finding things out’ is fun: this is the challenge of real data.
Some basic graphical methods that can be used to present data and make clearer the patterns in sets of numbers are introduced in Sections 2 and 3: pie charts and bar charts in Section 2, histograms and scatterplots in Section 4.
Finally, in Section 4 we discuss ways of producing numerical summaries of certain aspects of data sets, including measures of location (which are, in a sense, ‘averages’), measures of the dispersion or variability of a set of data, and measures of symmetry (and lack of symmetry).