2.1.3 Datasets

A dataset is a collection of separate pieces of data related to a particular experiment, study, project or programme. Datasets often take the form of a data table, in which columns represent variables, and rows represent data values for each data unit. You may have more than one dataset for a particular project or study, particularly if you are looking at lots of different groups or sites.

Activity 6: Working with datasets

Timing: Allow about 20 minutes

Look at Table 4.

Table 4 A simple dataset.
PatientIDDOBAgeGenderDate sampleSampleSpeciesMethicillin
123401/01/196060.8101/02/2020Blood cultureS. aureusR
123501/01/197050.8202/02/2020Nasal swabStaphylococcus aureusS

Use the space below to answer the following questions:

  1. How many rows, how many columns, how many data units, how many data values are there in total?
  2. How is the dataset structured? (Hint: what do the rows and columns represent?)
  3. Can you describe the variables from the column headings alone? What additional information do you need to understand what the variables are?
  4. Can you classify the variables from the data values?
  5. Take a closer look at the variables. Can you identify which variables represent directly observed data and which variables represent data that has been processed through calculations performed on other variables, some of which might not be included in this dataset?
  6. Are there any possible errors in this data?
  7. Are there any inconsistencies in how the data are recorded in this dataset?
  8. What important variables are absent from this dataset that might make it difficult to use this information for AMR decision-making?
To use this interactive functionality a free OU account is required. Sign in or register.
Interactive feature not available in single page view (see it in standard view).

Answer

  1. Eight columns, three rows, two data units and sixteen data values. (Don’t count the header row which contains the variable names, as a data value.)
  2. Columns represents variables, rows represent data values for a single patient, i.e. a single data unit.
  3. The column names alone are insufficient to fully define the variables, though you might be able to guess for some of the common variables – e.g. ‘DOB’ means ‘date of birth’, and the column ‘Methicillin’ records the result of a methicillin susceptibility test. For the variable ‘Date sample’, it is unclear whether the first record represents the date ‘1 February 2020’, or ‘2 January 2020’, depending on which date format has been used. For the variable ‘Gender’, you cannot tell whether 1 refers to male, female or other. And finally, ‘Age’ is also ambiguous – does 50.8 refer to 50 years and 8 months (which would be 50.67 as a decimal)? You would need a data dictionary to accompany this dataset to be sure what each variable represents.
  4. PatientID = nominal, DOB = ordinal, Age = continuous, Gender = nominal, Date sample = ordinal, Sample = nominal, Species = nominal, Methicillin = we don’t know from this dataset. If only the categories of R and S had been used, this would then be a nominal variable. If the categories R, I and S had been used where I = intermediate, this implies a ranking with R>I>S in terms of level of resistance. This would be an ordinal variable.
  5. Age has been calculated from DOB. The methicillin variable is the result of a laboratory test. Depending on what test was used, it is possible that a minimum inhibitory concentration (MIC) value was the original source of data and was categorised into resistant and susceptible categories based on the breakpoint as defined by the method used. Alternatively, if disk diffusion was used, the classification of R and S would have been derived from the diameter cut-off values, again as defined by the method used (usually EUCAST or CLSI).
  6. DOB is recorded as the first of January at the start of a decade for both patients. This might be correct, but it might indicate a problem with recording DOB. It is impossible to tell with only two patient records, but we should be alert to the possibility of data entry errors.
  7. There is inconsistent recording of bacterial species name.
  8. There are many variables missing! For example, we might like to know the primary diagnosis. Patient ward is relevant, as if we had more data, knowing the patient ward would help to identify whether the patient might have been exposed to an outbreak of methicillin-resistant Staphylococcus aureus (MRSA). The origin of the infection – whether acquired in the community or in a healthcare setting, would help us to understand which settings are higher risk for acquiring MRSA. Antimicrobial use in the past three months might also be relevant for understanding whether methicillin or related drugs have previously been administered. We might also like information about the outcome of previous antimicrobial therapy, especially treatment failures. This is just a small sample of the additional information we might need – which examples would you add to this list?