3.1 You’ve got data! Now what?
A dataset is a permanent work in progress. It is up to the researcher to decide whether enough work and progress have been made in order to stop wrangling and start analysing the data. Data wrangling is the process of transforming raw, unstructured and/or messy data into a prepared, structured and organised dataset. Our intervention will depend on the type of data we have collected. Completely unstructured data, such as text, that are not organised according to any standard model demand a more extensive intervention, whereas semi-structured data containing some organisational properties and elements, like a letter encoded with Extensible Markup Language (XML), and structured data (e.g. tabular) can be more intuitively wrangled.
Activity 5 First steps in data wrangling
The Darwin Correspondence Project [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)] has been collecting Charles Darwin’s letters since 1974. This project showcases the entire data pipeline and its evolution along the last decades. The letters from and to Darwin have been digitised, transcribed, described and encoded in XML using the guidelines of the Text Encoding Initiative. Some of the material is available at the project’s homepage and this activity resorts to this rich archive to create a ‘sample dataset’ and understand the place of wrangling within the digital methods pipeline.
To find a marked up example, go to the Darwin Correspondence Project, click on ‘Joseph Dalton Hooker’ and then under ‘Key letters’ then ‘Developing a theory, select ‘Darwin to J.D. Hooker, [11 January 1844]’.
The activity is simple: take the marked up letter, identify and enter in a spreadsheet the following variables (as columns).
- Letter no
- Sender
- Receiver
- Date
- Place
Data wrangling is an iterative process, a recurring loop rather than a linear sequence, but we can distinguish seven data wrangling tasks.
1 Data import
Immediately after entering or collecting your data, you will open or import it into a programme. Common spreadsheet software (e.g. Calc and Excel) can be enough to perform elementary discovery and manipulation operations, whereas ‘data wranglers’ are more powerful programmes (e.g. OpenRefine) or sets of functions in a programming language (e.g. Python and R) purposefully assembled for the whole process of large-scale wrangling which allow you to perform batch operations. If you have semi-structured data or unstructured data, the first task of a data wrangler is to parse the data into a standard data model (e.g. table) that you can explore.
Data exploration
While it is legitimate to assume that you already know what your manually collected and entered data look like (e.g. the sample dataset you have just created with the Darwin correspondence), it may not be true for automated collection (e.g. web scraping). If you have tabular data, for example, you may want to count the records, verify the number and name of the variables, examine the column headers, and perform a first inspection of the cell contents.
Structuring
Structuring includes operations that divide and classify the data in broad categories. Structuring (compared to data exploration) consists of substantive and in some cases substantial changes to the dataset. You are manipulating and transforming the original data, adapting the dataset to your research needs and making it consistent. Typical operations include merging and splitting records, combining and dividing rows and columns, subsetting variables and observations, reshaping data (transposing from rows to columns or vice versa), and deleting records, deemed irrelevant for our research.
Cleaning
All wrangling tasks are potentially recurrent, but cleaning is categorically recurrent, and repeated in combination with the other tasks. Common cleaning procedures include:
- Identification of missing data
- Removal/correction of corrupt or incorrect records;
- Detection of redundant records and consequent deletion of duplicates
- Normalisation of fields according to international standards and/or research needs. For example, the International Organization for Standardization has lists for country codes (ISO 3166) which is good research practice. Other simple operations include choosing between a full name and an acronym (e.g. United Nations or UN), or adopting a single and consistent date format (e.g. yyyy-mm-dd).
Only a clean dataset will produce valid analytical results.
Appending
Your dataset can be enriched by deriving data from existing fields or by collecting more data, known as data appending. Three general types of intervention are
- Deriving data is the simplest appending operation: adding new data based on values already present in the dataset, e.g. deriving the geographical coordinates from the column ‘place’ you have created in the Darwin dataset.
- Linking data is connecting our data to external sources, e.g. we could produce linked data by connecting the fields populated with correspondents in our Darwin data to a name authority file such as the Virtual International Authority File (VIAF).
- Data augmentation is the incorporation of new data, usually from new collections. Augmenting brings you back to data collection and you loop back to the beginning of the data wrangling process.
Validating
The integrity of our analysis and research depends on the quality of the underlying dataset. Validation is like quality control: we cross-check the dataset to ensure all operated transformations are correct and consistent. We might verify whether values are related to the right variables, confirm the data is correctly formatted (e.g. text as string, numbers as integers, etc.), and, lastly, confirm that our values have a normal distribution within the context of the dataset and respective research topic.
Exporting
This is the output of the data wrangling exercise and marks the end of the wrangling loop. Exporting can entail different, if complementary, actions. The most customary is the conversion of the dataset into a format interpretable by the programme chosen to analyse the data and/or simply loading the wrangled dataset in the analytical package. Alternatively or cumulatively, the dataset can be published to guarantee the reproducibility of the research and/or archived in a repository for preservation. At all instances, you should keep a log of data-level information where the adopted procedures and underlying logic are duly recorded.