Course content Course content

Digital humanities: humanities research in the digital age

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

More free courses

Session 3 Data wrangling

This session is written by Hugo Leal from The University of Cambridge.

Look around. Almost everything you see is data or can be turned into data. The text you are reading right now is data. The idea that characters on a screen are data may sound surprising or daunting. A nineteenth-century digitised letter, an online news article or a post on social media are all data too. They can all be created, collected, wrangled, queried, analysed, visualised and interpreted to address research questions.

In the last session you learned about accessing and collecting data. Most digital data we collect are ‘raw’ – they haven’t been prepared for any type of analysis. Raw data often contain omissions, errors and inconsistencies – we call this messy data. You will look at the basic steps from raw/messy data to a standard model, like a table. You’ll learn about a vital process in digital research called ‘data wrangling’, the process through which we convert our raw data into an organised dataset suitable for analysis, visualisation and interpretation. In this session and in the next, you will be using digitised letters of Charles Darwin to illustrate both data wrangling and data analysis.

Download this video clip.Video player: Video 3

Show transcript|Hide transcript

Transcript: Video 3

HUGO LEAL

Hi, this is Hugo at the University of Cambridge. We have with us two distinguished guests today, Alison Byrne, associate director of the Darwin Correspondence Project at University of Cambridge and Elizabeth Smith, associate editor for Digital Development at the Darwin Correspondence Project. So Alison, tell us about the project. What is the Darwin Correspondence Project?

ALISON BYRNE

The Darwin Correspondence Project is an international group of researchers who have been publishing full edited transcriptions of more than 15,000 letters written by and to the 19th century naturalist, Charles Darwin since about 1975. So what began as a purely print edition has successfully transitioned into both print and digital. The letters are being made available to search and read for free both on the project's own website and now also through Epsilon, which is a site that brings together metadata and transcriptions of 19th century scientific correspondences.

HUGO LEAL

And that's the perfect cue to bring in Elizabeth. Could you guide us through the data pipeline? How do we go from the original manuscript to the digital documents?

ELIZABETH SMITH

We start either from a physical letter, if we have it in the building. In Cambridge University library, we have 9,000 of the 16,000 letters we're aware of, or from an image that's been sent to us from somewhere in the world, the other letters are in both public and private collections everywhere. We transcribe them directly into TEI XML. In Oxygen, we've created templates so that our editors can do that easily. We assign a date to each letter and work out who the correspondent is, which is not always straightforward, because a lot of them are just dated Saturday, and are directed to dear sir.

Each letter is given an identifier, which is important when you are dealing with large numbers of letters. These transcriptions are then double checked by multiple people, and eventually fed to one of our editors, who will create an editorial apparatus that's mostly in the form of writing footnotes, which is part of our print legacy. They identify people, places, published works, species mentioned in the letters. They'll explain any historical oddities or scientific concepts that might be confusing to the general reader. And they cross-reference the letters to each other.

HUGO LEAL

You end up with a data set of encoded letters and that we will be using in this session, how relevant is the structuring operation that is the division classification of the dating to broad categories to the whole exercise?

ELIZABETH SMITH

Some degree of structuring is vital. Even in the early days of the project, we had ways to indicate the identifier, and the date, the physical description of the letter, and the location of the letter. More recently, we're able to put the entire file into structured data in TEI XML, which is a globally recognised standard. And because it's so widely used, it makes it really straightforward to combine our collection with other collections and give a better sense of the 19th century science conversations as they were taking place.

HUGO LEAL

The DCP digital data are openly and freely available on the project's website. We invite you to explore and enjoy this incredible digital archive using the skills, methods, and techniques you have been learning during this course.

End transcript: Video 3

Download

Video 3

Interactive feature not available in single page view (see it in standard view).

By the end of this session, you should be able to: