Week 3: Cleaning up our act Part 1
Introduction
Welcome to Week 3.
Please note: in the following video, where reference is made to a study ‘week’, this corresponds to Weeks 3 and 4 of this course.
Download this video clip.Video player: new_lcdab_w3_intro.mp4
Transcript
RUTH ALEXANDER:
Hello again.
In the previous week the data you looked at was quite tidy ready to be analysed. Unfortunately it's often the case that data is published in a form that you can't immediately start to work with. For example, numbers and dates may be in inconsistent formats, textual data may contain spurious characters and there may even be missing or nonsensical data, for example due to a typo. Such data is often referred to as 'dirty data' and it needs to be put through a process known unsurprisingly as data cleaning or data cleansing before it can be analysed. Data cleaning is often a substantial but unglamorous part of data analysis. I've been talking to someone who does this as part of his job.
DAVID GOODY:
My work at the moment is looking at a wide range of schools data to predict when they might be having issues with financial problems. This allows us to work with them early on and help resolve problems that otherwise might escalate.
We use a range of statistical techniques to do this from standard linear progression approaches through to more advanced data science techniques but with all of these if the data's quality is poor this will lead to spurious results and you might lead to the wrong conclusions.
RUTH ALEXANDER:
What's the first thing you do when you start working with data?
DAVID GOODY:
The first thing we do is see whether the data sets are complete or not. With a lot of the data sets we work with we may well find these won't cover all schools and in some cases it won't just show up as a missing value, it might show up as a zero or something that might look like you have got a valid result. So we need to analyse through these, find where that missing data is and then work out whether we exclude those schools from our analysis and predictive modelling when we're setting things up or replace them with averages or values from previous years.
RUTH ALEXANDER:
What other data quality checks do you do?
DAVID GOODY:
Once we've worked out whether the data is complete or not we then look at whether the data is sort of trustworthy and okay for us to use. The issues might be obvious if we have a pupil who's a hundred and twelve years old we've probably got an issue there, what's happened there is someone might have entered a date of birth as 1902 rather than 2002. In other cases things will be more subtle and we need to look at the distribution of results to work out where the data might be misleading and then again make these decisions of whether we try and exclude those results from our analysis or replace them with more sensible values.
RUTH ALEXANDER:
Are there any other things you do once you're happy the data is accurate?
DAVID GOODY:
In some cases the data's in the wrong format for us to work with. We've been doing quite a bit of geographical mapping work recently. The EduBase database we use that has information on schools records their location using a coordinate system known as 'northing and easting' whereas the computer programmes we've been using want them in latitude and longitude. So there we have to convert the data from one coordinate system to another and then check the information we've got at the end is still accurate.
RUTH ALEXANDER:
What processes do you have for managing data quality?
DAVID GOODY:
We've got quite a detailed quality assurance procedure within the department and with how we document this work. For, depending on the importance and complexity of the piece of work, we may find that it goes from simple sense checking and comparing against similar results through to someone parallel running the entire piece of work and by having these processes in place it allows us to have confidence in the data that we work with.
RUTH ALEXANDER:
This week you'll learn some simple approaches to cleaning data. Once the data's clean a picture is worth a thousand words, so you'll be producing your first chart with, as you guessed, one line of code. Enjoy the week.
Interactive feature not available in single page view (see it in standard view).
In Week 1 and 2 you worked on a dataset that combined two different World Health Organization datasets: population and the number of deaths due to tuberculosis.
They could be combined because they share a common attribute: the countries. This week you will learn the techniques behind the creation of such a combined dataset.