Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

Free course

Learn to code for data analysis

Week 3: Cleaning up our act Part 1


Welcome to Week 3.

Please note: in the following video, where reference is made to a study ‘week’, this corresponds to Weeks 3 and 4 of this course.

Download this video clip.Video player: new_lcdab_w3_intro.mp4
Skip transcript


Hello again.
In the previous week the data you looked at was quite tidy ready to be analysed. Unfortunately it's often the case that data is published in a form that you can't immediately start to work with. For example, numbers and dates may be in inconsistent formats, textual data may contain spurious characters and there may even be missing or nonsensical data, for example due to a typo. Such data is often referred to as 'dirty data' and it needs to be put through a process known unsurprisingly as data cleaning or data cleansing before it can be analysed. Data cleaning is often a substantial but unglamorous part of data analysis. I've been talking to someone who does this as part of his job.
My work at the moment is looking at a wide range of schools data to predict when they might be having issues with financial problems. This allows us to work with them early on and help resolve problems that otherwise might escalate.
We use a range of statistical techniques to do this from standard linear progression approaches through to more advanced data science techniques but with all of these if the data's quality is poor this will lead to spurious results and you might lead to the wrong conclusions.
What's the first thing you do when you start working with data?
The first thing we do is see whether the data sets are complete or not. With a lot of the data sets we work with we may well find these won't cover all schools and in some cases it won't just show up as a missing value, it might show up as a zero or something that might look like you have got a valid result. So we need to analyse through these, find where that missing data is and then work out whether we exclude those schools from our analysis and predictive modelling when we're setting things up or replace them with averages or values from previous years.
What other data quality checks do you do?
Once we've worked out whether the data is complete or not we then look at whether the data is sort of trustworthy and okay for us to use. The issues might be obvious if we have a pupil who's a hundred and twelve years old we've probably got an issue there, what's happened there is someone might have entered a date of birth as 1902 rather than 2002. In other cases things will be more subtle and we need to look at the distribution of results to work out where the data might be misleading and then again make these decisions of whether we try and exclude those results from our analysis or replace them with more sensible values.
Are there any other things you do once you're happy the data is accurate?
In some cases the data's in the wrong format for us to work with. We've been doing quite a bit of geographical mapping work recently. The EduBase database we use that has information on schools records their location using a coordinate system known as 'northing and easting' whereas the computer programmes we've been using want them in latitude and longitude. So there we have to convert the data from one coordinate system to another and then check the information we've got at the end is still accurate.
What processes do you have for managing data quality?
We've got quite a detailed quality assurance procedure within the department and with how we document this work. For, depending on the importance and complexity of the piece of work, we may find that it goes from simple sense checking and comparing against similar results through to someone parallel running the entire piece of work and by having these processes in place it allows us to have confidence in the data that we work with.
This week you'll learn some simple approaches to cleaning data. Once the data's clean a picture is worth a thousand words, so you'll be producing your first chart with, as you guessed, one line of code. Enjoy the week.
End transcript
Interactive feature not available in single page view (see it in standard view).

In Week 1 and 2 you worked on a dataset that combined two different World Health Organization datasets: population and the number of deaths due to tuberculosis.

They could be combined because they share a common attribute: the countries. This week you will learn the techniques behind the creation of such a combined dataset.


Take your learning further

Making the decision to study can be a big step, which is why you'll want a trusted University. The Open University has 50 years’ experience delivering flexible learning and 170,000 students are studying with us right now. Take a look at all Open University courses.

If you are new to University-level study, we offer two introductory routes to our qualifications. You could either choose to start with an Access module, or a module which allows you to count your previous learning towards an Open University qualification. Read our guide on Where to take your learning next for more information.

Not ready for formal University study? Then browse over 1000 free courses on OpenLearn and sign up to our newsletter to hear about new free courses as they are released.

Every year, thousands of students decide to study with The Open University. With over 120 qualifications, we’ve got the right course for you.

Request an Open University prospectus371