Course content Course content

Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

More free courses

1.3 Missing values

As you heard in the video at the start of the week, missing values (also called null values) are one of the reasons to clean data.

An image of a girl with the last piece of a jigsaw puzzle

Figure 5

Show description|Hide description

An image of a girl with the last piece of a jigsaw puzzle

Figure 5

Finding missing values in a particular column can be done with the column method isnull() , like this:

In []:

london['Events'].isnull()

The above code returns a series of Boolean values, where True indicates that the corresponding row in the 'Events' column is missing a value and False indicates the presence of a value. Here are the last few rows from the series:

...

360 False

361 True

362 True

363 True

364 False

Name: Events, dtype: bool

If, as you did with the comparison expressions, you put this code within square brackets after the dataframe’s name, it will return a new dataframe consisting of all the rows without recorded events (rain, fog, thunderstorm, etc.):

In []:

london[london['Events'].isnull()]

As you will see in Exercise 4 of the exercise notebook, this will return a new dataframe with 114 rows, showing that more than one in three days had no particular event recorded. If you scroll the table to the right, you will see that all values in the 'Events' column are marked NaN , which stands for ‘Not a Number’, but is also used to mark non-numeric missing values, like in this case (events are strings, not numbers).

Once you know how much and where data is missing, you have to decide what to do: ignore those rows? Replace with a fixed value? Replace with a computed value, like the mean?

In this case, only the first two options are possible. The method call london.dropna() will drop (remove) all rows that have a missing (non-available) value somewhere, returning a new dataframe. This will therefore also remove rows that have missing values in other columns.

The column method fillna() will replace all non-available values with the value given as argument. For this case, each NaN could be replaced by the empty string.

In []:

london['Events'] = london['Events'].fillna('')

london[london['Events'].isnull()]

The second line above will now show an empty dataframe, because there are no longer missing values in the events column.

As a final note on missing values, pandas ignores them when computing numeric statistics, i.e. you don’t have to remove missing values before applying sum(), median() and other similar methods.

Learn about checking data types of each column in the next section.

Previous 1.2 Removing extra characters

Next 1.4 Changing the value types of columns

My OpenLearn Profile

About this free course

Become an OU student

Download this course

Share this free course

1.3 Missing values