Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

Free course

1.3 Missing values

As you heard in the video at the start of the week, missing values (also called null values) are one of the reasons to clean data.

Finding missing values in a particular column can be done with the column method isnull() , like this:

In []:

london['Events'].isnull()

The above code returns a series of Boolean values, where True indicates that the corresponding row in the 'Events' column is missing a value and False indicates the presence of a value. Here are the last few rows from the series:

...

360 False

361 True

362 True

363 True

364 False

Name: Events, dtype: bool

If, as you did with the comparison expressions, you put this code within square brackets after the dataframe’s name, it will return a new dataframe consisting of all the rows without recorded events (rain, fog, thunderstorm, etc.):

In []:

london[london['Events'].isnull()]

As you will see in Exercise 4 of the exercise notebook, this will return a new dataframe with 114 rows, showing that more than one in three days had no particular event recorded. If you scroll the table to the right, you will see that all values in the 'Events' column are marked NaN , which stands for ‘Not a Number’, but is also used to mark non-numeric missing values, like in this case (events are strings, not numbers).

Once you know how much and where data is missing, you have to decide what to do: ignore those rows? Replace with a fixed value? Replace with a computed value, like the mean?

In this case, only the first two options are possible. The method call london.dropna() will drop (remove) all rows that have a missing (non-available) value somewhere, returning a new dataframe. This will therefore also remove rows that have missing values in other columns.

The column method fillna() will replace all non-available values with the value given as argument. For this case, each NaN could be replaced by the empty string.

In []:

london['Events'] = london['Events'].fillna('')

london[london['Events'].isnull()]

The second line above will now show an empty dataframe, because there are no longer missing values in the events column.

As a final note on missing values, pandas ignores them when computing numeric statistics, i.e. you don’t have to remove missing values before applying sum(), median() and other similar methods.

Learn about checking data types of each column in the next section.

LCDAB_1

Take your learning further

Making the decision to study can be a big step, which is why you'll want a trusted University. The Open University has 50 years’ experience delivering flexible learning and 170,000 students are studying with us right now. Take a look at all Open University courses.

If you are new to University-level study, we offer two introductory routes to our qualifications. You could either choose to start with an Access module, or a module which allows you to count your previous learning towards an Open University qualification. Read our guide on Where to take your learning next for more information.

Not ready for formal University study? Then browse over 1000 free courses on OpenLearn and sign up to our newsletter to hear about new free courses as they are released.

Every year, thousands of students decide to study with The Open University. With over 120 qualifications, we’ve got the right course for you.

Request an Open University prospectus371