1.3 Missing values
As you heard in the video at the start of the week, missing values (also called null values) are one of the reasons to clean data.
Finding missing values in a particular column can be done with the column method isnull() , like this:
In []:
london['Events'].isnull()
The above code returns a series of Boolean values, where True indicates that the corresponding row in the 'Events' column is missing a value and False indicates the presence of a value. Here are the last few rows from the series:
...
360 False
361 True
362 True
363 True
364 False
Name: Events, dtype: bool
If, as you did with the comparison expressions, you put this code within square brackets after the dataframe’s name, it will return a new dataframe consisting of all the rows without recorded events (rain, fog, thunderstorm, etc.):
In []:
london[london['Events'].isnull()]
As you will see in Exercise 4 of the exercise notebook, this will return a new dataframe with 114 rows, showing that more than one in three days had no particular event recorded. If you scroll the table to the right, you will see that all values in the 'Events' column are marked NaN , which stands for ‘Not a Number’, but is also used to mark non-numeric missing values, like in this case (events are strings, not numbers).
Once you know how much and where data is missing, you have to decide what to do: ignore those rows? Replace with a fixed value? Replace with a computed value, like the mean?
In this case, only the first two options are possible. The method call london.dropna() will drop (remove) all rows that have a missing (non-available) value somewhere, returning a new dataframe. This will therefore also remove rows that have missing values in other columns.
The column method fillna() will replace all non-available values with the value given as argument. For this case, each NaN could be replaced by the empty string.
In []:
london['Events'] = london['Events'].fillna('')
london[london['Events'].isnull()]
The second line above will now show an empty dataframe, because there are no longer missing values in the events column.
As a final note on missing values, pandas ignores them when computing numeric statistics, i.e. you don’t have to remove missing values before applying sum(), median() and other similar methods.
Learn about checking data types of each column in the next section.