1.4 Changing the value types of columns
The function read_csv() may, for many reasons, wrongly interpret the data type of the values in a column, so when cleaning data it’s important to check the data types of each column are what is expected, and if necessary change them.
The data type of every column in a dataframe can be determined by looking at the dataframe’s dtypes attribute, like this:
In []:
london.dtypes
Out[]:
GMT object
Max TemperatureC int64
Mean TemperatureC int64
Min TemperatureC int64
Dew PointC int64
MeanDew PointC int64
Min DewpointC int64
Max Humidity int64
Mean Humidity int64
Min Humidity int64
Max Sea Level PressurehPa int64
Mean Sea Level PressurehPa int64
Min Sea Level PressurehPa int64
Max VisibilityKm int64
Mean VisibilityKm int64
Min VisibilitykM int64
Max Wind SpeedKm/h int64
Mean Wind SpeedKm/h int64
Max Gust SpeedKm/h float64
Precipitationmm float64
CloudCover float64
Events object
WindDirDegrees object
dtype: object
In the above output, you can see the column names to the left and to the right the data types of the values in those columns.
- int64 is the pandas data type for whole numbers such as 55 or 2356
- float64 is the pandas data type for decimal numbers such as 55.25 or 2356.00
- object is the pandas data type for strings such as 'hello world' or 'rain'
Most of the column data types seem fine, however two are of concern, 'GMT' and 'WindDirDegrees' , both of which are of type object. Let’s take a look at 'WindDirDegrees' first.
Changing the data type of the 'WindDirDegrees' column
The read_csv() method has interpreted the values in the 'WindDirDegrees' column as strings (type object ). This is because in the CSV file the values in that column had all been suffixed with that html line break string so read_csv() had no alternative but to interpret the values as strings.
The values in the 'WindDirDegrees' column are meant to represent wind direction in terms of degrees from true north (360) and meteorologists always define the wind direction as the direction the wind is coming from. So if you stand so that the wind is blowing directly into your face, the direction you are facing names the wind, so a westerly wind is reported as 270 degrees. The compass rose shown below should make this clearer:
We need to be able to make queries such as ‘Get and display the rows where the wind direction is greater than 350 degrees’. To do this we need to change the data type of the ‘WindDirDegrees’ column from object to type int64 . We can do that by using the astype() method like this:
In []:
london['WindDirDegrees'] = london['WindDirDegrees'].astype('int64')
Now all the values in the 'WindDirDegrees' column are of type int64 and we can make our query:
In []:
london[london['WindDirDegrees'] > 350]
Out[]:
Note that the 'WindDirDegrees' column is on the far right of the table and the right of the table has been cropped to fit on the page.
Changing the data type of the ‘GMT’ column
Recall that I noted that the 'GMT' column was of type object , the type pandas uses for strings.
The 'GMT' column is supposed to represent dates. It would be helpful for the date values not to be strings to make it possible to make queries of the data such as ‘Return the row where the date is 4 June 2014’.
Pandas has a function called to_datetime() which can convert a column of object (string) values such as those in the 'GMT' column into values of a proper date type called datetime64 , just like this:
In []:
london['GMT'] = to_datetime(london['GMT'])
#Then display the types of all the columns again so we
#can check the changes have been made.
london.dtypes
Out[]:
GMT datetime64[ns]
Max TemperatureC int64
Mean TemperatureC int64
Min TemperatureC int64
Dew PointC int64
MeanDew PointC int64
Min DewpointC int64
Max Humidity int64
Mean Humidity int64
Min Humidity int64
Max Sea Level PressurehPa int64
Mean Sea Level PressurehPa int64
Min Sea Level PressurehPa int64
Max VisibilityKm int64
Mean VisibilityKm int64
Min VisibilitykM int64
Max Wind SpeedKm/h int64
Mean Wind SpeedKm/h int64
Max Gust SpeedKm/h float64
Precipitationmm float64
CloudCover float64
Events object
WindDirDegrees int64
dtype: object
From the above output, we can confirm that the 'WindDirDegrees' column type has been changed from object to int64 and that the 'GMT' column type has been changed from object to datetime64 .
To make queries such as ‘Return the row where the date is 4 June 2014’ you’ll need to be able to create a datetime64 value to represent June 4 2014. It cannot be:
london[london['GMT'] == '2014-1-3']
because ‘2014-1-3’ is a string and the values in the ‘GMT’ column are of type datetime64 . Instead you must create a datetime64 value using thedatetime() function like this:
datetime(2014, 6, 4)
In the function call above, the first integer argument is the year, the second the month and the third the day.
First import the `datetime()` function from the similarly named `datetime` package by running the following line of code:
In []:
from datetime import datetime
Let’s try the function out by executing the code to ‘Return the row where the date is 4 June 2014’:
In []:
london[london['GMT'] == datetime(2014, 6, 4)]
Out[]:
Note that the right side of the table has been cropped to fit on the page.
You can also now make more complex queries involving dates such as ‘Return all the rows where the date is between 8 December 2014 and 12 December 2014’, like this:
In []:
c
london[(london['GMT'] >= datetime(2014, 12, 8))
& (london['GMT'] <= datetime(2014, 12, 12))]
Out[]:
Note that the right side of the table has been cropped to fit on the page.
Exercise 4 Display rows from dataframe
Now try Exercise 4 in the Exercise notebook 2.
If you’re using Anaconda instead of CoCalc, remember that to open the notebook you’ll need to navigate to the notebook using Jupyter.
Once the notebook is open, run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook. If you need a quick reminder of how to use Jupyter, watch again the video in Week 1 Exercise 1 [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)] .