Skip to content
Skip to main content

About this free course

Download this course

Share this free course

Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

1.4 Changing the value types of columns

The function read_csv() may, for many reasons, wrongly interpret the data type of the values in a column, so when cleaning data it’s important to check the data types of each column are what is expected, and if necessary change them.

The data type of every column in a dataframe can be determined by looking at the dataframe’s dtypes attribute, like this:

In []:

london.dtypes

Out[]:

GMT object

Max TemperatureC int64

Mean TemperatureC int64

Min TemperatureC int64

Dew PointC int64

MeanDew PointC int64

Min DewpointC int64

Max Humidity int64

Mean Humidity int64

Min Humidity int64

Max Sea Level PressurehPa int64

Mean Sea Level PressurehPa int64

Min Sea Level PressurehPa int64

Max VisibilityKm int64

Mean VisibilityKm int64

Min VisibilitykM int64

Max Wind SpeedKm/h int64

Mean Wind SpeedKm/h int64

Max Gust SpeedKm/h float64

Precipitationmm float64

CloudCover float64

Events object

WindDirDegrees object

dtype: object

In the above output, you can see the column names to the left and to the right the data types of the values in those columns.

  • int64 is the pandas data type for whole numbers such as 55 or 2356
  • float64 is the pandas data type for decimal numbers such as 55.25 or 2356.00
  • object is the pandas data type for strings such as 'hello world' or 'rain'

Most of the column data types seem fine, however two are of concern, 'GMT' and 'WindDirDegrees' , both of which are of type object. Let’s take a look at 'WindDirDegrees' first.

Changing the data type of the 'WindDirDegrees' column

The read_csv() method has interpreted the values in the 'WindDirDegrees' column as strings (type object ). This is because in the CSV file the values in that column had all been suffixed with that html line break string so read_csv() had no alternative but to interpret the values as strings.

The values in the 'WindDirDegrees' column are meant to represent wind direction in terms of degrees from true north (360) and meteorologists always define the wind direction as the direction the wind is coming from. So if you stand so that the wind is blowing directly into your face, the direction you are facing names the wind, so a westerly wind is reported as 270 degrees. The compass rose shown below should make this clearer:

Figure 6 A compass rose

We need to be able to make queries such as ‘Get and display the rows where the wind direction is greater than 350 degrees’. To do this we need to change the data type of the ‘WindDirDegrees’ column from object to type int64 . We can do that by using the astype() method like this:

In []:

london['WindDirDegrees'] = london['WindDirDegrees'].astype('int64')

Now all the values in the 'WindDirDegrees' column are of type int64 and we can make our query:

In []:

london[london['WindDirDegrees'] > 350]

Out[]:

Rows from the london dataframe where the value in the WindDirDegrees column is greater than 350.
Figure 7

Note that the 'WindDirDegrees' column is on the far right of the table and the right of the table has been cropped to fit on the page.

Changing the data type of the ‘GMT’ column

Recall that I noted that the 'GMT' column was of type object , the type pandas uses for strings.

The 'GMT' column is supposed to represent dates. It would be helpful for the date values not to be strings to make it possible to make queries of the data such as ‘Return the row where the date is 4 June 2014’.

Pandas has a function called to_datetime() which can convert a column of object (string) values such as those in the 'GMT' column into values of a proper date type called datetime64 , just like this:

In []:

london['GMT'] = to_datetime(london['GMT'])

#Then display the types of all the columns again so we

#can check the changes have been made.

london.dtypes

Out[]:

GMT datetime64[ns]

Max TemperatureC int64

Mean TemperatureC int64

Min TemperatureC int64

Dew PointC int64

MeanDew PointC int64

Min DewpointC int64

Max Humidity int64

Mean Humidity int64

Min Humidity int64

Max Sea Level PressurehPa int64

Mean Sea Level PressurehPa int64

Min Sea Level PressurehPa int64

Max VisibilityKm int64

Mean VisibilityKm int64

Min VisibilitykM int64

Max Wind SpeedKm/h int64

Mean Wind SpeedKm/h int64

Max Gust SpeedKm/h float64

Precipitationmm float64

CloudCover float64

Events object

WindDirDegrees int64

dtype: object

From the above output, we can confirm that the 'WindDirDegrees' column type has been changed from object to int64 and that the 'GMT' column type has been changed from object to datetime64 .

To make queries such as ‘Return the row where the date is 4 June 2014’ you’ll need to be able to create a datetime64 value to represent June 4 2014. It cannot be:

london[london['GMT'] == '2014-1-3']

because ‘2014-1-3’ is a string and the values in the ‘GMT’ column are of type datetime64 . Instead you must create a datetime64 value using thedatetime() function like this:

datetime(2014, 6, 4)

In the function call above, the first integer argument is the year, the second the month and the third the day.

First import the `datetime()` function from the similarly named `datetime` package  by running the following line of code:

In []:

from datetime import datetime

Let’s try the function out by executing the code to ‘Return the row where the date is 4 June 2014’:

In []:

london[london['GMT'] == datetime(2014, 6, 4)]

Out[]:

Described image
Figure 8

Note that the right side of the table has been cropped to fit on the page.

You can also now make more complex queries involving dates such as ‘Return all the rows where the date is between 8 December 2014 and 12 December 2014’, like this:

In []:

c

london[(london['GMT'] >= datetime(2014, 12, 8))

& (london['GMT']

Out[]:

Described image
Figure 9

Note that the right side of the table has been cropped to fit on the page.

Exercise 4 Display rows from dataframe

Now try Exercise 4 in the Exercise notebook 2.

If you’re using Anaconda instead of CoCalc, remember that to open the notebook you’ll need to navigate to the notebook using Jupyter.

Once the notebook is open, run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook. If you need a quick reminder of how to use Jupyter, watch again the video in Week 1 Exercise 1 [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)] .