Skip to content
Skip to main content

About this free course

Download this course

Share this free course

Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

1.1 Removing rogue spaces

One of the problems often encountered with CSV files is rogue spaces before or after data values or column names.

An image of empty, numbered parking spaces
Figure 3

You learned earlier, in What is a CSV file? , that each value or column name is separated by a comma. However, if you opened ‘London_2014.csv’ in a text editor, you would see that in the row of column names sometimes there are spaces after a comma:

GMT,Max TemperatureC,Mean TemperatureC,Min TemperatureC,Dew PointC,MeanDew PointC,Min DewpointC,Max Humidity, Mean Humidity, Min Humidity, Max Sea Level PressurehPa, Mean Sea Level PressurehPa, Min Sea Level PressurehPa, Max VisibilityKm, Mean VisibilityKm, Min VisibilitykM, Max Wind SpeedKm/h, Mean Wind SpeedKm/h, Max Gust SpeedKm/h,Precipitationmm, CloudCover, Events,WindDirDegrees

For example, there is a space after the comma between Max Humidity and Mean Humidity . This means that when read_csv() reads the row of column names it will interpret a space after a comma as part of the next column name. So, for example, the column name after 'Max Humidity' will be interpreted as ' Mean Humidity' rather than what was intended, which is 'Mean Humidity' . The ramification of this is that code such as:

london[['Mean Humidity']]

will cause a key error (see Selecting a column [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)]   ), as the column name is confusingly ' Mean Humidity '.

This can easily be rectified by adding another argument to the read_csv() function:

skipinitialspace=True

which will tell read_csv() to ignore any spaces after a comma:

In []:

london = read_csv('London_2014.csv', skipinitialspace=True)

The rogue spaces will no longer be in the dataframe and we can write code such as:

In []:

london[['Mean Humidity']].head()

Out[]:

Mean Humidity
086
181
276
385
488

Note that a skipinitialspace=True argument won’t remove a trailing space at the end of a column name.

Next, find out about extra characters and how to remove them.