1.1 Removing rogue spaces
One of the problems often encountered with CSV files is rogue spaces before or after data values or column names.
You learned earlier, in What is a CSV file? , that each value or column name is separated by a comma. However, if you opened ‘London_2014.csv’ in a text editor, you would see that in the row of column names sometimes there are spaces after a comma:
GMT,Max TemperatureC,Mean TemperatureC,Min TemperatureC,Dew PointC,MeanDew PointC,Min DewpointC,Max Humidity, Mean Humidity, Min Humidity, Max Sea Level PressurehPa, Mean Sea Level PressurehPa, Min Sea Level PressurehPa, Max VisibilityKm, Mean VisibilityKm, Min VisibilitykM, Max Wind SpeedKm/h, Mean Wind SpeedKm/h, Max Gust SpeedKm/h,Precipitationmm, CloudCover, Events,WindDirDegrees
For example, there is a space after the comma between Max Humidity and Mean Humidity . This means that when read_csv() reads the row of column names it will interpret a space after a comma as part of the next column name. So, for example, the column name after 'Max Humidity' will be interpreted as ' Mean Humidity' rather than what was intended, which is 'Mean Humidity' . The ramification of this is that code such as:
london[['Mean Humidity']]
will cause a key error (see Selecting a column [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)] ), as the column name is confusingly ' Mean Humidity '.
This can easily be rectified by adding another argument to the read_csv() function:
skipinitialspace=True
which will tell read_csv() to ignore any spaces after a comma:
In []:
london = read_csv('London_2014.csv', skipinitialspace=True)
The rogue spaces will no longer be in the dataframe and we can write code such as:
In []:
london[['Mean Humidity']].head()
Out[]:
Mean Humidity | |
---|---|
0 | 86 |
1 | 81 |
2 | 76 |
3 | 85 |
4 | 88 |
Note that a skipinitialspace=True argument won’t remove a trailing space at the end of a column name.
Next, find out about extra characters and how to remove them.