Skip to main content

About this free course

Download this course

Share this free course

Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

1.3 Cleaning up

You may have noticed that the initial rows are not about countries, but groups of countries. Such aggregated values need to be removed, because we’re only interested in individual countries.

The expression frame[m:n], with n an integer bigger than m , represents the ‘sub-table’ from row m to row n-1 . In other words, it is a slice of frame with exactly n minus m rows. The expression is equivalent to the more convoluted expression frame.head(n).tail(n-m) .

In []:

gdp[0:3]

Out[]:

Table _unit7.1.10
countryyearNY.GDP.MKTP.CD
0Arab World20132.843483e+12
1Caribbean small states20136.680344e+10
2Central Europe and the Baltics20131.418166e+12

To slice all rows from m onwards, you don’t have to count how many rows there are beforehand, just omit n .

In []:

gdp[240:]

Out[]:

Table _unit7.1.11
countryyearNY.GDP.MKTP.CD
240Uzbekistan20135.679566e+10
241Vanuatu20138.017876e+08
242Venezuela, RB20133.713366e+11
243Vietnam20131.712220e+11
244Virgin Islands (U.S.)2013NaN
245West Bank and Gaza20131.247600e+10
246Yemen, Rep.20133.595450e+10
247Zambia20132.682081e+10
248Zimbabwe20131.349023e+10

By trying out head(m) for different values of m , I find that the list of individual countries starts in row number 34, with Afghanistan. Hence, I slice from row 34 onwards, and that’s my new dataframe.

In []:

gdp = gdp[34:]

gdp.head()

Out[]:

Table _unit7.1.12
countryyearNY.GDP.MKTP.CD
34Afghanistan20132.031088e+10
35Albania20131.291667e+10
36Algeria20132.101834e+11
37American Samoa2013NaN
38Andorra20133.249101e+09

Unsurprisingly, there is missing data, so I remove those rows, as shown in Missing values [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)] in Week 4.

In []:

gdp = gdp.dropna()

gdp.head()

Out[]:

Table _unit7.1.13
countryyearNY.GDP.MKTP.CD
34Afghanistan20132.031088e+10
35Albania20131.291667e+10
36Algeria20132.101834e+11
38Andorra20133.249101e+09
39Angola20131.241632e+11

Finally, I drop the irrelevant year column.

In []:

COUNTRY = 'country'

headings = [COUNTRY, GDP_INDICATOR]

gdp = gdp[headings]

gdp.head()

Out[]:

Table _unit7.1.14
countryNY.GDP.MKTP.CD
34Afghanistan2.031088e+10
35Albania1.291667e+10
36Algeria2.101834e+11
38Andorra3.249101e+09
39Angola1.241632e+11

And now I repeat the whole cleaning process for the life expectancy table.

In []:

headings = [COUNTRY, LIFE_INDICATOR]

life = life[34:].dropna()[headings]

life.head()

Out[]:

Table _unit7.1.15
countrySP.DYN.LE00.IN
34Afghanistan60.931415
35Albania77.537244
36Algeria71.009659
39Angola51.866171
40Antigua and Barbuda75.829293

Note how a single line of code can chain a row slice, a method call and a column slice, because each takes a dataframe and returns a dataframe.

Activity _unit7.1.4 Exercise 8 Cleaning up

Clean up the population data from Exercise 7, in Exercise 8 in the exercise notebook 3.