1.3 Cleaning up

You may have noticed that the initial rows are not about countries, but groups of countries. Such aggregated values need to be removed, because we’re only interested in individual countries.

The expression with an integer bigger than , represents the ‘sub-table’ from row to row . In other words, it is a slice of frame with exactly minus rows. The expression is equivalent to the more convoluted expression .

gdp[0:3]

countryyearNY.GDP.MKTP.CD
0Arab World20132.843483e+12
1Caribbean small states20136.680344e+10
2Central Europe and the Baltics20131.418166e+12

To slice all rows from onwards, you don’t have to count how many rows there are beforehand, just omit .

gdp[240:]

countryyearNY.GDP.MKTP.CD
240Uzbekistan20135.679566e+10
241Vanuatu20138.017876e+08
242Venezuela, RB20133.713366e+11
243Vietnam20131.712220e+11
244Virgin Islands (U.S.)2013NaN
245West Bank and Gaza20131.247600e+10
246Yemen, Rep.20133.595450e+10
247Zambia20132.682081e+10
248Zimbabwe20131.349023e+10

By trying out for different values of , I find that the list of individual countries starts in row number 34, with Afghanistan. Hence, I slice from row 34 onwards, and that’s my new dataframe.

gdp = gdp[34:]

gdp.head()

countryyearNY.GDP.MKTP.CD
34Afghanistan20132.031088e+10
35Albania20131.291667e+10
36Algeria20132.101834e+11
37American Samoa2013NaN
38Andorra20133.249101e+09

Unsurprisingly, there is missing data, so I remove those rows, as shown in Missing values in Week 4.

gdp = gdp.dropna()

gdp.head()

countryyearNY.GDP.MKTP.CD
34Afghanistan20132.031088e+10
35Albania20131.291667e+10
36Algeria20132.101834e+11
38Andorra20133.249101e+09
39Angola20131.241632e+11

Finally, I drop the irrelevant year column.

COUNTRY = 'country'

headings = [COUNTRY, GDP_INDICATOR]

gdp = gdp[headings]

gdp.head()

countryNY.GDP.MKTP.CD
34Afghanistan2.031088e+10
35Albania1.291667e+10
36Algeria2.101834e+11
38Andorra3.249101e+09
39Angola1.241632e+11

And now I repeat the whole cleaning process for the life expectancy table.

headings = [COUNTRY, LIFE_INDICATOR]

life = life[34:].dropna()[headings]

life.head()

countrySP.DYN.LE00.IN
34Afghanistan60.931415
35Albania77.537244
36Algeria71.009659
39Angola51.866171
40Antigua and Barbuda75.829293

Note how a single line of code can chain a row slice, a method call and a column slice, because each takes a dataframe and returns a dataframe.

Exercise 8 Cleaning up

Clean up the population data from Exercise 7, in Exercise 8 in the exercise notebook 3.