Skip to content
Skip to main content

About this free course

Download this course

Share this free course

Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

1.3 Cleaning up

You may have noticed that the initial rows are not about countries, but groups of countries. Such aggregated values need to be removed, because we’re only interested in individual countries.

The expression frame[m:n], with n an integer bigger than m , represents the ‘sub-table’ from row m to row n-1 . In other words, it is a slice of frame with exactly n minus m rows. The expression is equivalent to the more convoluted expression frame.head(n).tail(n-m) .

In []:

gdp[0:3]

Out[]:

countryyearNY.GDP.MKTP.CD
0Arab World20132.843483e+12
1Caribbean small states20136.680344e+10
2Central Europe and the Baltics20131.418166e+12

To slice all rows from m onwards, you don’t have to count how many rows there are beforehand, just omit n .

In []:

gdp[240:]

Out[]:

countryyearNY.GDP.MKTP.CD
240Uzbekistan20135.679566e+10
241Vanuatu20138.017876e+08
242Venezuela, RB20133.713366e+11
243Vietnam20131.712220e+11
244Virgin Islands (U.S.)2013NaN
245West Bank and Gaza20131.247600e+10
246Yemen, Rep.20133.595450e+10
247Zambia20132.682081e+10
248Zimbabwe20131.349023e+10

By trying out head(m) for different values of m , I find that the list of individual countries starts in row number 34, with Afghanistan. Hence, I slice from row 34 onwards, and that’s my new dataframe.

In []:

gdp = gdp[34:]

gdp.head()

Out[]:

countryyearNY.GDP.MKTP.CD
34Afghanistan20132.031088e+10
35Albania20131.291667e+10
36Algeria20132.101834e+11
37American Samoa2013NaN
38Andorra20133.249101e+09

Unsurprisingly, there is missing data, so I remove those rows, as shown in Missing values [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)]   in Week 4.

In []:

gdp = gdp.dropna()

gdp.head()

Out[]:

countryyearNY.GDP.MKTP.CD
34Afghanistan20132.031088e+10
35Albania20131.291667e+10
36Algeria20132.101834e+11
38Andorra20133.249101e+09
39Angola20131.241632e+11

Finally, I drop the irrelevant year column.

In []:

COUNTRY = 'country'

headings = [COUNTRY, GDP_INDICATOR]

gdp = gdp[headings]

gdp.head()

Out[]:

countryNY.GDP.MKTP.CD
34Afghanistan2.031088e+10
35Albania1.291667e+10
36Algeria2.101834e+11
38Andorra3.249101e+09
39Angola1.241632e+11

And now I repeat the whole cleaning process for the life expectancy table.

In []:

headings = [COUNTRY, LIFE_INDICATOR]

life = life[34:].dropna()[headings]

life.head()

Out[]:

countrySP.DYN.LE00.IN
34Afghanistan60.931415
35Albania77.537244
36Algeria71.009659
39Angola51.866171
40Antigua and Barbuda75.829293

Note how a single line of code can chain a row slice, a method call and a column slice, because each takes a dataframe and returns a dataframe.

Exercise 8 Cleaning up

Clean up the population data from Exercise 7, in Exercise 8 in the exercise notebook 3.