1.3 Cleaning up
You may have noticed that the initial rows are not about countries, but groups of countries. Such aggregated values need to be removed, because we’re only interested in individual countries.
The expression frame[m:n], with n an integer bigger than m , represents the ‘sub-table’ from row m to row n-1 . In other words, it is a slice of frame with exactly n minus m rows. The expression is equivalent to the more convoluted expression frame.head(n).tail(n-m) .
In []:
gdp[0:3]
Out[]:
country | year | NY.GDP.MKTP.CD | |
---|---|---|---|
0 | Arab World | 2013 | 2.843483e+12 |
1 | Caribbean small states | 2013 | 6.680344e+10 |
2 | Central Europe and the Baltics | 2013 | 1.418166e+12 |
To slice all rows from m onwards, you don’t have to count how many rows there are beforehand, just omit n .
In []:
gdp[240:]
Out[]:
country | year | NY.GDP.MKTP.CD | |
---|---|---|---|
240 | Uzbekistan | 2013 | 5.679566e+10 |
241 | Vanuatu | 2013 | 8.017876e+08 |
242 | Venezuela, RB | 2013 | 3.713366e+11 |
243 | Vietnam | 2013 | 1.712220e+11 |
244 | Virgin Islands (U.S.) | 2013 | NaN |
245 | West Bank and Gaza | 2013 | 1.247600e+10 |
246 | Yemen, Rep. | 2013 | 3.595450e+10 |
247 | Zambia | 2013 | 2.682081e+10 |
248 | Zimbabwe | 2013 | 1.349023e+10 |
By trying out head(m) for different values of m , I find that the list of individual countries starts in row number 34, with Afghanistan. Hence, I slice from row 34 onwards, and that’s my new dataframe.
In []:
gdp = gdp[34:]
gdp.head()
Out[]:
country | year | NY.GDP.MKTP.CD | |
---|---|---|---|
34 | Afghanistan | 2013 | 2.031088e+10 |
35 | Albania | 2013 | 1.291667e+10 |
36 | Algeria | 2013 | 2.101834e+11 |
37 | American Samoa | 2013 | NaN |
38 | Andorra | 2013 | 3.249101e+09 |
Unsurprisingly, there is missing data, so I remove those rows, as shown in Missing values [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)] in Week 4.
In []:
gdp = gdp.dropna()
gdp.head()
Out[]:
country | year | NY.GDP.MKTP.CD | |
---|---|---|---|
34 | Afghanistan | 2013 | 2.031088e+10 |
35 | Albania | 2013 | 1.291667e+10 |
36 | Algeria | 2013 | 2.101834e+11 |
38 | Andorra | 2013 | 3.249101e+09 |
39 | Angola | 2013 | 1.241632e+11 |
Finally, I drop the irrelevant year column.
In []:
COUNTRY = 'country'
headings = [COUNTRY, GDP_INDICATOR]
gdp = gdp[headings]
gdp.head()
Out[]:
country | NY.GDP.MKTP.CD | |
---|---|---|
34 | Afghanistan | 2.031088e+10 |
35 | Albania | 1.291667e+10 |
36 | Algeria | 2.101834e+11 |
38 | Andorra | 3.249101e+09 |
39 | Angola | 1.241632e+11 |
And now I repeat the whole cleaning process for the life expectancy table.
In []:
headings = [COUNTRY, LIFE_INDICATOR]
life = life[34:].dropna()[headings]
life.head()
Out[]:
country | SP.DYN.LE00.IN | |
---|---|---|
34 | Afghanistan | 60.931415 |
35 | Albania | 77.537244 |
36 | Algeria | 71.009659 |
39 | Angola | 51.866171 |
40 | Antigua and Barbuda | 75.829293 |
Note how a single line of code can chain a row slice, a method call and a column slice, because each takes a dataframe and returns a dataframe.
Exercise 8 Cleaning up
Clean up the population data from Exercise 7, in Exercise 8 in the exercise notebook 3.