1.3 Cleaning up
You may have noticed that the initial rows are not about countries, but groups of countries. Such aggregated values need to be removed, because we’re only interested in individual countries.
The expression with an integer bigger than , represents the ‘sub-table’ from row to row . In other words, it is a slice of frame with exactly minus rows. The expression is equivalent to the more convoluted expression .
gdp[0:3]
| country | year | NY.GDP.MKTP.CD | |
|---|---|---|---|
| 0 | Arab World | 2013 | 2.843483e+12 |
| 1 | Caribbean small states | 2013 | 6.680344e+10 |
| 2 | Central Europe and the Baltics | 2013 | 1.418166e+12 |
To slice all rows from onwards, you don’t have to count how many rows there are beforehand, just omit .
gdp[240:]
| country | year | NY.GDP.MKTP.CD | |
|---|---|---|---|
| 240 | Uzbekistan | 2013 | 5.679566e+10 |
| 241 | Vanuatu | 2013 | 8.017876e+08 |
| 242 | Venezuela, RB | 2013 | 3.713366e+11 |
| 243 | Vietnam | 2013 | 1.712220e+11 |
| 244 | Virgin Islands (U.S.) | 2013 | NaN |
| 245 | West Bank and Gaza | 2013 | 1.247600e+10 |
| 246 | Yemen, Rep. | 2013 | 3.595450e+10 |
| 247 | Zambia | 2013 | 2.682081e+10 |
| 248 | Zimbabwe | 2013 | 1.349023e+10 |
By trying out for different values of , I find that the list of individual countries starts in row number 34, with Afghanistan. Hence, I slice from row 34 onwards, and that’s my new dataframe.
gdp = gdp[34:]
gdp.head()
| country | year | NY.GDP.MKTP.CD | |
|---|---|---|---|
| 34 | Afghanistan | 2013 | 2.031088e+10 |
| 35 | Albania | 2013 | 1.291667e+10 |
| 36 | Algeria | 2013 | 2.101834e+11 |
| 37 | American Samoa | 2013 | NaN |
| 38 | Andorra | 2013 | 3.249101e+09 |
Unsurprisingly, there is missing data, so I remove those rows, as shown in Missing values in Week 4.
gdp = gdp.dropna()
gdp.head()
| country | year | NY.GDP.MKTP.CD | |
|---|---|---|---|
| 34 | Afghanistan | 2013 | 2.031088e+10 |
| 35 | Albania | 2013 | 1.291667e+10 |
| 36 | Algeria | 2013 | 2.101834e+11 |
| 38 | Andorra | 2013 | 3.249101e+09 |
| 39 | Angola | 2013 | 1.241632e+11 |
Finally, I drop the irrelevant year column.
COUNTRY = 'country'
headings = [COUNTRY, GDP_INDICATOR]
gdp = gdp[headings]
gdp.head()
| country | NY.GDP.MKTP.CD | |
|---|---|---|
| 34 | Afghanistan | 2.031088e+10 |
| 35 | Albania | 1.291667e+10 |
| 36 | Algeria | 2.101834e+11 |
| 38 | Andorra | 3.249101e+09 |
| 39 | Angola | 1.241632e+11 |
And now I repeat the whole cleaning process for the life expectancy table.
headings = [COUNTRY, LIFE_INDICATOR]
life = life[34:].dropna()[headings]
life.head()
| country | SP.DYN.LE00.IN | |
|---|---|---|
| 34 | Afghanistan | 60.931415 |
| 35 | Albania | 77.537244 |
| 36 | Algeria | 71.009659 |
| 39 | Angola | 51.866171 |
| 40 | Antigua and Barbuda | 75.829293 |
Note how a single line of code can chain a row slice, a method call and a column slice, because each takes a dataframe and returns a dataframe.
Exercise 8 Cleaning up
Clean up the population data from Exercise 7, in Exercise 8 in the exercise notebook 3.
OpenLearn - Introduction and guidance
Except for third party materials and otherwise, this content is made available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Licence, full copyright detail can be found in the acknowledgements section. Please see full copyright statement for details.
