Course content Course content

Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

More free courses

1.3 Cleaning up

You may have noticed that the initial rows are not about countries, but groups of countries. Such aggregated values need to be removed, because we’re only interested in individual countries.

The expression frame[m:n], with n an integer bigger than m , represents the ‘sub-table’ from row m to row n-1 . In other words, it is a slice of frame with exactly n minus m rows. The expression is equivalent to the more convoluted expression frame.head(n).tail(n-m) .

In []:

gdp[0:3]

Out[]:

Table _unit7.1.10
	country	year	NY.GDP.MKTP.CD
0	Arab World	2013	2.843483e+12
1	Caribbean small states	2013	6.680344e+10
2	Central Europe and the Baltics	2013	1.418166e+12

To slice all rows from m onwards, you don’t have to count how many rows there are beforehand, just omit n .

In []:

gdp[240:]

Out[]:

Table _unit7.1.11
	country	year	NY.GDP.MKTP.CD
240	Uzbekistan	2013	5.679566e+10
241	Vanuatu	2013	8.017876e+08
242	Venezuela, RB	2013	3.713366e+11
243	Vietnam	2013	1.712220e+11
244	Virgin Islands (U.S.)	2013	NaN
245	West Bank and Gaza	2013	1.247600e+10
246	Yemen, Rep.	2013	3.595450e+10
247	Zambia	2013	2.682081e+10
248	Zimbabwe	2013	1.349023e+10

By trying out head(m) for different values of m , I find that the list of individual countries starts in row number 34, with Afghanistan. Hence, I slice from row 34 onwards, and that’s my new dataframe.

In []:

gdp = gdp[34:]

gdp.head()

Out[]:

Table _unit7.1.12
	country	year	NY.GDP.MKTP.CD
34	Afghanistan	2013	2.031088e+10
35	Albania	2013	1.291667e+10
36	Algeria	2013	2.101834e+11
37	American Samoa	2013	NaN
38	Andorra	2013	3.249101e+09

Unsurprisingly, there is missing data, so I remove those rows, as shown in Missing values [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)] in Week 4.

In []:

gdp = gdp.dropna()

gdp.head()

Out[]:

Table _unit7.1.13
	country	year	NY.GDP.MKTP.CD
34	Afghanistan	2013	2.031088e+10
35	Albania	2013	1.291667e+10
36	Algeria	2013	2.101834e+11
38	Andorra	2013	3.249101e+09
39	Angola	2013	1.241632e+11

Finally, I drop the irrelevant year column.

In []:

COUNTRY = 'country'

headings = [COUNTRY, GDP_INDICATOR]

gdp = gdp[headings]

gdp.head()

Out[]:

Table _unit7.1.14
	country	NY.GDP.MKTP.CD
34	Afghanistan	2.031088e+10
35	Albania	1.291667e+10
36	Algeria	2.101834e+11
38	Andorra	3.249101e+09
39	Angola	1.241632e+11

And now I repeat the whole cleaning process for the life expectancy table.

In []:

headings = [COUNTRY, LIFE_INDICATOR]

life = life[34:].dropna()[headings]

life.head()

Out[]:

Table _unit7.1.15
	country	SP.DYN.LE00.IN
34	Afghanistan	60.931415
35	Albania	77.537244
36	Algeria	71.009659
39	Angola	51.866171
40	Antigua and Barbuda	75.829293

Note how a single line of code can chain a row slice, a method call and a column slice, because each takes a dataframe and returns a dataframe.

Activity _unit7.1.4 Exercise 8 Cleaning up

Clean up the population data from Exercise 7, in Exercise 8 in the exercise notebook 3.

Previous 1.2 Getting real

Next 1.4 Joining and transforming

My OpenLearn Profile

About this free course

Become an OU student

Download this course

Share this free course

1.3 Cleaning up

Activity _unit7.1.4 Exercise 8 Cleaning up