Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

Free course

Learn to code for data analysis

1.3 Cleaning up

You may have noticed that the initial rows are not about countries, but groups of countries. Such aggregated values need to be removed, because we’re only interested in individual countries.

The expression frame[m:n], with n an integer bigger than m , represents the ‘sub-table’ from row m to row n-1 . In other words, it is a slice of frame with exactly n minus m rows. The expression is equivalent to the more convoluted expression frame.head(n).tail(n-m) .

In []:

gdp[0:3]

Out[]:

countryyearNY.GDP.MKTP.CD
0Arab World20132.843483e+12
1Caribbean small states20136.680344e+10
2Central Europe and the Baltics20131.418166e+12

To slice all rows from m onwards, you don’t have to count how many rows there are beforehand, just omit n .

In []:

gdp[240:]

Out[]:

countryyearNY.GDP.MKTP.CD
240Uzbekistan20135.679566e+10
241Vanuatu20138.017876e+08
242Venezuela, RB20133.713366e+11
243Vietnam20131.712220e+11
244Virgin Islands (U.S.)2013NaN
245West Bank and Gaza20131.247600e+10
246Yemen, Rep.20133.595450e+10
247Zambia20132.682081e+10
248Zimbabwe20131.349023e+10

By trying out head(m) for different values of m , I find that the list of individual countries starts in row number 34, with Afghanistan. Hence, I slice from row 34 onwards, and that’s my new dataframe.

In []:

gdp = gdp[34:]

gdp.head()

Out[]:

countryyearNY.GDP.MKTP.CD
34Afghanistan20132.031088e+10
35Albania20131.291667e+10
36Algeria20132.101834e+11
37American Samoa2013NaN
38Andorra20133.249101e+09

Unsurprisingly, there is missing data, so I remove those rows, as shown in Missing values [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)]   in Week 4.

In []:

gdp = gdp.dropna()

gdp.head()

Out[]:

countryyearNY.GDP.MKTP.CD
34Afghanistan20132.031088e+10
35Albania20131.291667e+10
36Algeria20132.101834e+11
38Andorra20133.249101e+09
39Angola20131.241632e+11

Finally, I drop the irrelevant year column.

In []:

COUNTRY = 'country'

headings = [COUNTRY, GDP_INDICATOR]

gdp = gdp[headings]

gdp.head()

Out[]:

countryNY.GDP.MKTP.CD
34Afghanistan2.031088e+10
35Albania1.291667e+10
36Algeria2.101834e+11
38Andorra3.249101e+09
39Angola1.241632e+11

And now I repeat the whole cleaning process for the life expectancy table.

In []:

headings = [COUNTRY, LIFE_INDICATOR]

life = life[34:].dropna()[headings]

life.head()

Out[]:

countrySP.DYN.LE00.IN
34Afghanistan60.931415
35Albania77.537244
36Algeria71.009659
39Angola51.866171
40Antigua and Barbuda75.829293

Note how a single line of code can chain a row slice, a method call and a column slice, because each takes a dataframe and returns a dataframe.

Exercise 8 Cleaning up

Clean up the population data from Exercise 7, in Exercise 8 in the exercise notebook 3.

LCDAB_1

Take your learning further

Making the decision to study can be a big step, which is why you'll want a trusted University. The Open University has 50 years’ experience delivering flexible learning and 170,000 students are studying with us right now. Take a look at all Open University courses.

If you are new to University-level study, we offer two introductory routes to our qualifications. You could either choose to start with an Access module, or a module which allows you to count your previous learning towards an Open University qualification. Read our guide on Where to take your learning next for more information.

Not ready for formal University study? Then browse over 1000 free courses on OpenLearn and sign up to our newsletter to hear about new free courses as they are released.

Every year, thousands of students decide to study with The Open University. With over 120 qualifications, we’ve got the right course for you.

Request an Open University prospectus371