Course content Course content

Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

More free courses

1.2 Getting real

Having tried out the data transformations and combination on small tables, I feel confident about using the full data from the World Bank, which I pointed you to in Life expectancy project.

Open a new browser window and go to the World Bank’s data page [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)] . Type ‘GDP’ (without the quote marks) in the ‘Find an indicator’ box in the centre of the page and select ‘GDP current US$’. Click ‘Go’. This will take you to the data page you looked at earlier. Look at the top of your browser window. You will notice the URL is http://data.worldbank.org/ indicator/ NY.GDP.MKTP.CD. Every World Bank dataset is for an indicator (in this case GDP in current dollars) with a unique name (in this case NY.GDP.MKTP.CD).

Knowing the indicator name, it’s a doddle to get the data directly into a dataframe, by using the download() function of the wb (World Bank) module, instead of first downloading a CSV or Excel file and then loading it into a dataframe. (Note that CoCalc’s free plan doesn’t allow connecting to other sites, so if you are using CoCalc you’ll need to download the data as a CSV or Excel file from the World Bank and upload it to CoCalc.)

Here’s the code to get the 2013 GDP values for all countries. It may take a little while for the code to fetch the data.

In []:

from pandas.io.wb import download

YEAR = 2013

GDP_INDICATOR = 'NY.GDP.MKTP.CD'

data = download(indicator=GDP_INDICATOR, country='all',

start=YEAR, end=YEAR)

data.head()

Out[]:

		NY.GDP.MKTP.CD
country	year
Arab World	2013	2.843483e+12
Caribbean small states	2013	6.680344e+10
Central Europe and the Baltics	2013	1.418166e+12
East Asia & Pacific (all income levels)	2013	2.080794e+13
East Asia & Pacific (developing only)	2013	1.168563e+13

This table definitely has an odd shape. The three columns don’t have their headings side by side, and the row numbering (0, 1, 2, etc) is missing. That’s because the first two ‘columns’ are in fact the dataframe index. You saw a similar table in Changing a dataframe’s index , when the index of the weather dataframe was set to be the ‘GMT’ column, with values of type datetime64 . There’s a dataframe method to do the inverse, i.e. to transform the row names into column values and thereby reinstate the default dataframe index.

In []:

gdp = data.reset_index()

gdp.head()

Out[]:

	country	year	NY.GDP.MKTP.CD
0	Arab World	2013	2.843483e+12
1	Caribbean small states	2013	6.680344e+10
2	Central Europe and the Baltics	2013	1.418166e+12
3	East Asia & Pacific (all income levels)	2013	2.080794e+13
4	East Asia & Pacific (developing only)	2013	1.168563e+13

I repeat the whole process for the life expectancy:

search for ‘life expectancy’ on the World Bank site
choose the ‘total’ dataset, which includes both female and male inhabitants
note down its indicator (SP.DYN.LE00.IN)
use it to get the data
reset the dataframe index.

In []:

LIFE_INDICATOR = 'SP.DYN.LE00.IN'

data = download(indicator=LIFE_INDICATOR, country='all',

start=YEAR, end=YEAR)

life = data.reset_index()

life.head()

Out[]:

	country	year	SP.DYN.LE00.IN
0	Arab World	2013	70.086392
1	Caribbean small states	2013	71.966306
2	Central Europe and the Baltics	2013	76.127583
3	East Asia & Pacific (all income levels)	2013	74.893439
4	East Asia & Pacific (developing only)	2013	73.981255

By defining the year as a constant, it’s very quick to change the code to load both datasets for any other year. If you wish to get GDP data for an earlier year than for life expectancy, then you need to define a second constant.

Exercise 7 Getting real

The approach described above requires an internet connection to download the data directly from the World Bank. That may require some time, or sometimes not even work if the connection fails. Moreover, the World Bank sometimes changes its data format, which could break the code in the rest of this week.

Therefore, the Exercise notebook 3 loads instead the GDP and life expectancy data from files WB GDP 2013.csv and WB LE 2013.csv and Exercise 7 uses the file WB POP 2013.csv , which you should add to your disk folder or CoCalc project. All files are in the normal tabular format and need no resetting of the indices.

Previous 1.1 Constant variables

Next 1.3 Cleaning up

My OpenLearn Profile

About this free course

Become an OU student

Download this course

Share this free course

1.2 Getting real

Exercise 7 Getting real