Skip to content
Skip to main content

About this free course

Download this course

Share this free course

Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

1.2 Getting real

Having tried out the data transformations and combination on small tables, I feel confident about using the full data from the World Bank, which I pointed you to in Life expectancy project.

Open a new browser window and go to the World Bank’s data page [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)] . Type ‘GDP’ (without the quote marks) in the ‘Find an indicator’ box in the centre of the page and select ‘GDP current US$’. Click ‘Go’. This will take you to the data page you looked at earlier. Look at the top of your browser window. You will notice the URL is http://data.worldbank.org/ indicator/ NY.GDP.MKTP.CD. Every World Bank dataset is for an indicator (in this case GDP in current dollars) with a unique name (in this case NY.GDP.MKTP.CD).

Knowing the indicator name, it’s a doddle to get the data directly into a dataframe, by using the download() function of the wb (World Bank) module, instead of first downloading a CSV or Excel file and then loading it into a dataframe. (Note that CoCalc’s free plan doesn’t allow connecting to other sites, so if you are using CoCalc you’ll need to download the data as a CSV or Excel file from the World Bank and upload it to CoCalc.)

Here’s the code to get the 2013 GDP values for all countries. It may take a little while for the code to fetch the data.

In []:

from pandas.io.wb import download

YEAR = 2013

GDP_INDICATOR = 'NY.GDP.MKTP.CD'

data = download(indicator=GDP_INDICATOR, country='all',

start=YEAR, end=YEAR)

data.head()

Out[]:

NY.GDP.MKTP.CD
countryyear
Arab World20132.843483e+12
Caribbean small states20136.680344e+10
Central Europe and the Baltics20131.418166e+12
East Asia & Pacific (all income levels)20132.080794e+13
East Asia & Pacific (developing only)20131.168563e+13

This table definitely has an odd shape. The three columns don’t have their headings side by side, and the row numbering (0, 1, 2, etc) is missing. That’s because the first two ‘columns’ are in fact the dataframe index. You saw a similar table in Changing a dataframe’s index , when the index of the weather dataframe was set to be the ‘GMT’ column, with values of type datetime64 . There’s a dataframe method to do the inverse, i.e. to transform the row names into column values and thereby reinstate the default dataframe index.

In []:

gdp = data.reset_index()

gdp.head()

Out[]:

countryyearNY.GDP.MKTP.CD
0Arab World20132.843483e+12
1Caribbean small states20136.680344e+10
2Central Europe and the Baltics20131.418166e+12
3East Asia & Pacific (all income levels)20132.080794e+13
4East Asia & Pacific (developing only)20131.168563e+13

I repeat the whole process for the life expectancy:

  • search for ‘life expectancy’ on the World Bank site
  • choose the ‘total’ dataset, which includes both female and male inhabitants
  • note down its indicator (SP.DYN.LE00.IN)
  • use it to get the data
  • reset the dataframe index.

In []:

LIFE_INDICATOR = 'SP.DYN.LE00.IN'

data = download(indicator=LIFE_INDICATOR, country='all',

start=YEAR, end=YEAR)

life = data.reset_index()

life.head()

Out[]:

countryyearSP.DYN.LE00.IN
0Arab World201370.086392
1Caribbean small states201371.966306
2Central Europe and the Baltics201376.127583
3East Asia & Pacific (all income levels)201374.893439
4East Asia & Pacific (developing only)201373.981255

By defining the year as a constant, it’s very quick to change the code to load both datasets for any other year. If you wish to get GDP data for an earlier year than for life expectancy, then you need to define a second constant.

Exercise 7 Getting real

The approach described above requires an internet connection to download the data directly from the World Bank. That may require some time, or sometimes not even work if the connection fails. Moreover, the World Bank sometimes changes its data format, which could break the code in the rest of this week.

Therefore, the Exercise notebook 3 loads instead the GDP and life expectancy data from files WB GDP 2013.csv and WB LE 2013.csv and Exercise 7 uses the file WB POP 2013.csv , which you should add to your disk folder or CoCalc project. All files are in the normal tabular format and need no resetting of the indices.