1.2 Getting real
Having tried out the data transformations and combination on small tables, I feel confident about using the full data from the World Bank, which I pointed you to in Life expectancy project.
Open a new browser window and go to the World Bank’s data page [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)] . Type ‘GDP’ (without the quote marks) in the ‘Find an indicator’ box in the centre of the page and select ‘GDP current US$’. Click ‘Go’. This will take you to the data page you looked at earlier. Look at the top of your browser window. You will notice the URL is http://data.worldbank.org/ indicator/ NY.GDP.MKTP.CD. Every World Bank dataset is for an indicator (in this case GDP in current dollars) with a unique name (in this case NY.GDP.MKTP.CD).
Knowing the indicator name, it’s a doddle to get the data directly into a dataframe, by using the download() function of the wb (World Bank) module, instead of first downloading a CSV or Excel file and then loading it into a dataframe. (Note that CoCalc’s free plan doesn’t allow connecting to other sites, so if you are using CoCalc you’ll need to download the data as a CSV or Excel file from the World Bank and upload it to CoCalc.)
Here’s the code to get the 2013 GDP values for all countries. It may take a little while for the code to fetch the data.
In []:
from pandas.io.wb import download
YEAR = 2013
GDP_INDICATOR = 'NY.GDP.MKTP.CD'
data = download(indicator=GDP_INDICATOR, country='all',
start=YEAR, end=YEAR)
data.head()
Out[]:
NY.GDP.MKTP.CD | ||
---|---|---|
country | year | |
Arab World | 2013 | 2.843483e+12 |
Caribbean small states | 2013 | 6.680344e+10 |
Central Europe and the Baltics | 2013 | 1.418166e+12 |
East Asia & Pacific (all income levels) | 2013 | 2.080794e+13 |
East Asia & Pacific (developing only) | 2013 | 1.168563e+13 |
This table definitely has an odd shape. The three columns don’t have their headings side by side, and the row numbering (0, 1, 2, etc) is missing. That’s because the first two ‘columns’ are in fact the dataframe index. You saw a similar table in Changing a dataframe’s index , when the index of the weather dataframe was set to be the ‘GMT’ column, with values of type datetime64 . There’s a dataframe method to do the inverse, i.e. to transform the row names into column values and thereby reinstate the default dataframe index.
In []:
gdp = data.reset_index()
gdp.head()
Out[]:
country | year | NY.GDP.MKTP.CD | |
---|---|---|---|
0 | Arab World | 2013 | 2.843483e+12 |
1 | Caribbean small states | 2013 | 6.680344e+10 |
2 | Central Europe and the Baltics | 2013 | 1.418166e+12 |
3 | East Asia & Pacific (all income levels) | 2013 | 2.080794e+13 |
4 | East Asia & Pacific (developing only) | 2013 | 1.168563e+13 |
I repeat the whole process for the life expectancy:
- search for ‘life expectancy’ on the World Bank site
- choose the ‘total’ dataset, which includes both female and male inhabitants
- note down its indicator (SP.DYN.LE00.IN)
- use it to get the data
- reset the dataframe index.
In []:
LIFE_INDICATOR = 'SP.DYN.LE00.IN'
data = download(indicator=LIFE_INDICATOR, country='all',
start=YEAR, end=YEAR)
life = data.reset_index()
life.head()
Out[]:
country | year | SP.DYN.LE00.IN | |
---|---|---|---|
0 | Arab World | 2013 | 70.086392 |
1 | Caribbean small states | 2013 | 71.966306 |
2 | Central Europe and the Baltics | 2013 | 76.127583 |
3 | East Asia & Pacific (all income levels) | 2013 | 74.893439 |
4 | East Asia & Pacific (developing only) | 2013 | 73.981255 |
By defining the year as a constant, it’s very quick to change the code to load both datasets for any other year. If you wish to get GDP data for an earlier year than for life expectancy, then you need to define a second constant.
Exercise 7 Getting real
The approach described above requires an internet connection to download the data directly from the World Bank. That may require some time, or sometimes not even work if the connection fails. Moreover, the World Bank sometimes changes its data format, which could break the code in the rest of this week.
Therefore, the Exercise notebook 3 loads instead the GDP and life expectancy data from files WB GDP 2013.csv and WB LE 2013.csv and Exercise 7 uses the file WB POP 2013.csv , which you should add to your disk folder or CoCalc project. All files are in the normal tabular format and need no resetting of the indices.