1.1 Creating the data
I won’t yet work with the full data. Instead I will create small tables, to better illustrate this week’s concepts and techniques.
Small tables make it easier to see what is going on and to create specific data combination and transformation scenarios that test the code.
There are many ways of creating tables in pandas. One of the simplest is to define the rows as a list, with the first element of the list being the first row, the second element being the second row, etc.
Each row of a table has multiple cells, one for each column. The obvious way is to represent each row as a list too, the first element of the list being the cell in the first column, the second element corresponding to the second column, etc. To sum up, the table is represented as a list of lists.
Here is a table of the 2013 GDP of some countries, in US dollars:
In []:
table = [
['UK', 2678454886796.7], # 1st row
['USA', 16768100000000.0], # 2nd row
['China', 9240270452047.0], # and so on...
['Brazil', 2245673032353.8],
['South Africa', 366057913367.1]
]
To create a dataframe, I use a pandas function appropriately called DataFrame() . I have to give it two arguments: the names of the columns and the data itself. The column names are given as a list of strings, the first string being the first column name, etc.
In []:
headings = ['Country', 'GDP (US$)']
gdp = DataFrame(columns=headings, data=table)
gdp
Out[]:
Country | GDP (US$) | |
---|---|---|
0 | UK | 2.678455e+12 |
1 | USA | 1.676810e+13 |
2 | China | 9.240270e+12 |
3 | Brazil | 2.245673e+12 |
4 | South Africa | 3.660579e+11 |
Note that pandas shows large numbers in scientific notation, where, for example, 3e+12 means 3×10 12 , i.e. a 3 followed by 12 zeros.
I define a similar table for the life expectancy, based on the 2013 World Bank data.
In []:
headings = ['Country name', 'Life expectancy (years)']
table = [
['China', 75],
['Russia', 71],
['United States', 79],
['India', 66],
['United Kingdom', 81]
]
life = DataFrame(columns=headings, data=table)
life
Out[]:
Country name | Life expectancy (years) | |
---|---|---|
0 | China | 75 |
1 | Russia | 71 |
2 | United States | 79 |
3 | India | 66 |
4 | United Kingdom | 81 |
To illustrate potential issues when combining multiple datasets, I’ve taken a different set of countries, with common countries in a different order. Moreover, to illustrate a non-numeric conversion, I’ve abbreviated country names in one table but not the other.
Exercise 1 Creating the data
Open the exercise notebook 3 and save it in the disk folder or upload it to the CoCalc project you created in Week 1. Then practise creating dataframes in Exercise 1.
If you’re using Anaconda, remember that to open the notebook you’ll need to navigate to it using Jupyter. Whether you’re using Anaconda or CoCalc, once the notebook is open, run the existing code before you start the exercise. When you’ve completed the exercise, save the notebook. If you need a quick reminder of how to use Jupyter, watch again the video in Week 1 Exercise 1 [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)]