Week 6: Combine and transform data Part 2
1 Joining left, right and centre
Let’s take stock for a moment. There’s the original, unchanged table (with full country names) about the life expectancy:
In []:
life
Out[]:
Country name | Life expectancy (years) | |
---|---|---|
0 | China | 75 |
1 | Russia | 71 |
2 | United States | 79 |
3 | India | 66 |
4 | United Kingdom | 81 |
… and a table with the GDP in millions of pounds and also full country names.
In []:
gdp
Out[]:
Country name | GDP (£m) | |
---|---|---|
0 | United Kingdom | 1711727 |
1 | United States | 10716029 |
2 | China | 5905202 |
3 | Brazil | 1435148 |
4 | South Africa | 233937 |
Both tables have a common column with a common name (‘Country name’). I can join the two tables on that common column, using the merge() function. Merging basically puts all columns of the two tables together, without duplicating the common column, and joins any rows that have the same value in the common column.
There are four possible ways of joining, depending on which rows I want to include in the resulting table. If I want to include only those countries appearing in the GDP table, I call the merge() function like so:
In []:
merge(gdp, life, on='Country name', how='left')
Out[]:
Country name | GDP (£m) | Life expectancy (years) | |
---|---|---|---|
0 | United Kingdom | 1711727 | 81 |
1 | United States | 10716029 | 79 |
2 | China | 5905202 | 75 |
3 | Brazil | 1435148 | NaN |
4 | South Africa | 233937 | NaN |
The first two arguments are the tables to be merged, with the first table being called the ‘left’ table and the second being the ‘right’ table. The on argument is the name of the common column, i.e. both tables must have a column with that name. The how argument states I want a left join , i.e. the resulting rows are dictated by the left (GDP) table. You can easily see that India and Russia, which appear only in the right (expectancy) table, don’t show up in the result. You can also see that Brazil and South Africa, which appear only in the left table, have an undefined life expectancy. (Remember that ‘NaN’ stands for ‘not a number.)
A right join will instead take the rows from the right table, and add the columns of the left table. Therefore, countries not appearing in the left table will have undefined values for the left table’s columns:
In []:
merge(gdp, life, on='Country name', how='right')
Out[]:
Country name | GDP (£m) | Life expectancy (years) | |
---|---|---|---|
0 | United Kingdom | 1711727 | 81 |
1 | United States | 10716029 | 79 |
2 | China | 5905202 | 75 |
3 | Russia | NaN | 71 |
4 | India | NaN | 66 |
The third possibility is an outer join which takes all countries, i.e. whether they are in the left or right table. The result has all the rows of the left and right joins:
In []:
merge(gdp, life, on='Country name', how='outer')
Out[]:
Country name | GDP (£m) | Life expectancy (years) | |
---|---|---|---|
0 | United Kingdom | 1711727 | 81 |
1 | United States | 10716029 | 79 |
2 | China | 5905202 | 75 |
3 | Brazil | 1435148 | NaN |
4 | South Africa | 233937 | NaN |
5 | Russia | NaN | 71 |
6 | India | NaN | 66 |
The last possibility is an inner join which takes only those countries common to both tables, i.e. for which I know the GDP and the life expectancy. That’s the join I want, to avoid any undefined values:
In []:
gdpVsLife = merge(gdp, life, on='Country name', how='inner')
Out[]:
Country name | GDP (£m) | Life expectancy (years) | |
---|---|---|---|
0 | United Kingdom | 1711727 | 81 |
1 | United States | 10716029 | 79 |
2 | China | 5905202 | 75 |
Now it’s just a matter of applying the data transformation and combination techniques seen so far to the real data from the World Bank.
Exercise 5 Joining left, right and centre
Put your learning into practice by completing Exercise 5 in the Exercise notebook 3.
Remember to run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook.