1 Joining left, right and centre
Let’s take stock for a moment. There’s the original, unchanged table (with full country names) about the life expectancy:
life
| Country name | Life expectancy (years) | |
|---|---|---|
| 0 | China | 75 |
| 1 | Russia | 71 |
| 2 | United States | 79 |
| 3 | India | 66 |
| 4 | United Kingdom | 81 |
… and a table with the GDP in millions of pounds and also full country names.
gdp
| Country name | GDP (£m) | |
|---|---|---|
| 0 | United Kingdom | 1711727 |
| 1 | United States | 10716029 |
| 2 | China | 5905202 |
| 3 | Brazil | 1435148 |
| 4 | South Africa | 233937 |
Both tables have a common column with a common name (‘Country name’). I can join the two tables on that common column, using the function. Merging basically puts all columns of the two tables together, without duplicating the common column, and joins any rows that have the same value in the common column.
There are four possible ways of joining, depending on which rows I want to include in the resulting table. If I want to include only those countries appearing in the GDP table, I call the function like so:
merge(gdp, life, on='Country name', how='left')
| Country name | GDP (£m) | Life expectancy (years) | |
|---|---|---|---|
| 0 | United Kingdom | 1711727 | 81 |
| 1 | United States | 10716029 | 79 |
| 2 | China | 5905202 | 75 |
| 3 | Brazil | 1435148 | NaN |
| 4 | South Africa | 233937 | NaN |
The first two arguments are the tables to be merged, with the first table being called the ‘left’ table and the second being the ‘right’ table. The argument is the name of the common column, i.e. both tables must have a column with that name. The argument states I want a left join , i.e. the resulting rows are dictated by the left (GDP) table. You can easily see that India and Russia, which appear only in the right (expectancy) table, don’t show up in the result. You can also see that Brazil and South Africa, which appear only in the left table, have an undefined life expectancy. (Remember that ‘NaN’ stands for ‘not a number.)
A right join will instead take the rows from the right table, and add the columns of the left table. Therefore, countries not appearing in the left table will have undefined values for the left table’s columns:
merge(gdp, life, on='Country name', how='right')
| Country name | GDP (£m) | Life expectancy (years) | |
|---|---|---|---|
| 0 | United Kingdom | 1711727 | 81 |
| 1 | United States | 10716029 | 79 |
| 2 | China | 5905202 | 75 |
| 3 | Russia | NaN | 71 |
| 4 | India | NaN | 66 |
The third possibility is an outer join which takes all countries, i.e. whether they are in the left or right table. The result has all the rows of the left and right joins:
merge(gdp, life, on='Country name', how='outer')
| Country name | GDP (£m) | Life expectancy (years) | |
|---|---|---|---|
| 0 | United Kingdom | 1711727 | 81 |
| 1 | United States | 10716029 | 79 |
| 2 | China | 5905202 | 75 |
| 3 | Brazil | 1435148 | NaN |
| 4 | South Africa | 233937 | NaN |
| 5 | Russia | NaN | 71 |
| 6 | India | NaN | 66 |
The last possibility is an inner join which takes only those countries common to both tables, i.e. for which I know the GDP and the life expectancy. That’s the join I want, to avoid any undefined values:
gdpVsLife = merge(gdp, life, on='Country name', how='inner')
| Country name | GDP (£m) | Life expectancy (years) | |
|---|---|---|---|
| 0 | United Kingdom | 1711727 | 81 |
| 1 | United States | 10716029 | 79 |
| 2 | China | 5905202 | 75 |
Now it’s just a matter of applying the data transformation and combination techniques seen so far to the real data from the World Bank.
Exercise 5 Joining left, right and centre
Put your learning into practice by completing Exercise 5 in the Exercise notebook 3.
Remember to run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook.
OpenLearn - Introduction and guidance
Except for third party materials and otherwise, this content is made available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Licence, full copyright detail can be found in the acknowledgements section. Please see full copyright statement for details.
