Become an OU student

Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

Week 6: Combine and transform data Part 2

1 Joining left, right and centre

Let’s take stock for a moment. There’s the original, unchanged table (with full country names) about the life expectancy:

In []:

life

Out[]:

Country nameLife expectancy (years)
0China75
1Russia71
2United States79
3India66
4United Kingdom81

… and a table with the GDP in millions of pounds and also full country names.

In []:

gdp

Out[]:

Country nameGDP (£m)
0United Kingdom1711727
1United States10716029
2China5905202
3Brazil1435148
4South Africa233937

Both tables have a common column with a common name (‘Country name’). I can join the two tables on that common column, using the merge() function. Merging basically puts all columns of the two tables together, without duplicating the common column, and joins any rows that have the same value in the common column.

There are four possible ways of joining, depending on which rows I want to include in the resulting table. If I want to include only those countries appearing in the GDP table, I call the merge() function like so:

In []:

merge(gdp, life, on='Country name', how='left')

Out[]:

Country nameGDP (£m)Life expectancy (years)
0United Kingdom171172781
1United States1071602979
2China590520275
3Brazil1435148NaN
4South Africa233937NaN

The first two arguments are the tables to be merged, with the first table being called the ‘left’ table and the second being the ‘right’ table. The on argument is the name of the common column, i.e. both tables must have a column with that name. The how argument states I want a left join , i.e. the resulting rows are dictated by the left (GDP) table. You can easily see that India and Russia, which appear only in the right (expectancy) table, don’t show up in the result. You can also see that Brazil and South Africa, which appear only in the left table, have an undefined life expectancy. (Remember that ‘NaN’ stands for ‘not a number.)

A right join will instead take the rows from the right table, and add the columns of the left table. Therefore, countries not appearing in the left table will have undefined values for the left table’s columns:

In []:

merge(gdp, life, on='Country name', how='right')

Out[]:

Country nameGDP (£m)Life expectancy (years)
0United Kingdom171172781
1United States1071602979
2China590520275
3RussiaNaN71
4IndiaNaN66

The third possibility is an outer join which takes all countries, i.e. whether they are in the left or right table. The result has all the rows of the left and right joins:

In []:

merge(gdp, life, on='Country name', how='outer')

Out[]:

Country nameGDP (£m)Life expectancy (years)
0United Kingdom171172781
1United States1071602979
2China590520275
3Brazil1435148NaN
4South Africa233937NaN
5RussiaNaN71
6IndiaNaN66

The last possibility is an inner join which takes only those countries common to both tables, i.e. for which I know the GDP and the life expectancy. That’s the join I want, to avoid any undefined values:

In []:

gdpVsLife = merge(gdp, life, on='Country name', how='inner')

Out[]:

Country nameGDP (£m)Life expectancy (years)
0United Kingdom171172781
1United States1071602979
2China590520275

Now it’s just a matter of applying the data transformation and combination techniques seen so far to the real data from the World Bank.

Exercise 5 Joining left, right and centre

Put your learning into practice by completing Exercise 5 in the Exercise notebook 3.

Remember to run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook.