Skip to content
Skip to main content

About this free course

Download this course

Share this free course

Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

Week 8: Further techniques Part 2

1 The split-apply-combine pattern

In the exercise in Week 7, you downloaded data from Comtrade that could be described as ‘heterogenous’ or mixed in some way. For example, the same dataset contained information relating to both imports and exports.

To find the partner countries with the largest trade value in terms of exports means filtering the dataset to obtain just the rows containing export data and then ranking those. Finding the largest import partner requires a sort on just the import data.

A close-up image of someone putting glue onto a plank of wood
Figure 1

But what if you wanted to find out even more refined information? For example:

  • the total value of exports of product X from the UK to those countries on a year by year basis (group the information by year and then find the total for each year)
  • the total value of exports of product X from the UK to each of the partner countries by year (group the information by country and year and then find the total for each country/year pairing)
  • the average value of exports across all the countries on a month by month basis (group by month, then find the average value per month)
  • the average value of exports across each country on a month by month basis (group by month and country, then find the average value over each country/month pairing)
  • the difference month on month between the value of imports from, or exports to, each particular country over the five year period (group by country, order by month and year, then find the difference between consecutive months).

In each case, the original dataset needs to be separated into several subsets, or groups of data rows, and then some operation performed on those rows. To generate a single, final report would then require combining the results of those operations in a new or extended dataframe.

This sequence of operations is common enough for it to have been described as the ‘split-apply-combine’ pattern. The sequence is to:

  • ‘split’ an appropriately shaped dataset into several components
  • ‘apply’ an operator to the rows contained within a component
  • ‘combine’ the results of applying to operator to each component to return a single combined result.

You will see how to make use of this pattern using pandas next.