Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

Free course

Learn to code for data analysis

Week 8: Further techniques Part 2

1 The split-apply-combine pattern

In the exercise in Week 7, you downloaded data from Comtrade that could be described as ‘heterogenous’ or mixed in some way. For example, the same dataset contained information relating to both imports and exports.

To find the partner countries with the largest trade value in terms of exports means filtering the dataset to obtain just the rows containing export data and then ranking those. Finding the largest import partner requires a sort on just the import data.

But what if you wanted to find out even more refined information? For example:

  • the total value of exports of product X from the UK to those countries on a year by year basis (group the information by year and then find the total for each year)
  • the total value of exports of product X from the UK to each of the partner countries by year (group the information by country and year and then find the total for each country/year pairing)
  • the average value of exports across all the countries on a month by month basis (group by month, then find the average value per month)
  • the average value of exports across each country on a month by month basis (group by month and country, then find the average value over each country/month pairing)
  • the difference month on month between the value of imports from, or exports to, each particular country over the five year period (group by country, order by month and year, then find the difference between consecutive months).

In each case, the original dataset needs to be separated into several subsets, or groups of data rows, and then some operation performed on those rows. To generate a single, final report would then require combining the results of those operations in a new or extended dataframe.

This sequence of operations is common enough for it to have been described as the ‘split-apply-combine’ pattern. The sequence is to:

  • ‘split’ an appropriately shaped dataset into several components
  • ‘apply’ an operator to the rows contained within a component
  • ‘combine’ the results of applying to operator to each component to return a single combined result.

You will see how to make use of this pattern using pandas next.

LCDAB_1

Take your learning further

Making the decision to study can be a big step, which is why you'll want a trusted University. The Open University has 50 years’ experience delivering flexible learning and 170,000 students are studying with us right now. Take a look at all Open University courses.

If you are new to university level study, find out more about the types of qualifications we offer, including our entry level Access courses and Certificates.

Not ready for University study then browse over 900 free courses on OpenLearn and sign up to our newsletter to hear about new free courses as they are released.

Every year, thousands of students decide to study with The Open University. With over 120 qualifications, we’ve got the right course for you.

Request an Open University prospectus