Skip to main content

About this free course

Download this course

Share this free course

Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

1.1 Splitting a dataset by grouping

‘Grouping’ refers to the process of splitting a dataset into sets of rows, or ‘groups’, on the basis of one or more criteria associated with each data row.

Grouping is often used to split a dataset into one or more distinct groups. Each row in the dataset being grouped around can be assigned to one, and only one, of the derived groups. The rows associated with a particular group may be accessed by reference to the group or the same processing or reporting operation may be applied to the rows contained in each group on a group by group basis.

A table contains commodity and amount columns, there are amounts grouped by commodity A, B and C.
Figure 2

The rows do not have to be ‘grouped’ together in the original dataset – they could appear in any order in the original dataset (for example, a commodity A row, followed by a two commodity B rows, then another commodity A row, and so on). However, the order in which each row appears in the original dataset will typically be reflected by the order in which the rows appear in each subgroup.

Let’s see how to do that in pandas. Create a simple dataframe that looks like the full table in the image above:

In []:

data=[['A',10],['A',15],['A',5],['A',20],

['B',10],['B',10],['B',5],

['C',20],['C',30]]

df=DataFrame(data=data, columns=["Commodity","Amount"])

df

Out[]:

CommodityAmount
0A10
1A15
2A5
3A20
4B10
5B10
6B5
7C20
8C30

Next, use the groupby() method to group the dataframe into separate groups of rows based on the values contained within one or more specified ‘key’ columns. For example, group the rows according to what sort of commodity each row corresponds to as specified by the value taken in the ‘Commodity’ column.

In []:

grouped = df.groupby('Commodity')

The number and ‘names’ of the groups that are identified correspond to the unique values that can be found within the column or columns (which will be referred to as the ‘key columns’) used to identify the groups.

You can see what groups are available with the following method call:

In []:

grouped.groups.keys()

Out []:

['A', 'C', 'B']

The get_group() method can be used to grab just the rows associated with a particular group.

In []:

grouped.get_group('B')

Out []:

CommodityAmount
4B10
5B10
6B5

Datasets can also be grouped against multiple columns. For example, if there was an extra ‘Year’ column in the above table, you could group against just the commodity, exactly as above, to provide access to rows by commodity; just the year, setting grouped = df.groupby( 'Year' ); or by both commodity and year, passing in the two grouping key columns as a list:

grouped = df.groupby( ['Commodity','Year'])

The list of keys associated with the groups might then look like [(‘A’, 2015), (‘A’, 2014), (‘B’, 2015), (‘B’, 2014)]. The rows associated with the group corresponding to commodity A in 2014 could then be retrieved using the command:

grouped.get_group( ('A',2014) )

This may seem to you like a roundabout way of filtering the dataframe as you did in Week 2; but you’ll see that the ability to automatically group rows sets up the possibility of then processing those rows as separate ‘mini-dataframes’ and then combining the results back together.

Exercise 2 Grouping data

Work through Exercise 2 in the Week 4 notebook.

As you complete the tasks, think about these questions:

  • For your particular dataset, how did you group the data and what questions did you ask of it? Which countries were the major partners of your reporter country for the different groupings?
  • With the ability to group data so easily, what other sorts of questions would you like to be able to ask?