Skip to content
Skip to main content

About this free course

Download this course

Share this free course

Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

1.4 Filtering groups

Being able to group rows according to some criterion and then apply various operations to those groups is a very powerful technique.

However, there may be occasions when you only want to work with a subset of the groups that can be extracted from a single dataset based on a particular group property. For example, it might require that:

  • groups that contain a minimum number of rows, such as countries that engage in trade around a particular commodity with a minimum number of partner countries
  • groups for whom a summary statistic meets certain conditions (for example, the total value of exports for a particular commodity exceeds a particular threshold value, or whose minimum or maximum value are below a certain value)
  • a ranking of the groups based on a particular summary statistic, such as the total trade value, that returns only the top five or bottom three groups according to that ranking.

In the following example, where groups are selected based on group size, a filtering operation is applied to limit an original dataset so that it includes just those groups containing at least three rows, combining the rows from the selected groups back together again to produce the output dataset:

Dataframe for each commodity A, B and C and their amounts.
Figure 6

In pandas, groups can be filtered based on their group properties using the filter() method. Using the example dataframe again:

In []:

df

Out[]:

CommodityAmount
0A10
1A15
2A5
3A20
4B10
5B10
6B5
7C20
8C30

For example, the dataframe can be filtered to return just the rows from groups where there is a maximum number of rows in the group.

As a reference point, count how many rows are associated with each group.

In []:

grouped = df.groupby('Commodity')

grouped.aggregate(len)

Out[]:

Amount
Commodity
A4
B3
C2

The filter() method uses a function that returns a boolean ( True or False ) value to decide whether or not to filter through the rows associated with a particular group.

As with the apply() method, provide the filter() method with just a function name in order to pass each group to that function. For example, define a function that says whether or not a group contains three or fewer rows and use that as a basis for filtering the original dataset.

In []:

def groupsOfAtMostThreeRows(g):

return len(g)

grouped.filter(groupsOfAtMostThreeRows)

Out[]:

CommodityAmount
4B10
5B10
6B5
7C20
8C30

Alternatively, all the rows in a group can be filtered on an aggregate property of the group such as the sum total, or maximum, minimum or mean value, from one of the columns.

In []:

#Consider the following total amounts by group

grouped.aggregate(sum)

Out[]:

Amount
Commodity
A50
B25
C50

In []:

pivot_table(df,

index=['Commodity','Partner'],

values='Amount',

aggfunc=sum)

Out[]:

CommodityAmount
0A10
1A15
2A5
3A20
7C20
8C30

The ability to filter datasets based on group properties means that large datasets can more easily be limited to just those rows associated with groups of rows that are deemed to be relevant in some way.

Exercise 4 Filtering groups

Use the Exercise notebook 4 to practise filtering in Exercise 4.

As you complete the tasks, think about other questions you could ask of your data using the filter command.