Course content Course content

Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

More free courses

1.4 Filtering groups

Being able to group rows according to some criterion and then apply various operations to those groups is a very powerful technique.

However, there may be occasions when you only want to work with a subset of the groups that can be extracted from a single dataset based on a particular group property. For example, it might require that:

groups that contain a minimum number of rows, such as countries that engage in trade around a particular commodity with a minimum number of partner countries
groups for whom a summary statistic meets certain conditions (for example, the total value of exports for a particular commodity exceeds a particular threshold value, or whose minimum or maximum value are below a certain value)
a ranking of the groups based on a particular summary statistic, such as the total trade value, that returns only the top five or bottom three groups according to that ranking.

In the following example, where groups are selected based on group size, a filtering operation is applied to limit an original dataset so that it includes just those groups containing at least three rows, combining the rows from the selected groups back together again to produce the output dataset:

Dataframe for each commodity A, B and C and their amounts.

Figure 6

Show description|Hide description

Dataframe for each commodity A, B and C and their amounts. Filter operation applied to select rows associated with groups that pass a filtration test applied to each group and combined into one dataframe.

Figure 6

In pandas, groups can be filtered based on their group properties using the filter() method. Using the example dataframe again:

In []:

Out[]:

	Commodity	Amount
0	A	10
1	A	15
2	A	5
3	A	20
4	B	10
5	B	10
6	B	5
7	C	20
8	C	30

For example, the dataframe can be filtered to return just the rows from groups where there is a maximum number of rows in the group.

As a reference point, count how many rows are associated with each group.

In []:

grouped = df.groupby('Commodity')

grouped.aggregate(len)

Out[]:

	Amount
Commodity
A	4
B	3
C	2

The filter() method uses a function that returns a boolean ( True or False ) value to decide whether or not to filter through the rows associated with a particular group.

As with the apply() method, provide the filter() method with just a function name in order to pass each group to that function. For example, define a function that says whether or not a group contains three or fewer rows and use that as a basis for filtering the original dataset.

In []:

def groupsOfAtMostThreeRows(g):

return len(g) <= 3

grouped.filter(groupsOfAtMostThreeRows)

Out[]:

	Commodity	Amount
4	B	10
5	B	10
6	B	5
7	C	20
8	C	30

Alternatively, all the rows in a group can be filtered on an aggregate property of the group such as the sum total, or maximum, minimum or mean value, from one of the columns.

In []:

#Consider the following total amounts by group

grouped.aggregate(sum)

Out[]:

	Amount
Commodity
A	50
B	25
C	50

In []:

pivot_table(df,

index=['Commodity','Partner'],

values='Amount',

aggfunc=sum)

Out[]:

	Commodity	Amount
0	A	10
1	A	15
2	A	5
3	A	20
7	C	20
8	C	30

The ability to filter datasets based on group properties means that large datasets can more easily be limited to just those rows associated with groups of rows that are deemed to be relevant in some way.

Exercise 4 Filtering groups

Use the Exercise notebook 4 to practise filtering in Exercise 4.

As you complete the tasks, think about other questions you could ask of your data using the filter command.

Previous 1.3 Summary operations

Next 2 Pivot tables

My OpenLearn Profile

About this free course

Become an OU student

Download this course

Share this free course

1.4 Filtering groups

Exercise 4 Filtering groups