Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

Free course

Learn to code for data analysis

2.1 Pivot tables in pandas

The interactive pivot table provides a convenient way of exploring a relatively small dataset directly within a web browser. (A python package is also available that allows interactive pivot tables to be created directly from a pandas dataframe.)

You can also achieve a similar effect using code, one-line-at-a-time. In this step, you will learn how to ask – and answer – questions of a similar form to the ones you raised using the interactive pivot table, but this time using programming code.

There are several reasons why you might want to automate pivot table operations you might previously have done by hand. These include:

  • having a record of all the steps used to perform a particular task, or analysis, which can be useful if you need to check or provide evidence about what you have done (transparency)
  • being able to repeat the task automatically; this is particularly useful if you need to perform the same task repeatedly – for example, generating a new summary report each time a dataset is updated with new weekly or daily figures
  • being able to apply one analysis to another dataset. For example, you might want to produce the same sort of pivot table reports to similarly organised datasets but differently populated datasets (for example, Comtrade datasets that refer to different groups of countries and/or different commodity types).

In order to use the interactive pivot table, you had to identify:

  • what column(s) in the dataset to use to define the row groupings in the pivot table
  • what column(s) in the dataset to use to define the column groupings in the pivot table
  • what column in the dataset to use as the basis for the pivot table summary function
  • what summary function to use.

The process is similar when it comes to using pivot tables in pandas. Indeed, you might find it useful to use the interactive pivot table to help you identify just what needs to go where in order to generate a particular report using the pandas pivot table.

Working with pandas pivot tables

Let’s start by creating a sample dataset that includes several different columns that can be grouped around. The code below defines the dataframe column by column, instead of row by row as you have learned before.

In []:

df = DataFrame({"Commodity":["A","A","A","A","B","B","B","C","C"],

"Amount":[10,15,5,20,10,10,5,20,30],

"Partner":["P","P","Q","Q","P","P","Q","P","Q"],

"Flow":["X","Y","X","Y","X","Y","X","X","Y"]})

df

Out[]:

CommodityPartnerFlowAmount
0APX10
1APY15
2AQX5
3AQY20
4BPX10
5BPY10
6BQX5
7CPX20
8CQY30

Suppose, for example, that you have data for a particular reporter country, and that you want to find the total value of trade that country has for each commodity and each partner country. A pivot table can be used to split the data by ‘commodity’, and within that ‘partner’, and then apply some sort of aggregation function to each (‘commodity’, ‘partner’) group.

In the interactive pivot table, this would have meant ordering the ‘Commodity’ and ‘Partner’ labels in the rows area, setting the aggregation function to sum and applying it to the ‘Amount’ (that is, the ‘Trade Value’), and leaving the columns area free of any selections.

In turn, the pandas pivot_table() function uses:

  • the index parameter set as a list containing the ‘Commodity’ and ‘Reporter’ data elements, to define the row categories
  • the values parameter set to ‘Amount’
  • the aggfunc (aggregating function) set to sum .

In []:

pivot_table(df,

index=['Commodity','Partner'],

values='Amount',

aggfunc=sum)

Out[]:

FlowXY
CommodityPartner
AP1015
Q520
BP1010
Q5NaN
CP20NaN
QNaN30

To further subdivide the data, an additional ‘Flow’ grouping element could be added in. (In this case, the resulting pivot table just corresponds to the original dataset.)

In []:

pivot_table(df,

index=['Commodity','Partner','Flow'],

values='Amount',

aggfunc=sum)

Out[]:

CommodityPartnerFlow
APX10
Y15
QX5
Y20
BPX10
Y10
QX5
CPX20
QY30

Alternatively, you might decide that you want to pull out the ‘Flow’ items into separate columns for each of the original (‘commodity’, ‘partner’) groupings. To do this, add in a columns parameter:

pivot_table(df,

index=['Commodity','Partner'],

columns=['Flow'],

values='Amount',

aggfunc=sum)

FlowXY
CommodityPartner
AP1015
Q520
BP1010
Q5NaN
CP20NaN
QNaN30

In this case, some missing values arise for cases where there was no original row item. For example, there were no rows in the original dataset for Commodity/Partner/Flow values of B/Q/Y, C/P/Y or C/Q/X.

The pandas produced pivot table can be further extended to report ‘marginal’ items, that is, row and column based total amounts, by setting margins=True.

pivot_table(df,

index=['Commodity','Partner'],

columns=['Flow'],

values='Amount',

aggfunc=sum,

margins=True)

FlowXYAll
CommodityPartner
AP101525
Q52025
BP101020
Q5NaN5
CP20NaN20
QNaN3030
All5075125

In terms of the ‘split-apply-combine’ pattern, the pandas pivot table operates in much the same way as the interactive pivot table:

  • the list of original data columns assigned to the index parameter splits the data into a set of groups
  • the groups are further split into smaller cell level groupings by optionally setting the columns parameter.

The selected operator is then applied to each group and the results combined in an appropriately structured output display table.

Exercise 6 pivot tables with pandas

Use the Exercise notebook 4 to explore the creation of pivot tables using pandas in Exercise 6.

Did you manage to ask any new questions of your data using the pandas pivot table function? You could try using them in combination with other pandas functions, such as filter() , to limit the rows you generated the pivot table against. What did the pivot tables tell you about the levels of trade around the trade item and reporter country you selected?

One reason that pivot tables are often thought of as difficult to use is that there is a lot of data manipulation going on inside them. The data is grouped across rows, split across columns and may be aggregated in various ways. It can sometimes be hard to work out how to structure the output report you want, even before worrying about the programming code syntax. Given that, consider what you think the benefits of using code are as opposed to interactive pivot tables. Think about how you could use them to complement each other.

LCDAB_1

Take your learning further371

Making the decision to study can be a big step, which is why you'll want a trusted University. The Open University has 50 years’ experience delivering flexible learning and 170,000 students are studying with us right now. Take a look at all Open University courses372.

If you are new to university level study, we offer two introductory routes to our qualifications. Find out Where to take your learning next?373 You could either choose to start with an Access courses374or an open box module, which allows you to count your previous learning towards an Open University qualification.

Not ready for University study then browse over 1000 free courses on OpenLearn375 and sign up to our newsletter376 to hear about new free courses as they are released.

Every year, thousands of students decide to study with The Open University. With over 120 qualifications, we’ve got the right course for you.

Request an Open University prospectus371