Skip to content
Skip to main content

About this free course

Download this course

Share this free course

Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

1.1 Creating the data

I won’t yet work with the full data. Instead I will create small tables, to better illustrate this week’s concepts and techniques.

Small tables make it easier to see what is going on and to create specific data combination and transformation scenarios that test the code.

There are many ways of creating tables in pandas. One of the simplest is to define the rows as a list, with the first element of the list being the first row, the second element being the second row, etc.

Each row of a table has multiple cells, one for each column. The obvious way is to represent each row as a list too, the first element of the list being the cell in the first column, the second element corresponding to the second column, etc. To sum up, the table is represented as a list of lists.

Here is a table of the 2013 GDP of some countries, in US dollars:

In []:

table = [

['UK', 2678454886796.7], # 1st row

['USA', 16768100000000.0], # 2nd row

['China', 9240270452047.0], # and so on...

['Brazil', 2245673032353.8],

['South Africa', 366057913367.1]

]

To create a dataframe, I use a pandas function appropriately called DataFrame() . I have to give it two arguments: the names of the columns and the data itself. The column names are given as a list of strings, the first string being the first column name, etc.

In []:

headings = ['Country', 'GDP (US$)']

gdp = DataFrame(columns=headings, data=table)

gdp

Out[]:

CountryGDP (US$)
0UK2.678455e+12
1USA1.676810e+13
2China9.240270e+12
3Brazil2.245673e+12
4South Africa3.660579e+11

Note that pandas shows large numbers in scientific notation, where, for example, 3e+12 means 3×10 12 , i.e. a 3 followed by 12 zeros.

I define a similar table for the life expectancy, based on the 2013 World Bank data.

In []:

headings = ['Country name', 'Life expectancy (years)']

table = [

['China', 75],

['Russia', 71],

['United States', 79],

['India', 66],

['United Kingdom', 81]

]

life = DataFrame(columns=headings, data=table)

life

Out[]:

Country nameLife expectancy (years)
0China75
1Russia71
2United States79
3India66
4United Kingdom81

To illustrate potential issues when combining multiple datasets, I’ve taken a different set of countries, with common countries in a different order. Moreover, to illustrate a non-numeric conversion, I’ve abbreviated country names in one table but not the other.

Exercise 1 Creating the data

Open the exercise notebook 3 and save it in the disk folder or upload it to the CoCalc project you created in Week 1. Then practise creating dataframes in Exercise 1.

If you’re using Anaconda, remember that to open the notebook you’ll need to navigate to it using Jupyter. Whether you’re using Anaconda or CoCalc, once the notebook is open, run the existing code before you start the exercise. When you’ve completed the exercise, save the notebook. If you need a quick reminder of how to use Jupyter, watch again the video in Week 1 Exercise 1 [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)]