Skip to main content

About this free course

Download this course

Share this free course

Learn to code for data analysis
Learn to code for data analysis

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

1.1 What is a CSV file?

A CSV file is a plain text file that is used to hold tabular data. The acronym CSV is short for ‘comma-separated values’.

An image of many pins marking various countries on a globe
Figure 2

Take a look at the first few lines of a CSV file that holds the same data as the Excel file ‘WHO POP TB all.xls’ that you encountered in Week 2:

Country,Population (1000s),TB deaths

Afghanistan,30552,13000.0

Albania,3173,20.0

Algeria,39208,5100.0

Andorra,79,0.26

Angola,21472,6900.0

Antigua and Barbuda,90,1.2

Argentina,41446,570.0

Armenia,2977,170.0

Notice that the first line is a row of column names. The subsequent lines are rows of actual data that correspond to the column names. The row of column names is optional, but it is helpful in understanding the data in the following lines and making sure the right values fall in the right place. In this example, the first value on every row must be a string representing a country’s name, the second value is an integer representing that country’s population (in 1000s) and the third value is a decimal representing the number of deaths due to TB. Note that the third value is a decimal (like 0.26 deaths for Andorra) and not an integer because it is an estimate obtained from statistical processing of collected data.

Note that each value or column name is separated by a comma but actually any character can be used to separate values in a CSV file, including spaces and tabs etc., hence CSV can also stand for ‘character-separated values’.

Because CSV files are in plain-text it makes the data easy to import into any spreadsheet program, database or pandas dataframe.

Before anything can be done with a CSV file with pandas, the following import statement must be executed:

In []:

from pandas import *

As you learned in Week 2, the import statement loads into memory all the code in the pandas module.

To read a CSV file into a dataframe, the pandas function read_csv() needs to be called.

In []:

df = read_csv('WHO POP TB all.csv')

The above code creates a dataframe from the data in the file WHO POP TB all.csv and assigns it to the variable df. This is the simplest usage of the read_csv() function, just using a single argument, a string that holds the name of the CSV file.

However the function can take many additional arguments (some of which you’ll use later), which determine how the file is to be read.

In the next step, find out about dataframes and the ‘dot’ notation.