<?xml version="1.0" encoding="UTF-8"?>
<Item xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Autonumber="false" id="X-gopa_1_combined" TextType="CompleteItem" SchemaVersion="2.0" PageStartNumber="0" Template="Generic_A4_Unnumbered" ExportedEquationLocation="" SecondColour="None" ThirdColour="None" FourthColour="None" Logo="colour" Rendering="OpenLearn" xsi:noNamespaceSchemaLocation="http://www.open.edu/openlearn/ocw/mod/oucontent/schemas/v2_0/OUIntermediateSchema.xsd" x_oucontentversion="2023010300">
    <meta name="aaaf:olink_server" content="http://www.open.edu/openlearn/ocw"/>
    <meta content="https://www.open.edu/openlearn/money-business/retirement-planning-made-easy/content-section-0" name="dc:source"/>
    <meta content="false" name="vle:osep"/>
    <meta content="mathjax" name="equations"/>
    <!--ADD CORRECT OPENLEARN COURSE URL HERE:<meta name="dc:source" content="http://www.open.edu/openlearn/education/educational-technology-and-practice/educational-practice/english-grammar-context/content-section-0"/>-->
    <CourseCode>LCDAB_1</CourseCode>
    <CourseTitle>Learn to code for data analysis</CourseTitle>
    <ItemID><!--leave blank--></ItemID>
    <ItemTitle>Introduction and guidance</ItemTitle>
    <FrontMatter>
        <Imprint>
            <Standard>
                <GeneralInfo>
                    <Paragraph><b>About this free course</b></Paragraph>
                    <Paragraph>This version of the content may include video, images and interactive content that may not be optimised for your device.</Paragraph>
                    <Paragraph>You can experience this free course as it was originally designed on OpenLearn, the home of free learning from The Open University –</Paragraph>
                    <!--[course name] hyperlink to page URL make sure href includes http:// with trackingcode added <Paragraph><a href="http://www.open.edu/openlearn/money-management/introduction-bookkeeping-and-accounting/content-section-0?LKCAMPAIGN=ebook_&amp;amp;MEDIA=ol">www.open.edu/openlearn/money-management/introduction-bookkeeping-and-accounting/content-section-0</a>. </Paragraph>-->
                    <Paragraph>There you’ll also be able to track your progress via your activity record, which you can use to demonstrate your learning.</Paragraph>
                </GeneralInfo>
                <Address>
                    <AddressLine/>
                    <AddressLine/>
                </Address>
                <FirstPublished>
                    <Paragraph/>
                </FirstPublished>
                <Copyright>
                    <Paragraph>Copyright © 2017 The Open University</Paragraph>
                </Copyright>
                <Rights>
                    <Paragraph/>
                    <Paragraph><b>Intellectual property</b></Paragraph>
                    <Paragraph>Unless otherwise stated, this resource is released under the terms of the Creative Commons Licence v4.0 <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_GB">http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_GB</a>. Within that The Open University interprets this licence in the following way: <a href="http://www.open.edu/openlearn/about-openlearn/frequently-asked-questions-on-openlearn">www.open.edu/openlearn/about-openlearn/frequently-asked-questions-on-openlearn</a>. Copyright and rights falling outside the terms of the Creative Commons Licence are retained or controlled by The Open University. Please read the full text before using any of the content.</Paragraph>
                    <Paragraph>We believe the primary barrier to accessing high-quality educational experiences is cost, which is why we aim to publish as much free content as possible under an open licence. If it proves difficult to release content under our preferred Creative Commons licence (e.g. because we can’t afford or gain the clearances or find suitable alternatives), we will still release the materials for free under a personal end-user licence.</Paragraph>
                    <Paragraph>This is because the learning experience will always be the same high quality offering and that should always be seen as positive – even if at times the licensing is different to Creative Commons.</Paragraph>
                    <Paragraph>When using the content you must attribute us (The Open University) (the OU) and any identified author in accordance with the terms of the Creative Commons Licence.</Paragraph>
                    <Paragraph>The Acknowledgements section is used to list, amongst other things, third party (Proprietary), licensed content which is not subject to Creative Commons licensing. Proprietary content must be used (retained) intact and in context to the content at all times.</Paragraph>
                    <Paragraph>The Acknowledgements section is also used to bring to your attention any other Special Restrictions which may apply to the content. For example there may be times when the Creative Commons Non-Commercial Sharealike licence does not apply to any of the content even if owned by us (The Open University). In these instances, unless stated otherwise, the content may be used for personal and non-commercial use.</Paragraph>
                    <Paragraph>We have also identified as Proprietary other material included in the content which is not subject to Creative Commons Licence. These are OU logos, trading names and may extend to certain photographic and video images and sound recordings and any other material as may be brought to your attention.</Paragraph>
                    <Paragraph>Unauthorised use of any of the content may constitute a breach of the terms and conditions and/or intellectual property laws.</Paragraph>
                    <Paragraph>We reserve the right to alter, amend or bring to an end any terms and conditions provided here without notice.</Paragraph>
                    <Paragraph>All rights falling outside the terms of the Creative Commons licence are retained or controlled by The Open University.</Paragraph>
                    <Paragraph>Head of Intellectual Property, The Open University</Paragraph>
                </Rights>
                <Edited>
                    <Paragraph/>
                </Edited>
                <Printed>
                    <Paragraph/>
                </Printed>
                <ISBN><!--INSERT EPUB ISBN WHEN AVAILABLE (.kdl)-->
        <!--INSERT KDL ISBN WHEN AVAILABLE (.epub)--></ISBN>
                <Edition/>
            </Standard>
        </Imprint>
        <Covers>
            <Cover template="false" type="ebook" src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/lcdab_1_epub_1400x1200.jpg"/>
            <Cover template="false" type="A4" src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/lcdab_1_epub_1400x1200.jpg"/>
        </Covers>
    </FrontMatter>
<Unit><UnitID/><UnitTitle>Week 1: Having a go at it Part 1</UnitTitle><Session><Title>1 Install the software</Title><Paragraph>To code in the course notebooks that Ruth mentioned in the video you’ll need to install some software.</Paragraph><Paragraph>We’re going use a program called Jupyter that opens in your web browser and allows you to write notebooks that include Python code. Jupyter and other software you will need to take part in the course are freely available and you have two options.</Paragraph><InternalSection><Heading>Online CoCalc service</Heading><Paragraph>The advantages of using CoCalc are that you don’t have to install any software and you can work on the course exercises from anywhere there is an internet connection. The disadvantages are that you will need a good internet connection, running the code in your notebook may take time if there are many simultaneous users on CoCalc and you may occasionally lose the latest changes you make in your notebook, because the service will periodically reset.</Paragraph><Paragraph>However, since notebooks are regularly auto-saved, the risk of losing work should be rather small. CoCalc offers a paid plan that has better performance and stability than the free plan. The Open University and the authors have no commercial affiliation with CoCalc.</Paragraph></InternalSection><InternalSection><Heading>Install Anaconda package</Heading><Paragraph>The other option is to install on your laptop or desktop the free Anaconda package, which includes all necessary software for this course. If you plan to work on this course from multiple computers, you will need to install Anaconda on each one. You can use cloud storage, like Dropbox, to keep your notebooks in sync across machines. Anaconda doesn’t have the limitations of CoCalc, so we recommend you use Anaconda if you are going to work on this course always from the same computer.</Paragraph><Paragraph>You should now read the <a href="http://www.open.edu/openlearn/learn-to-code-installation"> instructions for installing Anaconda or creating a CoCalc project </a>for this course. Don’t forget to test everything is working, as explained in the instructions.</Paragraph><Paragraph>The installation of Anaconda is different for Windows, Macs and Linux. Please follow the appropriate instructions.</Paragraph><Paragraph>We advise you to accept the pre-filled defaults suggested during the installation process.</Paragraph></InternalSection><InternalSection><Heading>Notebooks</Heading><Paragraph>Each week you will use two notebooks (and any necessary data files): an exercise notebook and a project notebook. The notebooks are this course’s programming environment, where you will do your own coding.</Paragraph><Paragraph>The exercise notebook contains all the code shown throughout the week, so that you can try it out for yourself, any time you wish. The exercise notebook also contains all the week’s exercises. You will be able to solve several exercises just by slightly modifying our code.</Paragraph><Paragraph>The project notebook contains the week’s written-up data analysis project, including all necessary code. If you have the extra time, you’re encouraged to modify the project notebooks to write up your own data analyses.</Paragraph><Paragraph>You should now download from <a href="https://github.com/mwermelinger/Learn-to-code-for-data-analysis">here</a> the notebooks and data files needed for the whole course. </Paragraph><Paragraph>You’ll open the notebooks using Jupyter, which is part of Anaconda and CoCalc. You will learn how to use notebooks later this week, after you’ve seen what Python code looks like.</Paragraph><Paragraph>Note: please ensure that you abide by any terms and conditions associated with these pieces of software.</Paragraph></InternalSection><Section><Title>1.1 Start with a question</Title><Paragraph>Data analysis often starts with a question or with some data.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1026.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1026.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="9ec6e161" x_imagesrc="ou_futurelearn_learn_to_code_fig_1026.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 1</b> </Caption><Description>An image with a young boy wearing a medical mask, in the foreground; a patient in a South African tuberculosis clinic </Description></Figure><Paragraph>A question leads to data that can answer it, and looking at the available data helps to make a question precise or may trigger new questions, which, in turn, may require further data. Data analysis is thus often an iterative process: the questions determine which data to obtain, and the data influences which questions to ask and what the scope of the analysis is. How this week’s project came about is an example of such an iterative process.</Paragraph><Paragraph>I (Michel) was watching a news programme mentioning the fight against tuberculosis (TB) as part of the United Nations Millenium Development Goals. Wishing to know how serious TB is, I browsed the World Health Organization (WHO) website and found a dataset with the number of TB cases and deaths per country per year, from 2007 to 2013. This in turn raised the question of whether a high (or low) number could be mainly due to the country having a large (or small) population. Some more browsing revealed the WHO also has population data from 1990 to 2013.</Paragraph><Paragraph>That was enough data for the fuzzy question: how serious is TB? It was time to make it precise. I chose to measure the effect of TB in terms of deaths, which led to the following questions:</Paragraph><BulletedList><ListItem>What is the total, smallest, largest, and average number of deaths due to TB?</ListItem><ListItem>What is the death rate (number of deaths divided by population) of each country?</ListItem><ListItem>Which countries have the smallest and largest number of deaths?</ListItem><ListItem>Which countries have the smallest and largest death rate?</ListItem></BulletedList><Paragraph>Answering these questions for the whole world and for seven years (2007–2013) would be a bit too much for this initial project. A subset was needed. I decided to take only the latest data for 2013 and, being Portuguese, to focus on the Portuguese-speaking countries. One of them, Brazil, is part of the BRICS group of major emerging economies, so for more diversity the other four countries would be included too: Russia, India, China and South Africa. The project was finally defined! I’ve added links to the data below if you’d like to take a look!</Paragraph><Activity><Heading>Activity 1 What would you ask?</Heading><Question><Paragraph>Before you embark on coding the analysis to get answers, what other questions could be asked of the datasets described?</Paragraph><Paragraph>What countries would you be interested in? What groups of countries might be interesting to analyse?</Paragraph><Paragraph>Note down some of your questions so that you can come back to them later.</Paragraph></Question><Interaction><FreeResponse size="paragraph" id="a1"/></Interaction></Activity><Paragraph><a href="https://github.com/mwermelinger/Learn-to-code-for-data-analysis/raw/master/1_Having_a_go_at_it/WHO%20POP%20TB%20all.xls">WHO POPULATION - DATA BY COUNTRY (LATEST YEAR)</a></Paragraph><Paragraph><a href="https://github.com/mwermelinger/Learn-to-code-for-data-analysis/raw/master/1_Having_a_go_at_it/WHO%20POP%20TB%20some.xls">WHO TB MORTALITY AND PREVALENCE - DATA BY COUNTRY (2007 - PRESENT)</a></Paragraph><Paragraph>Next, I’ll explain how I started to organise the information.</Paragraph></Section><Section><Title>1.2 Variables and assignments</Title><Paragraph>With the choice of data and questions confirmed, the coding can begin.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1027.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1027.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="5f8eafca" x_imagesrc="ou_futurelearn_learn_to_code_fig_1027.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 2</b> </Caption><Alternative>An image of a lot of boxes stacked up</Alternative><Description>An image of a lot of boxes stacked up</Description></Figure><Paragraph>To introduce the basics of coding, I will show you a very simple approach, only suitable for the smallest of datasets. Please bear with me. In the second part of the week I will show you the proper approach. Read through this step and the next – <b>you’re not expected to write code just yet</b>. In Exercise 1, a bit further on in this week, you’ll be asked to start writing code.</Paragraph><Paragraph>Ok, let’s start. I want the computer to calculate the total number of deaths in 2013. For the computer to do that, it must first be told what is the number of deaths in each country in that year. I’ll start with my home country.</Paragraph><Paragraph><b><ComputerCode>In []:</ComputerCode> </b></Paragraph><ComputerDisplay><Paragraph>deathsInPortugal = 100</Paragraph></ComputerDisplay><Paragraph>The ‘In[]’ line is Jupyter’s way of saying that what follows is code I typed in. And there it is: the first line of code! It is a command to the computer that could be translated to English as: ‘find in the attic an empty box, put the number 100 in the box, and write “deathsInPortugal” on the box’. (Aren’t you glad Python is more succinct than English?) In coding jargon, the attic is the computer’s memory, boxes are called <b>variables</b> (I’ll explain why shortly), what’s written on a box is the variable’s <b>name</b> , and storing a value in a variable is called an <b>assignment</b>.</Paragraph><Paragraph>By naming the boxes, I can later ask the computer to show the value in box <ComputerCode>
<b>thingamajig</b>
</ComputerCode> or take the values in boxes <ComputerCode>
<b>stuff</b>
</ComputerCode> and <ComputerCode>
<b>moreStuff</b>
</ComputerCode> and add them together.</Paragraph><Paragraph>To see what’s inside a box, I can just write the name of the box on a line of its own. Jupyter will write the variable’s value on the screen, preceded by ‘Out[]’, to clearly mark the output generated by the code. When you start to use the Jupyter notebooks, you will see numbers inside the square brackets, i.e. In[1], In[2], etc., to indicate in which order the various pieces of code are being executed. Here we have omitted the numbers to avoid confusion between what you see here and what you see in your notebook.</Paragraph><Paragraph><b> <ComputerCode>In []:</ComputerCode> </b></Paragraph><ComputerDisplay><Paragraph>deathsInPortugal</Paragraph></ComputerDisplay><Paragraph><b> <ComputerCode>Out[]:</ComputerCode> </b></Paragraph><ComputerDisplay><Paragraph>100</Paragraph></ComputerDisplay><Paragraph>Each assignment is written on a line of its own. The computer executes the assignments line by line, from top to bottom. Thus, the program would continue as follows:</Paragraph><Paragraph><b> <ComputerCode>In []:</ComputerCode> </b></Paragraph><ComputerDisplay><Paragraph>deathsInPortugal = 100</Paragraph><Paragraph>deathsInAngola = 200</Paragraph><Paragraph>deathsInBrazil = 300</Paragraph></ComputerDisplay><Paragraph>I don’t think I need to continue, you get the gist.</Paragraph><Paragraph>By the way, all numbers so far are fictitious. If I use real data, taken from the World Health Organization website, you’ll see a difference.</Paragraph><Paragraph><b> <ComputerCode>In []:</ComputerCode> </b></Paragraph><ComputerDisplay><Paragraph>deathsInPortugal = 140</Paragraph><Paragraph>deathsInAngola = 6900</Paragraph><Paragraph>deathsInBrazil = 4400</Paragraph><Paragraph>deathsInPortugal</Paragraph></ComputerDisplay><Paragraph><b> <ComputerCode>Out[]:</ComputerCode> </b></Paragraph><ComputerDisplay><Paragraph>140</Paragraph></ComputerDisplay><Paragraph>Notice what happened. When a value is assigned to an already existing variable, the value stored in that variable is unceremoniously chucked away and replaced by the new value. In the example, the second group of assignments replaced the values assigned by the first group and thus the current value of <ComputerCode>
<b>deathsInPortugal</b>
</ComputerCode> is 140 and no longer 100. That’s why the storage boxes are called variables: their content can vary over time.</Paragraph><Paragraph>To sum up, a <b>variable</b> is a named storage for values and an <b>assignment</b> takes the value on the right hand side of the equal sign (=) and stores it in the variable on the left-hand side.</Paragraph><Paragraph>In the next section, you will find out the importance of naming in Python.</Paragraph></Section><Section id="art_of_naming"><Title>1.3 The art of naming</Title><Paragraph>Python is relatively flexible about what you name your variables but rather picky about the format of names.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1028.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1028.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="b851738c" x_imagesrc="ou_futurelearn_learn_to_code_fig_1028.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 3</b> </Caption><Alternative>An image of blank name labels headed, 'Hello my name is'</Alternative><Description>An image of blank name labels headed, 'Hello my name is'</Description></Figure><Paragraph>I could have chosen <ComputerCode>
<b> deaths_in_Brazil_in_2013, deathsBrazil,DeathsBrazil, dB </b>
</ComputerCode> or even <ComputerCode>
<b>stuff</b>
</ComputerCode> for my variables. If a box in your attic were labeled <ComputerCode>
<b>dB</b>
</ComputerCode> or <ComputerCode>
<b>stuff</b>
</ComputerCode> though, would you know what it contains a year later? So, although you can, it’s better not to use cryptic, general, or very long names.</Paragraph><Paragraph>You can’t use spaces to separate words in a name, you can’t start a name with a digit and names are case-sensitive, i.e. <ComputerCode>
<b>deathsBrazil</b>
</ComputerCode> and <ComputerCode>
<b>DeathsBrazil</b>
</ComputerCode> are not the same variable. Making one of those mistakes will result in a <b>syntax error</b> (when the computer doesn’t understand the line of code) or a <b>name error</b> (when the computer doesn’t know of any variable with that name).</Paragraph><Paragraph>Let’s see some examples. (Remember that you’re not expected to write any code for this step.) The first example has spaces between the words, the second example has a digit at the start of the name, and the third example changes the case of a letter, resulting in an unknown name. The kind of error is always at the end of the error message.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>deaths In Portugal = 140</Paragraph><Paragraph>File "&lt;ipython-input-7-ded1a063fe45&gt;", line 1</Paragraph><Paragraph>deaths In Portugal = 140</Paragraph><Paragraph>^</Paragraph><Paragraph>SyntaxError: invalid syntax</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>2013deathsInPortugal = 140</Paragraph><Paragraph>File "&lt;ipython-input-8-af085101fcfc&gt;", line 1</Paragraph><Paragraph>2013deathsInPortugal = 140</Paragraph><Paragraph>^</Paragraph><Paragraph>SyntaxError: invalid syntax</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>deathsinPortugal</Paragraph><Paragraph>---------------------------------------</Paragraph><Paragraph>
NameError Traceback (most recent call last)
</Paragraph><Paragraph>&lt;ipython-input-9-7d3c81b4fb34&gt; in &lt;module ()</Paragraph><Paragraph>----&gt; 1 deathsinPortugal</Paragraph><Paragraph>NameError: name ‘deathsinPortugal’ is not defined</Paragraph></ComputerDisplay><Paragraph>Note that Jupyter doesn’t write any <ComputerCode>
<b>Out[]</b>
</ComputerCode> because the code is wrong and thus doesn’t generate any output.</Paragraph><Paragraph>In this course, to make names shorter to help fit lines of code on small screens, we’ll use capitalisation instead of underscores to separate the different words of a name, as shown in the code so far. Such practice is called <b>camel case</b> independently of the name having <b> <ComputerCode>oneHump</ComputerCode> </b> (‘dromedary case’ just doesn’t sound good, does it?) or <b> <ComputerCode>moreThanTwoHumps</ComputerCode> </b>. The convention in Python is to start variable names with lower case and we’ll stick to it.</Paragraph><Paragraph>In the next section, download the notebook for this week and work through the first exercise – your first line of code!</Paragraph></Section><Section id="exercise_1"><Title> 1.4 Downloading the notebook and trying the first exercise </Title><Paragraph>So far, I’ve done the coding and you’ve read along. Booooring. It’s time to use the Jupyter notebooks and work on the first exercise in the course.</Paragraph><Activity>
                    <Heading>Exercise 1 Variables and assignments</Heading>
                    <Question>
                        <Paragraph>If you haven’t yet installed the software package or created an account on CoCalc, do it now using these <a href="http://www.open.edu/openlearn/learn-to-code-installation">instructions</a>!</Paragraph>
                        <Paragraph>Open the Exercise notebook 1 (from <a href="https://github.com/mwermelinger/Learn-to-code-for-data-analysis">here</a>), and put it in the folder you created. (You’ll open the data later and learn how to use it in the notebook.)</Paragraph>
                        <Paragraph>Once you have installed the file, watch the video to learn how to work with Jupyter notebooks and complete Exercise 1. Pause the video frequently to repeat the demonstrated steps in your notebook. Throughout the week you’ll be directed back to the notebook to complete the other exercises.</Paragraph>
                        <MediaContent src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_vid_1046.mp4" type="video" width="512" x_manifest="ou_futurelearn_learn_to_code_vid_1046_1_server_manifest.xml" x_filefolderhash="cbfeded3" x_folderhash="cbfeded3" x_contenthash="12863eea" x_subtitles="ou_futurelearn_learn_to_code_vid_1046.srt">
                            <Transcript>
                                <Speaker>NARRATOR</Speaker>
                                <Remark> In this screencast I'm going to introduce you to Jupyter Notebooks. First you'll need to start the Anaconda launcher. This screencast was done on a Mac. However, the same process applies to Windows. If you're a Linux user, you'll need to use the command line as described in the installation instructions. To follow along on your computer, make sure you have created a folder for this course, and that it contains the exercise notebook for this week. This screencast uses earlier versions of the exercise notebook, and of the Anaconda software than you downloaded, so don't worry that things look slightly different. </Remark>
                                <Remark> Do not click on 'Update' buttons in the Anaconda Launcher, because you should use the version you installed to avoid compatibility problems with the notebooks of this course. Once the Anaconda launcher has booted up, launch the ipython notebook. Whenever you see a circle, the mouse has been clicked. After a couple of screens you should see Jupyter running in a browser, and the contents of your home folder. Navigate to the folder you created, and open the relevant notebook. The first thing to appreciate is that Jupyter notebook consists of a sequence of individual cells. You can see the individual cells as I click on the left of each one. Each cell can contain text or code. </Remark>
                                <Remark> Before starting any exercises, you should execute all the code already in the notebook. I'll explain why in a moment. Go to the 'Cell' menu and select 'Run all'. As the notebook executes all code, it may automatically scroll to a different part of the notebook. Just scroll back to the start. Go to the first exercise. It asks you to add assignments for more countries into the preceding code cell. To select a cell, click to the left of it. A grey border shows the currently selected cell. To edit a cell, click inside it. The border becomes green to show the cell is in editing mode. </Remark>
                                <Remark> Once inside the cell press 'Enter' a couple of times to put the cursor on a new line, then start typing an appropriate variable name, say 'deaths in Russia'. Once you've start typing, names that have been used within the notebook can be accessed via the Tab key. So once you've typed 'de' press the Tab key to get some auto-complete suggestions. Use the arrow keys to scroll through the options and press 'Enter' to select the appropriate option. In this case I'll accept the first suggestion in the list and edit it to complete the assignment. Next start a new line and just enter the new variable name you've just added. Remember to use auto-complete to avoid spelling mistakes. </Remark>
                                <Remark> Now we can run the code. To execute only the current code cell, click on the 'Play' button. The results appear below the code cell in a line titled 'Out'. If you wish to split a cell, for example to separate the supplied code from the code you are adding, then put the cursor where you want to split the cell, go to the 'Edit' menu and select 'Split Cell'. It's easy to move cells around - for example we can cut a selected cell......and then paste it below another cell that you select. To save a snapshot of the notebook, called a checkpoint, click the 'Save and Checkpoint' button. </Remark>
                                <Remark> If things go horribly wrong, you can revert the notebook to the last checkpoint by using the 'Revert to Checkpoint' option in the 'File' menu. To finish your session, go to the 'File' menu and select 'Close and Halt'. It is very important to note that opening a notebook does not execute any code cells. Any code output was saved from the previous session. So if I reopen the workbook and execute the code in the second cell alone, I'll get an error, because in this session (that is since the notebook was opened again) the first cell hasn't been executed and therefore the computer doesn't recognise the variable name 'deaths in Portugal'. And that's why you should run all code after opening a notebook! </Remark>
                                <Remark> It's easy to add your own notes to the notebook. For example if I select the first cell......and then click on the plus button, I can insert a new cell below the current one. By default, a new cell is a code cell. Select 'Markdown' to change it to a text cell. To edit a text cell, double-click inside the cell. Text is written in Markdown, a very simple formatting system. Here are some examples of what Markdown can do - pay attention to how prefixing or surrounding words with simple characters is all the information needed to format the text. </Remark>
                                <Remark> Also note the way a word or phrase to be hyperlinked is surrounded with square brackets and immediately followed by the URL, in round brackets. Once the text is written, click the 'Play' button to see the formatted text in the cell. The 'Help' menu contains links to information about Jupyter notebooks and Markdown formatting. As you get used to Jupyter, take a look at the keyboard shortcuts, as they will help you to work more efficiently. </Remark>
                            </Transcript>
                            <Figure>
                                <Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_vid_1046.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_vid_1046.jpg" x_folderhash="cbfeded3" x_contenthash="8d6b068c" x_imagesrc="ou_futurelearn_learn_to_code_vid_1046.jpg" x_imagewidth="512" x_imageheight="288"/>
                            </Figure>
                        </MediaContent>
                    </Question>
                </Activity><Paragraph>If you haven’t yet installed Jupyter and Anaconda, do it now using these <a href="http://www.open.edu/openlearn/learn-to-code-installation">instructions</a>.</Paragraph></Section><Section id="expressions"><Title>1.5 Expressions</Title><Paragraph>I’ve told the computer the deaths in Angola, Brazil and Portugal. I can now ask it to add them together to obtain the total deaths.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>deathsInAngola + deathsInBrazil + deathsInPortugal</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>11440</Paragraph></ComputerDisplay><Paragraph>A fragment of code that has a value is called an <b>expression</b>. Calculating the value of an expression is called <b>evaluating</b> the expression. If the expression is on a line of its own, the Jupyter notebook displays its value, as above.</Paragraph><Paragraph>The value of an expression can of course be assigned to a variable.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>
totalDeaths = deathsInAngola + deathsInBrazil + deathsInPortugal
</Paragraph></ComputerDisplay><Paragraph>Note that no value is displayed because the whole line of code is not an expression, it’s a <b>statement</b> , a command to the computer. In this case the statement is an assignment. You will see another kind of statement later this week.</Paragraph><Paragraph>To see the value, you learned that you must write the variable’s name.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>totalDeaths</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>11440</Paragraph></ComputerDisplay><Paragraph>This is really just a special case of the general rule that writing an expression on its own shows its value. A variable (which stores a value) is just an example of an expression (which is anything that has a value).</Paragraph><Paragraph>I can now write an expression to compute the average number of deaths: it’s the total divided by the number of countries.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>totalDeaths / 3</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>3813.3333333333335</Paragraph></ComputerDisplay><Paragraph>Python has of course all four arithmetic <b>operators</b> : addition (+), division (/), subtraction (-) and multiplication (*). I’ll use the last two later in the week. Python follows the conventional operator precedence: multiplication and division before addition and subtraction, unless parentheses are used to change the order. For example, (3+4)/2 is 3.5 but 3+4/2 is 5.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1030.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1030.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="fd7d2b27" x_imagesrc="ou_futurelearn_learn_to_code_fig_1030.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 4</b> </Caption><Alternative>An image of many colourful painted skulls with different drawings and textures.</Alternative><Description>An image of many colourful painted skulls with different drawings and textures.</Description></Figure><Paragraph>Now practice writing expressions and complete Exercise 2 in the notebook.</Paragraph><Activity><Heading>Exercise 2 Expressions</Heading><Question><Paragraph>Go back to the Exercise notebook 1 you used in Exercise 1. In Exercise 2 you’ll see an example of operator precedence and practise writing expressions.</Paragraph><Paragraph>If you’re using Anaconda, remember that to open the notebook you’ll need to navigate to it using Jupyter. Whether you’re using Anaconda or CoCalc, once the notebook is open, run all the code before doing the exercise.</Paragraph><Paragraph>Writing code for the first time can be difficult but stick with it.</Paragraph></Question></Activity><Paragraph>In the next section, you will find out about functions.</Paragraph></Section><Section id="functions"><Title>1.6 Functions</Title><Paragraph>After the total and the average, next on my to-do list is to calculate the largest number of deaths.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1031.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1031.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="c59b3db9" x_imagesrc="ou_futurelearn_learn_to_code_fig_1031.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 5</b> </Caption><Alternative>An image with an angel in prayer statue in the foreground of a graveyard</Alternative><Description>An image with an angel in prayer statue in the foreground of a graveyard</Description></Figure><Paragraph>This will be the <b>maximum</b>. It takes another single line of code to calculate it.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>
max(deathsInAngola, deathsInBrazil, deathsInPortugal)
</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>6900</Paragraph></ComputerDisplay><Paragraph>In this expression, <b> <ComputerCode>max()</ComputerCode> </b> is a function – the parenthesis are a reminder that the name <b> <ComputerCode>max</ComputerCode> </b> doesn’t refer to a variable. A <b>function</b> is a piece of code that calculates ( <b>returns</b> ) a value, given zero or more values (the function’s <b>arguments</b> ). In this case, <b> <ComputerCode>max()</ComputerCode> </b> has three arguments and returns the greatest of them. Actually, <b> <ComputerCode>max()</ComputerCode> </b> can calculate the maximum of two, three, four or more values.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>max(deathsInBrazil, deathsInPortugal)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>4400</Paragraph></ComputerDisplay><Paragraph>The expressions above are function <b>calls</b>. I’m calling the <b> <ComputerCode>max()</ComputerCode> </b> function with three or two arguments, and the value of the expression is the value returned by the function. A function is called by writing its name, followed by the arguments, within parentheses and separated by commas. Function names follow the same rules as variable names.</Paragraph><Paragraph>As you might expect, Python also has a function to calculate the smallest (minimum) of two or more values.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>
min(deathsInAngola, deathsInBrazil, deathsInPortugal)
</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>140</Paragraph></ComputerDisplay><Paragraph>The value returned by a function call can be assigned to a variable. Here is an example, which calculates the <b>range</b> of deaths. The range of a set of values is the difference between the largest and the smallest of them.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>
largest = max(deathsInAngola, deathsInBrazil, deathsInPortugal)
</Paragraph><Paragraph>
smallest = min(deathsInAngola, deathsInBrazil, deathsInPortugal)
</Paragraph><Paragraph>range = largest - smallest</Paragraph><Paragraph>range</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>6760</Paragraph></ComputerDisplay><Activity><Heading>Exercise 3 Functions</Heading><Multipart><Paragraph>Identify different types of error (some of you may have experienced those already…) in Exercise 3. You’ll need to use the Week 1 notebook to answer question three.</Paragraph><Part><Question><Paragraph><b> 1. If the function name is misspelled as Min, what kind of error is it? </b></Paragraph></Question><Interaction><SingleChoice><Wrong><Paragraph>A syntax error</Paragraph><Feedback><Paragraph>Writing <ComputerCode>Min(…, …)</ComputerCode> instead of <ComputerCode>min(…, …)</ComputerCode> is not a syntax error, because both have the form of a function call. Names can use uppercase letters.</Paragraph><Paragraph>Take a look at <CrossRef idref="art_of_naming">The art of naming</CrossRef>.</Paragraph></Feedback></Wrong><Right><Paragraph>A name error</Paragraph><Feedback><Paragraph>The computer will understand than <ComputerCode>Min(…, …)</ComputerCode> is a function call but doesn’t know of any function with that name. Remember that names are case-sensitive.</Paragraph><Paragraph>Take a look at <CrossRef idref="art_of_naming">The art of naming</CrossRef>.</Paragraph></Feedback></Right></SingleChoice></Interaction></Part><Part><Question><Paragraph><b> 2. If a parenthesis or comma is forgotten, what kind of error is it? </b></Paragraph></Question><Interaction><SingleChoice><Wrong><Paragraph>A name error</Paragraph><Feedback><Paragraph>A parenthesis or comma is unrelated to how names are written.</Paragraph><Paragraph>Take a look at <CrossRef idref="art_of_naming">The art of naming</CrossRef>.</Paragraph></Feedback></Wrong><Right><Paragraph>A syntax error</Paragraph><Feedback><Paragraph>A function call requires two parentheses around the arguments, and one comma between successive arguments. Forgetting any of them therefore deviates from the syntax of the Python language.</Paragraph><Paragraph>Take a look at <CrossRef idref="functions">Functions</CrossRef>.</Paragraph></Feedback></Right></SingleChoice></Interaction></Part><Part><Question><Paragraph><b> 3. Use Exercise 3 in the Week 1 exercise notebook to answer this question. </b></Paragraph><Paragraph><b> What is the range of deaths among the BRICS countries (Brazil, Russia, India, China, South Africa)? </b></Paragraph></Question><Interaction><SingleChoice><Wrong><Paragraph>4400</Paragraph><Feedback><Paragraph>This is the minimum value (for Brazil), not the range.</Paragraph><Paragraph>Take a look at <CrossRef idref="functions">Functions</CrossRef>.</Paragraph></Feedback></Wrong><Wrong><Paragraph>65480</Paragraph><Feedback><Paragraph>This is the average number of deaths, not the range.</Paragraph><Paragraph>Take a look at <CrossRef idref="expressions">Expressions</CrossRef>.</Paragraph></Feedback></Wrong><Wrong><Paragraph>240000</Paragraph><Feedback><Paragraph>This is the maximum value (for India), not the range.</Paragraph><Paragraph>Take a look at <CrossRef idref="functions">Functions</CrossRef>.</Paragraph></Feedback></Wrong><Wrong><Paragraph>327400</Paragraph><Feedback><Paragraph>This is the total number of deaths not the range.</Paragraph><Paragraph>Take a look at <CrossRef idref="functions">Functions</CrossRef>.</Paragraph></Feedback></Wrong><Right><Paragraph>235600</Paragraph><Feedback><Paragraph>The range is the maximum value (240 thousand for India) minus the minimum value (4400 for Brazil).</Paragraph></Feedback></Right></SingleChoice></Interaction></Part></Multipart></Activity></Section><Section><Title>1.7 Comments</Title><Paragraph>Last on my to-do list is the death rate, which is the number of deaths divided by the population.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1032.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1032.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="10cd9adf" x_imagesrc="ou_futurelearn_learn_to_code_fig_1032.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 6</b> </Caption><Alternative>An image of different speech bubbles being held up</Alternative><Description>An image of different speech bubbles being held up</Description></Figure><Paragraph/><Paragraph>A quick glance at the WHO website tells me Portugal’s population in 2013.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>populationOfPortugal = 10608</Paragraph></ComputerDisplay><Paragraph>Wait a minute! This can’t be right. I know Portugal isn’t a large country, but ten and a half thousand people is ridiculous. I look more carefully at the WHO website. Oh, the value is given in thousands of people; it’s 10 million and 608 thousand people. I could change the assignment to</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>populationOfPortugal = 10608000</Paragraph></ComputerDisplay><Paragraph>but that could give the impression that the population had been counted exactly, whereas it’s more likely the number is an estimate based on a previous census. It also makes it easier to check my code against the WHO data if I use the exact same numbers.</Paragraph><Paragraph>I will therefore keep the original assignment but make a note of the unit, using a <b>comment</b> , a piece of text that documents what the code does.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph># population unit: thousands of inhabitants</Paragraph><Paragraph>populationOfPortugal = 10608</Paragraph><Paragraph># deaths unit: inhabitants</Paragraph><Paragraph>deathsInPortugal = 140</Paragraph></ComputerDisplay><Paragraph>A comment starts with a hash sign <b>(#)</b> and goes until the end of the line. Computers ignore all comments, they just execute the code. Comments are your insurance policy: they help you understand your own code if you come back to it after a long break.</Paragraph><Paragraph>I can now compute the death rate, making sure I first convert the population into number of inhabitants, the same unit as deaths.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>deathsInPortugal / (populationOfPortugal * 1000)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>1.3197586726998491e-05</Paragraph></ComputerDisplay><Paragraph>The death rate (roughly 140 people in 10 million) is a very small number, not very practical to display and reason about. Looking again at the WHO website, I note that other indicators, like TB prevalence, are given per 100 thousand inhabitants. I will do the same for the death rate. Since the population is already in thousands, dividing the deaths by the population gives me the number of deaths per thousand people. Thus, the number of deaths per 100 thousand people must be 100 times higher than that.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph># death rate: deaths per 100 thousand inhabitants</Paragraph><Paragraph>deathsInPortugal * 100 / populationOfPortugal</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>1.3197586726998491</Paragraph></ComputerDisplay><Paragraph>This finishes the basics of coding needed for this week. It took less than 30 lines of code…</Paragraph><Paragraph>Test this out for yourself in Exercise 4 of the Week 1 exercise notebook.</Paragraph><Activity><Heading>Exercise 4 Comments</Heading><Question><Paragraph>Complete the short exercise on the death rate in Exercise 4 in the Week 1 Exercise notebook.</Paragraph><Paragraph>Remember that once the notebook is open, run all the code, before doing the exercise.</Paragraph></Question></Activity></Section><Section><Title>1.8 Values have units</Title><Paragraph>Before I move on, let me explain the importance of using comments to record units of measurement.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1033.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1033.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="9e8a78d3" x_imagesrc="ou_futurelearn_learn_to_code_fig_1033.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 7</b> </Caption><Alternative>An image of four towers of liquorish all sorts</Alternative><Description>An image of four towers of liquorish all sorts</Description></Figure><Paragraph>Values are not just numbers, they have units: degrees Celsius, number of inhabitants, thousands of gallons, etc. Always make a note of the units the value refers to, using comments. This makes it easier to check whether the expressions are right. Disregarding the units will lead to wrong calculations and results.</Paragraph><Activity><Heading>Activity 2</Heading><Question><Paragraph>Have you come across ‘horror stories’ that have happened due to mistakes in the unit of measurement?</Paragraph><Paragraph>Think through what happened and consider what you have learned from them.</Paragraph></Question><Interaction><FreeResponse size="paragraph" id="a2"/></Interaction></Activity></Section></Session><Session><Title>2 This week’s quiz</Title><Paragraph>Check what you’ve learned this week by taking the end-of-week quiz.</Paragraph><Paragraph><a href="https://www.open.edu/openlearn/ocw/mod/quiz/view.php?id=78777">Week 1 practice quiz.</a></Paragraph><Paragraph>Open the quiz in a new window or tab then come back here when you’ve finished.</Paragraph></Session><Session><Title>3 Summary</Title><Paragraph>The first week of this course covered:</Paragraph><BulletedList><ListItem>installing software in course notebooks</ListItem><ListItem>starting data analysis with a question </ListItem><ListItem>the basics of coding</ListItem><ListItem>naming formats</ListItem><ListItem>recording units of measurement.</ListItem></BulletedList><Paragraph>Next week, you’ll be introduced to pandas. You'll use Jupyter notebooks to write and execute simple programs with Python and the pandas module. </Paragraph></Session></Unit>
<Unit><UnitID/><UnitTitle>Week 2: Having a go at it Part 2</UnitTitle><Session><Title>1 Enter the pandas</Title><Paragraph>As you probably realised, this way of coding is not practical for large scale data analysis.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1034.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1034.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="7376536f" x_imagesrc="ou_futurelearn_learn_to_code_fig_1034.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 1</b></Caption><Alternative>An image of four giant panda cubs climbing a bamboo fence</Alternative><Description>An image of four giant panda cubs climbing a bamboo fence</Description></Figure><Paragraph>Three lines of code were required for each country, to store the number of deaths, store the population, and calculate the death rate. With roughly 200 countries in the world, my trivial analysis would require 400 variables and typing almost 600 lines of code! Life’s too short to be spent that way.</Paragraph><Paragraph>Instead of using a separate variable for each datum, it is better to organise data as a table of rows and columns.</Paragraph><Table><TableHead>Table 1</TableHead><tbody><tr><th>Country</th><th>Deaths</th><th>Population</th></tr><tr><td>Angola</td><td>6900</td><td>21472</td></tr><tr><td>Brazil</td><td>4400</td><td>200362</td></tr><tr><td>Portugal</td><td>140</td><td>10608</td></tr></tbody></Table><Paragraph>In that way, instead of 400 variables, I only need one that stores the whole table. Instead of writing a mile long expression that adds 200 variables to obtain the total deaths, I’ll write a short expression that calculates the total of the ‘Deaths’ column, no matter how many countries (rows) there are.</Paragraph><Paragraph>To organise data into tables and do calculations on such tables, you and I will use the pandas module, which is included in Anaconda and CoCalc. A <b>module</b> is a package of various pieces of code that can be used individually. The pandas module provides very extensive and advanced data analysis capabilities to compliment Python. This course only scratches the surface of pandas.</Paragraph><Paragraph>I have to tell the computer that I’m going to use a module.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>from pandas import *</Paragraph></ComputerDisplay><Paragraph>That line of code is an <b>import</b> statement: from the pandas module, import everything. In plain English: load into memory all pieces of code that are in the pandas module, so that I can use any of them. In the above statement, the asterisk isn’t the multiplication operator but instead means ‘everything’.</Paragraph><Paragraph>Each weekly project in this course will start with this import statement, because all projects need the pandas module.</Paragraph><Paragraph>The words <b> <ComputerCode>from</ComputerCode> </b> and <b> <ComputerCode>import</ComputerCode> </b> are <b>reserved words</b> : they can’t be used as variable, function or module names. Otherwise you will get a syntax error.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>from = 100</Paragraph><Paragraph>File "&lt;ipython-input-23-6958f0ebc10d&gt;", line 1</Paragraph><Paragraph>from = 100</Paragraph><Paragraph>^</Paragraph><Paragraph>SyntaxError: invalid syntax</Paragraph></ComputerDisplay><Paragraph>Jupyter notebooks show reserved words in boldface font to make them easier to spot. If you see a boldface name in an assignment (as you will for the code above), you must choose a different name.</Paragraph><Activity><Heading>Exercise 5 pandas</Heading><Multipart><Part><Question><Paragraph>Use Exercise 5 the Exercise notebook 1 to help you answer these questions about errors you might come across.</Paragraph><Paragraph><b> 1. What kind of error will you get if you misspell 'pandas' as 'Pandas'? </b></Paragraph></Question><Interaction><SingleChoice><Wrong><Paragraph>A syntax error</Paragraph><Feedback><Paragraph>Remember that after the reserved word 'from' comes a module name.</Paragraph><Paragraph>Take a look at The art of naming .</Paragraph></Feedback></Wrong><Right><Paragraph>A name error, reported as an import error</Paragraph><Feedback><Paragraph>The computer is expecting a name but there is no module with the name 'Pandas' in the Anaconda distribution. Remember that names are case-sensitive.</Paragraph></Feedback></Right></SingleChoice></Interaction></Part><Part><Question><Paragraph><b> 2. What kind of error will you get if you misspell 'import' as 'impart'? </b></Paragraph></Question><Interaction><SingleChoice><Wrong><Paragraph>A name error</Paragraph><Feedback><Paragraph>A name error only occurs when a name is undefined, but import is not a name, it’s a reserved word.</Paragraph></Feedback></Wrong><Right><Paragraph>A syntax error</Paragraph><Feedback><Paragraph>The computer is expecting a reserved word and anything else will raise a syntax error.</Paragraph></Feedback></Right></SingleChoice></Interaction></Part><Part><Question><Paragraph><b> 3. What kind of error will you get if you forget the asterisk? </b></Paragraph></Question><Interaction><SingleChoice><Wrong><Paragraph>A name error</Paragraph><Feedback><Paragraph>An asterisk is not a name so the reported error can’t be this one.</Paragraph></Feedback></Wrong><Right><Paragraph>A syntax error</Paragraph><Feedback><Paragraph>The statement cannot end with the reserved word 'import'; the computer is expecting an indication of what to import.</Paragraph></Feedback></Right></SingleChoice></Interaction></Part></Multipart></Activity><Section><Title>1.1 This week’s data</Title><Paragraph>For the next part of the course you’ll need to download a file of data.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1026.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1026.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="9ec6e161" x_imagesrc="ou_futurelearn_learn_to_code_fig_1026.jpg" x_imagewidth="512" x_imageheight="341"/><Caption> <b>Figure 2</b> </Caption><Alternative>An image with a young boy wearing a medical mask, in the foreground; a patient in a South African tuberculosis clinic</Alternative><Description>An image with a young boy wearing a medical mask, in the foreground; a patient in a South African tuberculosis clinic </Description></Figure><Paragraph>I have created a table with all the data necessary for the project and saved it in an Excel file. Excel is a popular application to create, edit and analyse tabular data. You won’t need Excel to complete this course, but many datasets are provided as Excel files.</Paragraph><Paragraph>Open the data file WHO POP TB some.xls . The file is encoded using UTF-8, a character encoding that allows for accented letters. Do <b>not</b> open or edit the file, as you may change how it is encoded, which will lead to errors later on. If you do want to look at its contents, make a copy of the file and look at the copy.</Paragraph><Paragraph>Put the data file in the same folder (or CoCalc project) where you saved your exercise notebook. Done? Great, let’s proceed to loading the data – you’ll learn how to do this in the next section.</Paragraph></Section><Section><Title>1.2 Loading the data</Title><Paragraph>Many applications can read files in Excel format, and pandas can too. Asking the computer to read the data looks like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>data = read_excel('WHO POP TB some.xls')</Paragraph><Paragraph>data</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>0</td><td>Angola</td><td>21472</td><td>6900</td></tr><tr><td>1</td><td>Brazil</td><td>200362</td><td>4400</td></tr><tr><td>2</td><td>China</td><td>1393337</td><td>41000</td></tr><tr><td>3</td><td>Equatorial Guinea</td><td>757</td><td>67</td></tr><tr><td>4</td><td>Guinea-Bissau</td><td>1704</td><td>1200</td></tr><tr><td>5</td><td>India</td><td>1252140</td><td>240000</td></tr><tr><td>6</td><td>Mozambique</td><td>25834</td><td>18000</td></tr><tr><td>7</td><td>Portugal</td><td>10608</td><td>140</td></tr><tr><td>8</td><td>Russian Federation</td><td>142834</td><td>17000</td></tr><tr><td>9</td><td>Sao Tome and Principe</td><td>193</td><td>18</td></tr><tr><td>10</td><td>South Africa</td><td>52776</td><td>25000</td></tr><tr><td>11</td><td>Timor-Leste</td><td>1133</td><td>990</td></tr></tbody></Table><Paragraph>The variable name data is not descriptive, but as there is only one dataset in our analysis, there is no possible confusion with other data, and short names help to keep the lines of code short.</Paragraph><Paragraph>The function <ComputerCode>
<b>read_excel()</b>
</ComputerCode> takes a file name as an argument and returns the table contained in the file. In pandas, tables are called <b>dataframes</b> . To load the <ComputerCode>
<b>data</b>
</ComputerCode>, I simply call the function and store the returned dataframe in a variable.</Paragraph><Paragraph>A file name must be given as a <b>string</b> , a piece of text surrounded by quotes. The quote marks tell Python that this isn’t a variable, function or module name. Also, the quote marks state that this is a single name, even if it contains spaces, punctuation and other characters besides letters.</Paragraph><Paragraph>Misspelling the file name, or not having the file in the same folder as the notebook containing the code, results in a <b>file not found</b> error. In the example below there is an error in the file name.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>data = read_excel('WHO POP TB same.xls')</Paragraph><Paragraph>data</Paragraph><Paragraph>
<br/>
</Paragraph><Paragraph>---------------------------------------------</Paragraph><Paragraph>
FileNotFoundError Traceback (most recent call last)
</Paragraph><Paragraph>&lt;ipython-input-25-c017b2500afa&gt; in &lt;module&gt;()</Paragraph><Paragraph>----&gt; 1 data = read_excel(‘WHO POP TB same.xls’)</Paragraph><Paragraph>2 data</Paragraph><Paragraph>
<br/>
</Paragraph><Paragraph>
/Users/mw4687/anaconda/lib/python3.4/site-packages/pandas/io/excel.py in read_excel(io, sheetname, **kwds)
</Paragraph><Paragraph>130 engine = kwds.pop(‘engine’, None)</Paragraph><Paragraph>131</Paragraph><Paragraph>
--&gt; 132 return ExcelFile(io, engine=engine).parse(sheetname=sheetname, **kwds)
</Paragraph><Paragraph>133</Paragraph><Paragraph>134</Paragraph><Paragraph>
<br/>
</Paragraph><Paragraph>
/Users/mw4687/anaconda/lib/python3.4/site-packages/pandas/io/excel.py in __init__(self, io, **kwds)
</Paragraph><Paragraph>
167 self.book = xlrd.open_workbook(file_contents=data)
</Paragraph><Paragraph>168 else:</Paragraph><Paragraph>
--&gt; 169 self.book = xlrd.open_workbook(io)
</Paragraph><Paragraph>
170 elif engine == ‘xlrd’ and isinstance(io, xlrd.Book):
</Paragraph><Paragraph>171 self.book = io</Paragraph><Paragraph>
<br/>
</Paragraph><Paragraph>
/Users/mw4687/anaconda/lib/python3.4/site-packages/xlrd/__init__.py in open_workbook(filename, logfile,<br/> verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
</Paragraph><Paragraph>392 peek = file_contents[:peeksz]</Paragraph><Paragraph>393 else:</Paragraph><Paragraph>--&gt; 394 f = open(filename, "rb")</Paragraph><Paragraph>395 peek = f.read(peeksz)</Paragraph><Paragraph>396 f.close()</Paragraph><Paragraph>
<br/>
</Paragraph><Paragraph>
FileNotFoundError: [Errno 2] No such file or directory: ‘WHO POP TB same.xls’
</Paragraph></ComputerDisplay><Paragraph>Jupyter notebooks show strings in red. If you see red characters until the end of the line, you have forgotten to type the second quote that marks the end of the string.</Paragraph><Paragraph>In the next section, find out how to select a column.</Paragraph></Section><Section id="selecting_a_column"><Title>1.3 Selecting a column</Title><Paragraph>Now you have the data, let the analysis begin!</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1035.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1035.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="0053a248" x_imagesrc="ou_futurelearn_learn_to_code_fig_1035.jpg" x_imagewidth="512" x_imageheight="341"/><Caption> <b>Figure 3</b> </Caption><Alternative>An image of free standing Roman columns standing against a blue sky</Alternative><Description>An image of free standing Roman columns standing against a blue sky</Description></Figure><Paragraph>Let’s tackle the first part of the first question: ‘What are the total, smallest, largest and average number of deaths due to TB?’ Obtaining the total number will be done in two steps: first select the column with the TB deaths, then sum the values in that column.</Paragraph><Paragraph>Selecting a single column of a dataframe is done with an expression in the format: <b> <ComputerCode>dataFrame['column name'].</ComputerCode> </b></Paragraph><Paragraph><br/></Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>data['TB deaths']</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>0 6900</Paragraph><Paragraph>1 4400</Paragraph><Paragraph>2 41000</Paragraph><Paragraph>3 67</Paragraph><Paragraph>4 1200</Paragraph><Paragraph>5 240000</Paragraph><Paragraph>6 18000</Paragraph><Paragraph>7 140</Paragraph><Paragraph>8 17000</Paragraph><Paragraph>9 18</Paragraph><Paragraph>10 25000</Paragraph><Paragraph>11 990</Paragraph><Paragraph>Name: TB deaths, dtype: int64</Paragraph></ComputerDisplay><Paragraph>Strings are verbatim text, which means that the column name must be written exactly as given in the dataframe, which you saw after loading the data. The slightest deviation leads to a <b>key error</b> , which can be seen as a kind of name error. You can try out in the Week 2 exercise notebook what happens when misspelling the column name. The error message is horribly long. In such cases, just skip to the last line of the error message to see the type of error.</Paragraph><Paragraph>Put this learning into practice in Exercise 6.</Paragraph><Activity><Heading>Exercise 6 selecting a column</Heading><Question><Paragraph>In your Exercise notebook 1, select the population column and store it in a variable, so that you can use it in later exercises.</Paragraph><Paragraph>Remember that to open the notebook you’ll need to launch Anaconda and then navigate to the notebook using Jupyter. Once it’s open, run all the code.</Paragraph></Question></Activity><Paragraph>Next, you’ll learn about making calculations on a column.</Paragraph></Section><Section><Title>1.4 Calculations on a column</Title><Paragraph>Having selected the column with the number of deaths per country, I’ll add them with the appropriately named sum() method to obtain the overall total deaths.</Paragraph><Paragraph>A <b>method</b> is a function that can only be called in a certain context. In this course, the context will mostly be a dataframe or a column. A <b>method call</b> looks like a function call, but adds the context in which to call the method: <ComputerCode>
<b>context.methodName(argument1, argument2, ...)</b>
</ComputerCode> . In other words, a dataframe method can only be called on dataframes, a column method only on columns. Because methods are functions, a method call returns a value and is therefore an expression.</Paragraph><Paragraph>If all that sounded too abstract, here’s how to call the <ComputerCode>
<b>sum()</b>
</ComputerCode> method on the TB deaths column. Note that <ComputerCode>
<b>sum()</b>
</ComputerCode> doesn’t need any arguments because all the values are in the column.</Paragraph><Paragraph><br/></Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>tbColumn = data['TB deaths']</Paragraph><Paragraph>tbColumn.sum()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>354715</ComputerCode></Paragraph><Paragraph>The estimated total number of deaths due to TB in 2013 in the BRICS and Portuguese-speaking countries was over 350 thousand. An impressive number, for the wrong reasons.</Paragraph><Paragraph>Calculating the minimum and maximum number of deaths is done in a similar way.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>tbColumn.min()</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>18</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>tbColumn.max()</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>240000</ComputerCode></Paragraph><Paragraph>Like <ComputerCode>
<b>sum()</b>
</ComputerCode> , the column methods <ComputerCode>
<b>min()</b>
</ComputerCode> and <ComputerCode>
<b>max()</b>
</ComputerCode> don’t need arguments, whereas the Python functions <ComputerCode>
<b>min()</b>
</ComputerCode> and <ComputerCode>
<b>max()</b>
</ComputerCode> did need them, because there was no context (column) providing the values.</Paragraph><Paragraph>The average number is computed as before, dividing the total by the number of countries.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>tbColumn.sum() / 12</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>29559.583333333332</ComputerCode></Paragraph><Paragraph>This kind of average is called the <b>mean</b> and there’s a method for that.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>tbColumn.mean()</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>29559.583333333332</ComputerCode></Paragraph><Paragraph>Another kind of average measure is the <b>median</b> , which is the number in the middle, i.e. half of the values are above the median and half below it.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>tbColumn.median()</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>5650.0</ComputerCode></Paragraph><Paragraph>The mean is five times higher than the median. While half the countries had less than 5650 deaths in 2013, some countries had far more, which pushes the mean up.</Paragraph><Paragraph>The median is probably closer to the intuition you have of what ‘average’ should mean (pun intended). News reports don’t always make clear what average measure is being used, and using the mean may distort reality. For example, the mean household income in a country will be influenced by very poor and very rich households, whereas the median income doesn’t take into account how poor or rich the extremes are: it will always be half the households below and half above the median.</Paragraph><Paragraph>Put this learning into practice in Exercise 7.</Paragraph><Activity><Heading>Exercise 7 calculations on a column</Heading><Question><Paragraph>Practise the use of column methods by applying them to the population column you obtained in Exercise 6 in the Exercise notebook 1. Remember to run all code before doing the exercise.</Paragraph></Question></Activity></Section><Section><Title>1.5 Sorting on a column</Title><Paragraph>One of the research questions was: which countries have the smallest and largest number of deaths?</Paragraph><Paragraph>Being a small table, it is not too difficult to scan the TB deaths column and find those countries. However, such a process is prone to errors and impractical for large tables. It’s much better to sort the table by that column, and then look up the countries in the first and last rows.</Paragraph><Paragraph>As you’ve guessed by now, sorting a table is another single line of code.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>data.sort_values('TB deaths')</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>9</td><td>Sao Tome and Principe</td><td>193</td><td>18</td></tr><tr><td>3</td><td>Equatorial Guinea</td><td>757</td><td>67</td></tr><tr><td>7</td><td>Portugal</td><td>10608</td><td>140</td></tr><tr><td>11</td><td>Timor-Leste</td><td>1133</td><td>990</td></tr><tr><td>4</td><td>Guinea-Bissau</td><td>1704</td><td>1200</td></tr><tr><td>1</td><td>Brazil</td><td>200362</td><td>4400</td></tr><tr><td>0</td><td>Angola</td><td>21472</td><td>6900</td></tr><tr><td>8</td><td>Russian Federation</td><td>142834</td><td>17000</td></tr><tr><td>6</td><td>Mozambique</td><td>25834</td><td>18000</td></tr><tr><td>10</td><td>South Africa</td><td>52776</td><td>25000</td></tr><tr><td>2</td><td>China</td><td>1393337</td><td>41000</td></tr><tr><td>5</td><td>India</td><td>1252140</td><td>240000</td></tr></tbody></Table><Paragraph>The dataframe method <ComputerCode>
<b>sort_values()</b>
</ComputerCode> takes as argument a column name and returns a new dataframe where the rows are in ascending order of the values in that column. Note that sorting doesn’t modify the original dataframe.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>data # rows still in original order</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>0</td><td>Angola</td><td>21472</td><td>6900</td></tr><tr><td>1</td><td>Brazil</td><td>200362</td><td>4400</td></tr><tr><td>2</td><td>China</td><td>1393337</td><td>41000</td></tr><tr><td>3</td><td>Equatorial Guinea</td><td>757</td><td>67</td></tr><tr><td>4</td><td>Guinea-Bissau</td><td>1704</td><td>1200</td></tr><tr><td>5</td><td>India</td><td>1252140</td><td>240000</td></tr><tr><td>6</td><td>Mozambique</td><td>25834</td><td>18000</td></tr><tr><td>7</td><td>Portugal</td><td>10608</td><td>140</td></tr><tr><td>8</td><td>Russian Federation</td><td>142834</td><td>17000</td></tr><tr><td>9</td><td>Sao Tome and Principe</td><td>193</td><td>18</td></tr><tr><td>10</td><td>South Africa</td><td>52776</td><td>25000</td></tr><tr><td>11</td><td>Timor-Leste</td><td>1133</td><td>990</td></tr></tbody></Table><Paragraph>It’s also possible to sort on a column that has text instead of numbers; the rows will be sorted in alphabetical order.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>data.sort_values('Country')</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>0</td><td>Angola</td><td>21472</td><td>6900</td></tr><tr><td>1</td><td>Brazil</td><td>200362</td><td>4400</td></tr><tr><td>2</td><td>China</td><td>1393337</td><td>41000</td></tr><tr><td>3</td><td>Equatorial Guinea</td><td>757</td><td>67</td></tr><tr><td>4</td><td>Guinea-Bissau</td><td>1704</td><td>1200</td></tr><tr><td>5</td><td>India</td><td>1252140</td><td>240000</td></tr><tr><td>6</td><td>Mozambique</td><td>25834</td><td>18000</td></tr><tr><td>7</td><td>Portugal</td><td>10608</td><td>140</td></tr><tr><td>8</td><td>Russian Federation</td><td>142834</td><td>17000</td></tr><tr><td>9</td><td>Sao Tome and Principe</td><td>193</td><td>18</td></tr><tr><td>10</td><td>South Africa</td><td>52776</td><td>25000</td></tr><tr><td>11</td><td>Timor-Leste</td><td>1133</td><td>990</td></tr></tbody></Table><Activity><Heading>Exercise 8 sorting on a column</Heading><Question><Paragraph>Use the Exercise notebook 1 to sort the table by population so that you can quickly see which are the least and the most populous countries. Remember to run all code before doing the exercise.</Paragraph></Question></Activity><Paragraph>In the next section you’ll learn about calculations over columns.</Paragraph></Section><Section id="calculations_over_columns"><Title>1.6 Calculations over columns</Title><Paragraph>The last remaining task is to calculate the death rate of each country.</Paragraph><Paragraph>You may recall that with the simple approach I’d have to write:</Paragraph><ComputerDisplay><Paragraph>
rateAngola = deathsInAngola * 100 / populationOfAngola
</Paragraph><Paragraph>
rateBrazil = deathsInBrazil * 100 / populationOfBrazil
</Paragraph></ComputerDisplay><Paragraph>and so on, and so on. If you’ve used spreadsheets, it’s the same process: create the formula for the first row and then copy it down for all the rows. This is laborious and error-prone, e.g. if rows are added later on. Given that data is organised by columns, wouldn’t it be nice to simply write the following?</Paragraph><Paragraph><ComputerCode>rateColumn = deathsColumn * 100 / populationColumn</ComputerCode></Paragraph><Paragraph>Say no more: your wish is pandas’s command.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>deathsColumn = data['TB deaths']</Paragraph><Paragraph>populationColumn = data['Population (1000s)']</Paragraph><Paragraph>rateColumn = deathsColumn * 100 / populationColumn</Paragraph><Paragraph>rateColumn</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph>0 32.134873</Paragraph><Paragraph>1 2.196025</Paragraph><Paragraph>2 2.942576</Paragraph><Paragraph>3 8.850727</Paragraph><Paragraph>4 70.422535</Paragraph><Paragraph>5 19.167186</Paragraph><Paragraph>6 69.675621</Paragraph><Paragraph>7 1.319759</Paragraph><Paragraph>8 11.901928</Paragraph><Paragraph>9 9.326425</Paragraph><Paragraph>10 47.370017</Paragraph><Paragraph>11 87.378641</Paragraph><Paragraph>dtype: float64</Paragraph><Paragraph>Tadaaa! With pandas, the arithmetic operators become much smarter. When adding, subtracting, multiplying or dividing columns, the computer understands that the operation is to be done row by row and creates a new column.</Paragraph><Paragraph>All well and nice, but how to put that new column into the dataframe, in order to have everything in a single table? In an assignment <ComputerCode>
<b>variable = expression</b>
</ComputerCode> , if the variable hasn’t been mentioned before, the computer creates the variable and stores in it the expression’s value. Likewise, if I assign to a column that doesn’t exist in the dataframe, the computer will create it.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>data['TB deaths (per 100,000)'] = rateColumn</Paragraph><Paragraph>data</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country</th><th>Population (1000s)</th><th>TB deaths</th><th>TB deaths (per 100,000)</th></tr><tr><td>0</td><td>Angola</td><td>21472</td><td>6900</td><td>32.134873</td></tr><tr><td>1</td><td>Brazil</td><td>200362</td><td>4400</td><td>2.196025</td></tr><tr><td>2</td><td>China</td><td>1393337</td><td>41000</td><td>2.942576</td></tr><tr><td>3</td><td>Equatorial Guinea</td><td>757</td><td>67</td><td>8.850727</td></tr><tr><td>4</td><td>Guinea-Bissau</td><td>1704</td><td>1200</td><td>70.422535</td></tr><tr><td>5</td><td>India</td><td>1252140</td><td>240000</td><td>19.167186</td></tr><tr><td>6</td><td>Mozambique</td><td>25834</td><td>18000</td><td>69.675621</td></tr><tr><td>7</td><td>Portugal</td><td>10608</td><td>140</td><td>1.319759</td></tr><tr><td>8</td><td>Russian Federation</td><td>142834</td><td>17000</td><td>11.901928</td></tr><tr><td>9</td><td>Sao Tome and Principe</td><td>193</td><td>18</td><td>9.326425</td></tr><tr><td>10</td><td>South Africa</td><td>52776</td><td>25000</td><td>47.370017</td></tr><tr><td>11</td><td>Timor-Leste</td><td>1133</td><td>990</td><td>87.378641</td></tr></tbody></Table><Paragraph>That’s it! I’ve written all the code needed to answer the questions I had. Next I’ll write up the analysis into a succinct and stand-alone notebook that can be shared with friends, family and colleagues or the whole world. You’ll find that in the next section.</Paragraph></Section></Session><Session><Title>2 Writing up the analysis</Title><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1070.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1070.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="16ac6296" x_imagesrc="ou_futurelearn_learn_to_code_fig_1070.jpg" x_imagewidth="512" x_imageheight="341"/><Caption> <b>Figure 4</b>  A map identifying Portuguese speaking countries</Caption><Alternative>A map of the world identifying Portuguese speaking countries.</Alternative><Description>A map of the world identifying Portuguese speaking countries.</Description></Figure><Paragraph>Once you’ve done your analysis, you may want to record it or share it with others. The best way is to write up what you’ve discovered.</Paragraph><Paragraph>There is no right or wrong way to write up data analysis but the important thing is to present the answers to the questions you had. To keep things simple, I suggest the following structure:</Paragraph><NumberedList><ListItem>A descriptive title</ListItem><ListItem>An introduction setting the context and stating what you want to find out with the data.</ListItem><ListItem>A section detailing the source(s) of the data, with the code to load it into the notebook.</ListItem><ListItem>One or more sections showing the processes (calculating statistics, sorting the data, etc.) necessary to address the questions.</ListItem><ListItem>A conclusion summarising your findings, with qualitative analysis of the quantitative results and critical reflection on any shortcomings in the data or analysis process.</ListItem></NumberedList><Paragraph>You don’t need to explain your code, but it’s helpful to write the text in such a way that even readers who know nothing about Python or pandas can follow your analysis.</Paragraph><Paragraph>You can see how I’ve written up the analysis by opening this week’s project notebook, which you can open in project_1: Deaths by tuberculosis .</Paragraph><Paragraph>In the next section, amend this project to produce your own version.</Paragraph><Section><Title>2.1 Practice project</Title><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1036.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1036.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="2760a5ab" x_imagesrc="ou_futurelearn_learn_to_code_fig_1036.jpg" x_imagewidth="512" x_imageheight="341"/><Caption> <b>Figure 5</b> </Caption><Alternative>An image of many pins marking various countries on a globe.</Alternative><Description>An image of many pins marking various countries on a globe.</Description></Figure><Paragraph>Here’s a quick project for you, which is about looking at TB deaths in <i>all</i> countries.</Paragraph><Activity><Heading>Activity 1 The project</Heading><Question><NumberedList><ListItem>Open the data file WHO POP TB all.xls . Do <b>not</b> open or edit this file, to avoid changing its encoding. If you want to see the contents of the file, make a copy and look at the copy.</ListItem><ListItem>Open the project notebook.</ListItem><ListItem>If you’re using CoCalc do the following two steps: <NumberedSubsidiaryList class="lower-alpha" start="1"> <SubListItem>Click on the File menu and select ‘Download as’ and then ‘IPython notebook (.ipynb)’.</SubListItem> <SubListItem>On your computer, rename the downloaded file so that it includes your name, e.g. ‘TB deaths all world – Michel Wermelinger.ipynb’. Then upload the renamed notebook to CoCalc and open it.</SubListItem> </NumberedSubsidiaryList></ListItem><ListItem>If you’re using Anaconda do the following two steps: <NumberedSubsidiaryList class="lower-alpha" start="1"> <SubListItem>Click on the File menu and select ‘Make a copy’.</SubListItem> <SubListItem>Click on the title of the new notebook (‘project 1-Copy1’) to rename it. Make sure to include your name in the file name, e.g. ‘TB deaths all world – Michel Wermelinger’.</SubListItem> </NumberedSubsidiaryList></ListItem><ListItem>In the new notebook, add your name to mine and update the date.</ListItem><ListItem>Edit the first code cell: change the file name to ‘WHO POP TB all.xls’, in order to load the data for all countries in the world.</ListItem><ListItem>Run all cells in the notebook. This might take a little while.</ListItem><ListItem>Add one line of code at the end to sort the table by the death rate, so that it’s easy to see the least and most affected countries.</ListItem><ListItem>Go through the notebook and change any text (in particular the conclusions) to reflect the new results.</ListItem><ListItem>Save and then close and halt the notebook.</ListItem></NumberedList><Paragraph>If you happen to know how to use a spreadsheet application, then you can do a personal project: open the Excel file, remove all countries you are not interested in, and then do the analysis only for the remaining subset.</Paragraph><Paragraph>You might like to share your experience of working on this project with friends, family or colleagues.</Paragraph></Question></Activity></Section><Section><Title>2.2 Sharing your project notebook</Title><Figure><Image width="100%" src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1037.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1037.jpg" x_folderhash="cbfeded3" x_contenthash="a827cb89" x_imagesrc="ou_futurelearn_learn_to_code_fig_1037.jpg" x_imagewidth="512" x_imageheight="341"/><Caption> <b>Figure 6</b> </Caption><Alternative>An image of a young man explaining a chart to a small group sitting around a table.</Alternative><Description>An image of a young man explaining a chart to a small group sitting around a table.</Description></Figure><Paragraph>Sharing work is a great way to solve problems and learn from others.</Paragraph><Paragraph>You are encouraged to share the analysis notebook that you created in the previous section. There are a few different ways you can do this. I will only mention two, sharing and publishing, depending on whether you want people to be able to change your notebook or only read it.</Paragraph><Paragraph>If you don’t mind people editing and extending your notebook, like you have done with mine, then you’ll need to give them the notebook file (e.g. ‘TB deaths all world – Michel Wermelinger.ipynb’) and all necessary data files (just the ‘WHO POP TB all.xls’ in this case). There are many ways you can share files with other people. One of the simplest is to create a zip archive, upload it to a cloud service like Dropbox or Google Drive, and publicise the download link. You could also share the link on your social media or via email.</Paragraph><Paragraph>If the intended recipients don’t have the necessary software (Python, pandas and Jupyter) or you don’t want anybody to change your notebook, you can still publish the analysis in read-only mode, i.e. people can read the text and code, see the resulting tables and numbers, but can’t modify anything.</Paragraph><Paragraph>To do this, open your project notebook, run all the cells, double-check that there are no error messages and that all values and tables are shown as you want them to be, and save the notebook (without closing it).</Paragraph><Paragraph>If you use Anaconda, export the notebook by clicking ‘Download as’ in the ‘File’ menu and selecting the option you prefer. I prefer HTML because it looks much nicer. You can then share the single PDF or HTML file as before, by email, via Dropbox or Google Drive, on your blog and via a link.</Paragraph><Paragraph>If you use CoCalc, just click on the ‘Publish’ button on the right side above your notebook, and you will get after a little while the link that you can share with others. Anyone can then read your notebook, even if they don’t have a CoCalc account. For example, look at my <a href="https://cloud.sagemath.com/projects/ff47a32e-e177-4d13-ad9a-625c859cc20b/files/Week_1_project.html">Project 1</a> (it’s best to right-click and open this link in a new tab).</Paragraph><Paragraph>Now choose the sharing or publishing method, and get sharing!</Paragraph></Section></Session><Session><Title>3 This week’s quiz</Title><Paragraph>Check what you’ve learned this week by taking the end-of-week quiz.</Paragraph><Paragraph><a href="https://www.open.edu/openlearn/ocw/mod/quiz/view.php?id=78778">Week 2 practice quiz</a></Paragraph><Paragraph>Open the quiz in a new window or tab then come back here when you’ve finished.</Paragraph></Session><Session><Title>4 Summary</Title><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1076_3d.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1076_3d.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="1a37f943" x_imagesrc="ou_futurelearn_learn_to_code_fig_1076_3d.jpg" x_imagewidth="512" x_imageheight="341"/><Caption> <b>Figure 7</b> </Caption><Alternative>An image of python code on a computer screen</Alternative><Description>An image of python code on a computer screen</Description></Figure><Paragraph>This week you used Jupyter notebooks to write and execute simple programs with Python and the pandas module. You've learned how to:</Paragraph><BulletedList><ListItem>load a table from an Excel file</ListItem><ListItem>select a column, and compute some simple statistics (like the total, minimum and median) about it. </ListItem><ListItem>create a new column with values calculated from other columns</ListItem><ListItem>sort a table by one of its columns.</ListItem></BulletedList><Paragraph>Next week you will learn further ways to manipulate dataframes, in particular to clean data. You will also produce your first data chart, showing variations of values over time.</Paragraph><InternalSection><Heading>Futher reading</Heading><Reference><a href="http://apps.who.int/gho/data/node.main.POP107?lang=en">WHO population – data by country (2013)</a> </Reference><Reference><a href="http://apps.who.int/gho/data/node.country"> WHO mortality and prevalence – data by country (2007 – present) </a> </Reference></InternalSection><Section><Title>4.1 Week 1 and 2 glossary</Title><Paragraph>Here are alphabetical lists, for quick look up, of what this week introduced.</Paragraph><InternalSection><Heading>Programming and data analysis concepts</Heading><Paragraph>An <b>assignment</b> is a statement of the form <ComputerCode>
<b>variable = expression</b>
</ComputerCode> . It evaluates the expression and stores its value in the variable. The variable is created if it doesn’t exist. Each assignment is written on its own line.</Paragraph><Paragraph><b>CamelCase</b> is a naming style in which names made of various words have each word capitalized, except possibly the first.</Paragraph><Paragraph>A <b>comment</b> is a note about the code. It starts with the hash sign (#) and goes until the end of the line.</Paragraph><Paragraph>A <b>dataframe</b> is the pandas representation of a table.</Paragraph><Paragraph>An <b>expression</b> is a fragment of code that can be <b>evaluated</b> , i.e. that has a value, like a variable name.</Paragraph><Paragraph>A <b>file not found</b> error occurs if the computer can’t find the given file, e.g. because the name is misspelled or because it’s in another folder.</Paragraph><Paragraph>A <b>function</b> takes zero or more <b>arguments</b> (values) and <b>returns</b> (produces) a value.</Paragraph><Paragraph>A <b>function call</b> is an expression of the form <ComputerCode>
<b>functionName(argument1, argument2, …).</b>
</ComputerCode></Paragraph><Paragraph>An <b>import statement</b> of the form <ComputerCode>
<b>from module import</b>
</ComputerCode> * loads all the code from the given module.</Paragraph><Paragraph>The <b>maximum</b> and <b>minimum</b> of a set of values is the largest and smallest value, respectively.</Paragraph><Paragraph>The <b>mean</b> of a set of numbers is the sum of those numbers divided by how many there are.</Paragraph><Paragraph>The <b>median</b> of a set of numbers is the number in the middle, i.e. half of the numbers are below the median and half are above.</Paragraph><Paragraph>A <b>method</b> is a function that can only be called in a certain context, like a dataframe or a column.</Paragraph><Paragraph>A <b>method call</b> is an expression of the form <ComputerCode>
<b>context.methodName(argument1, argument2, ...).</b>
</ComputerCode></Paragraph><Paragraph>A <b>module</b> is a package of various pieces of code that can be used individually.</Paragraph><Paragraph>A <b>name</b> is a case-sensitive sequence of letters, digits and underscores. Names cannot start with a digit. Function, variable and module names usually start with lowercase.</Paragraph><Paragraph>A <b>name error</b> occurs if the computer doesn’t recognize a name, e.g. if it was misspelled.</Paragraph><Paragraph>An <b>operator</b> is a symbol that represents some operation on one or two expressions, e.g. the four basic arithmetic operators.</Paragraph><Paragraph>The <b>range</b> of a set of values is the difference between the maximum and the minimum.</Paragraph><Paragraph>A <b>reserved</b> word cannot be used as a name. Jupyter shows reserved words in green boldface.</Paragraph><Paragraph>A <b>statement</b> is a command for the computer to do something, e.g. to assign a value or to import some code.</Paragraph><Paragraph>A <b>string</b> is a verbatim piece of text, surrounded by quotes. Jupyter shows strings in red.</Paragraph><Paragraph>A <b>syntax error</b> occurs if the computer can’t understand the code because it is not in the expected form, e.g. if a reserved word is used instead of a name or some punctuation is missing.</Paragraph><Paragraph>A <b>variable</b> is a named storage for values.</Paragraph></InternalSection><InternalSection><Heading>Reserved words</Heading><BulletedList><ListItem><ComputerCode>
<b>from</b>
</ComputerCode></ListItem><ListItem><ComputerCode>
<b>import</b>
</ComputerCode></ListItem></BulletedList></InternalSection><InternalSection><Heading>Functions and methods</Heading><Paragraph><ComputerCode>
<b>max(value1, value2, …)</b>
</ComputerCode> returns the maximum of the given values.</Paragraph><Paragraph><ComputerCode>
<b>column.max()</b>
</ComputerCode> returns the maximum value in the column.</Paragraph><Paragraph><ComputerCode>
<b>min(value1, value2, …)</b>
</ComputerCode> returns the minimum of the given values.</Paragraph><Paragraph><ComputerCode>
<b>column.min()</b>
</ComputerCode> returns the minimum value in the column.</Paragraph><Paragraph><ComputerCode>
<b>column.mean()</b>
</ComputerCode> returns the mean of the values in the column.</Paragraph><Paragraph><ComputerCode>
<b>column.median()</b>
</ComputerCode> returns the median of the values in the column.</Paragraph><Paragraph><ComputerCode>
<b>column.sum()</b>
</ComputerCode> returns the total of the values in the column.</Paragraph><Paragraph><ComputerCode>
<b>dataFrame.sort_values(columnName)</b>
</ComputerCode> takes a string with a column’s name and returns a new dataframe, in which rows are sorted in ascending order according to the values in the given column.</Paragraph><Paragraph><ComputerCode>
<b>read_excel(fileName)</b>
</ComputerCode> takes a string with an Excel file name, reads the file, and returns a dataframe representing the table in the file.</Paragraph></InternalSection></Section></Session></Unit><Unit><UnitID/><UnitTitle>Week 3: Cleaning up our act Part 1</UnitTitle><Introduction><Title>Introduction</Title><Paragraph>Welcome to Week 3.</Paragraph><Paragraph><i>Please note: in the following video, where reference is made to a study ‘week’, this corresponds to Weeks 3 and 4 of this course.</i></Paragraph><MediaContent src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/new_lcdab_w3_intro.mp4" type="video" x_manifest="new_lcdab_w3_intro_1_server_manifest.xml" x_filefolderhash="d072e793" x_folderhash="d072e793" x_contenthash="51d91249"><Transcript><Speaker>RUTH ALEXANDER:</Speaker><Remark>Hello again. </Remark><Remark>In the previous week the data you looked at was quite tidy ready to be analysed. Unfortunately it's often the case that data is published in a form that you can't immediately start to work with. For example, numbers and dates may be in inconsistent formats, textual data may contain spurious characters and there may even be missing or nonsensical data, for example due to a typo. Such data is often referred to as 'dirty data' and it needs to be put through a process known unsurprisingly as data cleaning or data cleansing before it can be analysed. Data cleaning is often a substantial but unglamorous part of data analysis. I've been talking to someone who does this as part of his job. </Remark><Speaker>DAVID GOODY:</Speaker><Remark>My work at the moment is looking at a wide range of schools data to predict when they might be having issues with financial problems. This allows us to work with them early on and help resolve problems that otherwise might escalate. </Remark><Remark>We use a range of statistical techniques to do this from standard linear progression approaches through to more advanced data science techniques but with all of these if the data's quality is poor this will lead to spurious results and you might lead to the wrong conclusions. </Remark><Speaker>RUTH ALEXANDER:</Speaker><Remark>What's the first thing you do when you start working with data?</Remark><Speaker>DAVID GOODY:</Speaker><Remark>The first thing we do is see whether the data sets are complete or not. With a lot of the data sets we work with we may well find these won't cover all schools and in some cases it won't just show up as a missing value, it might show up as a zero or something that might look like you have got a valid result. So we need to analyse through these, find where that missing data is and then work out whether we exclude those schools from our analysis and predictive modelling when we're setting things up or replace them with averages or values from previous years.</Remark><Speaker>RUTH ALEXANDER:</Speaker><Remark>What other data quality checks do you do?</Remark><Speaker>DAVID GOODY:</Speaker><Remark>Once we've worked out whether the data is complete or not we then look at whether the data is sort of trustworthy and okay for us to use. The issues might be obvious if we have a pupil who's a hundred and twelve years old we've probably got an issue there, what's happened there is someone might have entered a date of birth as 1902 rather than 2002. In other cases things will be more subtle and we need to look at the distribution of results to work out where the data might be misleading and then again make these decisions of whether we try and exclude those results from our analysis or replace them with more sensible values. </Remark><Speaker>RUTH ALEXANDER:</Speaker><Remark>Are there any other things you do once you're happy the data is accurate? </Remark><Speaker>DAVID GOODY:</Speaker><Remark>In some cases the data's in the wrong format for us to work with. We've been doing quite a bit of geographical mapping work recently. The EduBase database we use that has information on schools records their location using a coordinate system known as 'northing and easting' whereas the computer programmes we've been using want them in latitude and longitude. So there we have to convert the data from one coordinate system to another and then check the information we've got at the end is still accurate. </Remark><Speaker>RUTH ALEXANDER:</Speaker><Remark>What processes do you have for managing data quality? </Remark><Speaker>DAVID GOODY:</Speaker><Remark>We've got quite a detailed quality assurance procedure within the department and with how we document this work. For, depending on the importance and complexity of the piece of work, we may find that it goes from simple sense checking and comparing against similar results through to someone parallel running the entire piece of work and by having these processes in place it allows us to have confidence in the data that we work with. </Remark><Speaker>RUTH ALEXANDER:</Speaker><Remark>This week you'll learn some simple approaches to cleaning data. Once the data's clean a picture is worth a thousand words, so you'll be producing your first chart with, as you guessed, one line of code. Enjoy the week. </Remark></Transcript><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/lcdab_w3_intro_wboards.png" src_uri="file:////esaki/LTS-common$/Dale%20Harry/Learn%20to%20code/week_videos/lcdab_w3_intro_wboards.png" x_folderhash="d072e793" x_contenthash="b286900c" x_imagesrc="lcdab_w3_intro_wboards.png" x_imagewidth="512" x_imageheight="289"/></Figure></MediaContent><Paragraph>In Week 1 and 2 you worked on a dataset that combined two different World Health Organization datasets: population and the number of deaths due to tuberculosis.</Paragraph><Paragraph>They could be combined because they share a common attribute: the countries. This week you will learn the techniques behind the creation of such a combined dataset.</Paragraph><!--
<Paragraph>This OpenLearn course is an adapted extract from the Open University course <a href="http://www3.open.ac.uk/study/undergraduate/course/l120.htm">module code <i>module title</i></a></Paragraph>
--></Introduction><Session><Title>1 Weather data</Title><Paragraph>This week you will be looking at investigating historic weather data.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1039.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1039.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="51178b01" x_imagesrc="ou_futurelearn_learn_to_code_fig_1039.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 1</b></Caption><Alternative>An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky.</Alternative><Description>An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky.</Description></Figure><Paragraph>Of course, such data is hugely important for research into the large-scale, long-term shift in our planet’s weather patterns and average temperatures – climate change. However, such data is also incredibly useful for more mundane planning purposes. To demonstrate the learning this week, I, Rob Griffiths, will be using historic weather data to try and plan a summer holiday in the UK. You’ll use the data too and get a chance to work on your own project at the end of the week.</Paragraph><Paragraph>The dataset we’ll use to do this will come from the <a href="http://www.wunderground.com/">Weather Underground</a>, which creates weather forecasts from data sent to them by a worldwide network of over 100,000 weather enthusiasts who have personal weather stations on their house or in their garden.</Paragraph><Paragraph>In addition to creating weather forecasts from that data, the Weather Underground also keeps that data as historic weather records allowing members of the public to download weather datasets for a particular time period and location. These datasets are downloaded as CSV files, explained in the next step.</Paragraph><Paragraph>Datasets are rarely ‘clean’ and fit for purpose, so it will be necessary to clean up the data and ‘mould it’ for your purposes. You will then learn how to visualise data by creating graphs using the <ComputerCode><b>plot()</b></ComputerCode> function.</Paragraph><Section id="what_is_a_csv_file"><Title>1.1 What is a CSV file?</Title><Paragraph>A CSV file is a plain text file that is used to hold tabular data. The acronym CSV is short for ‘comma-separated values’.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1036.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1036.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="2760a5ab" x_imagesrc="ou_futurelearn_learn_to_code_fig_1036.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 2</b></Caption><Alternative>An image of many pins marking various countries on a globe</Alternative><Description>An image of many pins marking various countries on a globe</Description></Figure><Paragraph>Take a look at the first few lines of a CSV file that holds the same data as the Excel file ‘WHO POP TB all.xls’ that you encountered in Week 2:</Paragraph><ComputerDisplay><Paragraph>Country,Population (1000s),TB deaths</Paragraph><Paragraph>Afghanistan,30552,13000.0</Paragraph><Paragraph>Albania,3173,20.0</Paragraph><Paragraph>Algeria,39208,5100.0</Paragraph><Paragraph>Andorra,79,0.26 </Paragraph><Paragraph>Angola,21472,6900.0</Paragraph><Paragraph>Antigua and Barbuda,90,1.2</Paragraph><Paragraph>Argentina,41446,570.0 </Paragraph><Paragraph>Armenia,2977,170.0</Paragraph></ComputerDisplay><Paragraph>Notice that the first line is a row of column names. The subsequent lines are rows of actual data that correspond to the column names. The row of column names is optional, but it is helpful in understanding the data in the following lines and making sure the right values fall in the right place. In this example, the first value on every row must be a string representing a country’s name, the second value is an integer representing that country’s population (in 1000s) and the third value is a decimal representing the number of deaths due to TB. Note that the third value is a decimal (like 0.26 deaths for Andorra) and not an integer because it is an estimate obtained from statistical processing of collected data.</Paragraph><Paragraph>Note that each value or column name is separated by a comma but actually any character can be used to separate values in a CSV file, including spaces and tabs etc., hence CSV can also stand for ‘character-separated values’.</Paragraph><Paragraph>Because CSV files are in plain-text it makes the data easy to import into any spreadsheet program, database or pandas dataframe.</Paragraph><Paragraph>Before anything can be done with a CSV file with pandas, the following import statement must be executed:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>from pandas import *</ComputerCode></Paragraph><Paragraph>As you learned in Week 2, the import statement loads into memory all the code in the pandas module.</Paragraph><Paragraph>To read a CSV file into a dataframe, the pandas function <ComputerCode><b>read_csv()</b></ComputerCode> needs to be called.</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df = read_csv('WHO POP TB all.csv')</ComputerCode></Paragraph><Paragraph>The above code creates a dataframe from the data in the file <ComputerCode><b>WHO POP TB</b></ComputerCode> <ComputerCode><b>all.csv</b></ComputerCode> and assigns it to the variable <ComputerCode><b>df</b></ComputerCode>. This is the simplest usage of the <ComputerCode><b>read_csv()</b></ComputerCode> function, just using a single argument, a string that holds the name of the CSV file.</Paragraph><Paragraph>However the function can take many additional arguments (some of which you’ll use later), which determine how the file is to be read.</Paragraph><Paragraph>In the next step, find out about dataframes and the ‘dot’ notation.</Paragraph></Section><Section><Title>1.2 Dataframes and the ‘dot’ notation</Title><Paragraph>In Week 2 you learned that dataframes have methods, which are like functions, that can only be called in the context of a dataframe.</Paragraph><Paragraph>For example, because the TB deaths dataframe <ComputerCode><b>df </b></ComputerCode>has a column named ‘Country’, the <ComputerCode><b>sort_values()</b></ComputerCode> method can be called like this:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.sort_values('Country')</ComputerCode></Paragraph><Paragraph>Because there is variable name, followed by a dot, followed by the method, this is called <b>dot notation</b>. Methods are said to be a property of a dataframe. In addition to methods, dataframes have another property – attributes.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1040.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1040.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="33719294" x_imagesrc="ou_futurelearn_learn_to_code_fig_1040.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 3</b></Caption><Alternative>A multi-coloured image of many different sized circles. </Alternative><Description>A multi-coloured image of many different sized circles. They could be described as bubbles.</Description></Figure><InternalSection><Heading>Attributes</Heading><Paragraph>A dataframe attribute is like a variable that can only be accessed in the context of a dataframe. One such attribute is <ComputerCode><b>columns </b></ComputerCode>which holds a dataframe’s column names.</Paragraph><Paragraph>So the expression <ComputerCode><b>df.columns</b></ComputerCode> evaluates to the value of the <ComputerCode><b>columns </b></ComputerCode>attribute inside the dataframe <ComputerCode><b>df</b></ComputerCode>. The following code will get and display the names of the columns in the dataframe <ComputerCode><b>df:</b></ComputerCode></Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.columns</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><ComputerDisplay><Paragraph>Index(['Country', 'Population (1000s)', 'TB deaths'],</Paragraph><Paragraph>dtype='object')</Paragraph></ComputerDisplay></InternalSection></Section><Section><Title>1.3 Getting and displaying dataframe rows</Title><Paragraph>Dataframes can have hundreds or thousands of rows, so it is not practical to display a whole dataframe.</Paragraph><Paragraph>However, there are a number of dataframe attributes and methods that allow you to get and display either a single row or a number of rows at a time. Three of the most useful methods are:<ComputerCode><b> iloc()</b></ComputerCode>, <ComputerCode><b>head()</b></ComputerCode> and <ComputerCode><b>tail()</b></ComputerCode>. Note that to distinguish methods and attributes, we write <ComputerCode>()</ComputerCode> after a method’s name.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1041.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1041.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="17592d1b" x_imagesrc="ou_futurelearn_learn_to_code_fig_1041.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 4</b></Caption><Alternative>An image of a data algorithm</Alternative><Description>An image of a data algorithm</Description></Figure><InternalSection><Heading>The iloc attribute</Heading><Paragraph>A dataframe has a default integer index for its rows, which starts at 0 (zero). You can get and display any single row in a dataframe by using the<ComputerCode><b>iloc</b></ComputerCode> attribute with the index of the row you want to access as its argument. For example, the following code will get and display the first row of data in the dataframe <ComputerCode><b>df</b></ComputerCode>, which is at index 0:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.iloc[0]</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><ComputerDisplay><Paragraph>Country Afghanistan</Paragraph><Paragraph>Population (1000s) 30552</Paragraph><Paragraph>TB deaths 13000</Paragraph><Paragraph>Name: 0, dtype: object</Paragraph></ComputerDisplay><Paragraph>Similarly, the following code will get and display the third row of data in the dataframe <ComputerCode><b>df</b></ComputerCode>, which is at index 2:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.iloc[2]</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><ComputerDisplay><Paragraph>Country Algeria</Paragraph><Paragraph>Population (1000s) 39208</Paragraph><Paragraph>TB deaths 5100.0</Paragraph><Paragraph>Name: 0, dtype: object</Paragraph></ComputerDisplay></InternalSection><InternalSection><Heading>The head() method</Heading><Paragraph>The first few rows of a dataframe can be printed out with the <ComputerCode><b>head()</b></ComputerCode> method.</Paragraph><Paragraph>You can tell <ComputerCode><b>head()</b></ComputerCode> is a method, rather than an attribute such as <ComputerCode><b>columns</b></ComputerCode>, because of the parentheses (round brackets) after the property name.</Paragraph><Paragraph>If you don’t give any argument, i.e. don’t put any number within those parentheses, the default behaviour is to return the first five rows of the dataframe. If you give an argument, it will print that number of rows (starting from the row indexed by 0).</Paragraph><Paragraph>For example, executing the following code will get and display the first five rows in the dataframe <ComputerCode><b>df</b></ComputerCode>.</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.head()</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>0</td><td>Afghanistan</td><td>30552</td><td>13000.00</td></tr><tr><td>1</td><td>Albania</td><td>3173</td><td>20.00</td></tr><tr><td>2</td><td>Algeria</td><td>39208</td><td>5100.00</td></tr><tr><td>3</td><td>Andorra</td><td>79</td><td>0.26</td></tr><tr><td>4</td><td>Angola</td><td>21472</td><td>6900.00</td></tr></tbody></Table><Paragraph>And, executing the following code will get and display the first seven rows in the dataframe <ComputerCode><b>df.</b></ComputerCode></Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.head(7)</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>0</td><td>Afghanistan</td><td>30552</td><td>13000.00</td></tr><tr><td>1</td><td>Albania</td><td>3173</td><td>20.00</td></tr><tr><td>2</td><td>Algeria</td><td>39208</td><td>5100.00</td></tr><tr><td>3</td><td>Andorra</td><td>79</td><td>0.26</td></tr><tr><td>4</td><td>Angola</td><td>21472</td><td>6900.00</td></tr><tr><td>5</td><td>Antigua and Barbuda</td><td>90</td><td>1.20</td></tr><tr><td>6</td><td>Argentina</td><td>41446</td><td>570.00</td></tr></tbody></Table></InternalSection><InternalSection><Heading>The tail() method</Heading><Paragraph>The <ComputerCode><b>tail()</b></ComputerCode> method is similar to the <ComputerCode><b>head()</b></ComputerCode> method.</Paragraph><Paragraph>If no argument is given, the last five rows of the dataframe are returned, otherwise the number of rows returned is dependent on the argument, just like for the <ComputerCode><b>head()</b></ComputerCode> method.</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.tail()</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>189</td><td>Venezuela (Bolivarian Republic of)</td><td>30405</td><td>480</td></tr><tr><td>190</td><td>Viet Nam</td><td>91680</td><td>17000</td></tr><tr><td>191</td><td>Yemen</td><td>24407</td><td>990</td></tr><tr><td>192</td><td>Zambia</td><td>14539</td><td>3600</td></tr><tr><td>193</td><td>Zimbabwe</td><td>14150</td><td>5700</td></tr></tbody></Table></InternalSection></Section><Section><Title>1.4 Getting and displaying dataframe columns</Title><Paragraph>You learned in Week 2 that you can get and display a single column of a dataframe by putting the name of the column (in quotes) within square brackets immediately after the dataframe’s name.</Paragraph><Paragraph>For example, like this:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df['TB deaths']</ComputerCode></Paragraph><Paragraph>You then get output like this:</Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><ComputerDisplay><Paragraph>0    13000.00</Paragraph><Paragraph>1       20.00</Paragraph><Paragraph>2     5100.00</Paragraph><Paragraph>3        0.26</Paragraph><Paragraph>4     6900.00</Paragraph><Paragraph>5        1.20</Paragraph><Paragraph>6      570.00</Paragraph><Paragraph>...</Paragraph></ComputerDisplay><Paragraph>Notice that although there is an index, there is no column heading. This is because what is returned is not a new dataframe with a single column but an example of the <ComputerCode><b>Series</b></ComputerCode> data type.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1042.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1042.jpg" x_folderhash="cbfeded3" x_contenthash="9cff0938" x_imagesrc="ou_futurelearn_learn_to_code_fig_1042.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 5</b></Caption><Description>An perspective image of the isle between many data storage towers. The floor and the storage units are lit up.</Description></Figure><InternalSection><Heading>Each column in a dataframe is an example of a series</Heading><Paragraph>The <ComputerCode><b>Series</b></ComputerCode> data type is a collection of values with an integer index that starts from zero. In addition, the <ComputerCode><b>Series</b></ComputerCode> data type has many of the same methods and attributes as the <ComputerCode><b>DataFrame</b></ComputerCode> data type, so you can still execute code like:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df['TB deaths'].head()</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><ComputerDisplay><Paragraph>0    13000.00</Paragraph><Paragraph>1       20.00</Paragraph><Paragraph>2     5100.00</Paragraph><Paragraph>3        0.26</Paragraph><Paragraph>4     6900.00</Paragraph><Paragraph>Name: TB deaths, dtype: float64</Paragraph></ComputerDisplay><Paragraph>And</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df['TB deaths'].iloc[2]</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>5100.00</ComputerCode></Paragraph><Paragraph>However, pandas does provide a mechanism for you to get and display one or more selected columns as a new dataframe in its own right. To do this you need to use a <b>list</b>. A list in Python consists of one or more items separated by commas and enclosed within square brackets, for example <ComputerCode><b>['Country']</b></ComputerCode> or<ComputerCode><b> ['Country', 'Population (1000s)']</b></ComputerCode>. This list is then put within outer square brackets immediately after the dataframe’s name, like this:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df[['Country']].head()</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th><b>Country</b></th></tr><tr><td>0</td><td>Afghanistan</td></tr><tr><td>1</td><td>Albania</td></tr><tr><td>2</td><td>Algeria</td></tr><tr><td>3</td><td>Andorra</td></tr><tr><td>4</td><td>Angola</td></tr></tbody></Table><Paragraph>Note that the column is now named. The expression<ComputerCode><b> df[['Country']]</b></ComputerCode>(with two square brackets) evaluates to a new dataframe (which happens to have a single column) rather than a series.</Paragraph><Paragraph>To get a new dataframe with multiple columns you just need to put more column names in the list, like this:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df[['Country', 'Population (1000s)']].head()</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th><b>Country</b></th><th><b>Population (1000s)</b></th></tr><tr><td>0</td><td>Afghanistan</td><td>30552</td></tr><tr><td>1</td><td>Albania</td><td>3173</td></tr><tr><td>2</td><td>Algeria</td><td>39208</td></tr><tr><td>3</td><td>Andorra</td><td>79</td></tr><tr><td>4</td><td>Angola</td><td>21472</td></tr></tbody></Table><Paragraph>The code has returned a new dataframe with just the <ComputerCode><b>'Country'</b></ComputerCode> and <ComputerCode><b>'Population (1000s)’</b></ComputerCode> columns.</Paragraph><Activity>
                        <Heading>Exercise 1 Dataframes and CSV files</Heading>
                        <Question>
                            <Paragraph>Now that you’ve learned about CSV files and more about pandas you are ready to complete Exercise 1 in the exercise notebook 2.</Paragraph>
                            <Paragraph>Open the exercise 2 notebook and the data file you used last week WHO POP TB all.csv and save it in the folder you created in Week 1.</Paragraph>
                            <Paragraph>If you’re using Anaconda instead of CoCalc, remember that to open the notebook you’ll need to navigate to the notebook using Jupyter. Once it’s open, run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook. If you need a quick reminder of how to use Jupyter watch again the video in <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83246&amp;targetdoc=Week+1%3A+Having+a+go+at+it+Part+1&amp;targetptr=1.4">Week 1 Exercise 1.</a></Paragraph>
                        </Question>
                    </Activity></InternalSection></Section><Section><Title>1.5 Comparison operators</Title><Paragraph>In <a>Expressions,</a>you learned that Python has arithmetic operators: +, /, - and * and that expressions such as 5 + 2 evaluate to a value (in this case the number 7).</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1043.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1043.jpg" x_folderhash="cbfeded3" x_contenthash="da320f19" x_imagesrc="ou_futurelearn_learn_to_code_fig_1043.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 6</b></Caption><Alternative>An illustration of two girls holding up signs. One sign says, 'YES', the other says, 'NO'.</Alternative><Description>An illustration of two girls holding up signs. One sign says, 'YES', the other says, 'NO'.</Description></Figure><Paragraph>Python also has what are called comparison operators, these are:</Paragraph><ComputerDisplay><Paragraph>==    equals</Paragraph><Paragraph>!=    not equal</Paragraph><Paragraph>&lt;     less than</Paragraph><Paragraph>&gt;     greater than</Paragraph><Paragraph>&lt;=    less than or equal to </Paragraph><Paragraph>&gt;=    greater than or equal to</Paragraph></ComputerDisplay><Paragraph>Expressions involving these operators always evaluate to a Boolean value, that is <ComputerCode><b>True</b></ComputerCode> or <ComputerCode><b>False</b></ComputerCode>. Here are some examples:</Paragraph><ComputerDisplay><Paragraph>2 = = 2      evaluates to True</Paragraph><Paragraph>2 + 2 = = 5  evaluates to False</Paragraph><Paragraph>2 != 1 + 1   evaluates to False</Paragraph><Paragraph>45 &lt; 50      evaluates to True</Paragraph><Paragraph>20 &gt; 30      evaluates to False</Paragraph><Paragraph>100 &lt;= 100   evaluates to True</Paragraph><Paragraph>101 &gt;= 100   evaluates to True</Paragraph></ComputerDisplay><Paragraph>The comparison operators can be used with other types of data, not just numbers. Used with strings they compare using alphabetical order. For example:</Paragraph><Paragraph><ComputerCode>'aardvark' &lt; 'zebra' evaluates to True</ComputerCode></Paragraph><Paragraph>In <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83246&amp;targetdoc=Week+2%3A+Having+a+go+at+it+Part+2&amp;targetptr=1.6">Calculating over columns</a> you saw that when applied to whole columns, the arithmetic operators did the calculations row by row. Similarly, an expression like <ComputerCode><b>df['Country'] &gt;= 'K'</b></ComputerCode> will compare the country names, row by row, against the string 'K' and record whether the result is <ComputerCode><b>True</b></ComputerCode> or <ComputerCode><b>False</b></ComputerCode> in a series like this:</Paragraph><ComputerDisplay><Paragraph>0    False</Paragraph><Paragraph>1    False</Paragraph><Paragraph>2    False</Paragraph><Paragraph>3    False</Paragraph><Paragraph>4    False</Paragraph><Paragraph>5    False</Paragraph><Paragraph>...</Paragraph><Paragraph>Name: Country, dtype: bool </Paragraph></ComputerDisplay><Paragraph>If such an expression is put within square brackets immediately after a dataframe’s name, a new dataframe is obtained with only those rows where the result is <ComputerCode><b>True</b></ComputerCode>. So:</Paragraph><Paragraph><ComputerCode>df[df['Country'] &gt;= 'K']</ComputerCode></Paragraph><Paragraph>returns a new dataframe with all the columns of <ComputerCode><b>df </b></ComputerCode>but with only the rows corresponding to countries starting with K or a letter later in the alphabet.</Paragraph><Paragraph>As another example, to see the data for countries with over 80 million inhabitants, the following code will return and display a new dataframe with all the columns of <ComputerCode><b>df</b></ComputerCode> but with only the rows where it is <ComputerCode><b>True</b></ComputerCode> that the value in the <ComputerCode><b>'Population (1000s)'</b></ComputerCode> column is greater than <ComputerCode><b>80000:</b></ComputerCode></Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df[df['Population (1000s)'] &gt; 80000]</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>13</td><td>Bangladesh</td><td>156595</td><td>80000</td></tr><tr><td>23</td><td>Brazil</td><td>200362</td><td>4400</td></tr><tr><td>36</td><td>China</td><td>1393337</td><td>41000</td></tr><tr><td>53</td><td>Egypt</td><td>82056</td><td>550</td></tr><tr><td>58</td><td>Ethiopia</td><td>94101</td><td>30000</td></tr><tr><td>65</td><td>Germany</td><td>82727</td><td>300</td></tr><tr><td>77</td><td>India</td><td>1252140</td><td>240000</td></tr><tr><td>78</td><td>Indonesia</td><td>249866</td><td>64000</td></tr><tr><td>85</td><td>Japan</td><td>127144</td><td>2100</td></tr><tr><td>109</td><td>Mexico</td><td>122332</td><td>2200</td></tr><tr><td>124</td><td>Nigeria</td><td>173615</td><td>160000</td></tr><tr><td>128</td><td>Pakistan</td><td>182143</td><td>49000</td></tr><tr><td>134</td><td>Philippines</td><td>98394</td><td>27000</td></tr><tr><td>141</td><td>Russian Federation</td><td>142834</td><td>17000</td></tr><tr><td>185</td><td>United States of America</td><td>320051</td><td>490</td></tr><tr><td>190</td><td>Viet Nam</td><td>91680</td><td>17000</td></tr></tbody></Table><Activity><Heading>Exercise 2 Comparison operators</Heading><Question><Paragraph>You are ready to complete Exercise 2 in the Exercise notebook 2.</Paragraph><Paragraph>Remember to run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook. </Paragraph></Question></Activity></Section><Section><Title>1.6 Bitwise operators</Title><Paragraph>To build more complicated expressions involving column comparisons, there are two bitwise operators.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1044.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1044.jpg" x_folderhash="cbfeded3" x_contenthash="72f00105" x_imagesrc="ou_futurelearn_learn_to_code_fig_1044.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 7</b></Caption><Alternative>An image of someone constructing a building from wooden blocks</Alternative><Description>An image of someone constructing a building from wooden blocks</Description></Figure><Paragraph>The <ComputerCode><b>&amp;</b></ComputerCode> operator means ‘and’ and the | operator (vertical bar, not uppercase letter ‘i’) means ‘or’. So, for example the expression:</Paragraph><ComputerDisplay><Paragraph>(df['Country'] &gt;= 'Latvia') &amp; (df['Country'] &lt;= 'Sweden')</Paragraph></ComputerDisplay><Paragraph>will evaluate to a series containing Boolean values where the values are<ComputerCode><b>True</b></ComputerCode> only if the equivalent rows in the dataframe contain the countries ‘<ComputerCode><b>Latvia</b></ComputerCode>’ to ‘<ComputerCode><b>Sweden</b></ComputerCode>’, inclusive. However, the following expression which uses | (or) rather than &amp; (and):</Paragraph><Paragraph><ComputerCode>(df['Country'] &gt;= 'Latvia') | (df['Country'] &lt;= 'Sweden')</ComputerCode></Paragraph><Paragraph>will evaluate to <ComputerCode><b>True</b></ComputerCode> for all countries, because every country comes alphabetically after ‘<ComputerCode><b>Latvia</b></ComputerCode>’ (e.g. the ‘UK’) or before '<ComputerCode><b>Sweden</b></ComputerCode>' (e.g. ‘<ComputerCode><b>Brazil</b></ComputerCode>’).</Paragraph><Paragraph>Note the round brackets around each comparison. Without them you will get an error.</Paragraph><Paragraph>The whole expression with multiple comparisons has to be put within <ComputerCode><b>df[…]</b></ComputerCode> to get a dataframe with only those rows that match the condition.</Paragraph><Paragraph>As a further example, using different columns, it is relatively easy to find the rows in <ComputerCode><b>df</b></ComputerCode> where '<ComputerCode><b>Population (1000s)</b></ComputerCode>' is greater than <ComputerCode><b>80000</b></ComputerCode> and where '<ComputerCode><b>TB deaths</b></ComputerCode>' are greater than <ComputerCode>10000</ComputerCode>.</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df[(df['Population (1000s)'] &gt; 80000) &amp; (df['TB deaths'] &gt; 10000)]</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out []:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>13</td><td>Bangladesh</td><td>156595</td><td>80000</td></tr><tr><td>36</td><td>China</td><td>1393337</td><td>41000</td></tr><tr><td>58</td><td>Ethiopia</td><td>94101</td><td>30000</td></tr><tr><td>77</td><td>India</td><td>1252140</td><td>240000</td></tr><tr><td>78</td><td>Indonesia</td><td>249866</td><td>64000</td></tr><tr><td>124</td><td>Nigeria</td><td>173615</td><td>160000</td></tr><tr><td>128</td><td>Pakistan</td><td>182143</td><td>49000</td></tr><tr><td>134</td><td>Philippines</td><td>98394</td><td>27000</td></tr><tr><td>141</td><td>Russian Federation</td><td>142834</td><td>17000</td></tr><tr><td>190</td><td>Viet Nam</td><td>91680</td><td>17000</td></tr></tbody></Table><Paragraph>These expressions can get long and complicated, making it easy to miss a crucial round or square bracket. In those cases it is best to break up the expression into small steps. The previous example could also be written as:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><ComputerDisplay><Paragraph>population = df['Population (1000s)'] </Paragraph><Paragraph>deaths = df['TB deaths']</Paragraph><Paragraph>df[(population &gt; 80000) &amp; (deaths &gt; 10000)]</Paragraph></ComputerDisplay><Activity><Heading>Exercise 3 Bitwise operators</Heading><Question><Paragraph>Complete Exercise 3 in the Exercise notebook 2.</Paragraph></Question></Activity></Section></Session><Session><Title>1 Weather data</Title><Paragraph>This week you will be looking at investigating historic weather data.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1039.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1039.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="51178b01" x_imagesrc="ou_futurelearn_learn_to_code_fig_1039.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 1</b></Caption><Alternative>An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky.</Alternative><Description>An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky.</Description></Figure><Paragraph>Of course, such data is hugely important for research into the large-scale, long-term shift in our planet’s weather patterns and average temperatures – climate change. However, such data is also incredibly useful for more mundane planning purposes. To demonstrate the learning this week, I, Rob Griffiths, will be using historic weather data to try and plan a summer holiday in the UK. You’ll use the data too and get a chance to work on your own project at the end of the week.</Paragraph><Paragraph>The dataset we’ll use to do this will come from the <a href="http://www.wunderground.com/">Weather Underground</a>, which creates weather forecasts from data sent to them by a worldwide network of over 100,000 weather enthusiasts who have personal weather stations on their house or in their garden.</Paragraph><Paragraph>In addition to creating weather forecasts from that data, the Weather Underground also keeps that data as historic weather records allowing members of the public to download weather datasets for a particular time period and location. These datasets are downloaded as CSV files, explained in the next step.</Paragraph><Paragraph>Datasets are rarely ‘clean’ and fit for purpose, so it will be necessary to clean up the data and ‘mould it’ for your purposes. You will then learn how to visualise data by creating graphs using the <ComputerCode><b>plot()</b></ComputerCode> function.</Paragraph><Section id="awg_l3l_sxb"><Title>1.1 What is a CSV file?</Title><Paragraph>A CSV file is a plain text file that is used to hold tabular data. The acronym CSV is short for ‘comma-separated values’.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1036.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1036.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="2760a5ab" x_imagesrc="ou_futurelearn_learn_to_code_fig_1036.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 2</b></Caption><Alternative>An image of many pins marking various countries on a globe</Alternative><Description>An image of many pins marking various countries on a globe</Description></Figure><Paragraph>Take a look at the first few lines of a CSV file that holds the same data as the Excel file ‘WHO POP TB all.xls’ that you encountered in Week 2:</Paragraph><ComputerDisplay><Paragraph>Country,Population (1000s),TB deaths</Paragraph><Paragraph>Afghanistan,30552,13000.0</Paragraph><Paragraph>Albania,3173,20.0</Paragraph><Paragraph>Algeria,39208,5100.0</Paragraph><Paragraph>Andorra,79,0.26 </Paragraph><Paragraph>Angola,21472,6900.0</Paragraph><Paragraph>Antigua and Barbuda,90,1.2</Paragraph><Paragraph>Argentina,41446,570.0 </Paragraph><Paragraph>Armenia,2977,170.0</Paragraph></ComputerDisplay><Paragraph>Notice that the first line is a row of column names. The subsequent lines are rows of actual data that correspond to the column names. The row of column names is optional, but it is helpful in understanding the data in the following lines and making sure the right values fall in the right place. In this example, the first value on every row must be a string representing a country’s name, the second value is an integer representing that country’s population (in 1000s) and the third value is a decimal representing the number of deaths due to TB. Note that the third value is a decimal (like 0.26 deaths for Andorra) and not an integer because it is an estimate obtained from statistical processing of collected data.</Paragraph><Paragraph>Note that each value or column name is separated by a comma but actually any character can be used to separate values in a CSV file, including spaces and tabs etc., hence CSV can also stand for ‘character-separated values’.</Paragraph><Paragraph>Because CSV files are in plain-text it makes the data easy to import into any spreadsheet program, database or pandas dataframe.</Paragraph><Paragraph>Before anything can be done with a CSV file with pandas, the following import statement must be executed:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>from pandas import *</ComputerCode></Paragraph><Paragraph>As you learned in Week 2, the import statement loads into memory all the code in the pandas module.</Paragraph><Paragraph>To read a CSV file into a dataframe, the pandas function <ComputerCode><b>read_csv()</b></ComputerCode> needs to be called.</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df = read_csv('WHO POP TB all.csv')</ComputerCode></Paragraph><Paragraph>The above code creates a dataframe from the data in the file <ComputerCode><b>WHO POP TB</b></ComputerCode> <ComputerCode><b>all.csv</b></ComputerCode> and assigns it to the variable <ComputerCode><b>df</b></ComputerCode>. This is the simplest usage of the <ComputerCode><b>read_csv()</b></ComputerCode> function, just using a single argument, a string that holds the name of the CSV file.</Paragraph><Paragraph>However the function can take many additional arguments (some of which you’ll use later), which determine how the file is to be read.</Paragraph><Paragraph>In the next step, find out about dataframes and the ‘dot’ notation.</Paragraph></Section><Section><Title>1.2 Dataframes and the ‘dot’ notation</Title><Paragraph>In Week 2 you learned that dataframes have methods, which are like functions, that can only be called in the context of a dataframe.</Paragraph><Paragraph>For example, because the TB deaths dataframe <ComputerCode><b>df </b></ComputerCode>has a column named ‘Country’, the <ComputerCode><b>sort_values()</b></ComputerCode> method can be called like this:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.sort_values('Country')</ComputerCode></Paragraph><Paragraph>Because there is variable name, followed by a dot, followed by the method, this is called <b>dot notation</b>. Methods are said to be a property of a dataframe. In addition to methods, dataframes have another property – attributes.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1040.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1040.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="33719294" x_imagesrc="ou_futurelearn_learn_to_code_fig_1040.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 3</b></Caption><Alternative>A multi-coloured image of many different sized circles. </Alternative><Description>A multi-coloured image of many different sized circles. They could be described as bubbles.</Description></Figure><InternalSection><Heading>Attributes</Heading><Paragraph>A dataframe attribute is like a variable that can only be accessed in the context of a dataframe. One such attribute is <ComputerCode><b>columns </b></ComputerCode>which holds a dataframe’s column names.</Paragraph><Paragraph>So the expression <ComputerCode><b>df.columns</b></ComputerCode> evaluates to the value of the <ComputerCode><b>columns </b></ComputerCode>attribute inside the dataframe <ComputerCode><b>df</b></ComputerCode>. The following code will get and display the names of the columns in the dataframe <ComputerCode><b>df:</b></ComputerCode></Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.columns</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><ComputerDisplay><Paragraph>Index(['Country', 'Population (1000s)', 'TB deaths'],</Paragraph><Paragraph>dtype='object')</Paragraph></ComputerDisplay></InternalSection></Section><Section><Title>1.3 Getting and displaying dataframe rows</Title><Paragraph>Dataframes can have hundreds or thousands of rows, so it is not practical to display a whole dataframe.</Paragraph><Paragraph>However, there are a number of dataframe attributes and methods that allow you to get and display either a single row or a number of rows at a time. Three of the most useful methods are:<ComputerCode><b> iloc()</b></ComputerCode>, <ComputerCode><b>head()</b></ComputerCode> and <ComputerCode><b>tail()</b></ComputerCode>. Note that to distinguish methods and attributes, we write <ComputerCode>()</ComputerCode> after a method’s name.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1041.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1041.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="17592d1b" x_imagesrc="ou_futurelearn_learn_to_code_fig_1041.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 4</b></Caption><Alternative>An image of a data algorithm</Alternative><Description>An image of a data algorithm</Description></Figure><InternalSection><Heading>The iloc attribute</Heading><Paragraph>A dataframe has a default integer index for its rows, which starts at 0 (zero). You can get and display any single row in a dataframe by using the<ComputerCode><b>iloc</b></ComputerCode> attribute with the index of the row you want to access as its argument. For example, the following code will get and display the first row of data in the dataframe <ComputerCode><b>df</b></ComputerCode>, which is at index 0:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.iloc[0]</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><ComputerDisplay><Paragraph>Country Afghanistan</Paragraph><Paragraph>Population (1000s) 30552</Paragraph><Paragraph>TB deaths 13000</Paragraph><Paragraph>Name: 0, dtype: object</Paragraph></ComputerDisplay><Paragraph>Similarly, the following code will get and display the third row of data in the dataframe <ComputerCode><b>df</b></ComputerCode>, which is at index 2:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.iloc[2]</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><ComputerDisplay><Paragraph>Country Algeria</Paragraph><Paragraph>Population (1000s) 39208</Paragraph><Paragraph>TB deaths 5100.0</Paragraph><Paragraph>Name: 0, dtype: object</Paragraph></ComputerDisplay></InternalSection><InternalSection><Heading>The head() method</Heading><Paragraph>The first few rows of a dataframe can be printed out with the <ComputerCode><b>head()</b></ComputerCode> method.</Paragraph><Paragraph>You can tell <ComputerCode><b>head()</b></ComputerCode> is a method, rather than an attribute such as <ComputerCode><b>columns</b></ComputerCode>, because of the parentheses (round brackets) after the property name.</Paragraph><Paragraph>If you don’t give any argument, i.e. don’t put any number within those parentheses, the default behaviour is to return the first five rows of the dataframe. If you give an argument, it will print that number of rows (starting from the row indexed by 0).</Paragraph><Paragraph>For example, executing the following code will get and display the first five rows in the dataframe <ComputerCode><b>df</b></ComputerCode>.</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.head()</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>0</td><td>Afghanistan</td><td>30552</td><td>13000.00</td></tr><tr><td>1</td><td>Albania</td><td>3173</td><td>20.00</td></tr><tr><td>2</td><td>Algeria</td><td>39208</td><td>5100.00</td></tr><tr><td>3</td><td>Andorra</td><td>79</td><td>0.26</td></tr><tr><td>4</td><td>Angola</td><td>21472</td><td>6900.00</td></tr></tbody></Table><Paragraph>And, executing the following code will get and display the first seven rows in the dataframe <ComputerCode><b>df.</b></ComputerCode></Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.head(7)</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>0</td><td>Afghanistan</td><td>30552</td><td>13000.00</td></tr><tr><td>1</td><td>Albania</td><td>3173</td><td>20.00</td></tr><tr><td>2</td><td>Algeria</td><td>39208</td><td>5100.00</td></tr><tr><td>3</td><td>Andorra</td><td>79</td><td>0.26</td></tr><tr><td>4</td><td>Angola</td><td>21472</td><td>6900.00</td></tr><tr><td>5</td><td>Antigua and Barbuda</td><td>90</td><td>1.20</td></tr><tr><td>6</td><td>Argentina</td><td>41446</td><td>570.00</td></tr></tbody></Table></InternalSection><InternalSection><Heading>The tail() method</Heading><Paragraph>The <ComputerCode><b>tail()</b></ComputerCode> method is similar to the <ComputerCode><b>head()</b></ComputerCode> method.</Paragraph><Paragraph>If no argument is given, the last five rows of the dataframe are returned, otherwise the number of rows returned is dependent on the argument, just like for the <ComputerCode><b>head()</b></ComputerCode> method.</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df.tail()</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>189</td><td>Venezuela (Bolivarian Republic of)</td><td>30405</td><td>480</td></tr><tr><td>190</td><td>Viet Nam</td><td>91680</td><td>17000</td></tr><tr><td>191</td><td>Yemen</td><td>24407</td><td>990</td></tr><tr><td>192</td><td>Zambia</td><td>14539</td><td>3600</td></tr><tr><td>193</td><td>Zimbabwe</td><td>14150</td><td>5700</td></tr></tbody></Table></InternalSection></Section><Section><Title>1.4 Getting and displaying dataframe columns</Title><Paragraph>You learned in Week 2 that you can get and display a single column of a dataframe by putting the name of the column (in quotes) within square brackets immediately after the dataframe’s name.</Paragraph><Paragraph>For example, like this:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df['TB deaths']</ComputerCode></Paragraph><Paragraph>You then get output like this:</Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><ComputerDisplay><Paragraph>0    13000.00</Paragraph><Paragraph>1       20.00</Paragraph><Paragraph>2     5100.00</Paragraph><Paragraph>3        0.26</Paragraph><Paragraph>4     6900.00</Paragraph><Paragraph>5        1.20</Paragraph><Paragraph>6      570.00</Paragraph><Paragraph>...</Paragraph></ComputerDisplay><Paragraph>Notice that although there is an index, there is no column heading. This is because what is returned is not a new dataframe with a single column but an example of the <ComputerCode><b>Series</b></ComputerCode> data type.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1042.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1042.jpg" x_folderhash="cbfeded3" x_contenthash="9cff0938" x_imagesrc="ou_futurelearn_learn_to_code_fig_1042.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 5</b></Caption><Description>An perspective image of the isle between many data storage towers. The floor and the storage units are lit up.</Description></Figure><InternalSection><Heading>Each column in a dataframe is an example of a series</Heading><Paragraph>The <ComputerCode><b>Series</b></ComputerCode> data type is a collection of values with an integer index that starts from zero. In addition, the <ComputerCode><b>Series</b></ComputerCode> data type has many of the same methods and attributes as the <ComputerCode><b>DataFrame</b></ComputerCode> data type, so you can still execute code like:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df['TB deaths'].head()</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><ComputerDisplay><Paragraph>0    13000.00</Paragraph><Paragraph>1       20.00</Paragraph><Paragraph>2     5100.00</Paragraph><Paragraph>3        0.26</Paragraph><Paragraph>4     6900.00</Paragraph><Paragraph>Name: TB deaths, dtype: float64</Paragraph></ComputerDisplay><Paragraph>And</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df['TB deaths'].iloc[2]</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>5100.00</ComputerCode></Paragraph><Paragraph>However, pandas does provide a mechanism for you to get and display one or more selected columns as a new dataframe in its own right. To do this you need to use a <b>list</b>. A list in Python consists of one or more items separated by commas and enclosed within square brackets, for example <ComputerCode><b>['Country']</b></ComputerCode> or<ComputerCode><b> ['Country', 'Population (1000s)']</b></ComputerCode>. This list is then put within outer square brackets immediately after the dataframe’s name, like this:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df[['Country']].head()</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th><b>Country</b></th></tr><tr><td>0</td><td>Afghanistan</td></tr><tr><td>1</td><td>Albania</td></tr><tr><td>2</td><td>Algeria</td></tr><tr><td>3</td><td>Andorra</td></tr><tr><td>4</td><td>Angola</td></tr></tbody></Table><Paragraph>Note that the column is now named. The expression<ComputerCode><b> df[['Country']]</b></ComputerCode>(with two square brackets) evaluates to a new dataframe (which happens to have a single column) rather than a series.</Paragraph><Paragraph>To get a new dataframe with multiple columns you just need to put more column names in the list, like this:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df[['Country', 'Population (1000s)']].head()</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th><b>Country</b></th><th><b>Population (1000s)</b></th></tr><tr><td>0</td><td>Afghanistan</td><td>30552</td></tr><tr><td>1</td><td>Albania</td><td>3173</td></tr><tr><td>2</td><td>Algeria</td><td>39208</td></tr><tr><td>3</td><td>Andorra</td><td>79</td></tr><tr><td>4</td><td>Angola</td><td>21472</td></tr></tbody></Table><Paragraph>The code has returned a new dataframe with just the <ComputerCode><b>'Country'</b></ComputerCode> and <ComputerCode><b>'Population (1000s)’</b></ComputerCode> columns.</Paragraph><Activity><Heading>Exercise 1 Dataframes and CSV files</Heading><Question><Paragraph>Now that you’ve learned about CSV files and more about pandas you are ready to complete Exercise 1 in the exercise notebook 2.</Paragraph><Paragraph>Open the exercise 2 notebook and the data file you used last week WHO POP TB all.csv and save it in the folder you created in Week 1.</Paragraph><Paragraph>If you’re using Anaconda instead of CoCalc, remember that to open the notebook you’ll need to navigate to the notebook using Jupyter. Once it’s open, run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook. If you need a quick reminder of how to use Jupyter watch again the video in <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83246&amp;targetdoc=Week+1%3A+Having+a+go+at+it+Part+1&amp;targetptr=1.4">Week 1 Exercise 1.</a> </Paragraph></Question></Activity></InternalSection></Section><Section><Title>1.5 Comparison operators</Title><Paragraph>In <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83246&amp;targetdoc=Week+1%3A+Having+a+go+at+it+Part+1&amp;targetptr=1.5">Expressions,</a> you learned that Python has arithmetic operators: +, /, - and * and that expressions such as 5 + 2 evaluate to a value (in this case the number 7).</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1043.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1043.jpg" x_folderhash="cbfeded3" x_contenthash="da320f19" x_imagesrc="ou_futurelearn_learn_to_code_fig_1043.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 6</b></Caption><Alternative>An illustration of two girls holding up signs. One sign says, 'YES', the other says, 'NO'.</Alternative><Description>An illustration of two girls holding up signs. One sign says, 'YES', the other says, 'NO'.</Description></Figure><Paragraph>Python also has what are called comparison operators, these are:</Paragraph><ComputerDisplay><Paragraph>==    equals</Paragraph><Paragraph>!=    not equal</Paragraph><Paragraph>&lt;     less than</Paragraph><Paragraph>&gt;     greater than</Paragraph><Paragraph>&lt;=    less than or equal to </Paragraph><Paragraph>&gt;=    greater than or equal to</Paragraph></ComputerDisplay><Paragraph>Expressions involving these operators always evaluate to a Boolean value, that is <ComputerCode><b>True</b></ComputerCode> or <ComputerCode><b>False</b></ComputerCode>. Here are some examples:</Paragraph><ComputerDisplay><Paragraph>2 = = 2      evaluates to True</Paragraph><Paragraph>2 + 2 = = 5  evaluates to False</Paragraph><Paragraph>2 != 1 + 1   evaluates to False</Paragraph><Paragraph>45 &lt; 50      evaluates to True</Paragraph><Paragraph>20 &gt; 30      evaluates to False</Paragraph><Paragraph>100 &lt;= 100   evaluates to True</Paragraph><Paragraph>101 &gt;= 100   evaluates to True</Paragraph></ComputerDisplay><Paragraph>The comparison operators can be used with other types of data, not just numbers. Used with strings they compare using alphabetical order. For example:</Paragraph><Paragraph><ComputerCode>'aardvark' &lt; 'zebra' evaluates to True</ComputerCode></Paragraph><Paragraph>In <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83246&amp;targetdoc=Week+2%3A+Having+a+go+at+it+Part+2&amp;targetptr=1.6">Calculating over columns</a> you saw that when applied to whole columns, the arithmetic operators did the calculations row by row. Similarly, an expression like <ComputerCode><b>df['Country'] &gt;= 'K'</b></ComputerCode> will compare the country names, row by row, against the string 'K' and record whether the result is <ComputerCode><b>True</b></ComputerCode> or <ComputerCode><b>False</b></ComputerCode> in a series like this:</Paragraph><ComputerDisplay><Paragraph>0    False</Paragraph><Paragraph>1    False</Paragraph><Paragraph>2    False</Paragraph><Paragraph>3    False</Paragraph><Paragraph>4    False</Paragraph><Paragraph>5    False</Paragraph><Paragraph>...</Paragraph><Paragraph>Name: Country, dtype: bool </Paragraph></ComputerDisplay><Paragraph>If such an expression is put within square brackets immediately after a dataframe’s name, a new dataframe is obtained with only those rows where the result is <ComputerCode><b>True</b></ComputerCode>. So:</Paragraph><Paragraph><ComputerCode>df[df['Country'] &gt;= 'K']</ComputerCode></Paragraph><Paragraph>returns a new dataframe with all the columns of <ComputerCode><b>df </b></ComputerCode>but with only the rows corresponding to countries starting with K or a letter later in the alphabet.</Paragraph><Paragraph>As another example, to see the data for countries with over 80 million inhabitants, the following code will return and display a new dataframe with all the columns of <ComputerCode><b>df</b></ComputerCode> but with only the rows where it is <ComputerCode><b>True</b></ComputerCode> that the value in the <ComputerCode><b>'Population (1000s)'</b></ComputerCode> column is greater than <ComputerCode><b>80000:</b></ComputerCode></Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df[df['Population (1000s)'] &gt; 80000]</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out[]:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>13</td><td>Bangladesh</td><td>156595</td><td>80000</td></tr><tr><td>23</td><td>Brazil</td><td>200362</td><td>4400</td></tr><tr><td>36</td><td>China</td><td>1393337</td><td>41000</td></tr><tr><td>53</td><td>Egypt</td><td>82056</td><td>550</td></tr><tr><td>58</td><td>Ethiopia</td><td>94101</td><td>30000</td></tr><tr><td>65</td><td>Germany</td><td>82727</td><td>300</td></tr><tr><td>77</td><td>India</td><td>1252140</td><td>240000</td></tr><tr><td>78</td><td>Indonesia</td><td>249866</td><td>64000</td></tr><tr><td>85</td><td>Japan</td><td>127144</td><td>2100</td></tr><tr><td>109</td><td>Mexico</td><td>122332</td><td>2200</td></tr><tr><td>124</td><td>Nigeria</td><td>173615</td><td>160000</td></tr><tr><td>128</td><td>Pakistan</td><td>182143</td><td>49000</td></tr><tr><td>134</td><td>Philippines</td><td>98394</td><td>27000</td></tr><tr><td>141</td><td>Russian Federation</td><td>142834</td><td>17000</td></tr><tr><td>185</td><td>United States of America</td><td>320051</td><td>490</td></tr><tr><td>190</td><td>Viet Nam</td><td>91680</td><td>17000</td></tr></tbody></Table><Activity><Heading>Exercise 2 Comparison operators</Heading><Question><Paragraph>You are ready to complete Exercise 2 in the Exercise notebook 2.</Paragraph><Paragraph>Remember to run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook. </Paragraph></Question></Activity></Section><Section><Title>1.6 Bitwise operators</Title><Paragraph>To build more complicated expressions involving column comparisons, there are two bitwise operators.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1044.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1044.jpg" x_folderhash="cbfeded3" x_contenthash="72f00105" x_imagesrc="ou_futurelearn_learn_to_code_fig_1044.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 7</b></Caption><Alternative>An image of someone constructing a building from wooden blocks</Alternative><Description>An image of someone constructing a building from wooden blocks</Description></Figure><Paragraph>The <ComputerCode><b>&amp;</b></ComputerCode> operator means ‘and’ and the | operator (vertical bar, not uppercase letter ‘i’) means ‘or’. So, for example the expression:</Paragraph><ComputerDisplay><Paragraph>(df['Country'] &gt;= 'Latvia') &amp; (df['Country'] &lt;= 'Sweden')</Paragraph></ComputerDisplay><Paragraph>will evaluate to a series containing Boolean values where the values are<ComputerCode><b>True</b></ComputerCode> only if the equivalent rows in the dataframe contain the countries ‘<ComputerCode><b>Latvia</b></ComputerCode>’ to ‘<ComputerCode><b>Sweden</b></ComputerCode>’, inclusive. However, the following expression which uses | (or) rather than &amp; (and):</Paragraph><Paragraph><ComputerCode>(df['Country'] &gt;= 'Latvia') | (df['Country'] &lt;= 'Sweden')</ComputerCode></Paragraph><Paragraph>will evaluate to <ComputerCode><b>True</b></ComputerCode> for all countries, because every country comes alphabetically after ‘<ComputerCode><b>Latvia</b></ComputerCode>’ (e.g. the ‘UK’) or before '<ComputerCode><b>Sweden</b></ComputerCode>' (e.g. ‘<ComputerCode><b>Brazil</b></ComputerCode>’).</Paragraph><Paragraph>Note the round brackets around each comparison. Without them you will get an error.</Paragraph><Paragraph>The whole expression with multiple comparisons has to be put within <ComputerCode><b>df[…]</b></ComputerCode> to get a dataframe with only those rows that match the condition.</Paragraph><Paragraph>As a further example, using different columns, it is relatively easy to find the rows in <ComputerCode><b>df</b></ComputerCode> where '<ComputerCode><b>Population (1000s)</b></ComputerCode>' is greater than <ComputerCode><b>80000</b></ComputerCode> and where '<ComputerCode><b>TB deaths</b></ComputerCode>' are greater than <ComputerCode>10000</ComputerCode>.</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><Paragraph><ComputerCode>df[(df['Population (1000s)'] &gt; 80000) &amp; (df['TB deaths'] &gt; 10000)]</ComputerCode></Paragraph><Paragraph><ComputerCode><b>Out []:</b></ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th> </th><th>Country</th><th>Population (1000s)</th><th>TB deaths</th></tr><tr><td>13</td><td>Bangladesh</td><td>156595</td><td>80000</td></tr><tr><td>36</td><td>China</td><td>1393337</td><td>41000</td></tr><tr><td>58</td><td>Ethiopia</td><td>94101</td><td>30000</td></tr><tr><td>77</td><td>India</td><td>1252140</td><td>240000</td></tr><tr><td>78</td><td>Indonesia</td><td>249866</td><td>64000</td></tr><tr><td>124</td><td>Nigeria</td><td>173615</td><td>160000</td></tr><tr><td>128</td><td>Pakistan</td><td>182143</td><td>49000</td></tr><tr><td>134</td><td>Philippines</td><td>98394</td><td>27000</td></tr><tr><td>141</td><td>Russian Federation</td><td>142834</td><td>17000</td></tr><tr><td>190</td><td>Viet Nam</td><td>91680</td><td>17000</td></tr></tbody></Table><Paragraph>These expressions can get long and complicated, making it easy to miss a crucial round or square bracket. In those cases it is best to break up the expression into small steps. The previous example could also be written as:</Paragraph><Paragraph><ComputerCode><b>In []:</b></ComputerCode></Paragraph><ComputerDisplay><Paragraph>population = df['Population (1000s)'] </Paragraph><Paragraph>deaths = df['TB deaths']</Paragraph><Paragraph>df[(population &gt; 80000) &amp; (deaths &gt; 10000)]</Paragraph></ComputerDisplay><Activity><Heading>Exercise 3 Bitwise operators</Heading><Question><Paragraph>Complete Exercise 3 in the Exercise notebook 2.</Paragraph></Question></Activity></Section></Session><Session><Title>2 This week’s quiz</Title><Paragraph>Check what you’ve learned this week by taking the end-of-week quiz.</Paragraph><Paragraph><a href="https://www.open.edu/openlearn/ocw/mod/quiz/view.php?id=78779">Week 3 practice quiz</a></Paragraph><Paragraph>Open the quiz in a new window or tab then come back here when you’ve finished.</Paragraph></Session><Session><Title>3 Summary</Title><Paragraph>This week looked at the importance of dataframes and the ‘dot’ notation, and the various dataframe methods. It also covered:</Paragraph><BulletedList><ListItem>CSV files</ListItem><ListItem>Comparison operators</ListItem><ListItem>Bitwise operators.</ListItem></BulletedList><Paragraph>Next week looks at weather data and how to use the data to get answers to your questions.</Paragraph></Session></Unit><Unit><UnitID/><UnitTitle>Week 4: Cleaning up our act Part 2</UnitTitle><Session><Title>1 Loading the weather data</Title><Paragraph>You have learned some more about Python and the pandas module and tried it out on a fairly small dataset. You are now ready to explore a dataset from the Weather Underground.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1039.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1039.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="51178b01" x_imagesrc="ou_futurelearn_learn_to_code_fig_1039.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 1</b> </Caption><Alternative>An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky.</Alternative><Description>An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky.</Description></Figure><Paragraph>Open the file London_2014.csv and save it in the disk folder or CoCalc project you created in Week 1.</Paragraph><Paragraph><b>Do not be tempted to open this file with Excel</b> as this application will attempt to localise the data in the file, i.e. use your country’s local data formats, which will make much of what follows rather incomprehensible! You can if you like open the file with a simple text editor, but <b>do not make any changes</b>.</Paragraph><Paragraph>The CSV file can be loaded into a dataframe by executing the following code:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>from pandas import *</Paragraph><Paragraph>london = read_csv('London_2014.csv')</Paragraph><Paragraph>london.head()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1006.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1006.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="47f77fd6" x_imagesrc="ou_futurelearn_learn_to_code_fig_1006.jpg" x_imagewidth="512" x_imageheight="214"/><Caption><b>Figure 2</b> </Caption><Alternative>First 5 rows of the London dataframe. </Alternative><Description>First 5 rows of the London dataframe. Note that only the first few columns are shown due to the limitation of page width.</Description></Figure><Paragraph><i> Note that the right hand side of the table has been cropped to fit on the page. </i></Paragraph><Paragraph>In the next section, you’ll find out how to remove rogue spaces.</Paragraph><InternalSection><Heading>Important notice for learners outside of the EU</Heading><Paragraph>The Weather Underground automatically localises data based on from what country it detects you are accessing the web site. So, for example, if you are accessing the website from the USA wind speeds will be in MPH rather than km/h and temperatures in Fahrenheit rather than Celsius.</Paragraph><Paragraph>In order to change the settings so that the data is in European format you will need to click on the ‘head and shoulders’ icon on the top right of the Weather Underground web page and create a free Weather Underground account.</Paragraph><Paragraph>Once you have created an account, click on the ‘cog’ icon on the top right of the web page. Then:</Paragraph><BulletedList><ListItem>click on the C button to select Celsius</ListItem><ListItem>click on ‘More Settings’ and select Units: metric</ListItem><ListItem>click on ‘Save My Preferences’.</ListItem></BulletedList><Paragraph>Now, when you download the data, temperatures will be in Celsius and wind speeds in km/h etc.</Paragraph></InternalSection><Section><Title>1.1 Removing rogue spaces</Title><Paragraph>One of the problems often encountered with CSV files is rogue spaces before or after data values or column names.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1045.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1045.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="fef0227c" x_imagesrc="ou_futurelearn_learn_to_code_fig_1045.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 3</b> </Caption><Alternative>An image of empty, numbered parking spaces</Alternative><Description>An image of empty, numbered parking spaces</Description></Figure><Paragraph>You learned earlier, in What is a CSV file? , that each value or column name is separated by a comma. However, if you opened ‘London_2014.csv’ in a text editor, you would see that in the row of column names sometimes there are spaces after a comma:</Paragraph><Quote><Paragraph>GMT,Max TemperatureC,Mean TemperatureC,Min TemperatureC,Dew PointC,MeanDew PointC,Min DewpointC,Max Humidity, Mean Humidity, Min Humidity, Max Sea Level PressurehPa, Mean Sea Level PressurehPa, Min Sea Level PressurehPa, Max VisibilityKm, Mean VisibilityKm, Min VisibilitykM, Max Wind SpeedKm/h, Mean Wind SpeedKm/h, Max Gust SpeedKm/h,Precipitationmm, CloudCover, Events,WindDirDegrees<br/></Paragraph></Quote><!--
Please leave the apostrophes as they are below as this is illustrating a point  it isn't wrong. 
--><Paragraph>For example, there is a space after the comma between <ComputerCode>
<b>Max Humidity</b>
</ComputerCode> and <ComputerCode>
<b>Mean Humidity</b>
</ComputerCode>. This means that when <ComputerCode>
<b>read_csv()</b>
</ComputerCode> reads the row of column names it will interpret a space after a comma as part of the next column name. So, for example, the column name after <ComputerCode>
<b>'Max Humidity'</b>
</ComputerCode> will be interpreted as <ComputerCode>
<b>' Mean Humidity'</b>
</ComputerCode> rather than what was intended, which is <ComputerCode>
<b>'Mean Humidity'</b>
</ComputerCode>. The ramification of this is that code such as:</Paragraph><Paragraph><ComputerCode>london[['Mean Humidity']]</ComputerCode></Paragraph><Paragraph>will cause a key error (see <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83247&amp;targetdoc=Week+2%3A+Having+a+go+at+it+Part+2&amp;targetptr=1.3">Selecting a column</a> ), as the column name is confusingly <ComputerCode>
<b>' Mean Humidity</b>
</ComputerCode> '.</Paragraph><Paragraph>This can easily be rectified by adding another argument to the <ComputerCode>
<b>read_csv()</b>
</ComputerCode> function:</Paragraph><Paragraph><ComputerCode>skipinitialspace=True</ComputerCode></Paragraph><Paragraph>which will tell <ComputerCode>
<b>read_csv()</b>
</ComputerCode> to ignore any spaces after a comma:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>
london = read_csv('London_2014.csv', skipinitialspace=True)
</Paragraph></ComputerDisplay><Paragraph>The rogue spaces will no longer be in the dataframe and we can write code such as:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london[['Mean Humidity']].head()</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th><b>Mean Humidity</b></th></tr><tr><td>0</td><td>86</td></tr><tr><td>1</td><td>81</td></tr><tr><td>2</td><td>76</td></tr><tr><td>3</td><td>85</td></tr><tr><td>4</td><td>88</td></tr></tbody></Table><Paragraph>Note that a <ComputerCode>
<b>skipinitialspace=True</b>
</ComputerCode> argument won’t remove a trailing space at the end of a column name.</Paragraph><Paragraph>Next, find out about extra characters and how to remove them.</Paragraph></Section><Section><Title>1.2 Removing extra characters</Title><Paragraph>If you opened London_2014.csv in a text editor once again and looked at the last column name you would see that the name is'WindDirDegrees<br/>'.</Paragraph><Paragraph>What has happened here is that when the dataset was exported from the Weather Underground website an html line break <ComputerCode>
<b>(<br/>)</b>
</ComputerCode> was added after the line of column headers which <ComputerCode>
<b>read_csv()</b>
</ComputerCode> has interpreted as the end part of the final column’s name.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1050.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1050.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="396fba4f" x_imagesrc="ou_futurelearn_learn_to_code_fig_1050.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 4</b> </Caption><Alternative>An image of two bouncers in suits standing in a corridor</Alternative><Description>An image of two bouncers in suits standing in a corridor</Description></Figure><Paragraph>In fact, the problem is worse than this, let’s look at some values in the final column:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london[['WindDirDegrees<br/>']].head()</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th><b> WindDirDegrees <br/> </b></th></tr><tr><td>0</td><td>186<br/></td></tr><tr><td>1</td><td>214<br/></td></tr><tr><td>2</td><td>219<br/></td></tr><tr><td>3</td><td>211<br/></td></tr><tr><td>4</td><td>199<br/></td></tr></tbody></Table><Paragraph>It’s seems there is an html line break at the end of each line. If I opened ‘London_2014.csv’ in a text editor and looked at the ends of all lines in the file this would be confirmed.</Paragraph><Paragraph>Once again I’m not going to edit the CSV file but rather fix the problem in the dataframe. To change <ComputerCode>
<b>'WindDirDegrees<br/>'</b>
</ComputerCode> to <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> all I have to do is use the <ComputerCode>
<b>rename()</b>
</ComputerCode> method as follows:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
london = london.rename(columns={'WindDirDegrees<br/>':'WindDirDegrees'})
</ComputerCode></Paragraph><Paragraph>Don’t worry about the syntax of the argument for <ComputerCode>
<b>rename()</b>
</ComputerCode> , just use this example as a template for whenever you need to change the name of a column.</Paragraph><Paragraph>Now I need to get rid of those pesky <ComputerCode>
<b><br/></b>
</ComputerCode> html line breaks from the ends of the values in the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column, so that they become something sensible. I can do that using the string method <ComputerCode>
<b>rstrip()</b>
</ComputerCode> which is used to remove characters from the end or ‘rear’ of a string, just like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
london['WindDirDegrees'] = london['WindDirDegrees'].str.rstrip('<br/>')
</ComputerCode></Paragraph><Paragraph>Again don’t worry too much about the syntax of the code and simply use it as a template for whenever you need to process a whole column of values stripping characters from the end of each string value.</Paragraph><Paragraph>Let’s display the first few rows of the ' <ComputerCode>
<b>WindDirDegrees</b>
</ComputerCode> ' to confirm the changes:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london[['WindDirDegrees']].head()</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th><b>WindDirDegrees</b></th></tr><tr><td>0</td><td>186</td></tr><tr><td>1</td><td>214</td></tr><tr><td>2</td><td>219</td></tr><tr><td>3</td><td>211</td></tr><tr><td>4</td><td>199</td></tr></tbody></Table></Section><Section id="missing_values"><Title>1.3 Missing values</Title><Paragraph>As you heard in the video at the start of the week, missing values (also called null values) are one of the reasons to clean data.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1051.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1051.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="8c07af78" x_imagesrc="ou_futurelearn_learn_to_code_fig_1051.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 5</b> </Caption><Alternative>An image of a girl with the last piece of a jigsaw puzzle</Alternative><Description>An image of a girl with the last piece of a jigsaw puzzle</Description></Figure><Paragraph>Finding missing values in a particular column can be done with the column method <ComputerCode>
<b>isnull()</b>
</ComputerCode> , like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london['Events'].isnull()</ComputerCode></Paragraph><Paragraph>The above code returns a series of Boolean values, where <ComputerCode>
<b>True</b>
</ComputerCode> indicates that the corresponding row in the <ComputerCode>
<b>'Events'</b>
</ComputerCode> column is missing a value and <ComputerCode>
<b>False</b>
</ComputerCode> indicates the presence of a value. Here are the last few rows from the series:</Paragraph><ComputerDisplay><Paragraph>...</Paragraph><Paragraph>360 False</Paragraph><Paragraph>361 True</Paragraph><Paragraph>362 True</Paragraph><Paragraph>363 True</Paragraph><Paragraph>364 False</Paragraph><Paragraph>Name: Events, dtype: bool</Paragraph></ComputerDisplay><Paragraph>If, as you did with the comparison expressions, you put this code within square brackets after the dataframe’s name, it will return a new dataframe consisting of all the rows without recorded events (rain, fog, thunderstorm, etc.):</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london[london['Events'].isnull()]</ComputerCode></Paragraph><Paragraph>As you will see in Exercise 4 of the exercise notebook, this will return a new dataframe with 114 rows, showing that more than one in three days had no particular event recorded. If you scroll the table to the right, you will see that all values in the <ComputerCode>
<b>'Events'</b>
</ComputerCode> column are marked <ComputerCode>
<b>NaN</b>
</ComputerCode> , which stands for ‘Not a Number’, but is also used to mark non-numeric missing values, like in this case (events are strings, not numbers).</Paragraph><Paragraph>Once you know how much and where data is missing, you have to decide what to do: ignore those rows? Replace with a fixed value? Replace with a computed value, like the mean?</Paragraph><Paragraph>In this case, only the first two options are possible. The method call <ComputerCode>
<b>london.dropna()</b>
</ComputerCode> will drop (remove) all rows that have a missing (non-available) value somewhere, returning a new dataframe. This will therefore also remove rows that have missing values in other columns.</Paragraph><Paragraph>The column method <ComputerCode>
<b>fillna()</b>
</ComputerCode> will replace all non-available values with the value given as argument. For this case, each NaN could be replaced by the empty string.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>london['Events'] = london['Events'].fillna('')</Paragraph><Paragraph>london[london['Events'].isnull()]</Paragraph></ComputerDisplay><Paragraph>The second line above will now show an empty dataframe, because there are no longer missing values in the events column.</Paragraph><Paragraph>As a final note on missing values, pandas ignores them when computing numeric statistics, i.e. you don’t have to remove missing values before applying <ComputerCode>
<b>sum(), median()</b>
</ComputerCode> and other similar methods.</Paragraph><Paragraph>Learn about checking data types of each column in the next section.</Paragraph></Section><Section><Title>1.4 Changing the value types of columns</Title><Paragraph>The function <ComputerCode>
<b>read_csv()</b>
</ComputerCode> may, for many reasons, wrongly interpret the data type of the values in a column, so when cleaning data it’s important to check the data types of each column are what is expected, and if necessary change them.</Paragraph><Paragraph>The data type of every column in a dataframe can be determined by looking at the dataframe’s <ComputerCode>
<b>dtypes</b>
</ComputerCode> attribute, like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london.dtypes</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>GMT object</Paragraph><Paragraph>Max TemperatureC int64</Paragraph><Paragraph>Mean TemperatureC int64</Paragraph><Paragraph>Min TemperatureC int64</Paragraph><Paragraph>Dew PointC int64</Paragraph><Paragraph>MeanDew PointC int64</Paragraph><Paragraph>Min DewpointC int64</Paragraph><Paragraph>Max Humidity int64</Paragraph><Paragraph>Mean Humidity int64</Paragraph><Paragraph>Min Humidity int64</Paragraph><Paragraph>Max Sea Level PressurehPa int64</Paragraph><Paragraph>Mean Sea Level PressurehPa int64</Paragraph><Paragraph>Min Sea Level PressurehPa int64</Paragraph><Paragraph>Max VisibilityKm int64</Paragraph><Paragraph>Mean VisibilityKm int64</Paragraph><Paragraph>Min VisibilitykM int64</Paragraph><Paragraph>Max Wind SpeedKm/h int64</Paragraph><Paragraph>Mean Wind SpeedKm/h int64</Paragraph><Paragraph>Max Gust SpeedKm/h float64</Paragraph><Paragraph>Precipitationmm float64</Paragraph><Paragraph>CloudCover float64</Paragraph><Paragraph>Events object</Paragraph><Paragraph>WindDirDegrees object</Paragraph><Paragraph>dtype: object</Paragraph></ComputerDisplay><Paragraph>In the above output, you can see the column names to the left and to the right the data types of the values in those columns.</Paragraph><BulletedList><ListItem><ComputerCode>
<b>int64</b>
</ComputerCode> is the pandas data type for whole numbers such as <ComputerCode>
<b>55</b>
</ComputerCode> or <ComputerCode>
<b>2356</b>
</ComputerCode></ListItem><ListItem><ComputerCode>
<b>float64</b>
</ComputerCode> is the pandas data type for decimal numbers such as <ComputerCode>
<b>55.25</b>
</ComputerCode> or <ComputerCode>
<b>2356.00</b>
</ComputerCode></ListItem><ListItem><ComputerCode>
<b>object</b>
</ComputerCode> is the pandas data type for strings such as <ComputerCode>
<b>'hello world'</b>
</ComputerCode> or <ComputerCode>
<b>'rain'</b>
</ComputerCode></ListItem></BulletedList><Paragraph>Most of the column data types seem fine, however two are of concern, <ComputerCode>
<b>'GMT'</b>
</ComputerCode> and <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> , both of which are of type <ComputerCode>
<b>object.</b>
</ComputerCode> Let’s take a look at <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> first.</Paragraph><InternalSection><Heading> Changing the data type of the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column </Heading><Paragraph>The <ComputerCode>
<b>read_csv()</b>
</ComputerCode> method has interpreted the values in the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column as strings (type <ComputerCode>
<b>object</b>
</ComputerCode> ). This is because in the CSV file the values in that column had all been suffixed with that html line break string <ComputerCode>
<b><br/></b>
</ComputerCode> so <ComputerCode>
<b>read_csv()</b>
</ComputerCode> had no alternative but to interpret the values as strings.</Paragraph><Paragraph>The values in the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column are meant to represent wind direction in terms of degrees from true north (360) and meteorologists always define the wind direction as the direction the wind is coming from. So if you stand so that the wind is blowing directly into your face, the direction you are facing names the wind, so a westerly wind is reported as 270 degrees. The compass rose shown below should make this clearer:</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1007.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1007.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="894f0d0a" x_imagesrc="ou_futurelearn_learn_to_code_fig_1007.jpg" x_imagewidth="512" x_imageheight="273"/><Caption><b>Figure 6</b> A compass rose </Caption></Figure><Paragraph>We need to be able to make queries such as ‘Get and display the rows where the wind direction is greater than 350 degrees’. To do this we need to change the data type of the ‘WindDirDegrees’ column from object to type <ComputerCode>
<b>int64</b>
</ComputerCode>. We can do that by using the <ComputerCode>
<b>astype()</b>
</ComputerCode> method like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
london['WindDirDegrees'] = london['WindDirDegrees'].astype('int64')
</ComputerCode></Paragraph><Paragraph>Now all the values in the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column are of type <ComputerCode>
<b>int64</b>
</ComputerCode> and we can make our query:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london[london['WindDirDegrees'] &gt; 350]</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1008.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1008.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="471ef2e3" x_imagesrc="ou_futurelearn_learn_to_code_fig_1008.jpg" x_imagewidth="512" x_imageheight="254"/><Caption><b>Figure 7</b> </Caption><Alternative>Rows from the london dataframe where the value in the WindDirDegrees column is greater than 350.</Alternative><Description>Rows from the london dataframe where the value in the WindDirDegrees column is greater than 350. Note that the WindDirDegrees column is not shown as it is on the far right of the table and only the first few columns are shown due to the limitation of page width. </Description></Figure><Paragraph><i> Note that the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column is on the far right of the table and the right of the table has been cropped to fit on the page. </i></Paragraph></InternalSection><InternalSection><Heading>Changing the data type of the ‘GMT’ column</Heading><Paragraph>Recall that I noted that the <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column was of type <ComputerCode>
<b>object</b>
</ComputerCode> , the type pandas uses for strings.</Paragraph><Paragraph>The <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column is supposed to represent dates. It would be helpful for the date values not to be strings to make it possible to make queries of the data such as ‘Return the row where the date is 4 June 2014’.</Paragraph><Paragraph>Pandas has a function called <ComputerCode>
<b>to_datetime()</b>
</ComputerCode> which can convert a column of <ComputerCode>
<b>object</b>
</ComputerCode> (string) values such as those in the <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column into values of a proper date type called <ComputerCode>
<b>datetime64</b>
,
</ComputerCode> just like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>london['GMT'] = to_datetime(london['GMT'])</Paragraph><Paragraph>
#Then display the types of all the columns again so we
</Paragraph><Paragraph>#can check the changes have been made.</Paragraph><Paragraph>london.dtypes</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>GMT datetime64[ns]</Paragraph><Paragraph>Max TemperatureC int64</Paragraph><Paragraph>Mean TemperatureC int64</Paragraph><Paragraph>Min TemperatureC int64</Paragraph><Paragraph>Dew PointC int64</Paragraph><Paragraph>MeanDew PointC int64</Paragraph><Paragraph>Min DewpointC int64</Paragraph><Paragraph>Max Humidity int64</Paragraph><Paragraph>Mean Humidity int64</Paragraph><Paragraph>Min Humidity int64</Paragraph><Paragraph>Max Sea Level PressurehPa int64</Paragraph><Paragraph>Mean Sea Level PressurehPa int64</Paragraph><Paragraph>Min Sea Level PressurehPa int64</Paragraph><Paragraph>Max VisibilityKm int64</Paragraph><Paragraph>Mean VisibilityKm int64</Paragraph><Paragraph>Min VisibilitykM int64</Paragraph><Paragraph>Max Wind SpeedKm/h int64</Paragraph><Paragraph>Mean Wind SpeedKm/h int64</Paragraph><Paragraph>Max Gust SpeedKm/h float64</Paragraph><Paragraph>Precipitationmm float64</Paragraph><Paragraph>CloudCover float64</Paragraph><Paragraph>Events object</Paragraph><Paragraph>WindDirDegrees int64</Paragraph><Paragraph>dtype: object</Paragraph></ComputerDisplay><Paragraph>From the above output, we can confirm that the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column type has been changed from <ComputerCode>
<b>object</b>
</ComputerCode> to <ComputerCode>
<b>int64</b>
</ComputerCode> and that the <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column type has been changed from <ComputerCode>
<b>object</b>
</ComputerCode> to <ComputerCode>
<b>datetime64</b>
</ComputerCode>.</Paragraph><Paragraph>To make queries such as ‘Return the row where the date is 4 June 2014’ you’ll need to be able to create a <ComputerCode>
<b>datetime64</b>
</ComputerCode> value to represent June 4 2014. It cannot be:</Paragraph><Paragraph><ComputerCode>london[london['GMT'] == '2014-1-3']</ComputerCode></Paragraph><Paragraph>because ‘2014-1-3’ is a string and the values in the ‘GMT’ column are of type <ComputerCode>
<b>datetime64</b>
</ComputerCode>. Instead you must create a <ComputerCode>
<b>datetime64</b>
</ComputerCode> value using <ComputerCode>
<b>thedatetime()</b>
</ComputerCode> function like this:</Paragraph><Paragraph><ComputerCode>datetime(2014, 6, 4)</ComputerCode></Paragraph><Paragraph>In the function call above, the first integer argument is the year, the second the month and the third the day.</Paragraph><Paragraph>First import the `datetime()` function from the similarly named `datetime` package  by running the following line of code:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>from datetime import datetime</ComputerCode></Paragraph><Paragraph>Let’s try the function out by executing the code to ‘Return the row where the date is 4 June 2014’:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london[london['GMT'] == datetime(2014, 6, 4)]</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1009.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1009.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="ef710c2d" x_imagesrc="ou_futurelearn_learn_to_code_fig_1009.jpg" x_imagewidth="512" x_imageheight="113"/><Caption><b>Figure 8</b> </Caption><Description>The row from the london dataframe where the date is 4 June 2014. Note that only the first few columns are shown due to the limitation of page width. </Description></Figure><Paragraph><i> Note that the right side of the table has been cropped to fit on the page. </i></Paragraph><Paragraph>You can also now make more complex queries involving dates such as ‘Return all the rows where the date is between 8 December 2014 and 12 December 2014’, like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph>c</Paragraph><ComputerDisplay><Paragraph>london[(london['GMT'] &gt;= datetime(2014, 12, 8)) </Paragraph><Paragraph>    &amp; (london['GMT'] &lt;= datetime(2014, 12, 12))]</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1010.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1010.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="059167f9" x_imagesrc="ou_futurelearn_learn_to_code_fig_1010.jpg" x_imagewidth="512" x_imageheight="274"/><Caption><b>Figure 9</b> </Caption><Alternative/><Description>The rows from the london dataframe where the date is between 8 December 2014 and 12 December 2014 (inclusive). Note that only the first few columns are shown due to the limitation of page width. </Description></Figure><Paragraph><i>Note that the right side of the table has been cropped to fit on the page. </i></Paragraph><Activity><Heading>Exercise 4 Display rows from dataframe</Heading><Question><Paragraph>Now try Exercise 4 in the Exercise notebook 2.</Paragraph><Paragraph>If you’re using Anaconda instead of CoCalc, remember that to open the notebook you’ll need to navigate to the notebook using Jupyter.</Paragraph><Paragraph>Once the notebook is open, run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook. If you need a quick reminder of how to use Jupyter, watch again the video in <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83247&amp;targetdoc=Week+1%3A+Having+a+go+at+it+Part+1&amp;targetptr=1.4">Week 1 Exercise 1.</a></Paragraph></Question></Activity></InternalSection></Section></Session><Session><Title>2 Every picture tells a story</Title><Paragraph>It can be difficult and confusing to look at a table of rows of numbers and make any meaningful interpretation especially if there are many rows and columns.</Paragraph><Paragraph>Handily, pandas has a method called <ComputerCode>
<b>plot()</b>
</ComputerCode> which will visualise data for us by producing a chart.</Paragraph><Paragraph>Before using the <ComputerCode>
<b>plot()</b>
</ComputerCode> method, the following line of code must be executed (once) which tells Jupyter to display all charts inside this notebook, immediately after each call to <ComputerCode>
<b>plot():</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>%matplotlib inline</ComputerCode></Paragraph><Paragraph>To plot <ComputerCode>
<b>‘Max Wind SpeedKm/h</b>
</ComputerCode> ’, it’s as simple as this code:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london['Max Wind SpeedKm/h'].plot(grid=True)</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1023.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1023.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="3299cc36" x_imagesrc="ou_futurelearn_learn_to_code_fig_1023.jpg" x_imagewidth="512" x_imageheight="222"/><Caption><b>Figure 10</b> </Caption><Alternative>Chart of the values in the Max Wind SpeedKm/h column of the london dataframe.</Alternative><Description>Chart of the values in the Max Wind SpeedKm/h column of the london dataframe.</Description></Figure><Paragraph>The <ComputerCode>
<b>grid=True</b>
</ComputerCode> argument makes the gridlines (the dotted lines in the image above) appear, which make values easier to read on the chart. The chart comes out a bit small, so you can make it bigger by giving the <ComputerCode>
<b>plot()</b>
</ComputerCode> method some extra information. The figsize units are inches.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
london['Max Wind SpeedKm/h'].plot(grid=True, figsize=(10,5))
</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1024.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1024.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="6297c031" x_imagesrc="ou_futurelearn_learn_to_code_fig_1024.jpg" x_imagewidth="512" x_imageheight="268"/><Caption><b>Figure 11</b></Caption><Description>Larger version of the first chart on this page</Description></Figure><Paragraph>That’s better! The argument given to the <ComputerCode>
<b>plot()</b>
</ComputerCode> method, <ComputerCode>
<b>figsize=(10,5)</b>
</ComputerCode> simply tells <ComputerCode>
<b>plot()</b>
</ComputerCode> that the x-axis should be 10 units wide and the y-axis should be 5 units high. In the above graph the x-axis (the numbers at the bottom) shows the dataframe’s index, so 0 is 1 January and 50 is 18 February.</Paragraph><Paragraph>The y-axis (the numbers on the side) shows the range of wind speed in kilometres per hour. It is clear that the windiest day in 2014 was somewhere in mid-February and the wind reached about 66 kilometers per hour.</Paragraph><Paragraph>By default, the <ComputerCode>
<b>plot()</b>
</ComputerCode> method will try to generate a line, although as you’ll see in a later week, it can produce other chart types too.</Paragraph><Activity><Heading>Exercise 5 Every picture tells a story</Heading><Question><Paragraph>Now try Exercise 5 in the Exercise notebook 2.</Paragraph><Paragraph>If you’re using Anaconda, remember that to open the notebook you’ll need to navigate to the notebook using Jupyter.</Paragraph></Question></Activity><Section id="changing_a_dataframes_index"><Title>2.1 Changing a dataframe’s index</Title><Paragraph>We have seen that by default every dataframe has an integer index for its rows which starts from 0.</Paragraph><Paragraph>The dataframe we’ve been using, <ComputerCode>
<b>london</b>
</ComputerCode> , has an index that goes from <ComputerCode>
<b>0</b>
</ComputerCode> to <ComputerCode>
<b>364</b>
</ComputerCode>. The row indexed by <ComputerCode>
<b>0</b>
</ComputerCode> holds data for the first day of the year and the row indexed by <ComputerCode>
<b>364</b>
</ComputerCode> holds data for the last day of the year. However, the column <ComputerCode>
<b>'GMT'</b>
</ComputerCode> holds <ComputerCode>
<b>datetime64</b>
</ComputerCode> values which would make a more intuitive index.</Paragraph><Paragraph>Changing the index to <ComputerCode>
<b>datetime64</b>
</ComputerCode> values is as easy as assigning to the dataframe’s <ComputerCode>
<b>index</b>
</ComputerCode> attribute the contents of the <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column, like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>london.index = london['GMT']</Paragraph><Paragraph>#Display the first 2 rows</Paragraph><Paragraph>london.head(2)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1011.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1011.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="2439d1a6" x_imagesrc="ou_futurelearn_learn_to_code_fig_1011.jpg" x_imagewidth="512" x_imageheight="199"/><Caption><b>Figure 12</b> </Caption><Alternative>First 2 rows of the london dataframe showing that the index has been changed to the datetime64 values from the GMT column</Alternative><Description>First 2 rows of the london dataframe showing that the index has been changed to the datetime64 values from the GMT column. Note that only the first few columns are shown due to the limitation of page width. </Description></Figure><Paragraph><i> Note that the right of the table has been cropped to fit on the page. </i></Paragraph><Paragraph>Notice that the <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column still remains and that the index has been labelled to show that it has been derived from the <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column.</Paragraph><Paragraph>You can still access a row using the <ComputerCode>
<b>iloc</b>
</ComputerCode> attribute, so to get the first line in the dataframe you can simply execute:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london.iloc[0]</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>GMT 2014-01-01 00:00:00</Paragraph><Paragraph>Max TemperatureC 11</Paragraph><Paragraph>Mean TemperatureC 8</Paragraph><Paragraph>Min TemperatureC 6</Paragraph><Paragraph>Dew PointC 9</Paragraph><Paragraph>MeanDew PointC 7</Paragraph><Paragraph>Min DewpointC 4</Paragraph><Paragraph>Max Humidity 94</Paragraph><Paragraph>Mean Humidity 86</Paragraph><Paragraph>Min Humidity 73</Paragraph><Paragraph>Max Sea Level PressurehPa 1002</Paragraph><Paragraph>Mean Sea Level PressurehPa 993</Paragraph><Paragraph>Min Sea Level PressurehPa 984</Paragraph><Paragraph>Max VisibilityKm 31</Paragraph><Paragraph>Mean VisibilityKm 11</Paragraph><Paragraph>Min VisibilitykM 2</Paragraph><Paragraph>Max Wind SpeedKm/h 40</Paragraph><Paragraph>Mean Wind SpeedKm/h 26</Paragraph><Paragraph>Max Gust SpeedKm/h 66</Paragraph><Paragraph>Precipitationmm 9.91</Paragraph><Paragraph>CloudCover 4</Paragraph><Paragraph>Events Rain</Paragraph><Paragraph>WindDirDegrees 186</Paragraph><Paragraph>Name: 2014-01-01 00:00:00, dtype: object</Paragraph></ComputerDisplay><Paragraph>But now you can now also use the <ComputerCode>
<b>datetime64</b>
</ComputerCode> index to get a row using the dataframe’s <ComputerCode>
<b>loc</b>
</ComputerCode> attribute, like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london.loc[datetime(2014, 1, 1)]</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>GMT 2014-01-01 00:00:00</Paragraph><Paragraph>Max TemperatureC 11</Paragraph><Paragraph>Mean TemperatureC 8</Paragraph><Paragraph>Min TemperatureC 6</Paragraph><Paragraph>Dew PointC 9</Paragraph><Paragraph>MeanDew PointC 7</Paragraph><Paragraph>Min DewpointC 4</Paragraph><Paragraph>Max Humidity 94</Paragraph><Paragraph>Mean Humidity 86</Paragraph><Paragraph>Min Humidity 73</Paragraph><Paragraph>Max Sea Level PressurehPa 1002</Paragraph><Paragraph>Mean Sea Level PressurehPa 993</Paragraph><Paragraph>Min Sea Level PressurehPa 984</Paragraph><Paragraph>Max VisibilityKm 31</Paragraph><Paragraph>Mean VisibilityKm 11</Paragraph><Paragraph>Min VisibilitykM 2</Paragraph><Paragraph>Max Wind SpeedKm/h 40</Paragraph><Paragraph>Mean Wind SpeedKm/h 26</Paragraph><Paragraph>Max Gust SpeedKm/h 66</Paragraph><Paragraph>Precipitationmm 9.91</Paragraph><Paragraph>CloudCover 4</Paragraph><Paragraph>Events Rain</Paragraph><Paragraph>WindDirDegrees 186</Paragraph><Paragraph>Name: 2014-01-01 00:00:00, dtype: object</Paragraph></ComputerDisplay><Paragraph>A query such as ‘Return all the rows where the date is between 8 December and 12 December’ which you did before (and can still do) with:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>london[(london['GMT'] &gt;= datetime(2014, 12, 8))</Paragraph><Paragraph>    &amp; (london['GMT'] &lt;= datetime(2014, 12, 12))]</Paragraph></ComputerDisplay><Paragraph>can now be done more succinctly like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>
london.loc[datetime(2014,12,8) : datetime(2014,12,12)]
</Paragraph><Paragraph/><Paragraph>
#The meaning of the above code is get the rows between
</Paragraph><Paragraph>#and including the indices datetime(2014,12,8) and</Paragraph><Paragraph>#datetime(2014,12,12)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1012.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1012.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="094925aa" x_imagesrc="ou_futurelearn_learn_to_code_fig_1012.jpg" x_imagewidth="512" x_imageheight="337"/><Caption><b>Figure 13</b> </Caption><Alternative>Rows from the london dataframe where the index is between 2014-12-08 and 2014-12-12 (inclusive).</Alternative><Description>Rows from the london dataframe where the index is between 2014-12-08 and 2014-12-12 (inclusive). Note that only the first few columns are shown due to the limitation of page width. </Description></Figure><Paragraph><i> Note that the right of the table has been cropped to fit on the page. </i></Paragraph><Paragraph>Because the table is in date order, we can be confident that only the rows with dates between 8 December 2014 and 12 December 2014 (inclusive) will be returned. However if the table had not been in date order, we would have needed to sort it first, like this:</Paragraph><Paragraph><ComputerCode>london = london.sort_index()</ComputerCode></Paragraph><Paragraph>Now there is a <ComputerCode>
<b>datetime64</b>
</ComputerCode> index, let’s plot ' <ComputerCode>
<b>Max Wind SpeedKm/h</b>
</ComputerCode> 'again:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
london['Max Wind SpeedKm/h'].plot(grid=True, figsize=(10,5))
</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1013.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1013.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="eb20ec94" x_imagesrc="ou_futurelearn_learn_to_code_fig_1013.jpg" x_imagewidth="512" x_imageheight="313"/><Caption><b>Figure 14</b> </Caption><Alternative>Chart of the values in the Max Wind SpeedKm/h column of the london dataframe.</Alternative><Description>Chart of the values in the Max Wind SpeedKm/h column of the london dataframe. Note that the legend for the x-axis has changed from numbers to month names. </Description></Figure><Paragraph>Now it is much clearer that the worst winds were in mid-February.</Paragraph><Activity><Heading>Exercise 6 Changing a dataframe’s index</Heading><Question><Paragraph>Now try Exercise 6 in the Exercise notebook 2.</Paragraph></Question></Activity></Section><Section><Title>2.2 The project</Title><Paragraph>Your project this week is to find out what would have been the best two weeks of weather for a 2014 vacation in a capital of a BRICS country.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1039.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1039.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="51178b01" x_imagesrc="ou_futurelearn_learn_to_code_fig_1039.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 15</b> </Caption><Alternative>An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky</Alternative><Description>An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky</Description></Figure><Paragraph>I’ve written up my analysis of the best two weeks of weather in London, UK, which you can open in project: 2: Holiday weather.</Paragraph><Paragraph>The structure is very simple: besides the introduction and the conclusions, there is one section for each step of the analysis – obtaining, cleaning and visualising the data.</Paragraph><Paragraph>Once you’ve worked through my analysis you should open a dataset for just one of the BRICS capitals: Brasilia, Moscow, Delhi, Beijing or Cape Town. The choice of capital is up to you. You should then work out the best two weeks, according to the weather, to choose for a two-week holiday in your chosen capital city.</Paragraph><Paragraph>Download the dataset for your chosen location as follows:</Paragraph><BulletedList><ListItem>Right click on the name of your chosen capital city above</ListItem><ListItem>Choose to save the file via ‘Download Linked File As...’ Save the file with its default name to your downloads folder.</ListItem><ListItem>If necessary, rename the file so that it has a .csv extension.</ListItem><ListItem>Finally, move or copy te file to the disk folder or SageMathCloud by Cocalc project you created in Week 1.</ListItem></BulletedList><Paragraph>Once again, <b>do not open the file with Excel</b> , but you could take a look using a text editor.</Paragraph><Paragraph>In my project, because I’m in London, which is often cold and rainy, I was looking for a two week period that had relatively high temperatures and little rain. If you choose a capital in a particularly hot and dry country you will probably be looking for relatively cool weather and low humidity.</Paragraph><Paragraph>Note that the London file has the dates in a column named ‘GMT’ whereas in the BRICS files they are in a column named ‘Date’. You will need to change the Python code accordingly. You should also change the name of the variable, London, according to the capital you choose.</Paragraph></Section></Session><Session><Title>3 This week’s quiz</Title><Paragraph>Now it’s time to complete the Week 4 badge quiz. It is similar to previous quizzes, but this time instead of answering five questions there will be fifteen.</Paragraph><Paragraph><a href="https://www.open.edu/openlearn/ocw/mod/quiz/view.php?id=78780">Week 4 compulsory badge quiz</a></Paragraph><Paragraph>Remember, this quiz counts towards your badge. If you’re not successful the first time, you can attempt the quiz again in 24 hours.</Paragraph></Session><Session><Title>4 Summary</Title><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1052.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1052.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="e2772382" x_imagesrc="ou_futurelearn_learn_to_code_fig_1052.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 16</b> </Caption><Alternative>An image of storm clouds and a double rainbow above a field with a dirt road</Alternative><Description>An image of storm clouds and a double rainbow above a field with a dirt road</Description></Figure><Paragraph>This week you have learned how to: </Paragraph><BulletedList><ListItem>load a dataset into a dataframe from a CSV file</ListItem><ListItem>clean data</ListItem><ListItem>use the data to get answers to your questions.</ListItem></BulletedList><Paragraph>Next week you will learn about the techniques behind the creation of a combined dataset. </Paragraph><Paragraph>You are now halfway through the course. The Open University would really appreciate your feedback and suggestions for future improvement in our optional <a href="https://www.surveymonkey.co.uk/r/BOCENDlearntocode">end-of-course survey</a>, which you will also have an opportunity to complete at the end of Week 8. Participation will be completely confidential and we will not pass on your details to others.</Paragraph></Session><Session><Title>4.1 Week 4 glossary</Title><Paragraph>Here is an alphabetical list of the terms introduced this week, for quick look-up.</Paragraph><InternalSection><Heading>Programming and data analysis concepts</Heading><Paragraph>The <b>bitwise operators</b> <ComputerCode>
<b>&amp;</b>
</ComputerCode> (and) and <ComputerCode>
<b>|</b>
</ComputerCode> (or) are used in pandas to build more complicated expressions from two comparison expressions (typically involving column comparisons).</Paragraph><Paragraph>A <b>Boolean</b> has one of two possible values: <ComputerCode>
<b>True</b>
</ComputerCode> or <ComputerCode>
<b>False</b>
</ComputerCode>.</Paragraph><Paragraph>A <b>Comma Separated Values (CSV)</b> file is a plain text file that is used to hold tabular data.</Paragraph><Paragraph>A <b>list</b> is a sequence of values, separated by commas, and written within square brackets.</Paragraph><Paragraph>There are six <b>comparison operators</b> that can be used to compare number, string and date values. Expressions composed of these operators evaluate to <ComputerCode>
<b>True</b>
</ComputerCode> or <ComputerCode>
<b>False</b>
</ComputerCode>. These operators can also be used to compare every value in a column, row by row, against some number, string or date value. When used in this manner the operators return a series of Boolean values.</Paragraph><Paragraph>The <b>‘dot’ notation</b> is used to access a dataframe’s methods and attributes.</Paragraph><Paragraph>The <ComputerCode>
<b>Series</b>
</ComputerCode> data type is a collection of values with an integer index that starts from zero. Each column in a dataframe is an example of the <ComputerCode>
<b>Series</b>
</ComputerCode> data type. The <ComputerCode>
<b>Series</b>
</ComputerCode> data type has many of the same methods as the <ComputerCode>
<b>DataFrame</b>
</ComputerCode> data type.</Paragraph><Paragraph>The <ComputerCode>
<b>object</b>
</ComputerCode> data type is how pandas represents strings.</Paragraph><Paragraph>The <ComputerCode>
<b>datetime64</b>
</ComputerCode> data type is how pandas represents dates.</Paragraph><Paragraph>The <ComputerCode>
<b>int64</b>
</ComputerCode> data type is how pandas represents integers (whole numbers).</Paragraph><Paragraph>The <ComputerCode>
<b>float64</b>
</ComputerCode> data type is how pandas represents floating point numbers (decimals).</Paragraph></InternalSection><InternalSection><Heading>Functions and methods</Heading><Paragraph><ComputerCode>
<b>asType(aType)</b>
</ComputerCode> when applied to a dataframe column, the method changes the data type of each value in that column to the type given by the string <ComputerCode>
<b>aType</b>
</ComputerCode>.</Paragraph><Paragraph><ComputerCode>
<b>datetime(yyyy, mm, dd)</b>
</ComputerCode> the function takes three arguments, <ComputerCode>
<b>yyyy</b>
</ComputerCode> a four digit integer representing a year, <ComputerCode>
<b>mm</b>
</ComputerCode> a two digit integer representing a month and <ComputerCode>
<b>dd</b>
</ComputerCode> a two digit integer representing a day. From these arguments the function creates and returns a value of <ComputerCode>
<b>datetime64</b>
</ComputerCode>.</Paragraph><Paragraph><ComputerCode>
<b>dropna()</b>
</ComputerCode> when applied to a dataframe returns a new dataframe without the rows that have at least one missing value.</Paragraph><Paragraph><ComputerCode>
<b>head()</b>
</ComputerCode> gets and displays the first five rows of a dataframe. Optionally the method can take an integer argument to specify how many rows (from and including row 0) to get and display.</Paragraph><Paragraph><ComputerCode>
<b>iloc[index]</b>
</ComputerCode> gets and displays the row in the dataframe indicated by the integer argument <ComputerCode>
<b>index</b>
</ComputerCode>.</Paragraph><Paragraph><ComputerCode>
<b>isnull()</b>
</ComputerCode> is a series method that checks which rows in that series have a missing value.</Paragraph><Paragraph><ComputerCode>
<b>fillna(value)</b>
</ComputerCode> is a series method that returns a new series in which all missing values have been filled with the given value.</Paragraph><Paragraph><ComputerCode>
<b>plot()</b>
</ComputerCode> when applied to a dataframe column of numeric values, the method displays a graph of those values. The x-axis shows the dataframe’s index and the y-axis the range of the column’s values. Before the method is called you first need to execute <ComputerCode>
<b>%matplotlib inline</b>
</ComputerCode>.</Paragraph><Paragraph><ComputerCode>
<b>read_csv(csvFile)</b>
</ComputerCode> creates a dataframe from the dataset in the CSV file.</Paragraph><Paragraph><ComputerCode>
<b>rename(columns={oldName : newName})</b>
</ComputerCode> renames the column <ComputerCode>
<b>oldName</b>
</ComputerCode> to <ComputerCode>
<b>newName</b>
</ComputerCode>.</Paragraph><Paragraph><ComputerCode>
<b>str.rstrip(suffix)</b>
</ComputerCode> when applied to a dataframe column of string values, the method removes the argument <ComputerCode>
<b>suffix</b>
</ComputerCode> from the end of each string value in the column.</Paragraph><Paragraph><ComputerCode>
<b>tail()</b>
</ComputerCode> gets and displays the last five rows of a dataframe. Optionally the method can take an integer argument to specify how many rows (until and including the last row) to get and display.</Paragraph><Paragraph><ComputerCode>
<b>to_datetime(aSeries)</b>
</ComputerCode> when applied to a series, typically a column from a dataframe, this function returns a new series in which each value in <ComputerCode>
<b>aSeries</b>
</ComputerCode> has been changed to type <ComputerCode>
<b>datetime64</b>
</ComputerCode>.</Paragraph></InternalSection></Session></Unit><Unit><UnitID/><UnitTitle>Week 4: Cleaning up our act Part 2</UnitTitle><Session><Title>1 Loading the weather data</Title><Paragraph>You have learned some more about Python and the pandas module and tried it out on a fairly small dataset. You are now ready to explore a dataset from the Weather Underground.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1039.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1039.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="51178b01" x_imagesrc="ou_futurelearn_learn_to_code_fig_1039.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 1</b> </Caption><Alternative>An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky.</Alternative><Description>An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky.</Description></Figure><Paragraph>Open the file London_2014.csv and save it in the disk folder or CoCalc project you created in Week 1.</Paragraph><Paragraph><b>Do not be tempted to open this file with Excel</b> as this application will attempt to localise the data in the file, i.e. use your country’s local data formats, which will make much of what follows rather incomprehensible! You can if you like open the file with a simple text editor, but <b>do not make any changes</b>.</Paragraph><Paragraph>The CSV file can be loaded into a dataframe by executing the following code:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>from pandas import *</Paragraph><Paragraph>london = read_csv('London_2014.csv')</Paragraph><Paragraph>london.head()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1006.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1006.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="47f77fd6" x_imagesrc="ou_futurelearn_learn_to_code_fig_1006.jpg" x_imagewidth="512" x_imageheight="214"/><Caption><b>Figure 2</b> </Caption><Alternative>First 5 rows of the London dataframe. </Alternative><Description>First 5 rows of the London dataframe. Note that only the first few columns are shown due to the limitation of page width.</Description></Figure><Paragraph><i> Note that the right hand side of the table has been cropped to fit on the page. </i></Paragraph><Paragraph>In the next section, you’ll find out how to remove rogue spaces.</Paragraph><InternalSection><Heading>Important notice for learners outside of the EU</Heading><Paragraph>The Weather Underground automatically localises data based on from what country it detects you are accessing the web site. So, for example, if you are accessing the website from the USA wind speeds will be in MPH rather than km/h and temperatures in Fahrenheit rather than Celsius.</Paragraph><Paragraph>In order to change the settings so that the data is in European format you will need to click on the ‘head and shoulders’ icon on the top right of the Weather Underground web page and create a free Weather Underground account.</Paragraph><Paragraph>Once you have created an account, click on the ‘cog’ icon on the top right of the web page. Then:</Paragraph><BulletedList><ListItem>click on the C button to select Celsius</ListItem><ListItem>click on ‘More Settings’ and select Units: metric</ListItem><ListItem>click on ‘Save My Preferences’.</ListItem></BulletedList><Paragraph>Now, when you download the data, temperatures will be in Celsius and wind speeds in km/h etc.</Paragraph></InternalSection><Section><Title>1.1 Removing rogue spaces</Title><Paragraph>One of the problems often encountered with CSV files is rogue spaces before or after data values or column names.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1045.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1045.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="fef0227c" x_imagesrc="ou_futurelearn_learn_to_code_fig_1045.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 3</b> </Caption><Alternative>An image of empty, numbered parking spaces</Alternative><Description>An image of empty, numbered parking spaces</Description></Figure><Paragraph>You learned earlier, in What is a CSV file? , that each value or column name is separated by a comma. However, if you opened ‘London_2014.csv’ in a text editor, you would see that in the row of column names sometimes there are spaces after a comma:</Paragraph><Quote><Paragraph>GMT,Max TemperatureC,Mean TemperatureC,Min TemperatureC,Dew PointC,MeanDew PointC,Min DewpointC,Max Humidity, Mean Humidity, Min Humidity, Max Sea Level PressurehPa, Mean Sea Level PressurehPa, Min Sea Level PressurehPa, Max VisibilityKm, Mean VisibilityKm, Min VisibilitykM, Max Wind SpeedKm/h, Mean Wind SpeedKm/h, Max Gust SpeedKm/h,Precipitationmm, CloudCover, Events,WindDirDegrees<br/></Paragraph></Quote><!--
Please leave the apostrophes as they are below as this is illustrating a point  it isn't wrong. 
--><Paragraph>For example, there is a space after the comma between <ComputerCode>
<b>Max Humidity</b>
</ComputerCode> and <ComputerCode>
<b>Mean Humidity</b>
</ComputerCode>. This means that when <ComputerCode>
<b>read_csv()</b>
</ComputerCode> reads the row of column names it will interpret a space after a comma as part of the next column name. So, for example, the column name after <ComputerCode>
<b>'Max Humidity'</b>
</ComputerCode> will be interpreted as <ComputerCode>
<b>' Mean Humidity'</b>
</ComputerCode> rather than what was intended, which is <ComputerCode>
<b>'Mean Humidity'</b>
</ComputerCode>. The ramification of this is that code such as:</Paragraph><Paragraph><ComputerCode>london[['Mean Humidity']]</ComputerCode></Paragraph><Paragraph>will cause a key error (see <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83247&amp;targetdoc=Week+2%3A+Having+a+go+at+it+Part+2&amp;targetptr=1.3">Selecting a column</a> ), as the column name is confusingly <ComputerCode>
<b>' Mean Humidity</b>
</ComputerCode> '.</Paragraph><Paragraph>This can easily be rectified by adding another argument to the <ComputerCode>
<b>read_csv()</b>
</ComputerCode> function:</Paragraph><Paragraph><ComputerCode>skipinitialspace=True</ComputerCode></Paragraph><Paragraph>which will tell <ComputerCode>
<b>read_csv()</b>
</ComputerCode> to ignore any spaces after a comma:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>
london = read_csv('London_2014.csv', skipinitialspace=True)
</Paragraph></ComputerDisplay><Paragraph>The rogue spaces will no longer be in the dataframe and we can write code such as:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london[['Mean Humidity']].head()</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th><b>Mean Humidity</b></th></tr><tr><td>0</td><td>86</td></tr><tr><td>1</td><td>81</td></tr><tr><td>2</td><td>76</td></tr><tr><td>3</td><td>85</td></tr><tr><td>4</td><td>88</td></tr></tbody></Table><Paragraph>Note that a <ComputerCode>
<b>skipinitialspace=True</b>
</ComputerCode> argument won’t remove a trailing space at the end of a column name.</Paragraph><Paragraph>Next, find out about extra characters and how to remove them.</Paragraph></Section><Section><Title>1.2 Removing extra characters</Title><Paragraph>If you opened London_2014.csv in a text editor once again and looked at the last column name you would see that the name is'WindDirDegrees<br/>'.</Paragraph><Paragraph>What has happened here is that when the dataset was exported from the Weather Underground website an html line break <ComputerCode>
<b>(<br/>)</b>
</ComputerCode> was added after the line of column headers which <ComputerCode>
<b>read_csv()</b>
</ComputerCode> has interpreted as the end part of the final column’s name.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1050.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1050.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="396fba4f" x_imagesrc="ou_futurelearn_learn_to_code_fig_1050.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 4</b> </Caption><Alternative>An image of two bouncers in suits standing in a corridor</Alternative><Description>An image of two bouncers in suits standing in a corridor</Description></Figure><Paragraph>In fact, the problem is worse than this, let’s look at some values in the final column:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london[['WindDirDegrees<br/>']].head()</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th><b> WindDirDegrees <br/> </b></th></tr><tr><td>0</td><td>186<br/></td></tr><tr><td>1</td><td>214<br/></td></tr><tr><td>2</td><td>219<br/></td></tr><tr><td>3</td><td>211<br/></td></tr><tr><td>4</td><td>199<br/></td></tr></tbody></Table><Paragraph>It’s seems there is an html line break at the end of each line. If I opened ‘London_2014.csv’ in a text editor and looked at the ends of all lines in the file this would be confirmed.</Paragraph><Paragraph>Once again I’m not going to edit the CSV file but rather fix the problem in the dataframe. To change <ComputerCode>
<b>'WindDirDegrees<br/>'</b>
</ComputerCode> to <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> all I have to do is use the <ComputerCode>
<b>rename()</b>
</ComputerCode> method as follows:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
london = london.rename(columns={'WindDirDegrees<br/>':'WindDirDegrees'})
</ComputerCode></Paragraph><Paragraph>Don’t worry about the syntax of the argument for <ComputerCode>
<b>rename()</b>
</ComputerCode> , just use this example as a template for whenever you need to change the name of a column.</Paragraph><Paragraph>Now I need to get rid of those pesky <ComputerCode>
<b><br/></b>
</ComputerCode> html line breaks from the ends of the values in the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column, so that they become something sensible. I can do that using the string method <ComputerCode>
<b>rstrip()</b>
</ComputerCode> which is used to remove characters from the end or ‘rear’ of a string, just like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
london['WindDirDegrees'] = london['WindDirDegrees'].str.rstrip('<br/>')
</ComputerCode></Paragraph><Paragraph>Again don’t worry too much about the syntax of the code and simply use it as a template for whenever you need to process a whole column of values stripping characters from the end of each string value.</Paragraph><Paragraph>Let’s display the first few rows of the ' <ComputerCode>
<b>WindDirDegrees</b>
</ComputerCode> ' to confirm the changes:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london[['WindDirDegrees']].head()</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th><b>WindDirDegrees</b></th></tr><tr><td>0</td><td>186</td></tr><tr><td>1</td><td>214</td></tr><tr><td>2</td><td>219</td></tr><tr><td>3</td><td>211</td></tr><tr><td>4</td><td>199</td></tr></tbody></Table></Section><Section id="nmb_43l_sxb"><Title>1.3 Missing values</Title><Paragraph>As you heard in the video at the start of the week, missing values (also called null values) are one of the reasons to clean data.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1051.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1051.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="8c07af78" x_imagesrc="ou_futurelearn_learn_to_code_fig_1051.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 5</b> </Caption><Alternative>An image of a girl with the last piece of a jigsaw puzzle</Alternative><Description>An image of a girl with the last piece of a jigsaw puzzle</Description></Figure><Paragraph>Finding missing values in a particular column can be done with the column method <ComputerCode>
<b>isnull()</b>
</ComputerCode> , like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london['Events'].isnull()</ComputerCode></Paragraph><Paragraph>The above code returns a series of Boolean values, where <ComputerCode>
<b>True</b>
</ComputerCode> indicates that the corresponding row in the <ComputerCode>
<b>'Events'</b>
</ComputerCode> column is missing a value and <ComputerCode>
<b>False</b>
</ComputerCode> indicates the presence of a value. Here are the last few rows from the series:</Paragraph><ComputerDisplay><Paragraph>...</Paragraph><Paragraph>360 False</Paragraph><Paragraph>361 True</Paragraph><Paragraph>362 True</Paragraph><Paragraph>363 True</Paragraph><Paragraph>364 False</Paragraph><Paragraph>Name: Events, dtype: bool</Paragraph></ComputerDisplay><Paragraph>If, as you did with the comparison expressions, you put this code within square brackets after the dataframe’s name, it will return a new dataframe consisting of all the rows without recorded events (rain, fog, thunderstorm, etc.):</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london[london['Events'].isnull()]</ComputerCode></Paragraph><Paragraph>As you will see in Exercise 4 of the exercise notebook, this will return a new dataframe with 114 rows, showing that more than one in three days had no particular event recorded. If you scroll the table to the right, you will see that all values in the <ComputerCode>
<b>'Events'</b>
</ComputerCode> column are marked <ComputerCode>
<b>NaN</b>
</ComputerCode> , which stands for ‘Not a Number’, but is also used to mark non-numeric missing values, like in this case (events are strings, not numbers).</Paragraph><Paragraph>Once you know how much and where data is missing, you have to decide what to do: ignore those rows? Replace with a fixed value? Replace with a computed value, like the mean?</Paragraph><Paragraph>In this case, only the first two options are possible. The method call <ComputerCode>
<b>london.dropna()</b>
</ComputerCode> will drop (remove) all rows that have a missing (non-available) value somewhere, returning a new dataframe. This will therefore also remove rows that have missing values in other columns.</Paragraph><Paragraph>The column method <ComputerCode>
<b>fillna()</b>
</ComputerCode> will replace all non-available values with the value given as argument. For this case, each NaN could be replaced by the empty string.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>london['Events'] = london['Events'].fillna('')</Paragraph><Paragraph>london[london['Events'].isnull()]</Paragraph></ComputerDisplay><Paragraph>The second line above will now show an empty dataframe, because there are no longer missing values in the events column.</Paragraph><Paragraph>As a final note on missing values, pandas ignores them when computing numeric statistics, i.e. you don’t have to remove missing values before applying <ComputerCode>
<b>sum(), median()</b>
</ComputerCode> and other similar methods.</Paragraph><Paragraph>Learn about checking data types of each column in the next section.</Paragraph></Section><Section><Title>1.4 Changing the value types of columns</Title><Paragraph>The function <ComputerCode>
<b>read_csv()</b>
</ComputerCode> may, for many reasons, wrongly interpret the data type of the values in a column, so when cleaning data it’s important to check the data types of each column are what is expected, and if necessary change them.</Paragraph><Paragraph>The data type of every column in a dataframe can be determined by looking at the dataframe’s <ComputerCode>
<b>dtypes</b>
</ComputerCode> attribute, like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london.dtypes</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>GMT object</Paragraph><Paragraph>Max TemperatureC int64</Paragraph><Paragraph>Mean TemperatureC int64</Paragraph><Paragraph>Min TemperatureC int64</Paragraph><Paragraph>Dew PointC int64</Paragraph><Paragraph>MeanDew PointC int64</Paragraph><Paragraph>Min DewpointC int64</Paragraph><Paragraph>Max Humidity int64</Paragraph><Paragraph>Mean Humidity int64</Paragraph><Paragraph>Min Humidity int64</Paragraph><Paragraph>Max Sea Level PressurehPa int64</Paragraph><Paragraph>Mean Sea Level PressurehPa int64</Paragraph><Paragraph>Min Sea Level PressurehPa int64</Paragraph><Paragraph>Max VisibilityKm int64</Paragraph><Paragraph>Mean VisibilityKm int64</Paragraph><Paragraph>Min VisibilitykM int64</Paragraph><Paragraph>Max Wind SpeedKm/h int64</Paragraph><Paragraph>Mean Wind SpeedKm/h int64</Paragraph><Paragraph>Max Gust SpeedKm/h float64</Paragraph><Paragraph>Precipitationmm float64</Paragraph><Paragraph>CloudCover float64</Paragraph><Paragraph>Events object</Paragraph><Paragraph>WindDirDegrees object</Paragraph><Paragraph>dtype: object</Paragraph></ComputerDisplay><Paragraph>In the above output, you can see the column names to the left and to the right the data types of the values in those columns.</Paragraph><BulletedList><ListItem><ComputerCode>
<b>int64</b>
</ComputerCode> is the pandas data type for whole numbers such as <ComputerCode>
<b>55</b>
</ComputerCode> or <ComputerCode>
<b>2356</b>
</ComputerCode></ListItem><ListItem><ComputerCode>
<b>float64</b>
</ComputerCode> is the pandas data type for decimal numbers such as <ComputerCode>
<b>55.25</b>
</ComputerCode> or <ComputerCode>
<b>2356.00</b>
</ComputerCode></ListItem><ListItem><ComputerCode>
<b>object</b>
</ComputerCode> is the pandas data type for strings such as <ComputerCode>
<b>'hello world'</b>
</ComputerCode> or <ComputerCode>
<b>'rain'</b>
</ComputerCode></ListItem></BulletedList><Paragraph>Most of the column data types seem fine, however two are of concern, <ComputerCode>
<b>'GMT'</b>
</ComputerCode> and <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> , both of which are of type <ComputerCode>
<b>object.</b>
</ComputerCode> Let’s take a look at <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> first.</Paragraph><InternalSection><Heading> Changing the data type of the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column </Heading><Paragraph>The <ComputerCode>
<b>read_csv()</b>
</ComputerCode> method has interpreted the values in the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column as strings (type <ComputerCode>
<b>object</b>
</ComputerCode> ). This is because in the CSV file the values in that column had all been suffixed with that html line break string <ComputerCode>
<b><br/></b>
</ComputerCode> so <ComputerCode>
<b>read_csv()</b>
</ComputerCode> had no alternative but to interpret the values as strings.</Paragraph><Paragraph>The values in the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column are meant to represent wind direction in terms of degrees from true north (360) and meteorologists always define the wind direction as the direction the wind is coming from. So if you stand so that the wind is blowing directly into your face, the direction you are facing names the wind, so a westerly wind is reported as 270 degrees. The compass rose shown below should make this clearer:</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1007.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1007.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="894f0d0a" x_imagesrc="ou_futurelearn_learn_to_code_fig_1007.jpg" x_imagewidth="512" x_imageheight="273"/><Caption><b>Figure 6</b> A compass rose </Caption></Figure><Paragraph>We need to be able to make queries such as ‘Get and display the rows where the wind direction is greater than 350 degrees’. To do this we need to change the data type of the ‘WindDirDegrees’ column from object to type <ComputerCode>
<b>int64</b>
</ComputerCode>. We can do that by using the <ComputerCode>
<b>astype()</b>
</ComputerCode> method like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
london['WindDirDegrees'] = london['WindDirDegrees'].astype('int64')
</ComputerCode></Paragraph><Paragraph>Now all the values in the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column are of type <ComputerCode>
<b>int64</b>
</ComputerCode> and we can make our query:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london[london['WindDirDegrees'] &gt; 350]</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1008.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1008.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="471ef2e3" x_imagesrc="ou_futurelearn_learn_to_code_fig_1008.jpg" x_imagewidth="512" x_imageheight="254"/><Caption><b>Figure 7</b> </Caption><Alternative>Rows from the london dataframe where the value in the WindDirDegrees column is greater than 350.</Alternative><Description>Rows from the london dataframe where the value in the WindDirDegrees column is greater than 350. Note that the WindDirDegrees column is not shown as it is on the far right of the table and only the first few columns are shown due to the limitation of page width. </Description></Figure><Paragraph><i> Note that the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column is on the far right of the table and the right of the table has been cropped to fit on the page. </i></Paragraph></InternalSection><InternalSection><Heading>Changing the data type of the ‘GMT’ column</Heading><Paragraph>Recall that I noted that the <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column was of type <ComputerCode>
<b>object</b>
</ComputerCode> , the type pandas uses for strings.</Paragraph><Paragraph>The <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column is supposed to represent dates. It would be helpful for the date values not to be strings to make it possible to make queries of the data such as ‘Return the row where the date is 4 June 2014’.</Paragraph><Paragraph>Pandas has a function called <ComputerCode>
<b>to_datetime()</b>
</ComputerCode> which can convert a column of <ComputerCode>
<b>object</b>
</ComputerCode> (string) values such as those in the <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column into values of a proper date type called <ComputerCode>
<b>datetime64</b>
,
</ComputerCode> just like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>london['GMT'] = to_datetime(london['GMT'])</Paragraph><Paragraph>
#Then display the types of all the columns again so we
</Paragraph><Paragraph>#can check the changes have been made.</Paragraph><Paragraph>london.dtypes</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>GMT datetime64[ns]</Paragraph><Paragraph>Max TemperatureC int64</Paragraph><Paragraph>Mean TemperatureC int64</Paragraph><Paragraph>Min TemperatureC int64</Paragraph><Paragraph>Dew PointC int64</Paragraph><Paragraph>MeanDew PointC int64</Paragraph><Paragraph>Min DewpointC int64</Paragraph><Paragraph>Max Humidity int64</Paragraph><Paragraph>Mean Humidity int64</Paragraph><Paragraph>Min Humidity int64</Paragraph><Paragraph>Max Sea Level PressurehPa int64</Paragraph><Paragraph>Mean Sea Level PressurehPa int64</Paragraph><Paragraph>Min Sea Level PressurehPa int64</Paragraph><Paragraph>Max VisibilityKm int64</Paragraph><Paragraph>Mean VisibilityKm int64</Paragraph><Paragraph>Min VisibilitykM int64</Paragraph><Paragraph>Max Wind SpeedKm/h int64</Paragraph><Paragraph>Mean Wind SpeedKm/h int64</Paragraph><Paragraph>Max Gust SpeedKm/h float64</Paragraph><Paragraph>Precipitationmm float64</Paragraph><Paragraph>CloudCover float64</Paragraph><Paragraph>Events object</Paragraph><Paragraph>WindDirDegrees int64</Paragraph><Paragraph>dtype: object</Paragraph></ComputerDisplay><Paragraph>From the above output, we can confirm that the <ComputerCode>
<b>'WindDirDegrees'</b>
</ComputerCode> column type has been changed from <ComputerCode>
<b>object</b>
</ComputerCode> to <ComputerCode>
<b>int64</b>
</ComputerCode> and that the <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column type has been changed from <ComputerCode>
<b>object</b>
</ComputerCode> to <ComputerCode>
<b>datetime64</b>
</ComputerCode>.</Paragraph><Paragraph>To make queries such as ‘Return the row where the date is 4 June 2014’ you’ll need to be able to create a <ComputerCode>
<b>datetime64</b>
</ComputerCode> value to represent June 4 2014. It cannot be:</Paragraph><Paragraph><ComputerCode>london[london['GMT'] == '2014-1-3']</ComputerCode></Paragraph><Paragraph>because ‘2014-1-3’ is a string and the values in the ‘GMT’ column are of type <ComputerCode>
<b>datetime64</b>
</ComputerCode>. Instead you must create a <ComputerCode>
<b>datetime64</b>
</ComputerCode> value using <ComputerCode>
<b>thedatetime()</b>
</ComputerCode> function like this:</Paragraph><Paragraph><ComputerCode>datetime(2014, 6, 4)</ComputerCode></Paragraph><Paragraph>In the function call above, the first integer argument is the year, the second the month and the third the day.</Paragraph><Paragraph>First import the `datetime()` function from the similarly named `datetime` package  by running the following line of code:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>from datetime import datetime</ComputerCode></Paragraph><Paragraph>Let’s try the function out by executing the code to ‘Return the row where the date is 4 June 2014’:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london[london['GMT'] == datetime(2014, 6, 4)]</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1009.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1009.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="ef710c2d" x_imagesrc="ou_futurelearn_learn_to_code_fig_1009.jpg" x_imagewidth="512" x_imageheight="113"/><Caption><b>Figure 8</b> </Caption><Description>The row from the london dataframe where the date is 4 June 2014. Note that only the first few columns are shown due to the limitation of page width. </Description></Figure><Paragraph><i> Note that the right side of the table has been cropped to fit on the page. </i></Paragraph><Paragraph>You can also now make more complex queries involving dates such as ‘Return all the rows where the date is between 8 December 2014 and 12 December 2014’, like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph>c</Paragraph><ComputerDisplay><Paragraph>london[(london['GMT'] &gt;= datetime(2014, 12, 8)) </Paragraph><Paragraph>    &amp; (london['GMT'] &lt;= datetime(2014, 12, 12))]</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1010.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1010.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="059167f9" x_imagesrc="ou_futurelearn_learn_to_code_fig_1010.jpg" x_imagewidth="512" x_imageheight="274"/><Caption><b>Figure 9</b> </Caption><Alternative/><Description>The rows from the london dataframe where the date is between 8 December 2014 and 12 December 2014 (inclusive). Note that only the first few columns are shown due to the limitation of page width. </Description></Figure><Paragraph><i>Note that the right side of the table has been cropped to fit on the page. </i></Paragraph><Activity><Heading>Exercise 4 Display rows from dataframe</Heading><Question><Paragraph>Now try Exercise 4 in the Exercise notebook 2.</Paragraph><Paragraph>If you’re using Anaconda instead of CoCalc, remember that to open the notebook you’ll need to navigate to the notebook using Jupyter.</Paragraph><Paragraph>Once the notebook is open, run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook. If you need a quick reminder of how to use Jupyter, watch again the video in <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83247&amp;targetdoc=Week+1%3A+Having+a+go+at+it+Part+1&amp;targetptr=1.4">Week 1 Exercise 1.</a></Paragraph></Question></Activity></InternalSection></Section></Session><Session><Title>2 Every picture tells a story</Title><Paragraph>It can be difficult and confusing to look at a table of rows of numbers and make any meaningful interpretation especially if there are many rows and columns.</Paragraph><Paragraph>Handily, pandas has a method called <ComputerCode>
<b>plot()</b>
</ComputerCode> which will visualise data for us by producing a chart.</Paragraph><Paragraph>Before using the <ComputerCode>
<b>plot()</b>
</ComputerCode> method, the following line of code must be executed (once) which tells Jupyter to display all charts inside this notebook, immediately after each call to <ComputerCode>
<b>plot():</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>%matplotlib inline</ComputerCode></Paragraph><Paragraph>To plot <ComputerCode>
<b>‘Max Wind SpeedKm/h</b>
</ComputerCode> ’, it’s as simple as this code:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london['Max Wind SpeedKm/h'].plot(grid=True)</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1023.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1023.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="3299cc36" x_imagesrc="ou_futurelearn_learn_to_code_fig_1023.jpg" x_imagewidth="512" x_imageheight="222"/><Caption><b>Figure 10</b> </Caption><Alternative>Chart of the values in the Max Wind SpeedKm/h column of the london dataframe.</Alternative><Description>Chart of the values in the Max Wind SpeedKm/h column of the london dataframe.</Description></Figure><Paragraph>The <ComputerCode>
<b>grid=True</b>
</ComputerCode> argument makes the gridlines (the dotted lines in the image above) appear, which make values easier to read on the chart. The chart comes out a bit small, so you can make it bigger by giving the <ComputerCode>
<b>plot()</b>
</ComputerCode> method some extra information. The figsize units are inches.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
london['Max Wind SpeedKm/h'].plot(grid=True, figsize=(10,5))
</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1024.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1024.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="6297c031" x_imagesrc="ou_futurelearn_learn_to_code_fig_1024.jpg" x_imagewidth="512" x_imageheight="268"/><Caption><b>Figure 11</b></Caption><Description>Larger version of the first chart on this page</Description></Figure><Paragraph>That’s better! The argument given to the <ComputerCode>
<b>plot()</b>
</ComputerCode> method, <ComputerCode>
<b>figsize=(10,5)</b>
</ComputerCode> simply tells <ComputerCode>
<b>plot()</b>
</ComputerCode> that the x-axis should be 10 units wide and the y-axis should be 5 units high. In the above graph the x-axis (the numbers at the bottom) shows the dataframe’s index, so 0 is 1 January and 50 is 18 February.</Paragraph><Paragraph>The y-axis (the numbers on the side) shows the range of wind speed in kilometres per hour. It is clear that the windiest day in 2014 was somewhere in mid-February and the wind reached about 66 kilometers per hour.</Paragraph><Paragraph>By default, the <ComputerCode>
<b>plot()</b>
</ComputerCode> method will try to generate a line, although as you’ll see in a later week, it can produce other chart types too.</Paragraph><Activity><Heading>Exercise 5 Every picture tells a story</Heading><Question><Paragraph>Now try Exercise 5 in the Exercise notebook 2.</Paragraph><Paragraph>If you’re using Anaconda, remember that to open the notebook you’ll need to navigate to the notebook using Jupyter.</Paragraph></Question></Activity><Section id="ihp_43l_sxb"><Title>2.1 Changing a dataframe’s index</Title><Paragraph>We have seen that by default every dataframe has an integer index for its rows which starts from 0.</Paragraph><Paragraph>The dataframe we’ve been using, <ComputerCode>
<b>london</b>
</ComputerCode> , has an index that goes from <ComputerCode>
<b>0</b>
</ComputerCode> to <ComputerCode>
<b>364</b>
</ComputerCode>. The row indexed by <ComputerCode>
<b>0</b>
</ComputerCode> holds data for the first day of the year and the row indexed by <ComputerCode>
<b>364</b>
</ComputerCode> holds data for the last day of the year. However, the column <ComputerCode>
<b>'GMT'</b>
</ComputerCode> holds <ComputerCode>
<b>datetime64</b>
</ComputerCode> values which would make a more intuitive index.</Paragraph><Paragraph>Changing the index to <ComputerCode>
<b>datetime64</b>
</ComputerCode> values is as easy as assigning to the dataframe’s <ComputerCode>
<b>index</b>
</ComputerCode> attribute the contents of the <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column, like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>london.index = london['GMT']</Paragraph><Paragraph>#Display the first 2 rows</Paragraph><Paragraph>london.head(2)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1011.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1011.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="2439d1a6" x_imagesrc="ou_futurelearn_learn_to_code_fig_1011.jpg" x_imagewidth="512" x_imageheight="199"/><Caption><b>Figure 12</b> </Caption><Alternative>First 2 rows of the london dataframe showing that the index has been changed to the datetime64 values from the GMT column</Alternative><Description>First 2 rows of the london dataframe showing that the index has been changed to the datetime64 values from the GMT column. Note that only the first few columns are shown due to the limitation of page width. </Description></Figure><Paragraph><i> Note that the right of the table has been cropped to fit on the page. </i></Paragraph><Paragraph>Notice that the <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column still remains and that the index has been labelled to show that it has been derived from the <ComputerCode>
<b>'GMT'</b>
</ComputerCode> column.</Paragraph><Paragraph>You can still access a row using the <ComputerCode>
<b>iloc</b>
</ComputerCode> attribute, so to get the first line in the dataframe you can simply execute:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london.iloc[0]</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>GMT 2014-01-01 00:00:00</Paragraph><Paragraph>Max TemperatureC 11</Paragraph><Paragraph>Mean TemperatureC 8</Paragraph><Paragraph>Min TemperatureC 6</Paragraph><Paragraph>Dew PointC 9</Paragraph><Paragraph>MeanDew PointC 7</Paragraph><Paragraph>Min DewpointC 4</Paragraph><Paragraph>Max Humidity 94</Paragraph><Paragraph>Mean Humidity 86</Paragraph><Paragraph>Min Humidity 73</Paragraph><Paragraph>Max Sea Level PressurehPa 1002</Paragraph><Paragraph>Mean Sea Level PressurehPa 993</Paragraph><Paragraph>Min Sea Level PressurehPa 984</Paragraph><Paragraph>Max VisibilityKm 31</Paragraph><Paragraph>Mean VisibilityKm 11</Paragraph><Paragraph>Min VisibilitykM 2</Paragraph><Paragraph>Max Wind SpeedKm/h 40</Paragraph><Paragraph>Mean Wind SpeedKm/h 26</Paragraph><Paragraph>Max Gust SpeedKm/h 66</Paragraph><Paragraph>Precipitationmm 9.91</Paragraph><Paragraph>CloudCover 4</Paragraph><Paragraph>Events Rain</Paragraph><Paragraph>WindDirDegrees 186</Paragraph><Paragraph>Name: 2014-01-01 00:00:00, dtype: object</Paragraph></ComputerDisplay><Paragraph>But now you can now also use the <ComputerCode>
<b>datetime64</b>
</ComputerCode> index to get a row using the dataframe’s <ComputerCode>
<b>loc</b>
</ComputerCode> attribute, like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>london.loc[datetime(2014, 1, 1)]</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>GMT 2014-01-01 00:00:00</Paragraph><Paragraph>Max TemperatureC 11</Paragraph><Paragraph>Mean TemperatureC 8</Paragraph><Paragraph>Min TemperatureC 6</Paragraph><Paragraph>Dew PointC 9</Paragraph><Paragraph>MeanDew PointC 7</Paragraph><Paragraph>Min DewpointC 4</Paragraph><Paragraph>Max Humidity 94</Paragraph><Paragraph>Mean Humidity 86</Paragraph><Paragraph>Min Humidity 73</Paragraph><Paragraph>Max Sea Level PressurehPa 1002</Paragraph><Paragraph>Mean Sea Level PressurehPa 993</Paragraph><Paragraph>Min Sea Level PressurehPa 984</Paragraph><Paragraph>Max VisibilityKm 31</Paragraph><Paragraph>Mean VisibilityKm 11</Paragraph><Paragraph>Min VisibilitykM 2</Paragraph><Paragraph>Max Wind SpeedKm/h 40</Paragraph><Paragraph>Mean Wind SpeedKm/h 26</Paragraph><Paragraph>Max Gust SpeedKm/h 66</Paragraph><Paragraph>Precipitationmm 9.91</Paragraph><Paragraph>CloudCover 4</Paragraph><Paragraph>Events Rain</Paragraph><Paragraph>WindDirDegrees 186</Paragraph><Paragraph>Name: 2014-01-01 00:00:00, dtype: object</Paragraph></ComputerDisplay><Paragraph>A query such as ‘Return all the rows where the date is between 8 December and 12 December’ which you did before (and can still do) with:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>london[(london['GMT'] &gt;= datetime(2014, 12, 8))</Paragraph><Paragraph>    &amp; (london['GMT'] &lt;= datetime(2014, 12, 12))]</Paragraph></ComputerDisplay><Paragraph>can now be done more succinctly like this:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>
london.loc[datetime(2014,12,8) : datetime(2014,12,12)]
</Paragraph><Paragraph/><Paragraph>
#The meaning of the above code is get the rows between
</Paragraph><Paragraph>#and including the indices datetime(2014,12,8) and</Paragraph><Paragraph>#datetime(2014,12,12)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1012.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1012.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="094925aa" x_imagesrc="ou_futurelearn_learn_to_code_fig_1012.jpg" x_imagewidth="512" x_imageheight="337"/><Caption><b>Figure 13</b> </Caption><Alternative>Rows from the london dataframe where the index is between 2014-12-08 and 2014-12-12 (inclusive).</Alternative><Description>Rows from the london dataframe where the index is between 2014-12-08 and 2014-12-12 (inclusive). Note that only the first few columns are shown due to the limitation of page width. </Description></Figure><Paragraph><i> Note that the right of the table has been cropped to fit on the page. </i></Paragraph><Paragraph>Because the table is in date order, we can be confident that only the rows with dates between 8 December 2014 and 12 December 2014 (inclusive) will be returned. However if the table had not been in date order, we would have needed to sort it first, like this:</Paragraph><Paragraph><ComputerCode>london = london.sort_index()</ComputerCode></Paragraph><Paragraph>Now there is a <ComputerCode>
<b>datetime64</b>
</ComputerCode> index, let’s plot ' <ComputerCode>
<b>Max Wind SpeedKm/h</b>
</ComputerCode> 'again:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
london['Max Wind SpeedKm/h'].plot(grid=True, figsize=(10,5))
</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1013.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1013.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="eb20ec94" x_imagesrc="ou_futurelearn_learn_to_code_fig_1013.jpg" x_imagewidth="512" x_imageheight="313"/><Caption><b>Figure 14</b> </Caption><Alternative>Chart of the values in the Max Wind SpeedKm/h column of the london dataframe.</Alternative><Description>Chart of the values in the Max Wind SpeedKm/h column of the london dataframe. Note that the legend for the x-axis has changed from numbers to month names. </Description></Figure><Paragraph>Now it is much clearer that the worst winds were in mid-February.</Paragraph><Activity><Heading>Exercise 6 Changing a dataframe’s index</Heading><Question><Paragraph>Now try Exercise 6 in the Exercise notebook 2.</Paragraph></Question></Activity></Section><Section><Title>2.2 The project</Title><Paragraph>Your project this week is to find out what would have been the best two weeks of weather for a 2014 vacation in a capital of a BRICS country.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1039.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1039.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="51178b01" x_imagesrc="ou_futurelearn_learn_to_code_fig_1039.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 15</b> </Caption><Alternative>An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky</Alternative><Description>An image of filter like diagonal strips across various skies such as an orange sunset, a storm and a clear blue sky</Description></Figure><Paragraph>I’ve written up my analysis of the best two weeks of weather in London, UK, which you can open in project: 2: Holiday weather.</Paragraph><Paragraph>The structure is very simple: besides the introduction and the conclusions, there is one section for each step of the analysis – obtaining, cleaning and visualising the data.</Paragraph><Paragraph>Once you’ve worked through my analysis you should open a dataset for just one of the BRICS capitals: Brasilia, Moscow, Delhi, Beijing or Cape Town. The choice of capital is up to you. You should then work out the best two weeks, according to the weather, to choose for a two-week holiday in your chosen capital city.</Paragraph><Paragraph>Download the dataset for your chosen location as follows:</Paragraph><BulletedList><ListItem>Right click on the name of your chosen capital city above</ListItem><ListItem>Choose to save the file via ‘Download Linked File As...’ Save the file with its default name to your downloads folder.</ListItem><ListItem>If necessary, rename the file so that it has a .csv extension.</ListItem><ListItem>Finally, move or copy te file to the disk folder or SageMathCloud by Cocalc project you created in Week 1.</ListItem></BulletedList><Paragraph>Once again, <b>do not open the file with Excel</b> , but you could take a look using a text editor.</Paragraph><Paragraph>In my project, because I’m in London, which is often cold and rainy, I was looking for a two week period that had relatively high temperatures and little rain. If you choose a capital in a particularly hot and dry country you will probably be looking for relatively cool weather and low humidity.</Paragraph><Paragraph>Note that the London file has the dates in a column named ‘GMT’ whereas in the BRICS files they are in a column named ‘Date’. You will need to change the Python code accordingly. You should also change the name of the variable, London, according to the capital you choose.</Paragraph></Section></Session><Session><Title>3 This week’s quiz</Title><Paragraph>Now it’s time to complete the Week 4 badge quiz. It is similar to previous quizzes, but this time instead of answering five questions there will be fifteen.</Paragraph><Paragraph><a href="https://www.open.edu/openlearn/ocw/mod/quiz/view.php?id=78780">Week 4 compulsory badge quiz</a></Paragraph><Paragraph>Remember, this quiz counts towards your badge. If you’re not successful the first time, you can attempt the quiz again in 24 hours.</Paragraph></Session><Session><Title>4 Summary</Title><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1052.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1052.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="e2772382" x_imagesrc="ou_futurelearn_learn_to_code_fig_1052.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 16</b> </Caption><Alternative>An image of storm clouds and a double rainbow above a field with a dirt road</Alternative><Description>An image of storm clouds and a double rainbow above a field with a dirt road</Description></Figure><Paragraph>This week you have learned how to: </Paragraph><BulletedList><ListItem>load a dataset into a dataframe from a CSV file</ListItem><ListItem>clean data</ListItem><ListItem>use the data to get answers to your questions.</ListItem></BulletedList><Paragraph>Next week you will learn about the techniques behind the creation of a combined dataset. </Paragraph><Paragraph>You are now halfway through the course. The Open University would really appreciate your feedback and suggestions for future improvement in our optional <a href="https://www.surveymonkey.co.uk/r/BOCENDlearntocode">end-of-course survey</a>, which you will also have an opportunity to complete at the end of Week 8. Participation will be completely confidential and we will not pass on your details to others.</Paragraph></Session><Session><Title>4.1 Week 4 glossary</Title><Paragraph>Here is an alphabetical list of the terms introduced this week, for quick look-up.</Paragraph><InternalSection><Heading>Programming and data analysis concepts</Heading><Paragraph>The <b>bitwise operators</b> <ComputerCode>
<b>&amp;</b>
</ComputerCode> (and) and <ComputerCode>
<b>|</b>
</ComputerCode> (or) are used in pandas to build more complicated expressions from two comparison expressions (typically involving column comparisons).</Paragraph><Paragraph>A <b>Boolean</b> has one of two possible values: <ComputerCode>
<b>True</b>
</ComputerCode> or <ComputerCode>
<b>False</b>
</ComputerCode>.</Paragraph><Paragraph>A <b>Comma Separated Values (CSV)</b> file is a plain text file that is used to hold tabular data.</Paragraph><Paragraph>A <b>list</b> is a sequence of values, separated by commas, and written within square brackets.</Paragraph><Paragraph>There are six <b>comparison operators</b> that can be used to compare number, string and date values. Expressions composed of these operators evaluate to <ComputerCode>
<b>True</b>
</ComputerCode> or <ComputerCode>
<b>False</b>
</ComputerCode>. These operators can also be used to compare every value in a column, row by row, against some number, string or date value. When used in this manner the operators return a series of Boolean values.</Paragraph><Paragraph>The <b>‘dot’ notation</b> is used to access a dataframe’s methods and attributes.</Paragraph><Paragraph>The <ComputerCode>
<b>Series</b>
</ComputerCode> data type is a collection of values with an integer index that starts from zero. Each column in a dataframe is an example of the <ComputerCode>
<b>Series</b>
</ComputerCode> data type. The <ComputerCode>
<b>Series</b>
</ComputerCode> data type has many of the same methods as the <ComputerCode>
<b>DataFrame</b>
</ComputerCode> data type.</Paragraph><Paragraph>The <ComputerCode>
<b>object</b>
</ComputerCode> data type is how pandas represents strings.</Paragraph><Paragraph>The <ComputerCode>
<b>datetime64</b>
</ComputerCode> data type is how pandas represents dates.</Paragraph><Paragraph>The <ComputerCode>
<b>int64</b>
</ComputerCode> data type is how pandas represents integers (whole numbers).</Paragraph><Paragraph>The <ComputerCode>
<b>float64</b>
</ComputerCode> data type is how pandas represents floating point numbers (decimals).</Paragraph></InternalSection><InternalSection><Heading>Functions and methods</Heading><Paragraph><ComputerCode>
<b>asType(aType)</b>
</ComputerCode> when applied to a dataframe column, the method changes the data type of each value in that column to the type given by the string <ComputerCode>
<b>aType</b>
</ComputerCode>.</Paragraph><Paragraph><ComputerCode>
<b>datetime(yyyy, mm, dd)</b>
</ComputerCode> the function takes three arguments, <ComputerCode>
<b>yyyy</b>
</ComputerCode> a four digit integer representing a year, <ComputerCode>
<b>mm</b>
</ComputerCode> a two digit integer representing a month and <ComputerCode>
<b>dd</b>
</ComputerCode> a two digit integer representing a day. From these arguments the function creates and returns a value of <ComputerCode>
<b>datetime64</b>
</ComputerCode>.</Paragraph><Paragraph><ComputerCode>
<b>dropna()</b>
</ComputerCode> when applied to a dataframe returns a new dataframe without the rows that have at least one missing value.</Paragraph><Paragraph><ComputerCode>
<b>head()</b>
</ComputerCode> gets and displays the first five rows of a dataframe. Optionally the method can take an integer argument to specify how many rows (from and including row 0) to get and display.</Paragraph><Paragraph><ComputerCode>
<b>iloc[index]</b>
</ComputerCode> gets and displays the row in the dataframe indicated by the integer argument <ComputerCode>
<b>index</b>
</ComputerCode>.</Paragraph><Paragraph><ComputerCode>
<b>isnull()</b>
</ComputerCode> is a series method that checks which rows in that series have a missing value.</Paragraph><Paragraph><ComputerCode>
<b>fillna(value)</b>
</ComputerCode> is a series method that returns a new series in which all missing values have been filled with the given value.</Paragraph><Paragraph><ComputerCode>
<b>plot()</b>
</ComputerCode> when applied to a dataframe column of numeric values, the method displays a graph of those values. The x-axis shows the dataframe’s index and the y-axis the range of the column’s values. Before the method is called you first need to execute <ComputerCode>
<b>%matplotlib inline</b>
</ComputerCode>.</Paragraph><Paragraph><ComputerCode>
<b>read_csv(csvFile)</b>
</ComputerCode> creates a dataframe from the dataset in the CSV file.</Paragraph><Paragraph><ComputerCode>
<b>rename(columns={oldName : newName})</b>
</ComputerCode> renames the column <ComputerCode>
<b>oldName</b>
</ComputerCode> to <ComputerCode>
<b>newName</b>
</ComputerCode>.</Paragraph><Paragraph><ComputerCode>
<b>str.rstrip(suffix)</b>
</ComputerCode> when applied to a dataframe column of string values, the method removes the argument <ComputerCode>
<b>suffix</b>
</ComputerCode> from the end of each string value in the column.</Paragraph><Paragraph><ComputerCode>
<b>tail()</b>
</ComputerCode> gets and displays the last five rows of a dataframe. Optionally the method can take an integer argument to specify how many rows (until and including the last row) to get and display.</Paragraph><Paragraph><ComputerCode>
<b>to_datetime(aSeries)</b>
</ComputerCode> when applied to a series, typically a column from a dataframe, this function returns a new series in which each value in <ComputerCode>
<b>aSeries</b>
</ComputerCode> has been changed to type <ComputerCode>
<b>datetime64</b>
</ComputerCode>.</Paragraph></InternalSection></Session></Unit><Unit><UnitID/><UnitTitle>Week 5: Combine and transform data Part 1</UnitTitle><Session id="life_expectancy_project"><Title>1 Life expectancy project</Title><Paragraph>This week I wish to see (literally, via a chart) if the life expectancy in richer countries tends to be longer.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1047.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1047.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="2657bfa8" x_imagesrc="ou_futurelearn_learn_to_code_fig_1047.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 1</b> </Caption><Alternative>A photograph of an adult hand holding the hand of a baby.</Alternative><Description>This is a photograph of an adult hand holding the hand of a baby.</Description></Figure><Paragraph>Richer countries can afford to spend more on healthcare and on road safety, for example, to reduce mortality. On the other hand, richer countries may have less healthy lifestyles.</Paragraph><Paragraph>The World Bank provides loans and grants to governments of middle and low-income countries to help reduce poverty. As part of their work, the World Bank has put together hundreds of datasets on a range of issues, such as health, education, economy, energy and the effectiveness of aid in different countries. I will use two of their datasets, which you can see online by following the links below. You do not need to download the datasets.</Paragraph><Paragraph>One dataset lists the gross domestic product (GDP) for each country, in United States dollars and cents; the other lists the life expectancy, in years, for each country. The latest life expectancy data I can access is for 2013, so that will be the year I take for the GDP. The disadvantage of using the GDP and the life expectancy values for the same year is that they do not account for the time it takes for a country’s wealth to have an effect on lifestyle, healthcare and other factors influencing life expectancy.</Paragraph><Paragraph>While it is useful to have all GDPs in a common currency to compare different countries, it doesn’t make much sense to report the GDP of a whole country to a supposed precision of a US cent. I noted that the value for the USA is a round number, but it is not for other countries. This is likely due in part to the conversion of local currencies to US dollars. It makes more sense to report the GDP values in a larger unit, e.g. millions of dollars. Moreover, for those who don’t live in a country using the US dollar as the official currency, it’s probably easier to understand GDP values in their own local currency.</Paragraph><Paragraph>To sum up, this week’s project will transform currency values and combine GDP and life expectancy data.</Paragraph><Paragraph>Note that the combination is made simple by the common country names in the two datasets, but in general care has to be taken that the common attribute really means the same thing. For example, if you were combining two datasets on a common unemployment attribute, you must be sure that it was obtained in the same way as there are various ways of measuring unemployment.</Paragraph><Paragraph>I’m aware that the GDP is a crude way of comparing wealth across nations. For example, it doesn’t take population or the cost of living into account. Some of this week’s exercises will ask you to add the population data. Think of other ways to improve the analysis method, of other conversions that might be needed, and of other ways to investigate life expectancy factors.</Paragraph><InternalSection><Heading>Links:</Heading><UnNumberedList><ListItem><a href="http://data.worldbank.org/indicator/NY.GDP.MKTP.CD">GDP in current US dollars</a></ListItem><ListItem><a href="http://data.worldbank.org/indicator/SP.DYN.LE00.IN">Life expectancy at birth</a></ListItem></UnNumberedList></InternalSection><Section><Title>1.1 Creating the data</Title><Paragraph>I won’t yet work with the full data. Instead I will create small tables, to better illustrate this week’s concepts and techniques.</Paragraph><Paragraph>Small tables make it easier to see what is going on and to create specific data combination and transformation scenarios that test the code.</Paragraph><Paragraph>There are many ways of creating tables in pandas. One of the simplest is to define the rows as a list, with the first element of the list being the first row, the second element being the second row, etc.</Paragraph><Paragraph>Each row of a table has multiple cells, one for each column. The obvious way is to represent each row as a list too, the first element of the list being the cell in the first column, the second element corresponding to the second column, etc. To sum up, the table is represented as a list of lists.</Paragraph><Paragraph>Here is a table of the 2013 GDP of some countries, in US dollars:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>table = [</Paragraph><Paragraph>['UK', 2678454886796.7], # 1st row</Paragraph><Paragraph>['USA', 16768100000000.0], # 2nd row</Paragraph><Paragraph>['China', 9240270452047.0], # and so on...</Paragraph><Paragraph>['Brazil', 2245673032353.8],</Paragraph><Paragraph>['South Africa', 366057913367.1]</Paragraph><Paragraph>]</Paragraph></ComputerDisplay><Paragraph>To create a dataframe, I use a pandas function appropriately called <ComputerCode>
<b>DataFrame()</b>
</ComputerCode>. I have to give it two arguments: the names of the columns and the data itself. The column names are given as a list of strings, the first string being the first column name, etc.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>headings = ['Country', 'GDP (US$)']</Paragraph><Paragraph>gdp = DataFrame(columns=headings, data=table)</Paragraph><Paragraph>gdp</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country</th><th>GDP (US$)</th></tr><tr><td><b>0</b></td><td>UK</td><td>2.678455e+12</td></tr><tr><td><b>1</b></td><td>USA</td><td>1.676810e+13</td></tr><tr><td><b>2</b></td><td>China</td><td>9.240270e+12</td></tr><tr><td><b>3</b></td><td>Brazil</td><td>2.245673e+12</td></tr><tr><td><b>4</b></td><td>South Africa</td><td>3.660579e+11</td></tr></tbody></Table><Paragraph>Note that pandas shows large numbers in scientific notation, where, for example, 3e+12 means 3×10 <sup>12</sup> , i.e. a 3 followed by 12 zeros.</Paragraph><Paragraph>I define a similar table for the life expectancy, based on the 2013 World Bank data.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>
headings = ['Country name', 'Life expectancy (years)']
</Paragraph><Paragraph>table = [</Paragraph><Paragraph>['China', 75],</Paragraph><Paragraph>['Russia', 71],</Paragraph><Paragraph>['United States', 79],</Paragraph><Paragraph>['India', 66],</Paragraph><Paragraph>['United Kingdom', 81]</Paragraph><Paragraph>]</Paragraph><Paragraph>life = DataFrame(columns=headings, data=table)</Paragraph><Paragraph>life</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country name</th><th>Life expectancy (years)</th></tr><tr><td><b>0</b></td><td>China</td><td>75</td></tr><tr><td><b>1</b></td><td>Russia</td><td>71</td></tr><tr><td><b>2</b></td><td>United States</td><td>79</td></tr><tr><td><b>3</b></td><td>India</td><td>66</td></tr><tr><td><b>4</b></td><td>United Kingdom</td><td>81</td></tr></tbody></Table><Paragraph>To illustrate potential issues when combining multiple datasets, I’ve taken a different set of countries, with common countries in a different order. Moreover, to illustrate a non-numeric conversion, I’ve abbreviated country names in one table but not the other.</Paragraph><Activity><Heading>Exercise 1 Creating the data</Heading><Question><Paragraph>Open the exercise notebook 3 and save it in the disk folder or upload it to the CoCalc project you created in Week 1. Then practise creating dataframes in Exercise 1.</Paragraph><Paragraph>If you’re using Anaconda, remember that to open the notebook you’ll need to navigate to it using Jupyter. Whether you’re using Anaconda or CoCalc, once the notebook is open, run the existing code before you start the exercise. When you’ve completed the exercise, save the notebook. If you need a quick reminder of how to use Jupyter, watch again the video in <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83249&amp;targetdoc=Week+1%3A+Having+a+go+at+it+Part+1&amp;targetptr=1.4">Week 1 Exercise 1</a></Paragraph></Question></Activity></Section><Section><Title>1.2 Defining functions</Title><Paragraph>To make the GDP values easier to read, I wish to convert US dollars to millions of US dollars.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1048.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1048.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="52394965" x_imagesrc="ou_futurelearn_learn_to_code_fig_1048.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 2</b></Caption><Alternative>An image of a large shipping container.</Alternative></Figure><Paragraph>I have to be precise about what I mean. For example, if the GDP is 4,567,890.1 (using commas to separate the thousands, millions, etc.), what do I want to obtain? Do I want always to round down to the nearest million, making it 4 million, round to the nearest million, making it 5, or round to one decimal place, making it 4.6 million? Since the aim is to simplify the numbers and not introduce a false sense of precision, let’s round to the nearest million.</Paragraph><Paragraph>I will define my own function to do such a conversion. It’s a generic function that takes any number and rounds it to the nearest million. I will later apply the function to each value in the GDP column. It’s easier to first show the code and then explain it.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>def roundToMillions (value):</Paragraph><Paragraph>result = round(value / 1000000)</Paragraph><Paragraph>return result</Paragraph></ComputerDisplay><Paragraph>A function definition always starts with <ComputerCode>
<b>def</b>
</ComputerCode> , which is a reserved word in Python.</Paragraph><Paragraph>After it comes the function’s name and arguments, surrounded by parenthesis, and finally a colon (:). This function just takes one argument. If there’s more than one argument, use commas to separate them.</Paragraph><Paragraph>Next comes the function’s body, where the calculations are done, using the arguments like any other variables. The body must be indented, conventionally by four spaces.</Paragraph><Paragraph>For this function, the calculation is simple. I take the value, divide it by one million, and call the built-in Python function <ComputerCode>
<b>round()</b>
</ComputerCode> to convert that number to the nearest integer. If the number is exactly mid-way between two integers, <ComputerCode>
<b>round()</b>
</ComputerCode> will pick the even integer, i.e. <ComputerCode>
<b>round(2.5)</b>
</ComputerCode> is 2 but <ComputerCode>
<b>round(3.5)</b>
</ComputerCode> is 4.Finally, I write a <b>return statement</b> to pass the result back to the code that called the function. The <ComputerCode>
<b>return</b>
</ComputerCode> word is also reserved in Python.</Paragraph><Paragraph>The <ComputerCode>
<b>result</b>
</ComputerCode> variable just stores the rounded value temporarily and has no other purpose. It‘s better to write the body as a single line of code:</Paragraph><Paragraph><ComputerCode>return round(value / 1000000)</ComputerCode></Paragraph><Paragraph>Finally I need to test the function, by calling it with various argument values and checking whether the returned value is equal to what I expect.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>roundToMillions(4567890.1) == 5</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>True</ComputerCode></Paragraph><Paragraph>The art of testing is to find as few test cases as possible that cover all bases. And I mean all, especially those you think ‘Naaah, it’ll never happen’. It will, because data can be incorrect. Prepare for the worst and hope for the best.</Paragraph><Paragraph>So here are some more tests, even for the unlikely cases of the GDP being zero or negative, and you can probably think of others.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>roundToMillions(0) == 0 # always test with zero...</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>True</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>roundToMillions(-1) == 0 #...and negative numbers</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>True</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
roundToMillions(1499999) == 1 # test rounding to the nearest
</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>True</ComputerCode></Paragraph><Paragraph>Now for the next conversion, from US dollars to a local currency, for example British pounds. I searched the internet for ‘average yearly USD to GBP rate’, chose a conversion service and took the value for 2013. Here’s the code and some tests.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>def usdToGbp (usd):</Paragraph><Paragraph>return usd / 1.564768 # average rate during 2013</Paragraph><Paragraph>usdToGbp(0) == 0</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>True</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>usdToGbp(1.564768) == 1</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>True</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>usdToGbp(-1) &lt; 0</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>True</ComputerCode></Paragraph><Paragraph>Defining functions is such an important part of coding, that you should not skip the next exercise where you will define your own functions.</Paragraph><Activity><Heading>Exercise 2 Defining functions</Heading><Question><Paragraph>Complete Exercise 2 in the Exercise notebook 3 to practise defining your own functions.</Paragraph></Question></Activity></Section><Section><Title>1.3 What if...?</Title><Paragraph>The third conversion, from abbreviated country names to full names, can’t be written as a simple formula, because each abbreviation is expanded differently.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1036.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1036.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="2760a5ab" x_imagesrc="ou_futurelearn_learn_to_code_fig_1036.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 3</b> </Caption><Alternative>A world globe with pins all over the European continent</Alternative></Figure><Paragraph>What I need is the Python code equivalent of:</Paragraph><BulletedList><ListItem>if the name is ‘UK’, return ‘United Kingdom’,</ListItem><ListItem>otherwise if the name is ‘USA’, return ‘United States’,</ListItem><ListItem>otherwise return the name.</ListItem></BulletedList><Paragraph>The last part basically says that if the name is none of the known abbreviations, return it unchanged. Translating the English sentence to Python is straightforward.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>def expandCountry (name):</Paragraph><Paragraph>	if name == 'UK': # if the name is 'UK'</Paragraph><Paragraph>		return 'United Kingdom'</Paragraph><Paragraph>	
elif name == 'USA': # otherwise if the name is 'USA'
</Paragraph><Paragraph>		return 'United States'</Paragraph><Paragraph>	else: # otherwise</Paragraph><Paragraph>		return name</Paragraph><Paragraph>expandCountry('India') == 'India'</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>True</ComputerCode></Paragraph><Paragraph>Note that ‘otherwise if’ is written <ComputerCode>
<b>'elif'</b>
</ComputerCode> in Python, not <ComputerCode>
<b>'else if'</b>
</ComputerCode>. As you might expect, ‘if’, ‘elif’ and ‘else’ are reserved words.</Paragraph><Paragraph>The computer will evaluate one condition at a time, from top to bottom, and execute only the instructions of the first condition that is true. Note that there is no condition after <ComputerCode>
<b>'else'</b>
</ComputerCode> , it is a ‘catch all’ in case all previous conditions fail.</Paragraph><Paragraph>Note again the colons at the end of lines and that code after the colon must be indented. That is how Python distinguishes which lines of code belong to which condition.</Paragraph><Paragraph>There are almost always many ways to write the same function. A <b>conditional statement</b> does not need to have an <ComputerCode>
<b>'elif'</b>
</ComputerCode> or <ComputerCode>
<b>'else'</b>
</ComputerCode> part. In that case, if the condition is false, nothing happens. Here is the same function, written differently.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>def expandCountry (name):</Paragraph><Paragraph>	if name == 'UK':</Paragraph><Paragraph>		name = 'United Kingdom'</Paragraph><Paragraph>	if name == 'USA':</Paragraph><Paragraph>		name = 'United States'</Paragraph><Paragraph>	return name</Paragraph></ComputerDisplay><Paragraph>You will see later this week an example of an ‘if-else’ statement, i.e. without the <ComputerCode>
<b>'elif'</b>
</ComputerCode> part.</Paragraph><Activity><Heading>Exercise 3 What if…?</Heading><Question><Paragraph>Complete Exercise 3 in the Exercise notebook 3 to practise writing functions with conditional statements.</Paragraph></Question></Activity></Section><Section><Title>1.4 Applying functions</Title><Paragraph>Having coded the three data conversion functions, they can be applied to the GDP table.</Paragraph><Paragraph>I first select the relevant column:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>column = gdp['Country']</Paragraph><Paragraph>column</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>0              UK</Paragraph><Paragraph>1             USA</Paragraph><Paragraph>2           China</Paragraph><Paragraph>3          Brazil</Paragraph><Paragraph>4    South Africa</Paragraph><Paragraph>Name: Country, dtype: object</Paragraph></ComputerDisplay><Paragraph>Next, I use the column method <ComputerCode>
<b>apply()</b>
</ComputerCode> , which applies a given function to each cell in the column, returning a new column, in which each cell is the conversion of the corresponding original cell:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>column.apply(expandCountry)</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>0    United Kingdom</Paragraph><Paragraph>1     United States</Paragraph><Paragraph>2             China</Paragraph><Paragraph>3            Brazil</Paragraph><Paragraph>4      South Africa</Paragraph><Paragraph>Name: Country, dtype: object</Paragraph></ComputerDisplay><Paragraph>Finally, I add that new column to the dataframe, using a new column heading:</Paragraph><Paragraph><ComputerCode>
<b>In []</b>
</ComputerCode> :</Paragraph><ComputerDisplay><Paragraph>gdp['Country name'] = column.apply(expandCountry)</Paragraph><Paragraph>gdp</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country</th><th>GDP (US$)</th><th>Country name</th></tr><tr><td><b>0</b></td><td>UK</td><td>2.678455e+12</td><td>United Kingdom</td></tr><tr><td><b>1</b></td><td>USA</td><td>1.676810e+13</td><td>United States</td></tr><tr><td><b>2</b></td><td>China</td><td>9.240270e+12</td><td>China</td></tr><tr><td><b>3</b></td><td>Brazil</td><td>2.245673e+12</td><td>Brazil</td></tr><tr><td><b>4</b></td><td>South Africa</td><td>3.660579e+11</td><td>South Africa</td></tr></tbody></Table><Paragraph>In a similar way, I can convert the US dollars to British pounds, then round to the nearest million, and store the result in a new column. I could apply the conversion and rounding functions in two separate statements, but using <b>method chaining</b> , I can apply both functions in a single line of code. This is possible because the column returned by the first call of <ComputerCode>
<b>apply()</b>
</ComputerCode> is the context for the second call of <ComputerCode>
<b>apply()</b>
</ComputerCode>. Here’s how it’s written:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>column = gdp['GDP (US$)']</Paragraph><Paragraph>
result = column.apply(usdToGbp).apply(roundToMillions)
</Paragraph><Paragraph>gdp['GDP (£m)'] = result</Paragraph><Paragraph>gdp</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country</th><th>GDP (US$)</th><th>Country name</th><th>GDP (£m)</th></tr><tr><td><b>0</b></td><td>UK</td><td>2.678455e+12</td><td>United Kingdom</td><td>1711727</td></tr><tr><td><b>1</b></td><td>USA</td><td>1.676810e+13</td><td>United States</td><td>10716029</td></tr><tr><td><b>2</b></td><td>China</td><td>9.240270e+12</td><td>China</td><td>5905202</td></tr><tr><td><b>3</b></td><td>Brazil</td><td>2.245673e+12</td><td>Brazil</td><td>1435148</td></tr><tr><td><b>4</b></td><td>South Africa</td><td>3.660579e+11</td><td>South Africa</td><td>233937</td></tr></tbody></Table><Paragraph>Now it’s just a matter of selecting the two new columns, as the original ones are no longer needed.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>headings = ['Country name', 'GDP (£m)']</Paragraph><Paragraph>gdp = gdp[headings]</Paragraph><Paragraph>gdp</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country name</th><th>GDP (£m)</th></tr><tr><td><b>0</b></td><td>United Kingdom</td><td>1711727</td></tr><tr><td><b>1</b></td><td>United States</td><td>10716029</td></tr><tr><td><b>2</b></td><td>China</td><td>5905202</td></tr><tr><td><b>3</b></td><td>Brazil</td><td>1435148</td></tr><tr><td><b>4</b></td><td>South Africa</td><td>233937</td></tr></tbody></Table><Paragraph>Note that method chaining only works if the methods chained return the same type of value as their context, in the same way that you can chain multiple arithmetic operators (e.g. 3+4-5) because each one takes two numbers and returns a number that is used by the next operator in the chain. In this course, methods only have two possible contexts, columns and dataframes, so you can either chain column methods that return a single column (that is a <ComputerCode>
<b>Series</b>
</ComputerCode> ), like <ComputerCode>
<b>apply()</b>
</ComputerCode> , or dataframe methods that return dataframes. For example, <ComputerCode>
<b>gdp.head(4).tail(2)</b>
</ComputerCode> is a dataframe just with China and Brazil, i.e. the last two of the first four rows of the dataframe shown above. You’ll see further examples of chaining (and an easier way to select multiple rows) later this week.</Paragraph><Paragraph>This concludes the data transformation part. After applying functions in the next exercise, you’ll learn how to combine two tables.</Paragraph><Activity><Heading>Exercise 4 Applying functions</Heading><Question><Paragraph>You can practise applying functions in Exercise 4 of your Exercise notebook 3.</Paragraph></Question></Activity></Section></Session><Session><Title>2 This week’s quiz</Title><Paragraph>Check what you’ve learned this week by taking the end-of-week quiz.</Paragraph><Paragraph><a href="https://www.open.edu/openlearn/ocw/mod/quiz/view.php?id=78781">Week 5 practice quiz</a></Paragraph><Paragraph>Open the quiz in a new window or tab then come back here when you’ve finished.</Paragraph></Session><Session><Title>3 Summary</Title><Paragraph>This week you learned how to transform currency values and combine GDP and life expectancy data by:</Paragraph><BulletedList><ListItem>creating the data</ListItem><ListItem>defining the functions of data conversion</ListItem><ListItem>applying the functions of data conversion.</ListItem></BulletedList><Paragraph>Next week you will learn more about combining, merging and transforming tables.</Paragraph></Session></Unit><Unit><UnitID/><UnitTitle>Week 6: Combine and transform data Part 2</UnitTitle><Session><Title>1 Joining left, right and centre</Title><Paragraph>Let’s take stock for a moment. There’s the original, unchanged table (with full country names) about the life expectancy:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>life</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country name</th><th>Life expectancy (years)</th></tr><tr><td><b>0</b></td><td>China</td><td>75</td></tr><tr><td><b>1</b></td><td>Russia</td><td>71</td></tr><tr><td><b>2</b></td><td>United States</td><td>79</td></tr><tr><td><b>3</b></td><td>India</td><td>66</td></tr><tr><td><b>4</b></td><td>United Kingdom</td><td>81</td></tr></tbody></Table><Paragraph>… and a table with the GDP in millions of pounds and also full country names.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>gdp</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country name</th><th>GDP (£m)</th></tr><tr><td><b>0</b></td><td>United Kingdom</td><td>1711727</td></tr><tr><td><b>1</b></td><td>United States</td><td>10716029</td></tr><tr><td><b>2</b></td><td>China</td><td>5905202</td></tr><tr><td><b>3</b></td><td>Brazil</td><td>1435148</td></tr><tr><td><b>4</b></td><td>South Africa</td><td>233937</td></tr></tbody></Table><Paragraph>Both tables have a common column with a common name (‘Country name’). I can <b>join</b> the two tables on that common column, using the <ComputerCode>
<b>merge()</b>
</ComputerCode> function. Merging basically puts all columns of the two tables together, without duplicating the common column, and joins any rows that have the same value in the common column.</Paragraph><Paragraph>There are four possible ways of joining, depending on which rows I want to include in the resulting table. If I want to include only those countries appearing in the GDP table, I call the <ComputerCode>
<b>merge()</b>
</ComputerCode> function like so:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>merge(gdp, life, on='Country name', how='left')</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country name</th><th>GDP (£m)</th><th>Life expectancy (years)</th></tr><tr><td><b>0</b></td><td>United Kingdom</td><td>1711727</td><td>81</td></tr><tr><td><b>1</b></td><td>United States</td><td>10716029</td><td>79</td></tr><tr><td><b>2</b></td><td>China</td><td>5905202</td><td>75</td></tr><tr><td><b>3</b></td><td>Brazil</td><td>1435148</td><td>NaN</td></tr><tr><td><b>4</b></td><td>South Africa</td><td>233937</td><td>NaN</td></tr></tbody></Table><Paragraph>The first two arguments are the tables to be merged, with the first table being called the ‘left’ table and the second being the ‘right’ table. The <ComputerCode>
<b>on</b>
</ComputerCode> argument is the name of the common column, i.e. both tables must have a column with that name. The <ComputerCode>
<b>how</b>
</ComputerCode> argument states I want a <b>left join</b> , i.e. the resulting rows are dictated by the left (GDP) table. You can easily see that India and Russia, which appear only in the right (expectancy) table, don’t show up in the result. You can also see that Brazil and South Africa, which appear only in the left table, have an undefined life expectancy. (Remember that ‘NaN’ stands for ‘not a number.)</Paragraph><Paragraph>A <b>right join</b> will instead take the rows from the right table, and add the columns of the left table. Therefore, countries not appearing in the left table will have undefined values for the left table’s columns:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>merge(gdp, life, on='Country name', how='right')</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country name</th><th>GDP (£m)</th><th>Life expectancy (years)</th></tr><tr><td><b>0</b></td><td>United Kingdom</td><td>1711727</td><td>81</td></tr><tr><td><b>1</b></td><td>United States</td><td>10716029</td><td>79</td></tr><tr><td><b>2</b></td><td>China</td><td>5905202</td><td>75</td></tr><tr><td><b>3</b></td><td>Russia</td><td>NaN</td><td>71</td></tr><tr><td><b>4</b></td><td>India</td><td>NaN</td><td>66</td></tr></tbody></Table><Paragraph>The third possibility is an <b>outer join</b> which takes all countries, i.e. whether they are in the left or right table. The result has all the rows of the left and right joins:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>merge(gdp, life, on='Country name', how='outer')</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country name</th><th>GDP (£m)</th><th>Life expectancy (years)</th></tr><tr><td><b>0</b></td><td>United Kingdom</td><td>1711727</td><td>81</td></tr><tr><td><b>1</b></td><td>United States</td><td>10716029</td><td>79</td></tr><tr><td><b>2</b></td><td>China</td><td>5905202</td><td>75</td></tr><tr><td><b>3</b></td><td>Brazil</td><td>1435148</td><td>NaN</td></tr><tr><td><b>4</b></td><td>South Africa</td><td>233937</td><td>NaN</td></tr><tr><td><b>5</b></td><td>Russia</td><td>NaN</td><td>71</td></tr><tr><td><b>6</b></td><td>India</td><td>NaN</td><td>66</td></tr></tbody></Table><Paragraph>The last possibility is an <b>inner join</b> which takes only those countries common to both tables, i.e. for which I know the GDP <i>and</i> the life expectancy. That’s the join I want, to avoid any undefined values:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
gdpVsLife = merge(gdp, life, on='Country name', how='inner')
</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Country name</th><th>GDP (£m)</th><th>Life expectancy (years)</th></tr><tr><td><b>0</b></td><td>United Kingdom</td><td>1711727</td><td>81</td></tr><tr><td><b>1</b></td><td>United States</td><td>10716029</td><td>79</td></tr><tr><td><b>2</b></td><td>China</td><td>5905202</td><td>75</td></tr></tbody></Table><Paragraph>Now it’s just a matter of applying the data transformation and combination techniques seen so far to the real data from the World Bank.</Paragraph><Activity><Heading>Exercise 5 Joining left, right and centre</Heading><Question><Paragraph>Put your learning into practice by completing Exercise 5 in the Exercise notebook 3.</Paragraph><Paragraph>Remember to run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook.</Paragraph></Question></Activity><Section><Title>1.1 Constant variables</Title><Paragraph>You may have noticed that the same column names appear over and over in the code.</Paragraph><Paragraph>If, someday, I decide one of the new columns should be called ‘GDP (million GBP)’ instead of ‘GDP (£m)’ to make clear which currency is meant (because various countries use the pound symbol), I need to change the string in every line of code it occurs.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1053.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1053.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="1f92a17f" x_imagesrc="ou_futurelearn_learn_to_code_fig_1053.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 1</b> </Caption><Alternative>An abstract image of different coloured vertical strips with a column of numbers through each.</Alternative><Description>An abstract image of different coloured vertical strips with a column of numbers through each. The strips are distorted by an arrow moving horizontally through them </Description></Figure><Paragraph>Laziness is the mother of invention. If I assign the string to a variable and then use the variable everywhere instead of the string, whenever I wish to change the string, I only have to edit one line of code, where it’s assigned to the variable. A second advantage of using names instead of values is that I can use the name completion facility of Jupyter notebooks by pressing ‘TAB’. Writing code becomes much faster…</Paragraph><Paragraph><ComputerCode>
<b>In[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>gdpInGbp = 'GDP (million GBP)'</Paragraph><Paragraph>gdpInUsd = 'GDP (US$)'</Paragraph><Paragraph>country = 'Country name'</Paragraph><Paragraph>gdp[gdpInGbp] = gdp[gdpInUsd].apply(usdToGbp)</Paragraph><Paragraph>headings = [country, gdpInGbp]</Paragraph><Paragraph>gdp = gdp[headings]</Paragraph></ComputerDisplay><Paragraph>Such variables are meant to be assigned once. They are called <b>constants</b> , because their value never changes. However, if someone else takes my code and wishes to adapt and extend it, they may not realise those variables are supposed to remain constant. Even I may forget it and try to assign a new value further down in the code! To help prevent such slip-ups the Python convention is to write names of constants in uppercase letters, with words separated by underscores. Thus, any further assignment to a variable in uppercase will ring an alarm bell (in your head, the computer remains silent).</Paragraph><Paragraph><ComputerCode>
<b>In[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>GDP_GBP = 'GDP (million GBP)'</Paragraph><Paragraph>GDP_USD = 'GDP (US$)'</Paragraph><Paragraph>COUNTRY = 'Country name'</Paragraph><Paragraph>gdp[GDP_GBP] = gdp[GDP_USD].apply(usdToGbp)</Paragraph><Paragraph>headings = [COUNTRY, GDP_GBP]</Paragraph><Paragraph>gdp = gdp[headings]</Paragraph></ComputerDisplay><Paragraph>Using constants is not just a matter of laziness. There are various advantages. First, constants stand out in the code.</Paragraph><Paragraph>Second, when making changes to the repeated values throughout the code, it’s easy to miss an occurrence. Using constants means the code is always consistent throughout.</Paragraph><Paragraph>Third, the name of the constant can help clarify what the value means. For example, instead of using the number 1995 throughout the code, define a constant that makes clear whether it’s a year, the cubic centimetres of a car engine or something else.</Paragraph><Paragraph>To sum up, using constants makes the code clearer, easier to change, and less prone to silly (but hard to find) mistakes due to inconsistent values.</Paragraph><Paragraph>Any value can be defined as a constant, whether it’s a string, a number or even a dataframe. For example, you could store the data you have loaded from the file into a constant, as a reminder to not change the original data. In the rest of the week, I’ll use constants mainly for the column names.</Paragraph><Activity><Heading>Exercise 6 Constants</Heading><Question><Paragraph>To practise using constants, rewrite your exercises in the Exercise notebook 3 using them.</Paragraph></Question></Activity></Section><Section><Title>1.2 Getting real</Title><Paragraph>Having tried out the data transformations and combination on small tables, I feel confident about using the full data from the World Bank, which I pointed you to in Life expectancy project.</Paragraph><Paragraph>Open a new browser window and go to the World Bank’s <a href="http://data.worldbank.org/">data page</a>. Type ‘GDP’ (without the quote marks) in the ‘Find an indicator’ box in the centre of the page and select ‘GDP current US$’. Click ‘Go’. This will take you to the data page you looked at earlier. Look at the top of your browser window. You will notice the URL is <a href="http://data.worldbank.org/indicator/NY.GDP.MKTP.CD">http://data.worldbank.org/indicator/NY.GDP.MKTP.CD</a>. Every World Bank dataset is for an indicator (in this case GDP in current dollars) with a unique name (in this case NY.GDP.MKTP.CD).</Paragraph><Paragraph>Knowing the indicator name, it’s a doddle to get the data directly into a dataframe, by using the <ComputerCode>
<b>download()</b>
</ComputerCode> function of the <ComputerCode>
<b>wb</b>
</ComputerCode> (World Bank) module, instead of first downloading a CSV or Excel file and then loading it into a dataframe. (Note that CoCalc’s free plan doesn’t allow connecting to other sites, so if you are using CoCalc you’ll need to download the data as a CSV or Excel file from the World Bank and upload it to CoCalc.)</Paragraph><Paragraph>Here’s the code to get the 2013 GDP values for all countries. It may take a little while for the code to fetch the data.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>from pandas.io.wb import download </Paragraph><Paragraph>YEAR = 2013</Paragraph><Paragraph>GDP_INDICATOR = 'NY.GDP.MKTP.CD'</Paragraph><Paragraph>data = download(indicator=GDP_INDICATOR, country='all',</Paragraph><Paragraph>              start=YEAR, end=YEAR)</Paragraph><Paragraph>data.head()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th/><th>NY.GDP.MKTP.CD</th></tr><tr><th><b>country</b></th><th>year</th><th/></tr><tr><td><b>Arab World</b></td><td><b>2013</b></td><td>2.843483e+12</td></tr><tr><td><b>Caribbean small states</b></td><td><b>2013</b></td><td>6.680344e+10</td></tr><tr><td><b>Central Europe and the Baltics</b></td><td><b>2013</b></td><td>1.418166e+12</td></tr><tr><td><b>East Asia &amp; Pacific (all income levels)</b></td><td><b>2013</b></td><td>2.080794e+13</td></tr><tr><td><b>East Asia &amp; Pacific (developing only)</b></td><td><b>2013</b></td><td>1.168563e+13</td></tr></tbody></Table><Paragraph>This table definitely has an odd shape. The three columns don’t have their headings side by side, and the row numbering (0, 1, 2, etc) is missing. That’s because the first two ‘columns’ are in fact the dataframe index. You saw a similar table in <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83250&amp;targetdoc=Week+4%3A+Cleaning+up+our+act+Part+2&amp;targetptr=2.1">Changing a dataframe’s index,</a> when the index of the weather dataframe was set to be the ‘GMT’ column, with values of type <ComputerCode>
<b>datetime64</b>
</ComputerCode>. There’s a dataframe method to do the inverse, i.e. to transform the row names into column values and thereby reinstate the default dataframe index.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>gdp = data.reset_index()</Paragraph><Paragraph>gdp.head()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>year</th><th>NY.GDP.MKTP.CD</th></tr><tr><td><b>0</b></td><td>Arab World</td><td>2013</td><td>2.843483e+12</td></tr><tr><td><b>1</b></td><td>Caribbean small states</td><td>2013</td><td>6.680344e+10</td></tr><tr><td><b>2</b></td><td>Central Europe and the Baltics</td><td>2013</td><td>1.418166e+12</td></tr><tr><td><b>3</b></td><td>East Asia &amp; Pacific (all income levels)</td><td>2013</td><td>2.080794e+13</td></tr><tr><td><b>4</b></td><td>East Asia &amp; Pacific (developing only)</td><td>2013</td><td>1.168563e+13</td></tr></tbody></Table><Paragraph>I repeat the whole process for the life expectancy:</Paragraph><BulletedList><ListItem>search for ‘life expectancy’ on the World Bank site</ListItem><ListItem>choose the ‘total’ dataset, which includes both female and male inhabitants</ListItem><ListItem>note down its indicator (SP.DYN.LE00.IN)</ListItem><ListItem>use it to get the data</ListItem><ListItem>reset the dataframe index.</ListItem></BulletedList><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>LIFE_INDICATOR = 'SP.DYN.LE00.IN'</Paragraph><Paragraph>data = download(indicator=LIFE_INDICATOR, country='all',</Paragraph><Paragraph>              start=YEAR, end=YEAR)</Paragraph><Paragraph>life = data.reset_index()</Paragraph><Paragraph>life.head()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>year</th><th>SP.DYN.LE00.IN</th></tr><tr><td><b>0</b></td><td>Arab World</td><td>2013</td><td>70.086392</td></tr><tr><td><b>1</b></td><td>Caribbean small states</td><td>2013</td><td>71.966306</td></tr><tr><td><b>2</b></td><td>Central Europe and the Baltics</td><td>2013</td><td>76.127583</td></tr><tr><td><b>3</b></td><td>East Asia &amp; Pacific (all income levels)</td><td>2013</td><td>74.893439</td></tr><tr><td><b>4</b></td><td>East Asia &amp; Pacific (developing only)</td><td>2013</td><td>73.981255</td></tr></tbody></Table><Paragraph>By defining the year as a constant, it’s very quick to change the code to load both datasets for any other year. If you wish to get GDP data for an earlier year than for life expectancy, then you need to define a second constant.</Paragraph><Activity><Heading>Exercise 7 Getting real</Heading><Question><Paragraph>The approach described above requires an internet connection to download the data directly from the World Bank. That may require some time, or sometimes not even work if the connection fails. Moreover, the World Bank sometimes changes its data format, which could break the code in the rest of this week.</Paragraph><Paragraph>Therefore, the Exercise notebook 3 loads instead the GDP and life expectancy data from files WB GDP 2013.csv and WB LE 2013.csv and Exercise 7 uses the file WB POP 2013.csv , which you should add to your disk folder or CoCalc project. All files are in the normal tabular format and need no resetting of the indices.</Paragraph></Question></Activity></Section><Section><Title>1.3 Cleaning up</Title><Paragraph>You may have noticed that the initial rows are not about countries, but groups of countries. Such aggregated values need to be removed, because we’re only interested in individual countries.</Paragraph><Paragraph>The expression <ComputerCode>
<b>frame[m:n],</b>
</ComputerCode> with <ComputerCode>
<b>n</b>
</ComputerCode> an integer bigger than <ComputerCode>
<b>m</b>
</ComputerCode> , represents the ‘sub-table’ from row <ComputerCode>
<b>m</b>
</ComputerCode> to row <ComputerCode>
<b>n-1</b>
</ComputerCode>. In other words, it is a slice of frame with exactly <ComputerCode>
<b>n</b>
</ComputerCode> minus <ComputerCode>
<b>m</b>
</ComputerCode> rows. The expression is equivalent to the more convoluted expression <ComputerCode>
<b>frame.head(n).tail(n-m)</b>
</ComputerCode>.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>gdp[0:3]</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>year</th><th>NY.GDP.MKTP.CD</th></tr><tr><td><b>0</b></td><td>Arab World</td><td>2013</td><td>2.843483e+12</td></tr><tr><td><b>1</b></td><td>Caribbean small states</td><td>2013</td><td>6.680344e+10</td></tr><tr><td><b>2</b></td><td>Central Europe and the Baltics</td><td>2013</td><td>1.418166e+12</td></tr></tbody></Table><Paragraph>To slice all rows from <ComputerCode>
<b>m</b>
</ComputerCode> onwards, you don’t have to count how many rows there are beforehand, just omit <ComputerCode>
<b>n</b>
</ComputerCode>.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>gdp[240:]</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>year</th><th>NY.GDP.MKTP.CD</th></tr><tr><td><b>240</b></td><td>Uzbekistan</td><td>2013</td><td>5.679566e+10</td></tr><tr><td><b>241</b></td><td>Vanuatu</td><td>2013</td><td>8.017876e+08</td></tr><tr><td><b>242</b></td><td>Venezuela, RB</td><td>2013</td><td>3.713366e+11</td></tr><tr><td><b>243</b></td><td>Vietnam</td><td>2013</td><td>1.712220e+11</td></tr><tr><td><b>244</b></td><td>Virgin Islands (U.S.)</td><td>2013</td><td>NaN</td></tr><tr><td><b>245</b></td><td>West Bank and Gaza</td><td>2013</td><td>1.247600e+10</td></tr><tr><td><b>246</b></td><td>Yemen, Rep.</td><td>2013</td><td>3.595450e+10</td></tr><tr><td><b>247</b></td><td>Zambia</td><td>2013</td><td>2.682081e+10</td></tr><tr><td><b>248</b></td><td>Zimbabwe</td><td>2013</td><td>1.349023e+10</td></tr></tbody></Table><Paragraph>By trying out <ComputerCode>
<b>head(m)</b>
</ComputerCode> for different values of <ComputerCode>
<b>m</b>
</ComputerCode> , I find that the list of individual countries starts in row number 34, with Afghanistan. Hence, I slice from row 34 onwards, and that’s my new dataframe.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>gdp = gdp[34:]</Paragraph><Paragraph>gdp.head()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>year</th><th>NY.GDP.MKTP.CD</th></tr><tr><td><b>34</b></td><td>Afghanistan</td><td>2013</td><td>2.031088e+10</td></tr><tr><td><b>35</b></td><td>Albania</td><td>2013</td><td>1.291667e+10</td></tr><tr><td><b>36</b></td><td>Algeria</td><td>2013</td><td>2.101834e+11</td></tr><tr><td><b>37</b></td><td>American Samoa</td><td>2013</td><td>NaN</td></tr><tr><td><b>38</b></td><td>Andorra</td><td>2013</td><td>3.249101e+09</td></tr></tbody></Table><Paragraph>Unsurprisingly, there is missing data, so I remove those rows, as shown in <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83250&amp;targetdoc=Week+4%3A+Cleaning+up+our+act+Part+2&amp;targetptr=1.3">Missing values</a> in Week 4.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>gdp = gdp.dropna()</Paragraph><Paragraph>gdp.head()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>year</th><th>NY.GDP.MKTP.CD</th></tr><tr><td><b>34</b></td><td>Afghanistan</td><td>2013</td><td>2.031088e+10</td></tr><tr><td><b>35</b></td><td>Albania</td><td>2013</td><td>1.291667e+10</td></tr><tr><td><b>36</b></td><td>Algeria</td><td>2013</td><td>2.101834e+11</td></tr><tr><td><b>38</b></td><td>Andorra</td><td>2013</td><td>3.249101e+09</td></tr><tr><td><b>39</b></td><td>Angola</td><td>2013</td><td>1.241632e+11</td></tr></tbody></Table><Paragraph>Finally, I drop the irrelevant year column.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>COUNTRY = 'country'</Paragraph><Paragraph>headings = [COUNTRY, GDP_INDICATOR]</Paragraph><Paragraph>gdp = gdp[headings]</Paragraph><Paragraph>gdp.head()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>NY.GDP.MKTP.CD</th></tr><tr><td><b>34</b></td><td>Afghanistan</td><td>2.031088e+10</td></tr><tr><td><b>35</b></td><td>Albania</td><td>1.291667e+10</td></tr><tr><td><b>36</b></td><td>Algeria</td><td>2.101834e+11</td></tr><tr><td><b>38</b></td><td>Andorra</td><td>3.249101e+09</td></tr><tr><td><b>39</b></td><td>Angola</td><td>1.241632e+11</td></tr></tbody></Table><Paragraph>And now I repeat the whole cleaning process for the life expectancy table.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>headings = [COUNTRY, LIFE_INDICATOR]</Paragraph><Paragraph>life = life[34:].dropna()[headings]</Paragraph><Paragraph>life.head()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>SP.DYN.LE00.IN</th></tr><tr><td><b>34</b></td><td>Afghanistan</td><td>60.931415</td></tr><tr><td><b>35</b></td><td>Albania</td><td>77.537244</td></tr><tr><td><b>36</b></td><td>Algeria</td><td>71.009659</td></tr><tr><td><b>39</b></td><td>Angola</td><td>51.866171</td></tr><tr><td><b>40</b></td><td>Antigua and Barbuda</td><td>75.829293</td></tr></tbody></Table><Paragraph>Note how a single line of code can chain a row slice, a method call and a column slice, because each takes a dataframe and returns a dataframe.</Paragraph><Activity><Heading>Exercise 8 Cleaning up</Heading><Question><Paragraph>Clean up the population data from Exercise 7, in Exercise 8 in the exercise notebook 3.</Paragraph></Question></Activity></Section><Section><Title>1.4 Joining and transforming</Title><Paragraph>With the little tables, I first transformed the columns and then joined the tables.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1054.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1054.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="d660b5f4" x_imagesrc="ou_futurelearn_learn_to_code_fig_1054.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 2</b> </Caption><Alternative>An image of a bride and groom holding hands with the minister between them in the background</Alternative><Description>An image of a bride and groom holding hands with the minister between them in the background</Description></Figure><Paragraph>As you may be starting to realise, there’s often more than one way to do it. Just for illustration, I’ll do the other way round for the big tables. Here are the tables, as a reminder.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>life.head()</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>SP.DYN.LE00.IN</th></tr><tr><td><b>34</b></td><td>Afghanistan</td><td>60.931415</td></tr><tr><td><b>35</b></td><td>Albania</td><td>77.537244</td></tr><tr><td><b>36</b></td><td>Algeria</td><td>71.009659</td></tr><tr><td><b>39</b></td><td>Angola</td><td>51.866171</td></tr><tr><td><b>40</b></td><td>Antigua and Barbuda</td><td>75.829293</td></tr></tbody></Table><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>gdp.head()</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>NY.GDP.MKTP.CD</th></tr><tr><td><b>34</b></td><td>Afghanistan</td><td>2.031088e+10</td></tr><tr><td><b>35</b></td><td>Albania</td><td>1.291667e+10</td></tr><tr><td><b>36</b></td><td>Algeria</td><td>2.101834e+11</td></tr><tr><td><b>38</b></td><td>Andorra</td><td>3.249101e+09</td></tr><tr><td><b>39</b></td><td>Angola</td><td>1.241632e+11</td></tr></tbody></Table><Paragraph>First, an inner join on the common column to combine rows where the common column value appears in both tables.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>
gdpVsLife = merge(gdp, life, on='country', how='inner')
</Paragraph><Paragraph>gdpVsLife.head()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out []:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>NY.GDP.MKTP.CD</th><th>SP.DYN.LE00.IN</th></tr><tr><td><b>0</b></td><td>Afghanistan</td><td>2.031088e+10</td><td>60.931415</td></tr><tr><td><b>1</b></td><td>Albania</td><td>1.291667e+10</td><td>77.537244</td></tr><tr><td><b>2</b></td><td>Algeria</td><td>2.101834e+11</td><td>71.009659</td></tr><tr><td><b>3</b></td><td>Angola</td><td>1.241632e+11</td><td>51.866171</td></tr><tr><td><b>4</b></td><td>Antigua and Barbuda</td><td>1.200588e+09</td><td>75.829293</td></tr></tbody></Table><Paragraph>Second, the dollars are converted to millions of pounds.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>GDP = 'GDP (£m)'</Paragraph><Paragraph>column = gdpVsLife[GDP_INDICATOR]</Paragraph><Paragraph>
gdpVsLife[GDP] = column.apply(usdToGbp).apply(roundToMillions)
</Paragraph><Paragraph>gdpVsLife.head()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>NY.GDP.MKTP.CD</th><th>SP.DYN.LE00.IN</th><th>GDP (£m)</th></tr><tr><td><b>0</b></td><td>Afghanistan</td><td>2.031088e+10</td><td>60.931415</td><td>12980</td></tr><tr><td><b>1</b></td><td>Albania</td><td>1.291667e+10</td><td>77.537244</td><td>8255</td></tr><tr><td><b>2</b></td><td>Algeria</td><td>2.101834e+11</td><td>71.009659</td><td>134322</td></tr><tr><td><b>3</b></td><td>Angola</td><td>1.241632e+11</td><td>51.866171</td><td>79349</td></tr><tr><td><b>4</b></td><td>Antigua and Barbuda</td><td>1.200588e+09</td><td>75.829293</td><td>767</td></tr></tbody></Table><Paragraph>Third, the life expectancy is rounded to the nearest integer, with a by now familiar function.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>LIFE = 'Life expectancy (years)'</Paragraph><Paragraph>
gdpVsLife[LIFE] = gdpVsLife[LIFE_INDICATOR].apply(round)
</Paragraph><Paragraph>gdpVsLife.head()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>NY.GDP.MKTP.CD</th><th>SP.DYN.LE00.IN</th><th>GDP (£m)</th><th>Life expectancy (years)</th></tr><tr><td><b>0</b></td><td>Afghanistan</td><td>2.031088e+10</td><td>60.931415</td><td>12980</td><td>61</td></tr><tr><td><b>1</b></td><td>Albania</td><td>1.291667e+10</td><td>77.537244</td><td>8255</td><td>78</td></tr><tr><td><b>2</b></td><td>Algeria</td><td>2.101834e+11</td><td>71.009659</td><td>134322</td><td>71</td></tr><tr><td><b>3</b></td><td>Angola</td><td>1.241632e+11</td><td>51.866171</td><td>79349</td><td>52</td></tr><tr><td><b>4</b></td><td>Antigua and Barbuda</td><td>1.200588e+09</td><td>75.829293</td><td>767</td><td>76</td></tr></tbody></Table><Paragraph>Lastly, the original columns are discarded.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>headings = [COUNTRY, GDP, LIFE]</Paragraph><Paragraph>gdpVsLife = gdpVsLife[headings]</Paragraph><Paragraph>gdpVsLife.head()</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>GDP (£m)</th><th>Life expectancy (years)</th></tr><tr><td><b>0</b></td><td>Afghanistan</td><td>12980</td><td>61</td></tr><tr><td><b>1</b></td><td>Albania</td><td>8255</td><td>78</td></tr><tr><td><b>2</b></td><td>Algeria</td><td>134322</td><td>71</td></tr><tr><td><b>3</b></td><td>Angola</td><td>79349</td><td>52</td></tr><tr><td><b>4</b></td><td>Antigua and Barbuda</td><td>767</td><td>76</td></tr></tbody></Table><Paragraph>For the first five countries there doesn’t seem to be any relation between wealth and life expectancy, but that might be just for those countries.</Paragraph><Activity><Heading>Exercise 9 Joining and transforming</Heading><Question><Paragraph>Have a go at merging dataframes with an inner join in Exercise 9 in the Exercise notebook 3.</Paragraph></Question></Activity></Section></Session><Session><Title>2 Correlation</Title><Paragraph>To see if life expectancy grows when the GDP increases I will use a statistical measure known as the <b>Spearman rank correlation coefficient</b>.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1055.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1055.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="04acf66c" x_imagesrc="ou_futurelearn_learn_to_code_fig_1055.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 3</b></Caption><Alternative>An image of many thread spools with threads drawn together below and twisted into one</Alternative><Description>An image of many thread spools with threads drawn together below and twisted into one</Description></Figure><Paragraph>It’s a number between -1 and 1 that describes how well two indicators correlate, in the following sense.</Paragraph><BulletedList><ListItem>A value of 1 means that if I rank (sort) the data from smallest to largest value in one indicator, it will also be in ascending order according to the other indicator. In other words, if one indicator grows, so does the other.</ListItem><ListItem>A value of -1 means a perfect inverse rank relation: if I sort the data from smallest to largest according to one indicator, I will see it is sorted from largest to smallest in the other indicator. When one indicator goes up, the other goes down.</ListItem><ListItem>A value of 0 means there is no rank relation between the two indicators.</ListItem></BulletedList><Paragraph>A positive value smaller than 1 (or a negative value larger than -1) means there is some direct (or inverse) correlation, but it is not systematic across the whole dataset.</Paragraph><Paragraph>The <b>p-value</b> indicates how significant the result is, in a particular technical sense. To say a correlation is statistically significant doesn’t necessarily mean it is important or strong in the real world, but only that there is reasonable statistical evidence that there is some kind of relationship. Typically, the obtained correlation coefficient is considered statistically significant if the p-value is below 0.05.</Paragraph><Paragraph>The pandas module doesn’t calculate complex statistics. There are other modules in the Anaconda distribution for that. In particular, <ComputerCode>
<b>scipy</b>
</ComputerCode> (Scientific Python) has a stats module that provides the <ComputerCode>
<b>spearmanr()</b>
</ComputerCode> function. The function takes as arguments the two columns of data to correlate. Contrary to the functions you’ve seen so far, it returns two values instead of one: the correlation and the p-value. To store both values, simply use a pair of variables, written in parenthesis.</Paragraph><Paragraph>To show the results in a nicer way, I will use the Python <ComputerCode>
<b>print()</b>
</ComputerCode> function, which displays its arguments in a single line.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>from scipy.stats import spearmanr </Paragraph><Paragraph>gdpColumn = gdpVsLife[GDP]</Paragraph><Paragraph>lifeColumn = gdpVsLife[LIFE]</Paragraph><Paragraph>(correlation, pValue) = spearmanr(gdpColumn, lifeColumn)</Paragraph><Paragraph>print('The correlation is', correlation)</Paragraph><Paragraph>if pValue &lt; 0.05: </Paragraph><Paragraph>        print('It is statistically significant.') </Paragraph><Paragraph>else:</Paragraph><Paragraph>        print('It is not statistically significant.')</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>The correlation is 0.493179132478.</Paragraph><Paragraph>It is statistically significant.</Paragraph></ComputerDisplay><Paragraph>Although there is a statistically significant direct correlation (life expectancy grows as GDP grows), it isn’t strong.</Paragraph><Paragraph>A perfect (direct or inverse) correlation doesn’t mean there is any cause-effect between the two indicators. A perfect direct correlation between life expectancy and GDP would only state that the higher the GDP, the higher the life expectancy. It would not state that the higher expectancy is due to the GDP. Correlation is not causation.</Paragraph><Activity><Heading>Exercise 10 Correlation</Heading><Question><Paragraph>Calculate the correlation between GDP and population in Exercise 10 in the Exercise notebook 3.</Paragraph><Paragraph>Remember to run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook.</Paragraph></Question></Activity><Section><Title>2.1 Scatterplots</Title><Paragraph>Statistics can be misleading. A coefficient of zero only states there is no ranking relation between the indicators, but there might be some other relationship.</Paragraph><Paragraph>In the next example, the correlation between x and y is zero, but they are clearly related (y is the square of x).</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>table = [ [-2,4], [-1,1], [0,0], [1,1], [2,4] ]</Paragraph><Paragraph>data = DataFrame(columns=['x', 'y'], data=table)</Paragraph><Paragraph>
(correlation, pValue) = spearmanr(data['x'], data['y'])
</Paragraph><Paragraph>print('The correlation is', correlation)</Paragraph><Paragraph>data</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>The correlation is 0.0</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>x</th><th>y</th></tr><tr><td>0</td><td>-2</td><td>4</td></tr><tr><td>1</td><td>-1</td><td>1</td></tr><tr><td>2</td><td>0</td><td>0</td></tr><tr><td>3</td><td>1</td><td>1</td></tr><tr><td>4</td><td>2</td><td>4</td></tr></tbody></Table><Paragraph>It’s therefore best to complement the quantitative analysis with a more qualitative view provided by a chart. In the case of correlations, <b>scatterplots</b> will do very nicely. Each country is a dot plotted at the x and y coordinates corresponding to the GDP and life expectancy values.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>%matplotlib inline</Paragraph><Paragraph>
gdpVsLife.plot(x=GDP, y=LIFE, kind='scatter', grid=True)
</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
&lt;matplotlib.axes._subplots.AxesSubplot at 0x10e2e6eb8&gt;
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1018.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1018.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="26a3c738" x_imagesrc="ou_futurelearn_learn_to_code_fig_1018.jpg" x_imagewidth="512" x_imageheight="214"/><Caption><b>Figure 4</b> </Caption><Alternative>A graph with GDP (£m) on x-axis and Life expectancy (years) on y-axis of the poorest and richest countries</Alternative></Figure><Paragraph>This graph is not very useful. The GDP difference between the poorest and richest countries is so vast that the whole chart is squashed to fit all GDP values on the x-axis. It is best to use a <b>logarithmic scale</b> , where the axis values don’t increase by a constant interval (10, 20, 30, for example), but by a multiplicative factor (10, 100, 1000, 10000, etc.). The parameter <ComputerCode>
<b>logx</b>
</ComputerCode> has to be set to <ComputerCode>
<b>True</b>
</ComputerCode> to get a logarithmic scale on the x-axis. Moreover, let’s make the chart a bit wider, by using the <ComputerCode>
<b>figsize</b>
</ComputerCode> parameter you saw last week.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>gdpVsLife.plot(x=GDP, y=LIFE, kind='scatter', grid=True, </Paragraph><Paragraph>              logx=True, figsize = (10, 4))</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
&gt;matplotlib.axes._subplots.AxesSubplot at 0x10e400588&gt;
</ComputerCode></Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1019.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1019.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="042d3371" x_imagesrc="ou_futurelearn_learn_to_code_fig_1019.jpg" x_imagewidth="512" x_imageheight="231"/><Caption><b>Figure 5</b></Caption><Alternative>A graph with GDP (£m) on x-axis and Life expectancy (years) on y-axis of the poorest and richest countries</Alternative></Figure><Paragraph>The major tick marks in the x-axis go from 10 <sup>2</sup> (that’s a one followed by two zeros, hence 100) to 10 <sup>8</sup> (that’s a one followed by eight zeros, hence 100,000,000) million pounds, with the minor ticks marking the numbers in between. For example, the eight minor ticks between 10 <sup>2</sup> and 10 <sup>3</sup> represent the values 200 (2 × 10 <sup>2</sup> ), 300 (3 × 10 <sup>2</sup> ), and so on until 900 (9 × 10 <sup>2</sup> ). As a further example, the country with the lowest life expectancy is on the second minor tick to the right of 10 <sup>3</sup> , which means its GDP is about 3 × 10 <sup>3</sup> (three thousand) million pounds.</Paragraph><Paragraph>Countries with a GDP around 10 thousand (10 <sup>4</sup> ) millions of pounds have a wide range of life expectancies, from under 50 to over 80, but the range tends to shrink both for poorer and for richer countries. Countries with the lowest life expectancy are neither the poorest nor the richest, but those with highest expectancy are among the richer countries.</Paragraph><Activity><Heading>Exercise 11 Scatterplots</Heading><Question><Paragraph>Practise using Scatterplots in Exercise11 in the Exercise notebook 3.</Paragraph></Question></Activity></Section><Section><Title>2.2 My project</Title><Paragraph>I’ve written up my analysis of this week’s project in the notebook you can open this in your downloaded files.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1056.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1056.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="b6c2da6b" x_imagesrc="ou_futurelearn_learn_to_code_fig_1056.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 6</b></Caption><Alternative>An image of an older man holding a surf board on a beach</Alternative><Description>An image of an older man holding a surf board on a beach</Description></Figure><Paragraph>The structure is very simple: besides the introduction and the conclusions, there is one section for each step of the analysis – downloading, cleaning, transforming, and merging the data, then calculating and visualising the correlation.</Paragraph><Paragraph>Open Project 4: Life expectancy</Paragraph><Paragraph>If you have time, extend my project to answer different questions or create your own project in the activity below.</Paragraph><Activity><Heading>Activity 1</Heading><Multipart><Part><Heading>Extend the project</Heading><Question><Paragraph>Make a copy of the Project 3: GDP and Life expectancy and change it to answer one or more of the following questions:</Paragraph><BulletedList><ListItem>To what extent do the ten countries with the highest GDP coincide with the ten countries with the longest life expectancy?</ListItem><ListItem>Which are the two countries in the right half of the plot (higher GDP) with life expectancy below 60 years? What factors could explain their lower life expectancy compared to countries with similar GDP? <Paragraph>Hint: use the filtering techniques you learned in Week 2 to find the two countries.</Paragraph></ListItem><ListItem>Redo the analysis using the countries’ GDP per capita (i.e. per inhabitant) instead of their total GDP. If you’ve done the workbook exercises, you already have a column with the population data. <Paragraph>Hint: write an expression involving the GDP and population columns, as you learned in <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83250&amp;targetdoc=Week+2%3A+Having+a+go+at+it+Part+2&amp;targetptr=1.6">Calculating over columns</a> in Week 1.</Paragraph></ListItem><ListItem>Think about the units in which you display GDP per capita.</ListItem><ListItem>Redo the analysis using the indicator suggested at the end of the project notebook.</ListItem></BulletedList></Question></Part><Part><Heading>Create your own project</Heading><Question><Paragraph>If you have more time, create a completely new project and choose another two of the hundreds of World Bank indicators and see if there is any correlation between them. If there is a choice of similar indicators, choose one that leads to meaningful comparisons between countries.</Paragraph><Paragraph>Look at the results you obtained and take a few moments to assess how they differ from mine.</Paragraph></Question></Part></Multipart></Activity></Section></Session><Session><Title>3 This week’s quiz</Title><Paragraph>Check what you’ve learned this week by taking the end-of-week quiz.</Paragraph><Paragraph><a href="https://www.open.edu/openlearn/ocw/mod/quiz/view.php?id=78782">Week 6 practice quiz</a></Paragraph><Paragraph>Open the quiz in a new window or tab then come back here when you’ve finished.</Paragraph></Session><Session><Title>4 Summary </Title><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1058.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1058.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="b927eddd" x_imagesrc="ou_futurelearn_learn_to_code_fig_1058.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 7</b> </Caption><Description>An images of an elderly black couple riding their bicycles in a park </Description></Figure><Paragraph>This week you transformed and combined databy:</Paragraph><BulletedList><ListItem>computing a correlation coefficient between two series of values and checking whether the correlation is statistically significant</ListItem><ListItem>generating scatterplots to ook for other relationships</ListItem><ListItem>using a logarithmic scale when an indicator had a wide range of values.</ListItem></BulletedList><Paragraph>Next week you'll learn how to group, export and import data to generate pivot table style reports.</Paragraph><Section><Title>4.1 Weeks 5 and 6 glossary</Title><Paragraph>Here are alphabetical lists, for quick look up, of what this week introduced.</Paragraph><InternalSection><Heading>Concepts</Heading><Paragraph>A <b>conditional statement</b> is of the form</Paragraph><ComputerDisplay><Paragraph>if condition1:</Paragraph><Paragraph>     statements1</Paragraph><Paragraph>elif condition2:</Paragraph><Paragraph>     statements2</Paragraph><Paragraph>...</Paragraph><Paragraph>else:</Paragraph><Paragraph>     statements</Paragraph></ComputerDisplay><Paragraph>The computer evaluates the conditions from top to bottom and executes <i>only</i> the statements for the <i>first</i> condition that is true. If all conditions are false, it executes the <ComputerCode>
<b>else</b>
</ComputerCode> statements. If there is no <ComputerCode>
<b>else</b>
</ComputerCode> part nothing happens. The <ComputerCode>
<b>elif</b>
</ComputerCode> parts are optional too. Each block of statements must be indented, usually by four spaces.</Paragraph><Paragraph>A <b>constant</b> is a variable that is assigned only once, i.e. its initial value never changes. Constant names are conventionally written in uppercase, with underscores to separate multiple words.</Paragraph><Paragraph>A <b>function definition</b> is typically of the form</Paragraph><ComputerDisplay><Paragraph>
def functionName (argumentName1, argumentName2,...):
</Paragraph><Paragraph>statements using arguments to compute the result</Paragraph><Paragraph>return result</Paragraph></ComputerDisplay><Paragraph>All statements in the body of the function must have the same indentation, usually four spaces. The statements use the arguments like normal variables. The execution of the function ends when a return statement is encountered.</Paragraph><Paragraph>A <b>join</b> is the merging of two tables on a common column. The resulting table has all columns of both tables (the common column isn’t duplicated), and the rows are determined by the type of join. Rows in the two tables that have the same value in the common column become a joined row in the resulting table.</Paragraph><Paragraph>In a <b>logarithmic scale</b> , each major tick represents a value that is the multiplication by some constant (usually 10) of the value of the previous major tick.</Paragraph><Paragraph>A <b>method chain</b> is an expression like <ComputerCode>
<b> context.method1(args1).method2(args2).method3(args3) </b>
</ComputerCode> where each method has and returns the same type of context, except possibly the last method, which can return any type of value.</Paragraph><Paragraph>The <b>p-value</b> is an indication of the significance of the result. Usually a p-value below 0.05 is taken to mean the result is statistically significant.</Paragraph><Paragraph>A <b>return statement</b> is of the form <ComputerCode>
<b>return expression</b>
</ComputerCode> and passes the value of the expression back to the code that called the function to which the return statement belongs.</Paragraph><Paragraph>The <b>Spearman rank correlation coefficient</b> of two series of values (e.g. two columns) is a number from -1 (perfect inverse correlation) to 1 (perfect direct correlation), with 0 meaning there is no rank correlation. Correlation doesn’t imply causation. A rank correlation of 1 merely states that both values increase and decrease together, while a correlation of -1 states that if one value increases, the other decreases.</Paragraph><Paragraph>A <b>test</b> is some code that checks whether some other code works as expected, e.g. a boolean expression that compares the return value of a function call with the expected value.</Paragraph></InternalSection><InternalSection><Heading>Reserved Words</Heading><Paragraph><ComputerCode>
<b>def, elif, else,</b>
</ComputerCode> if and <ComputerCode>
<b>return</b>
</ComputerCode> cannot be used as names.</Paragraph></InternalSection><InternalSection><Heading>Functions and methods</Heading><Paragraph><ComputerCode>
<b>col.apply(functionName)</b>
</ComputerCode> returns a new column, obtained by applying the given one-argument function to each cell in column <ComputerCode>
<b>col</b>
</ComputerCode>.</Paragraph><Paragraph><ComputerCode>
<b>DataFrame(columns=listOfStrings, data=listOfLists)</b>
</ComputerCode> returns a new dataframe, given the data as a list of rows, each row being a list of values in column order.</Paragraph><Paragraph><ComputerCode>
<b> download(indicator=string, country='all', start=number, end=number) </b>
</ComputerCode> is a function in the pandas.io.wb module that downloads the World Bank data for the given indicator and all countries and country groups from the given start year to the given end year.</Paragraph><Paragraph><ComputerCode>
<b> merge(left=frame1, right=frame2, on=columnName, how=string) </b>
</ComputerCode> returns a new dataframe, obtained by joining the two frames on the columns with the given common name. The <ComputerCode>
<b>how</b>
</ComputerCode> argument can be one of <ComputerCode>
<b>‘left’, ‘right’, ‘inner’</b>
</ComputerCode> and <ComputerCode>
<b>'outer’.</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>print()</b>
</ComputerCode> is a Python function that takes one or more expressions and prints their values on the screen in a single line.</Paragraph><Paragraph><ComputerCode>
<b>frame.reset_index()</b>
</ComputerCode> returns a new dataframe in which rows are labelled from 0 onwards.</Paragraph><Paragraph><ComputerCode>
<b>spearmanr()</b>
</ComputerCode> is a function in the scipy.stats module that takes two columns and returns a pair of numbers: the Spearman rank correlation coefficient of the two series of values, and its p-value.</Paragraph></InternalSection></Section></Session></Unit><Unit><UnitID/><UnitTitle>Week 7: Further techniques Part 1</UnitTitle><Session><Title>1 I spy with my little eye</Title><Paragraph>One of the ways you are shown for loading World Bank data into the notebook in Week 7, was to use the <ComputerCode>
<b>download ()</b>
</ComputerCode> function.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1059.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1059.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="c15454e0" x_imagesrc="ou_futurelearn_learn_to_code_fig_1059.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 1</b></Caption><Alternative>A close-up image of a green eye</Alternative><Description>A close-up image of a green eye</Description></Figure><Paragraph>One way to find out for yourself what sorts of argument a function expects is to ask it. Running a code cell containing a question mark (?) followed by a function name should pop up a help area in the bottom of the notebook window. (Close it using the x in the top right hand corner of the panel.)</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>from pandas.io.wb import download</Paragraph><Paragraph>?download</Paragraph></ComputerDisplay><Paragraph>The function documentation tells you that you can enter a list of one or more country names using standard country codes as well as a date range. You can also calculate a date range from a single date to show the <ComputerCode>
<b>N</b>
</ComputerCode> years of data leading up to a particular year.</Paragraph><Paragraph>Note that if you are using the CoCalc free plan, you will not be able to use the <ComputerCode>
<b>download ()</b>
</ComputerCode> function to download the data directly from the World Bank API, although you will still be able to inspect the documentation associated with the function.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>YEAR = 2013</Paragraph><Paragraph>GDP_INDICATOR = 'NY.GDP.MKTP.CD'</Paragraph><Paragraph>
gdp = download(indicator=GDP_INDICATOR, country=['GB','CN'], start=YEAR-5, end=YEAR)
</Paragraph><Paragraph>gdp = gdp.reset_index()</Paragraph><Paragraph>gdp</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>country</th><th>year</th><th>NY.GDP.MKTP.CD</th></tr><tr><td><b>0</b></td><td>China</td><td>2013</td><td>9.490603e+12</td></tr><tr><td><b>1</b></td><td>China</td><td>2012</td><td>8.461623e+12</td></tr><tr><td><b>2</b></td><td>China</td><td>2011</td><td>7.492432e+12</td></tr><tr><td><b>3</b></td><td>China</td><td>2010</td><td>6.039659e+12</td></tr><tr><td><b>4</b></td><td>China</td><td>2009</td><td>5.059420e+12</td></tr><tr><td><b>5</b></td><td>China</td><td>2008</td><td>4.558431e+12</td></tr><tr><td><b>6</b></td><td>United Kingdom</td><td>2013</td><td>2.678173e+12</td></tr><tr><td><b>7</b></td><td>United Kingdom</td><td>2012</td><td>2.614946e+12</td></tr><tr><td><b>8</b></td><td>United Kingdom</td><td>2011</td><td>2.592016e+12</td></tr><tr><td><b>9</b></td><td>United Kingdom</td><td>2010</td><td>2.407857e+12</td></tr><tr><td><b>10</b></td><td>United Kingdom</td><td>2009</td><td>2.308995e+12</td></tr><tr><td><b>11</b></td><td>United Kingdom</td><td>2008</td><td>2.791682e+12</td></tr></tbody></Table><Paragraph>Although many datasets that you are likely to work with are published in the form of a single data table, such as a single CSV file or spreadsheet worksheet, it is often possible to regard the dataset as being made up from several distinct subsets of data.</Paragraph><Paragraph>In the above example, you will probably notice that each country name appears in several rows, as does each year. This suggests that we can make different sorts of comparisons between different groupings of data using just this dataset. For example, compare the total GDP of each country calculated over the six years 2008 to 2013 using just a single line of code:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
gdp.groupby('country')['NY.GDP.MKTP.CD'].aggregate(sum)
</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>country</Paragraph><Paragraph>China             4.110217e+13</Paragraph><Paragraph>United Kingdom    1.539367e+13</Paragraph><Paragraph>Name: NY.GDP.MKTP.CD, dtype: float64</Paragraph></ComputerDisplay><Paragraph>Essentially what this does is to say ‘for each country, find the total GDP’.</Paragraph><Paragraph>The total combined GDP for those two countries in each year could be found by making just one slight tweak to our code (can you see below where I made the change?):</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
gdp.groupby('year')['NY.GDP.MKTP.CD'].aggregate(sum)
</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>year</Paragraph><Paragraph>2008    7.350113e+12</Paragraph><Paragraph>2009    7.368415e+12</Paragraph><Paragraph>2010    8.447515e+12</Paragraph><Paragraph>2011    1.008445e+13</Paragraph><Paragraph>2012    1.107657e+13</Paragraph><Paragraph>2013    1.216878e+13</Paragraph><Paragraph>Name: NY.GDP.MKTP.CD, dtype: float64</Paragraph></ComputerDisplay><Paragraph>That second calculation probably doesn’t make much sense in this particular case, but what if there was another column saying which region of the world each country was in? Then, by taking the data for all the countries in the world, the total GDP could be found for each region by grouping on <i>both</i> the year <i>and</i> the region.</Paragraph><Paragraph>Next, you will consider ways of grouping data.</Paragraph><Section><Title>1.1 Ways of grouping data</Title><Paragraph>Think back to the weather dataset you used in Week 3 , how might you group that data into several distinct groups? What sorts of comparisons could you make by grouping just the elements of that dataset? Or how might you group and compare the GDP data?</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1060.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1060.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="44c3939f" x_imagesrc="ou_futurelearn_learn_to_code_fig_1060.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 2</b></Caption><Alternative>A very colourful image of piles of different peppers</Alternative><Description>A very colourful image of piles of different peppers</Description></Figure><Paragraph>One thing the newspapers love to report are weather ‘records’, such as the ‘hottest June ever’ or the wettest location in a particular year as measured by total annual rainfall, or highest average monthly rainfall. How easy is it to find that information out from the data?</Paragraph><Paragraph>Or with the GDP data, if countries were assigned to economic groupings such as the European Union, or regional groupings such as Africa, or South America, how would you generate information such as lowest GDP in the EU or highest GDP in South America?</Paragraph><Paragraph>This week you will learn how to split data into groups based on particular features of the data, and then generate information about each separate group, across all of the groups, at the same time.</Paragraph><Activity><Heading>Activity 1 Grouping data</Heading><Question><Paragraph>Based on the data you have seen so far, or some other datasets you may be aware of, what other ways of grouping data can you think of, and why might grouping data that way be useful?</Paragraph></Question><Interaction><FreeResponse size="paragraph" id="ahx_23l_sxb"/></Interaction></Activity></Section><Section><Title>1.2 Data that describes the world of trade</Title><Paragraph>A news article from the <i>Guardian</i> announcing a gloomy export outlook for UK manufacturers (see the link below), got me wondering about what sorts of thing different countries actually export.</Paragraph><Paragraph>For example, it might surprise you that India was the world’s largest exporter by value of unset diamonds in 2014 (24 billion US dollars worth), or that Germany was the biggest importer of chocolate (over $2.5 billion worth) in that same year.</Paragraph><Paragraph>National governments all tend to publish their own trade figures, but the UN also collect data from across the world. In particular, the UN’s global trade database, Comtrade, contains data about import and export trade flows between countries for a wide range of goods and services.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1061.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1061.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="8089a7e8" x_imagesrc="ou_futurelearn_learn_to_code_fig_1061.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 3</b></Caption><Alternative>An image of a China Shipping Lane cargo ship passing under the San Francisco bridge</Alternative><Description>An image of a China Shipping Lane cargo ship passing under the San Francisco bridge</Description></Figure><Paragraph>So if you’ve ever wondered where your country imports most of its T-shirts from, or exports most of its municipal waste to, Comtrade is likely to have the data.</Paragraph><Paragraph>In the next section, you will find out about the Comtrade data.</Paragraph></Section><Section><Title>1.3 Exploring the world of export data</Title><Paragraph>The Comtrade Data Extraction interface provides a user interface for selecting, previewing and exporting data from the Comtrade database.</Paragraph><Activity><Heading>Activity 2 Exploring export data</Heading><Question><Paragraph>Open the <a href="http://comtrade.un.org/data/">Comtrade Data Extraction interface</a> and keep it open alongside this page. You’ll explore the options and preview some data.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1020.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1020.jpg" width="100%" webthumbnail="true" x_folderhash="cbfeded3" x_contenthash="1d733822" x_imagesrc="ou_futurelearn_learn_to_code_fig_1020.jpg" x_imagewidth="780" x_imageheight="433" x_smallsrc="ou_futurelearn_learn_to_code_fig_1020.small.jpg" x_smallfullsrc="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1020.small.jpg" x_smallwidth="512" x_smallheight="284"/><Caption><b>Figure 4:</b>Comtrade Data Extraction interface</Caption><Alternative>Comtrade Data Extraction interface</Alternative></Figure><Paragraph>In the text area marked <b>HS (as reported) commodity codes</b> , start to enter the name of various goods and services. You should see suggestions regarding different goods and services that Comtrade records trade flow data for.</Paragraph><Paragraph>If you don’t select too much data, you should be able to get a preview of the data by clicking the green ‘Preview’ button. Notice that the interface allows you to sort the data by a particular column, which provides a quick way of finding the countries that export most, or least, goods by value.</Paragraph><Paragraph>If you selected ‘All’ reporters, you will probably notice that a decreasing sort on the ‘Trade Value’ column always has ‘World’ at the top: in the ‘All’ reports dataset, individual country reports and reports from ‘areas not elsewhere specified’ (‘nes’) are complemented by the ‘World’ report which represents a sum total of those other values.</Paragraph><Paragraph>The user interface is rather complicated at first glance, but with a bit of trial and error you should be able to work out:</Paragraph><BulletedList><ListItem>how to display trade flows between a particular country (the ‘Reporter’) and a particular country or region of the world (the ‘Partners’)</ListItem><ListItem>how to limit the display to show just imports, or exports, between ‘Reporter(s)’ and ‘Partner(s)’</ListItem><ListItem>how to display data for different years</ListItem><ListItem>how to display data for different months in a particular year, or all the months in a particular year.</ListItem></BulletedList><Paragraph>You might notice that the commodities codes are organised hierarchically, i.e. a code breaks down into further sub-codes. For example:</Paragraph><BulletedList><ListItem>3825 – Residual products of the chemical or allied industries <BulletedSubsidiaryList> <SubListItem>382510 – Municipal waste</SubListItem> <SubListItem>382520 – Sewage sludge</SubListItem> <SubListItem>382530 – Clinical waste</SubListItem> <SubListItem>…</SubListItem> </BulletedSubsidiaryList></ListItem></BulletedList><Paragraph>Adding up the results from the next level down on a particular code should generate trade value totals that correspond to the higher level totals, rounding errors aside. This means that if you want to focus on the subcategories of a particular commodity type, you may well be able to do so.</Paragraph><Paragraph>For a particular category of goods, and a reporting period of a single month or year, select your country as the reporter and ‘All’ as the partner.</Paragraph><Paragraph>Does the range of goods and services listed within the database surprise you?</Paragraph></Question></Activity><Paragraph>Keep the Comtrade webpage open as you’ll use it again in the next section.</Paragraph></Section><Section><Title>1.4 Getting data from the Comtrade API</Title><Paragraph>Hopefully, you have a few ideas about data you’d like to explore from the Comtrade database.</Paragraph><Paragraph>In the previous section, I managed to identify a set of data that describes the amount of unset diamonds (commodity code 7102) imported into the UK from the Russian Federation, Angola and South Africa in 2013 and 2014.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1021.jpg" src_uri="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1021.jpg" width="100%" webthumbnail="true" x_folderhash="cbfeded3" x_contenthash="08a57f12" x_imagesrc="ou_futurelearn_learn_to_code_fig_1021.jpg" x_imagewidth="780" x_imageheight="399" x_smallsrc="ou_futurelearn_learn_to_code_fig_1021.small.jpg" x_smallfullsrc="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1021.small.jpg" x_smallwidth="512" x_smallheight="262"/><Caption><b>Figure 5</b>Comtrade Data Extraction interface</Caption><Alternative>Comtrade Data Extraction interface</Alternative></Figure><Paragraph>You can export the data you have selected as a CSV file that will be downloaded to your own computer by clicking on the <i>Download CSV</i> button. You may find it useful to change the filename of the downloaded file to something more meaningful than the comtrade.csv default name.</Paragraph><Paragraph>If you moved the downloaded CSV file into the same folder as your Exercise notebook 4 (that you’ll download later), you could use the following command to load the data into a pandas dataframe:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>filename='comtrade.csv</Paragraph><Paragraph>
' df=read_csv(filename, dtype={'Commodity Code':str, 'Reporter Code':str })
</Paragraph></ComputerDisplay><Paragraph>The ‘Commodity Code’ and ‘Reporter Code’ values are explicitly read in as a string <ComputerCode>
<b>(str)</b>
</ComputerCode> otherwise codes like 0401 will be returned as the number 401.</Paragraph><Paragraph>One of the problems of working with real data like this is that it may not be just the data you want. The data returned from Comtrade includes several columns that are essentially surplus to requirements for the reports you will produce. I suggest that you clean the dataframes so that they contain at most the following key columns: ‘Year’, ‘Period’, ‘Trade Flow’, ‘Reporter’, ‘Partner’, ‘Commodity’, ‘Commodity Code’, ‘Trade Value (US$)’.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>
COLUMNS = ['Year', 'Period','Trade Flow','Reporter','Partner', 'Commodity','Commodity Code','Trade Value (US$)']
</Paragraph><Paragraph>df=df[COLUMNS]</Paragraph></ComputerDisplay><Paragraph>To avoid conflating data relating to all countries (the ‘World’ partner), and each separate country, create separate dataframes for each, using the comparison operators introduced in Week 3.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>world = df[df['Partner'] == 'World']</Paragraph><Paragraph>countries = df[df['Partner'] != 'World']</Paragraph></ComputerDisplay><InternalSection><Heading>A More Direct Way of Getting the Data</Heading><Paragraph>Just as there was a method for downloading data directly from the World Bank, there is also a more direct way of getting the Comtrade data into a dataframe – directly from the Comtrade website. You might have noticed that when you downloaded the file from the Comtrade website, a link appeared on the site labelled ‘View API Call’.</Paragraph><Paragraph>An API is an ‘application programming interface’ that provides a means for one computer to talk to another ‘in machine terms’. When you extracted data from the World Bank, you were calling the World Bank API using a set of functions provided by the pandas library. Behind the scenes, these functions create URLs (that is, web addresses) that call the World Bank API and allow requests to be made directly from it, putting the response into a pandas dataframe.</Paragraph><Paragraph>In the case of Comtrade, clicking the <i>View API Link</i> reveals a URL that requests the data you selected in the search form as a data file, though not, by default, as a CSV data file.</Paragraph><Paragraph>This link can be used to download data directly into a pandas dataframe from Comtrade, although you will need to make a couple of modifications to the URL first. In particular, change the max value to 5000 (to increase the amount of data returned by each request) and add <ComputerCode>
<b>&amp;fmt=csv</b>
</ComputerCode> to the end of the URL to ensure that the data is returned in a CSV format.</Paragraph><Paragraph>For example, if you copied the URL:</Paragraph><Paragraph>http://comtrade.un.org/api/get?max=500&amp;type=C&amp;freq=M&amp;px=HS&amp;ps=2015&amp;r=826&amp;p=all&amp;rg=1%2C2&amp;cc=0401%2C0402</Paragraph><Paragraph>you would need to modify it as follows:</Paragraph><Paragraph>http://comtrade.un.org/api/get?max= <b>5000</b> &amp;type=C&amp;freq=M&amp;px=HS&amp;ps=2015&amp;r=826&amp;p=all&amp;rg=1%2C2&amp;cc=0401%2C0402 <b>&amp;fmt=csv</b></Paragraph><Paragraph>You can then load the data in using the panda <ComputerCode>
<b>read_csv()</b>
</ComputerCode> function.</Paragraph><Paragraph><i> Note that if you are using the CoCalc free plan, you will not be able to download data directly from the Comtrade API into a pandas dataframe. </i></Paragraph><Paragraph>Set the datatypes as shown using the <ComputerCode>
<b>dtype</b>
</ComputerCode> argument to ensure that the codes are read in correctly.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>
URL='http://comtrade.un.org/api/get?max=5000&amp;type=C&amp;freq=A&amp;px=HS&amp;ps=2014%2C2013%2C2012&amp;r=826&amp;p=all&amp;rg=all&amp;cc<br/>=0401%2C0402&amp;fmt=csv'
</Paragraph><Paragraph>
df=read_csv(URL, dtype={'Commodity Code':str, 'Reporter Code':str})
</Paragraph></ComputerDisplay><Paragraph>Having downloaded the data, you should then separate out the World data as before.</Paragraph><Paragraph>If you want to save a copy of data downloaded into pandas directly from the Comtrade API, call the <ComputerCode>
<b>to_csv()</b>
</ComputerCode> method on your dataframe, pasting in the filename you want to save the file under, and setting <ComputerCode>
<b>index=False</b>
</ComputerCode> so that the dataframe’s automatically introduced index column is not included. For example:</Paragraph><Paragraph><ComputerCode>
countries.to_csv('saved_country_data_example.csv', index=False)
</ComputerCode></Paragraph><Paragraph>The file will be saved in the same folder as the notebook.</Paragraph></InternalSection></Section><Section><Title>1.5 Practice getting data</Title><Activity><Heading>Exercise 1 Getting data from API</Heading><Question><Paragraph>In Exercise 1, identify a dataset from the <a href="http://comtrade.un.org/data/">Comtrade Data Extraction interface</a> selecting one or more commodity codes and a single reporter that are of interest to you and import the data into pandas.</Paragraph><Paragraph>Open the exercise notebook 4 and save it in the disk folder or the CoCalc project you created in Week 1. You should also open the comtrade_pivot.html comtrade_milk_uk_monthly_14.csv files and save them into the same folder or project.</Paragraph><Paragraph>Remember to run the existing code in the notebook before you start the exercise. When you’ve completed the exercise, save the notebook. If you need a quick reminder of how to use Jupyter watch again the video in <a href="https://www.open.edu/openlearn/mod/oucontent/olink.php?id=83251&amp;targetdoc=Week+1%3A+Having+a+go+at+it+Part+1&amp;targetptr=1.4">Week 1 Exercise 1</a>.</Paragraph><Paragraph>For the commodities and reporter you chose, find out which countries are the biggest partners in recent years in terms of import and export trade flows.</Paragraph></Question></Activity></Section></Session><Session><Title>2 This week’s quiz</Title><Paragraph>Check what you’ve learned this week by taking the end-of-week quiz.</Paragraph><Paragraph><a href="https://www.open.edu/openlearn/ocw/mod/quiz/view.php?id=78783">Week 7 practice quiz</a></Paragraph><Paragraph>Open the quiz in a new window or tab then come back here when you’ve finished.</Paragraph></Session><Session><Title>3 Summary</Title><Paragraph>This week you’ve learned how to take a dataset that contains multiple possible groupings, or subsets of data, and work with those groups to perform a variety of transformations. You’ve explored:</Paragraph><BulletedList><ListItem>ways of grouping data</ListItem><ListItem>Comtrade data</ListItem><ListItem>the world of export data</ListItem><ListItem>how to get data.</ListItem></BulletedList><Paragraph>Next week looks at how to split the data contained in a dataframe into multiple groups based on the unique ‘key’ values in a single column, or unique combinations of values that appear across two or more columns.</Paragraph></Session></Unit><Unit><UnitID/><UnitTitle>Week 8: Further techniques Part 2</UnitTitle><Session><Title>1 The split-apply-combine pattern</Title><Paragraph>In the  exercise in Week 7, you downloaded data from Comtrade that could be described as ‘heterogenous’ or mixed in some way. For example, the same dataset contained information relating to both imports and exports.</Paragraph><Paragraph>To find the partner countries with the largest trade value in terms of exports means filtering the dataset to obtain just the rows containing export data and then ranking those. Finding the largest import partner requires a sort on just the import data.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1062.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="72290ebe" x_imagesrc="ou_futurelearn_learn_to_code_fig_1062.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 1</b></Caption><Alternative>A close-up image of someone putting glue onto a plank of wood</Alternative><Description>A close-up image of someone putting glue onto a plank of wood</Description></Figure><Paragraph>But what if you wanted to find out even more refined information? For example:</Paragraph><BulletedList><ListItem>the total value of exports of product X from the UK to those countries on a year by year basis (group the information by year and then find the total for each year)</ListItem><ListItem>the total value of exports of product X from the UK to each of the partner countries by year (group the information by country and year and then find the total for each country/year pairing)</ListItem><ListItem>the average value of exports across all the countries on a month by month basis (group by month, then find the average value per month)</ListItem><ListItem>the average value of exports across each country on a month by month basis (group by month and country, then find the average value over each country/month pairing)</ListItem><ListItem>the difference month on month between the value of imports from, or exports to, each particular country over the five year period (group by country, order by month and year, then find the difference between consecutive months).</ListItem></BulletedList><Paragraph>In each case, the original dataset needs to be separated into several subsets, or groups of data rows, and then some operation performed on those rows. To generate a single, final report would then require combining the results of those operations in a new or extended dataframe.</Paragraph><Paragraph>This sequence of operations is common enough for it to have been described as the ‘split-apply-combine’ pattern. The sequence is to:</Paragraph><BulletedList><ListItem>‘split’ an appropriately shaped dataset into several components</ListItem><ListItem>‘apply’ an operator to the rows contained within a component</ListItem><ListItem>‘combine’ the results of applying to operator to each component to return a single combined result.</ListItem></BulletedList><Paragraph>You will see how to make use of this pattern using pandas next.</Paragraph><Section><Title>1.1 Splitting a dataset by grouping</Title><Paragraph>‘Grouping’ refers to the process of splitting a dataset into sets of rows, or ‘groups’, on the basis of one or more criteria associated with each data row.</Paragraph><Paragraph>Grouping is often used to split a dataset into one or more distinct groups. Each row in the dataset being grouped around can be assigned to one, and only one, of the derived groups. The rows associated with a particular group may be accessed by reference to the group or the same processing or reporting operation may be applied to the rows contained in each group on a group by group basis.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1016.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="aa765348" x_imagesrc="ou_futurelearn_learn_to_code_fig_1016.jpg" x_imagewidth="512" x_imageheight="384"/><Caption><b>Figure 2</b></Caption><Alternative>A table contains commodity and amount columns, there are amounts grouped by commodity A, B and C.</Alternative><Description>The first table on the left contains commodity and amount columns, there are amounts grouped by commodity A, B and C. Data is: A 10; A 15; A 5; A 20; B 10; B 10; B 5; C 20; C 30. There are arrows to three further tables which splits out the commodities into separate tables for each commodity. Data is from a dataset. </Description></Figure><Paragraph>The rows do not have to be ‘grouped’ together in the original dataset – they could appear in any order in the original dataset (for example, a commodity A row, followed by a two commodity B rows, then another commodity A row, and so on). However, the order in which each row appears in the original dataset will typically be reflected by the order in which the rows appear in each subgroup.</Paragraph><Paragraph>Let’s see how to do that in pandas. Create a simple dataframe that looks like the full table in the image above:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>data=[['A',10],['A',15],['A',5],['A',20],</Paragraph><Paragraph>              ['B',10],['B',10],['B',5],</Paragraph><Paragraph>              ['C',20],['C',30]] </Paragraph><Paragraph>df=DataFrame(data=data, columns=["Commodity","Amount"])</Paragraph><Paragraph>df</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Commodity</th><th>Amount</th></tr><tr><td><b>0</b></td><td>A</td><td>10</td></tr><tr><td><b>1</b></td><td>A</td><td>15</td></tr><tr><td><b>2</b></td><td>A</td><td>5</td></tr><tr><td><b>3</b></td><td>A</td><td>20</td></tr><tr><td><b>4</b></td><td>B</td><td>10</td></tr><tr><td><b>5</b></td><td>B</td><td>10</td></tr><tr><td><b>6</b></td><td>B</td><td>5</td></tr><tr><td><b>7</b></td><td>C</td><td>20</td></tr><tr><td><b>8</b></td><td>C</td><td>30</td></tr></tbody></Table><Paragraph>Next, use the <ComputerCode>
<b>groupby()</b>
</ComputerCode> method to group the dataframe into separate groups of rows based on the values contained within one or more specified ‘key’ columns. For example, group the rows according to what sort of commodity each row corresponds to as specified by the value taken in the ‘Commodity’ column.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>grouped = df.groupby('Commodity')</ComputerCode></Paragraph><Paragraph>The number and ‘names’ of the groups that are identified correspond to the unique values that can be found within the column or columns (which will be referred to as the ‘key columns’) used to identify the groups.</Paragraph><Paragraph>You can see what groups are available with the following method call:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>grouped.groups.keys()</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>['A', 'C', 'B']</ComputerCode></Paragraph><Paragraph>The <ComputerCode>
<b>get_group()</b>
</ComputerCode> method can be used to grab just the rows associated with a particular group.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>grouped.get_group('B')</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out []:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Commodity</th><th>Amount</th></tr><tr><td><b>4</b></td><td>B</td><td>10</td></tr><tr><td><b>5</b></td><td>B</td><td>10</td></tr><tr><td><b>6</b></td><td>B</td><td>5</td></tr></tbody></Table><Paragraph>Datasets can also be grouped against multiple columns. For example, if there was an extra ‘Year’ column in the above table, you could group against just the commodity, exactly as above, to provide access to rows by commodity; just the year, setting <ComputerCode>
<b>grouped = df.groupby( 'Year' );</b>
</ComputerCode> or by both commodity and year, passing in the two grouping key columns as a list:</Paragraph><Paragraph><ComputerCode>grouped = df.groupby( ['Commodity','Year'])</ComputerCode></Paragraph><Paragraph>The list of keys associated with the groups might then look like [(‘A’, 2015), (‘A’, 2014), (‘B’, 2015), (‘B’, 2014)]. The rows associated with the group corresponding to commodity A in 2014 could then be retrieved using the command:</Paragraph><Paragraph><ComputerCode>grouped.get_group( ('A',2014) )</ComputerCode></Paragraph><Paragraph>This may seem to you like a roundabout way of filtering the dataframe as you did in Week 2; but you’ll see that the ability to automatically group rows sets up the possibility of then processing those rows as separate ‘mini-dataframes’ and then combining the results back together.</Paragraph><Activity><Heading>Exercise 2 Grouping data</Heading><Question><Paragraph>Work through Exercise 2 in the Week 4 notebook.</Paragraph><Paragraph>As you complete the tasks, think about these questions:</Paragraph><BulletedList><ListItem>For your particular dataset, how did you group the data and what questions did you ask of it? Which countries were the major partners of your reporter country for the different groupings?</ListItem><ListItem>With the ability to group data so easily, what other sorts of questions would you like to be able to ask?</ListItem></BulletedList></Question></Activity></Section><Section><Title>1.2 Looking at apply and combine operations</Title><Paragraph>Having split a dataset by grouping, an operation is ‘applied’ to each group.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1063.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="c4ec9cb6" x_imagesrc="ou_futurelearn_learn_to_code_fig_1063.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 3</b></Caption><Alternative>A close-up image of a cubes of wood interlaced, making a pattern</Alternative><Description>A close-up image of a cubes of wood interlaced, making a pattern</Description></Figure><Paragraph>The operation often takes one of two forms:</Paragraph><BulletedList><ListItem>a ‘summary’ operation, in which a summary statistic based on the rows contained within each group is generated. A single value is returned for each group, for example, the group median or mean, the number of rows in the group, or the maximum or minimum value in the group. The final result will have <i>M</i> rows, one for each of the M groups created by the split (that is, . <ComputerCode>
<b>groupby()</b>
</ComputerCode> ) operation.</ListItem><ListItem>a ‘filtering’ or ‘filtration’ operation, in which groups of rows are retained or discarded based on a particular property of the group as a whole. For example, only groups of rows where the sum of all the values in the group is above some threshold are retained. The effect is that each group keeps the same number of rows, but the resulting dataset (after combination, see below) may contain fewer groups than the original.</ListItem></BulletedList><Paragraph>The results of applying the summary or filtration operation are then combined to provide a single output dataframe.</Paragraph><Paragraph>In the next section, you will see how to apply a variety of summary operations, and in a later step examples of filtration operations.</Paragraph></Section><Section><Title>1.3 Summary operations</Title><Paragraph>Summary, or aggregation, operations are used to produce a single summary value or statistic, such as the group average, for each separate group.</Paragraph><Paragraph>Find the ‘total’ amount within each group using a <b>summary</b> operation:</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1014.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="98bad1ce" x_imagesrc="ou_futurelearn_learn_to_code_fig_1014.jpg" x_imagewidth="512" x_imageheight="419"/><Caption><b>Figure 4</b></Caption><Alternative>A summary operator of the 'total' amount for Commodity A, B and C</Alternative><Description>Applying a summary operator to the rows contained within each group for each group separately: Three tables one for each Commodity A, B and C with their amounts. Amount for each commodity are then totalled.</Description></Figure><Paragraph>To apply a summary operator to each group, such as a function to find the mean value of each group, and then automatically combine the results into a single output dataframe, pass the name of the function in to the <ComputerCode>
<b>aggregate()</b>
</ComputerCode> method. Note that pandas will try to use this operator to summarise each column in the grouped rows separately if there is more than one column that can be summarised. So for example, if there was a ‘Volume’ column, it would also return total volumes.</Paragraph><Paragraph>Let’s use again the example dataframe defined earlier:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>df</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Commodity</th><th>Amount</th></tr><tr><td>0</td><td>A</td><td>10</td></tr><tr><td>1</td><td>A</td><td>15</td></tr><tr><td>2</td><td>A</td><td>5</td></tr><tr><td>3</td><td>A</td><td>20</td></tr><tr><td>4</td><td>B</td><td>10</td></tr><tr><td>5</td><td>B</td><td>10</td></tr><tr><td>6</td><td>B</td><td>5</td></tr><tr><td>7</td><td>C</td><td>20</td></tr><tr><td>8</td><td>C</td><td>30</td></tr></tbody></Table><Paragraph>Group the data by commodity type and then apply the <ComputerCode>
<b>sum</b>
</ComputerCode> operation and combine the results in an output dataframe. The grouping elements are used to create index values in the output dataframe.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>grouped=df.groupby('Commodity')</Paragraph><Paragraph>grouped.aggregate(sum)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Amount</th></tr><tr><th>Commodity</th><td/></tr><tr><td><b>A</b></td><td>50</td></tr><tr><td><b>B</b></td><td>25</td></tr><tr><td><b>C</b></td><td>50</td></tr></tbody></Table><Paragraph>In this case, the <ComputerCode>
<b>aggregate()</b>
</ComputerCode> method applies the sum summary operation to each group and then automatically combines the results. For a <b>summary</b> operation such as this, the resulting combined dataframe contains as many rows as there were groups created by the splitting <ComputerCode>
<b>.groupby()</b>
</ComputerCode> operation.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1015.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="9a8770f0" x_imagesrc="ou_futurelearn_learn_to_code_fig_1015.jpg" x_imagewidth="512" x_imageheight="363"/><Caption><b>Figure 5</b></Caption><Alternative>A single dataframe of the previous summary operator.</Alternative><Description>The individual summaries, the results of the previous image, for each group Commodities A, B and C are combined into a single dataframe </Description></Figure><Paragraph>The slightly more general <ComputerCode>
<b>apply()</b>
</ComputerCode> method can also be substituted for the <ComputerCode>
<b>aggregate()</b>
</ComputerCode> method and will similarly take the rows associated with each group, apply a function to them, and return a combined result.</Paragraph><Paragraph>The <ComputerCode>
<b>apply()</b>
</ComputerCode> method can be really handy if you have defined a function of your own that you want to apply to just the rows associated with each group. Simply pass the name of the function to the <ComputerCode>
<b>apply()</b>
</ComputerCode> method and it will then call your function, once per group, on the sets of rows associated with each group.</Paragraph><Paragraph>For example, find the top two items by ‘Amount’ in each group:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>def top2byAmount(g): </Paragraph><Paragraph>        return g.sort_values('Amount', ascending=False).head(2)</Paragraph><Paragraph>grouped.apply(top2byAmount)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th/><th>Amount</th></tr><tr><td><b>Commodity</b></td><td/><td/></tr><tr><td><b>A</b></td><td>3</td><td>20</td></tr><tr><td/><td>1</td><td>15</td></tr><tr><td><b>B</b></td><td>4</td><td>10</td></tr><tr><td/><td>5</td><td>10</td></tr><tr><td><b>C</b></td><td>8</td><td>30</td></tr><tr><td/><td>7</td><td>20</td></tr></tbody></Table><Paragraph>The second index column containing the numbers 3, 1, 4 etc., contains the original index value of each row.</Paragraph><Paragraph>In Week 3 the <ComputerCode>
<b>apply()</b>
</ComputerCode> method was called on a column, to apply the given function to each cell. Here it was called on a grouped dataframe, to apply the given function to each group.</Paragraph><Activity><Heading>Exercise 3 Experimenting with split-apply-combine</Heading><Question><Paragraph>Work through Exercise 3 in your Exercise notebook 4 to practise the summary operations.</Paragraph><Paragraph>As you complete the tasks, think about these questions:</Paragraph><BulletedList><ListItem>For your dataset, which months saw the highest and lowest levels of trade activity? Did there appear to be any seasonal behaviour?</ListItem><ListItem>When graphically comparing total trade flows from the leading partner countries to the World total, did it look as if any partners particularly dominated that area of trade? If you have time, find news reports discussing why this should be the case.</ListItem></BulletedList></Question></Activity></Section><Section><Title>1.4 Filtering groups</Title><Paragraph>Being able to group rows according to some criterion and then apply various operations to those groups is a very powerful technique.</Paragraph><Paragraph>However, there may be occasions when you only want to work with a subset of the groups that can be extracted from a single dataset based on a particular group property. For example, it might require that:</Paragraph><BulletedList><ListItem>groups that contain a minimum number of rows, such as countries that engage in trade around a particular commodity with a minimum number of partner countries</ListItem><ListItem>groups for whom a summary statistic meets certain conditions (for example, the total value of exports for a particular commodity exceeds a particular threshold value, or whose minimum or maximum value are below a certain value)</ListItem><ListItem>a ranking of the groups based on a particular summary statistic, such as the total trade value, that returns only the top five or bottom three groups according to that ranking.</ListItem></BulletedList><Paragraph>In the following example, where groups are selected based on group size, a filtering operation is applied to limit an original dataset so that it includes just those groups containing at least three rows, combining the rows from the selected groups back together again to produce the output dataset:</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1017.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="bc13962b" x_imagesrc="ou_futurelearn_learn_to_code_fig_1017.jpg" x_imagewidth="512" x_imageheight="302"/><Caption><b>Figure 6</b></Caption><Alternative>Dataframe for each commodity A, B and C and their amounts.</Alternative><Description>Dataframe for each commodity A, B and C and their amounts. Filter operation applied to select rows associated with groups that pass a filtration test applied to each group and combined into one dataframe. </Description></Figure><Paragraph>In pandas, groups can be filtered based on their group properties using the <ComputerCode>
<b>filter()</b>
</ComputerCode> method. Using the example dataframe again:</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>df</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Commodity</th><th>Amount</th></tr><tr><td><b>0</b></td><td>A</td><td>10</td></tr><tr><td><b>1</b></td><td>A</td><td>15</td></tr><tr><td><b>2</b></td><td>A</td><td>5</td></tr><tr><td><b>3</b></td><td>A</td><td>20</td></tr><tr><td><b>4</b></td><td>B</td><td>10</td></tr><tr><td><b>5</b></td><td>B</td><td>10</td></tr><tr><td><b>6</b></td><td>B</td><td>5</td></tr><tr><td><b>7</b></td><td>C</td><td>20</td></tr><tr><td><b>8</b></td><td>C</td><td>30</td></tr></tbody></Table><Paragraph>For example, the dataframe can be filtered to return just the rows from groups where there is a maximum number of rows in the group.</Paragraph><Paragraph>As a reference point, count how many rows are associated with each group.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>grouped = df.groupby('Commodity')</Paragraph><Paragraph>grouped.aggregate(len)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Amount</th></tr><tr><td><b>Commodity</b></td><td/></tr><tr><td><b>A</b></td><td>4</td></tr><tr><td><b>B</b></td><td>3</td></tr><tr><td><b>C</b></td><td>2</td></tr></tbody></Table><Paragraph>The <ComputerCode>
<b>filter()</b>
</ComputerCode> method uses a function that returns a boolean ( <ComputerCode>
<b>True</b>
</ComputerCode> or <ComputerCode>
<b>False</b>
</ComputerCode> ) value to decide whether or not to filter through the rows associated with a particular group.</Paragraph><Paragraph>As with the <ComputerCode>
<b>apply()</b>
</ComputerCode> method, provide the <ComputerCode>
<b>filter()</b>
</ComputerCode> method with just a function name in order to pass each group to that function. For example, define a function that says whether or not a group contains three or fewer rows and use that as a basis for filtering the original dataset.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>def groupsOfAtMostThreeRows(g): </Paragraph><Paragraph>        return len(g) &lt;= 3 </Paragraph><Paragraph>grouped.filter(groupsOfAtMostThreeRows)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Commodity</th><th>Amount</th></tr><tr><td><b>4</b></td><td>B</td><td>10</td></tr><tr><td><b>5</b></td><td>B</td><td>10</td></tr><tr><td><b>6</b></td><td>B</td><td>5</td></tr><tr><td><b>7</b></td><td>C</td><td>20</td></tr><tr><td><b>8</b></td><td>C</td><td>30</td></tr></tbody></Table><Paragraph>Alternatively, all the rows in a group can be filtered on an aggregate property of the group such as the sum total, or maximum, minimum or mean value, from one of the columns.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>#Consider the following total amounts by group</Paragraph><Paragraph>grouped.aggregate(sum)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Amount</th></tr><tr><td><b>Commodity</b></td><td/></tr><tr><td><b>A</b></td><td>50</td></tr><tr><td><b>B</b></td><td>25</td></tr><tr><td><b>C</b></td><td>50</td></tr></tbody></Table><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>pivot_table(df,</Paragraph><Paragraph>               index=['Commodity','Partner'], </Paragraph><Paragraph>               values='Amount', </Paragraph><Paragraph>               aggfunc=sum)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th/><th>Commodity</th><th>Amount</th></tr><tr><td><b>0</b></td><td>A</td><td>10</td></tr><tr><td><b>1</b></td><td>A</td><td>15</td></tr><tr><td><b>2</b></td><td>A</td><td>5</td></tr><tr><td><b>3</b></td><td>A</td><td>20</td></tr><tr><td><b>7</b></td><td>C</td><td>20</td></tr><tr><td><b>8</b></td><td>C</td><td>30</td></tr></tbody></Table><Paragraph>The ability to filter datasets based on group properties means that large datasets can more easily be limited to just those rows associated with groups of rows that are deemed to be relevant in some way.</Paragraph><Activity><Heading>Exercise 4 Filtering groups</Heading><Question><Paragraph>Use the Exercise notebook 4 to practise filtering in Exercise 4.</Paragraph><Paragraph>As you complete the tasks, think about other questions you could ask of your data using the filter command.</Paragraph></Question></Activity></Section></Session><Session><Title>2 Pivot tables</Title><Paragraph>One of the most useful, if poorly understood, features offered by many spreadsheet applications is the ‘pivot table’.</Paragraph><Paragraph>Pivot tables provide a way of creating summary reports over particular parts of a dataset, reshaping the data into grouped rows, itemised columns, and summary values within each group and item.</Paragraph><Paragraph>The screenshot of the interactive pivot table shown below, based on a widget originally created by Nicolas Krutchen at Datacritic, contains a small fragment of the Comtrade data describing milk imports to the UK.</Paragraph><Paragraph>The pivot table is organised as follows:</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1022.jpg" width="100%" webthumbnail="true" x_folderhash="cbfeded3" x_contenthash="6c0e19e8" x_imagesrc="ou_futurelearn_learn_to_code_fig_1022.jpg" x_imagewidth="780" x_imageheight="329" x_smallsrc="ou_futurelearn_learn_to_code_fig_1022.small.jpg" x_smallfullsrc="https://openuniv.sharepoint.com/sites/informal-lrning/learn-to-code-for-data-analysis/learntocodefordataanalysisopenlearnstudyunit/ou_futurelearn_learn_to_code_fig_1022.small.jpg" x_smallwidth="512" x_smallheight="216"/><Caption><b>Figure 7</b></Caption><Alternative>Screenshot of the interactive pivot table</Alternative><Description>Screenshot of the interactive pivot table</Description></Figure><Paragraph>You can see how the ‘Trade Flow’ and ‘Reporter’ columns are used to group the data, with each row representing a separate group. In addition, the values in the ‘Year’ column are broken out to create separate columns (although in this example there is only data for one year, and hence one ‘Year’ column, 2014). The function that is applied to the grouped data is a <ComputerCode>
<b>sum</b>
</ComputerCode> operation, and it is applied to the selected ‘Trade Value (US$)’ column in the original dataset. A marginal total value is calculated by summing across all the columns. The ‘Commodity’ and ‘Trade Value (US$)’ columns, while part of the original dataset, are not directly used to define the pivot table’s structure; that is, they are not used to set the row or column index header labels in the displayed pivot table.</Paragraph><Paragraph>In terms of the split-apply-combine pattern, the pivot table operates as follows:</Paragraph><BulletedList><ListItem>the column names from the original dataframe that are listed in the rows panel on the left hand side of the interactive pivot table split the data into a set of groups, with each row specifying a group</ListItem><ListItem>the pivot table’s columns are set according to the unique values associated with the specified columns from the orignal dataframe; these break the data down into yet smaller groups that are associated with each cell.</ListItem></BulletedList><Paragraph>The selected operator is then applied to each cell level group, the results combined and an appropriately structured output table is displayed.</Paragraph><Paragraph>To create a pivot table report for a dataset, typically three actions will be needed:</Paragraph><BulletedList><ListItem>identify what elements will appear as the row index values – that is, how the rows will be grouped. Typically, groups will be created based on the unique values within a single column or a combination of values, one from each of multiple grouping columns.</ListItem><ListItem>identify what elements will appear as column headings. Again, the column heading may just be the unique values of a single variable, or combined values across multiple grouping columns.</ListItem><ListItem>identify what numbers will be reported on. This step may often break down into two smaller steps: <BulletedSubsidiaryList> <SubListItem>to count the number of rows associated with a particular combination of row and column index values, select the count operation</SubListItem> <SubListItem>to perform an operation on the value of cells in another column, select that column and then identify what operation to apply to it. For example, find the sum or mean values of a numerical quantity associated with rows keyed by the row and column index values, or count the number of unique values of a particular variable in rows identified by those key values.</SubListItem> </BulletedSubsidiaryList></ListItem></BulletedList><Paragraph>In addition, one or more ‘filters’ can be added to the selection of row and column index values, either limiting which unique values in each key column to report on, or, by default, selecting them all.</Paragraph><Paragraph>It is often easier to understand how a pivot table is organised by using it interactively. You’ll get a chance to do this in the next exercise.</Paragraph><Activity><Heading>Exercise 5 Interactive pivot table</Heading><Question><Paragraph>If you haven’t already, open the comtrade_pivot.html and save it into the same folder as the Exercise notebook 4. Then either re-run all the notebook cells, or just run the cell that contains the interactive pivot table.</Paragraph><Paragraph>Configuring a pivot table requires paying careful attention to the selection of row (grouping) values, columns (reported values) and summary (aggregating) function.</Paragraph><Paragraph>How easy did you find it to use the interactive pivot table? Could you work out how to select the row and column labels in order to ask particular questions of the data? What sorts of questions did you try to ask?</Paragraph></Question></Activity><Section><Title>2.1 Pivot tables in pandas</Title><Paragraph>The interactive pivot table provides a convenient way of exploring a relatively small dataset directly within a web browser. (A python package is also available that allows interactive pivot tables to be created directly from a pandas dataframe.)</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1064.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="864b4f6b" x_imagesrc="ou_futurelearn_learn_to_code_fig_1064.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 8</b></Caption><Alternative>An image of several giant pandas lounging against some logs eating bamboo</Alternative><Description>An image of several giant pandas lounging against some logs eating bamboo</Description></Figure><Paragraph>You can also achieve a similar effect using code, one-line-at-a-time. In this step, you will learn how to ask – and answer – questions of a similar form to the ones you raised using the interactive pivot table, but this time using programming code.</Paragraph><Paragraph>There are several reasons why you might want to automate pivot table operations you might previously have done by hand. These include:</Paragraph><BulletedList><ListItem>having a record of all the steps used to perform a particular task, or analysis, which can be useful if you need to check or provide evidence about what you have done (transparency)</ListItem><ListItem>being able to repeat the task automatically; this is particularly useful if you need to perform the same task repeatedly – for example, generating a new summary report each time a dataset is updated with new weekly or daily figures</ListItem><ListItem>being able to apply one analysis to another dataset. For example, you might want to produce the same sort of pivot table reports to similarly organised datasets but differently populated datasets (for example, Comtrade datasets that refer to different groups of countries and/or different commodity types).</ListItem></BulletedList><Paragraph>In order to use the interactive pivot table, you had to identify:</Paragraph><BulletedList><ListItem>what column(s) in the dataset to use to define the row groupings in the pivot table</ListItem><ListItem>what column(s) in the dataset to use to define the column groupings in the pivot table</ListItem><ListItem>what column in the dataset to use as the basis for the pivot table summary function</ListItem><ListItem>what summary function to use.</ListItem></BulletedList><Paragraph>The process is similar when it comes to using pivot tables in pandas. Indeed, you might find it useful to use the interactive pivot table to help you identify just what needs to go where in order to generate a particular report using the pandas pivot table.</Paragraph><InternalSection><Heading>Working with pandas pivot tables</Heading><Paragraph>Let’s start by creating a sample dataset that includes several different columns that can be grouped around. The code below defines the dataframe column by column, instead of row by row as you have learned before.</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>
df = DataFrame({"Commodity":["A","A","A","A","B","B","B","C","C"],
</Paragraph><Paragraph>"Amount":[10,15,5,20,10,10,5,20,30],</Paragraph><Paragraph>"Partner":["P","P","Q","Q","P","P","Q","P","Q"],</Paragraph><Paragraph>"Flow":["X","Y","X","Y","X","Y","X","X","Y"]})</Paragraph><Paragraph>df</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table class="normal" style="topbottomrules"><TableHead/><tbody><tr><th/><th>Commodity</th><th>Partner</th><th>Flow</th><th>Amount</th></tr><tr><td><b>0</b></td><td>A</td><td>P</td><td>X</td><td>10</td></tr><tr><td><b>1</b></td><td>A</td><td>P</td><td>Y</td><td>15</td></tr><tr><td><b>2</b></td><td>A</td><td>Q</td><td>X</td><td>5</td></tr><tr><td><b>3</b></td><td>A</td><td>Q</td><td>Y</td><td>20</td></tr><tr><td><b>4</b></td><td>B</td><td>P</td><td>X</td><td>10</td></tr><tr><td><b>5</b></td><td>B</td><td>P</td><td>Y</td><td>10</td></tr><tr><td><b>6</b></td><td>B</td><td>Q</td><td>X</td><td>5</td></tr><tr><td><b>7</b></td><td>C</td><td>P</td><td>X</td><td>20</td></tr><tr><td><b>8</b></td><td>C</td><td>Q</td><td>Y</td><td>30</td></tr></tbody></Table><Paragraph>Suppose, for example, that you have data for a particular reporter country, and that you want to find the total value of trade that country has for each commodity and each partner country. A pivot table can be used to split the data by ‘commodity’, and within that ‘partner’, and then apply some sort of aggregation function to each (‘commodity’, ‘partner’) group.</Paragraph><Paragraph>In the interactive pivot table, this would have meant ordering the ‘Commodity’ and ‘Partner’ labels in the rows area, setting the aggregation function to <ComputerCode>
<b>sum</b>
</ComputerCode> and applying it to the ‘Amount’ (that is, the ‘Trade Value’), and leaving the columns area free of any selections.</Paragraph><Paragraph>In turn, the pandas <ComputerCode>
<b>pivot_table()</b>
</ComputerCode> function uses:</Paragraph><BulletedList><ListItem>the <ComputerCode>
<b>index</b>
</ComputerCode> parameter set as a list containing the ‘Commodity’ and ‘Reporter’ data elements, to define the row categories</ListItem><ListItem>the <ComputerCode>
<b>values</b>
</ComputerCode> parameter set to ‘Amount’</ListItem><ListItem>the <ComputerCode>
<b>aggfunc</b>
</ComputerCode> (aggregating function) set to <ComputerCode>
<b>sum</b>
</ComputerCode> .</ListItem></BulletedList><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>pivot_table(df,</Paragraph><Paragraph>index=['Commodity','Partner'],</Paragraph><Paragraph>values='Amount',</Paragraph><Paragraph>aggfunc=sum)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table class="normal" style="topbottomrules"><TableHead/><tbody><tr><th/><th>Flow</th><th>X</th><th>Y</th></tr><tr><th>Commodity</th><th>Partner</th><td/><td/></tr><tr><td><b>A</b></td><td>P</td><td>10</td><td>15</td></tr><tr><td/><td>Q</td><td>5</td><td>20</td></tr><tr><td><b>B</b></td><td>P</td><td>10</td><td>10</td></tr><tr><td/><td>Q</td><td>5</td><td>NaN</td></tr><tr><td><b>C</b></td><td>P</td><td>20</td><td>NaN</td></tr><tr><td/><td>Q</td><td>NaN</td><td>30</td></tr></tbody></Table><Paragraph>To further subdivide the data, an additional ‘Flow’ grouping element could be added in. (In this case, the resulting pivot table just corresponds to the original dataset.)</Paragraph><Paragraph><ComputerCode>
<b>In []:</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>pivot_table(df,</Paragraph><Paragraph>               index=['Commodity','Partner','Flow'], </Paragraph><Paragraph>               values='Amount', </Paragraph><Paragraph>               aggfunc=sum)</Paragraph></ComputerDisplay><Paragraph><ComputerCode>
<b>Out[]:</b>
</ComputerCode></Paragraph><Table><TableHead/><tbody><tr><th>Commodity</th><th>Partner</th><th>Flow</th><th/></tr><tr><td>A</td><td>P</td><td>X</td><td>10</td></tr><tr><td/><td/><td>Y</td><td>15</td></tr><tr><td/><td>Q</td><td>X</td><td>5</td></tr><tr><td/><td/><td>Y</td><td>20</td></tr><tr><td>B</td><td>P</td><td>X</td><td>10</td></tr><tr><td/><td/><td>Y</td><td>10</td></tr><tr><td/><td>Q</td><td>X</td><td>5</td></tr><tr><td>C</td><td>P</td><td>X</td><td>20</td></tr><tr><td/><td>Q</td><td>Y</td><td>30</td></tr></tbody></Table><Paragraph>Alternatively, you might decide that you want to pull out the ‘Flow’ items into separate columns for each of the original (‘commodity’, ‘partner’) groupings. To do this, add in a columns parameter:</Paragraph><ComputerDisplay><Paragraph>pivot_table(df,</Paragraph><Paragraph>        index=['Commodity','Partner'],</Paragraph><Paragraph>        columns=['Flow'],</Paragraph><Paragraph>        values='Amount',</Paragraph><Paragraph>        aggfunc=sum)</Paragraph></ComputerDisplay><Table class="normal" style="topbottomrules"><TableHead/><tbody><tr><th/><th>Flow</th><th>X</th><th>Y</th></tr><tr><th>Commodity</th><th>Partner</th><th/><th/></tr><tr><td><b>A</b></td><td><b>P</b></td><td>10</td><td>15</td></tr><tr><td/><td><b>Q</b></td><td>5</td><td>20</td></tr><tr><td><b>B</b></td><td><b>P</b></td><td>10</td><td>10</td></tr><tr><td/><td><b>Q</b></td><td>5</td><td>NaN</td></tr><tr><td><b>C</b></td><td><b>P</b></td><td>20</td><td>NaN</td></tr><tr><td/><td><b>Q</b></td><td>NaN</td><td>30</td></tr></tbody></Table><Paragraph>In this case, some missing values arise for cases where there was no original row item. For example, there were no rows in the original dataset for Commodity/Partner/Flow values of B/Q/Y, C/P/Y or C/Q/X.</Paragraph><Paragraph>The pandas produced pivot table can be further extended to report ‘marginal’ items, that is, row and column based total amounts, by setting <ComputerCode>
<b>margins=True.</b>
</ComputerCode></Paragraph><ComputerDisplay><Paragraph>pivot_table(df,</Paragraph><Paragraph>        index=['Commodity','Partner'],</Paragraph><Paragraph>        columns=['Flow'],</Paragraph><Paragraph>        values='Amount',</Paragraph><Paragraph>        aggfunc=sum,</Paragraph><Paragraph>        margins=True) </Paragraph></ComputerDisplay><Table class="normal" style="topbottomrules"><TableHead/><tbody><tr><th/><th>Flow</th><th>X</th><th>Y</th><th>All</th></tr><tr><td><b>Commodity</b></td><td><b>Partner</b></td><td/><td/><td/></tr><tr><td><b>A</b></td><td><b>P</b></td><td>10</td><td>15</td><td>25</td></tr><tr><td/><td><b>Q</b></td><td>5</td><td>20</td><td>25</td></tr><tr><td><b>B</b></td><td><b>P</b></td><td>10</td><td>10</td><td>20</td></tr><tr><td/><td><b>Q</b></td><td>5</td><td>NaN</td><td>5</td></tr><tr><td><b>C</b></td><td><b>P</b></td><td>20</td><td>NaN</td><td>20</td></tr><tr><td/><td><b>Q</b></td><td>NaN</td><td>30</td><td>30</td></tr><tr><td><b>All</b></td><td/><td>50</td><td>75</td><td>125</td></tr></tbody></Table><Paragraph>In terms of the ‘split-apply-combine’ pattern, the pandas pivot table operates in much the same way as the interactive pivot table:</Paragraph><BulletedList><ListItem>the list of original data columns assigned to the index parameter splits the data into a set of groups</ListItem><ListItem>the groups are further split into smaller cell level groupings by optionally setting the columns parameter.</ListItem></BulletedList><Paragraph>The selected operator is then applied to each group and the results combined in an appropriately structured output display table.</Paragraph><Activity><Heading>Exercise 6 pivot tables with pandas</Heading><Question><Paragraph>Use the Exercise notebook 4 to explore the creation of pivot tables using pandas in Exercise 6.</Paragraph><Paragraph>Did you manage to ask any new questions of your data using the pandas pivot table function? You could try using them in combination with other pandas functions, such as <ComputerCode>
<b>filter()</b>
</ComputerCode> , to limit the rows you generated the pivot table against. What did the pivot tables tell you about the levels of trade around the trade item and reporter country you selected?</Paragraph><Paragraph>One reason that pivot tables are often thought of as difficult to use is that there is a lot of data manipulation going on inside them. The data is grouped across rows, split across columns and may be aggregated in various ways. It can sometimes be hard to work out how to structure the output report you want, even before worrying about the programming code syntax. Given that, consider what you think the benefits of using code are as opposed to interactive pivot tables. Think about how you could use them to complement each other.</Paragraph></Question></Activity></InternalSection></Section><Section><Title>2.2 Looking at the milk and cream trade</Title><Paragraph>This week’s project looks at the milk and cream trade between the UK and other countries in the first five months of 2015.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1065.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="5861e0ab" x_imagesrc="ou_futurelearn_learn_to_code_fig_1065.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 9</b></Caption><Alternative>A close-up image of the splash of milk being poured</Alternative><Description>A close-up image of the splash of milk being poured</Description></Figure><Paragraph>The written up analysis is in the project notebook. You will also need to open the file comtrade_milk_uk_jan_jul_15.csv and save it to your Anaconda folder or CoCalc project.</Paragraph><Paragraph>The structure is very simple: besides the introduction and the conclusions, there is one section for each research question.</Paragraph><Paragraph>Extend or create your own project next.</Paragraph></Section><Section><Title>2.3 Your project</Title><Paragraph>If you have time, extend my project to answer different questions or create your own project.</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1066.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="3e88e1cc" x_imagesrc="ou_futurelearn_learn_to_code_fig_1066.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 10</b></Caption><Alternative>An image of a young man writing in a notebook at a desk in front of a brick wall.</Alternative><Description>An image of a young man writing in a notebook at a desk in front of a brick wall. There is a second notebook, a cup and saucer and a vase of flowers on the desk.</Description></Figure><Activity><Heading>Activity 1 Extend the project</Heading><Question><InternalSection><Paragraph>Make a copy of the project notebook and change it to answer one or all of the following questions:</Paragraph><BulletedList><ListItem>Which are the regular exporters, i.e. which countries sell every month both unprocessed and processed milk and cream to the UK?</ListItem><ListItem>Where could the export market be further developed, i.e. which countries import the least? Do the figures look realistic?</ListItem><ListItem>What is total amount of exports to and imports from the bi-lateral trade countries? Hint: pivot tables can have ‘marginal’ values.</ListItem><ListItem>Repeat the whole analysis for January–May 2014 and compare the results.</ListItem></BulletedList></InternalSection></Question></Activity><Activity><Heading>Activity 2 Create a project (optional)</Heading><Question><InternalSection><Paragraph>If you have more time, create a completely new project. You could choose completely different commodities, a different reporter (e.g. your country), a different period (e.g. two or more full years), and only a few select partners (e.g. just the ‘World’ partner for a global analysis).</Paragraph></InternalSection></Question></Activity></Section></Session><Session><Title>3 This week’s quiz</Title><Paragraph>Now it’s time to complete the Week 4 badge quiz. It is similar to previous quizzes, but this time instead of answering five questions there will be fifteen.</Paragraph><Paragraph><a href="https://www.open.edu/openlearn/ocw/mod/quiz/view.php?id=78784">Week 8 compulsory badge quiz</a></Paragraph><Paragraph>Remember, this quiz counts towards your badge. If you’re not successful the first time, you can attempt the quiz again in 24 hours.</Paragraph></Session><Session><Title>4 Summary </Title><Paragraph>Phew – you made it! Well done!</Paragraph><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_fig_1067.jpg" width="100%" x_folderhash="cbfeded3" x_contenthash="3edbabd5" x_imagesrc="ou_futurelearn_learn_to_code_fig_1067.jpg" x_imagewidth="512" x_imageheight="341"/><Caption><b>Figure 11</b></Caption><Alternative>An image of a gymnast on her front with her legs split over her head</Alternative><Description>An image of a gymnast on her front with her legs split over her head</Description></Figure><Paragraph>During this week you’ve learned how to take a dataset that contains multiple possible groupings or subsets of data, and work with those groups to perform a variety of transformations.</Paragraph><Paragraph>In particular, you have learned how to:</Paragraph><BulletedList><ListItem>split the data contained in a dataframe into multiple groups based on the unique ‘key’ values in a single column, or unique combinations of values that appear across two or more columns</ListItem><ListItem><ComputerCode>
<b>apply</b>
</ComputerCode> an <ComputerCode>
<b>aggregate</b>
</ComputerCode> (summary) function to generate a single summary result for the group, and then combine these results to generate a summary report with as many rows as there were groups, one summary report row per group <ComputerCode>
<b>-apply</b>
</ComputerCode> a <ComputerCode>
<b>filter</b>
</ComputerCode> function that would use the rows contained in each group as the basis for a filtering operation, returning rows from each group who’s group properties matched the filter conditions</ListItem><ListItem>use a pivot table to generate a variety of summary reports.</ListItem></BulletedList><Paragraph>You may not have thought of performing gymnastics with data before, but as you’ve seen, we can start to work our data quite hard if we need to get it into shape!</Paragraph><Section><Title>4.1 Week 7 and 8 glossary</Title><Paragraph>Here are alphabetical lists, for quick look up, of what this week introduced.</Paragraph><InternalSection><Heading>Concepts</Heading><Paragraph>An <b>API</b> , or <b>application programming interface</b> provides a way for computer programmes to make function or resource requests from a software application. Many online applications provide a <b>web API</b> that allows requests to be made over the internet using web addresses or <b>URLs</b> (uniform resource locator). A URL may include several parameters that act as arguments used to pass information into a function provided by the API. To prevent ambiguity, simple punctuation is avoided in URLs. Instead, ‘websafe’ encodings using the ASCII encoding scheme are typically used to describe punctuation characters.</Paragraph><Paragraph>The notion of <b>grouping</b> refers to the collecting together of sets of rows based on some defining characteristic. Grouping on one or more key columns splits the dataset into separate groups, one group for each unique combination of values that appears in the dataset across the key columns. Note that not all possible combinations of cross-column key values will necessarily exist in a dataset.</Paragraph><Paragraph>The <b>split-apply-combine</b> pattern describes a process in which a dataset is <b>split</b> into separate groups, some function is <b>applied</b> to the members of each separate group, and the results then <b>combined</b> to form an output dataset.</Paragraph></InternalSection><InternalSection><Heading>Functions and methods</Heading><Paragraph><ComputerCode>
<b>df.to_csv(filename,index=False)</b>
</ComputerCode> writes the contents of the dataframe <ComputerCode>
<b>df</b>
</ComputerCode> to a CSV file with the specified filename in the current folder. The <ComputerCode>
<b>index</b>
</ComputerCode> parameter controls whether the dataframe index is included in the output file.</Paragraph><Paragraph><ComputerCode>
<b>read_csv(URL,dtype={})</b>
</ComputerCode> can be used to read a CSV file in from a web location given the web address or URL of the file. We also made use of an additional parameter, <ComputerCode>
<b>dtype</b>
</ComputerCode> to specify the data type of specified columns in a dataframe created from a CSV file.</Paragraph><Paragraph><ComputerCode>
<b>df.groupby(columnName)</b>
</ComputerCode> or <ComputerCode>
<b>df.groupby(listOfColumnNames)</b>
</ComputerCode> is used to split a dataframe into separate groups indexed by the unique values of <ComputerCode>
<b>columnName</b>
</ComputerCode> or unique combinations of the column values specified in the <ComputerCode>
<b>listOfColumnNames.</b>
</ComputerCode></Paragraph><Paragraph><ComputerCode>
<b>grouped.get_group(groupName)</b>
</ComputerCode> is used to retrieve a particular group of rows by group name from a set of grouped items.</Paragraph><Paragraph><ComputerCode>
<b>grouped.groups.keys()</b>
</ComputerCode> is used to retrieve the names of groups that exist within a set of grouped items.</Paragraph><Paragraph><ComputerCode>
<b>grouped.aggregate(operation)</b>
</ComputerCode> applies a specified operation to a group (such as sum) and then combines the results into a single dataframe indexed by group.</Paragraph><Paragraph><ComputerCode>
<b>grouped.apply(myFunction)</b>
</ComputerCode> will apply a user defined function to the rows associated with each group in a set of grouped items and return a dataframe containing the combined rows returned from the user defined function.</Paragraph><Paragraph><ComputerCode>
<b>grouped.filter(myFilterFunction)</b>
</ComputerCode> will apply a user defined filtration function to each group in a set of grouped items that tests each group and returns a Boolean True or False value to say whether each group has passed the filter test. The <ComputerCode>
<b>.filter()</b>
</ComputerCode> function then returns a single dataframe that contains the combined rows from groups that the user defined filter function let through.</Paragraph><Paragraph><ComputerCode>
<b> pivot_table(df, index=indexColumnNames, columns=columnsColumnNames, values=valueColumnName, aggfunc=aggregationFunction) </b>
</ComputerCode> generates a pivot table from a dataframe using unique combinations of values from one or more columns specified by the <ComputerCode>
<b>indexColumnNames</b>
</ComputerCode> list to define the row index and unique combinations of values from one or more columns specified by the <ComputerCode>
<b>columnsColumnNames</b>
</ComputerCode> list to define the columns. The pivot table cells are calculated by applying the <ComputerCode>
<b>aggfunc</b>
</ComputerCode> function to the <ComputerCode>
<b>valueColumnName</b>
</ComputerCode> column in the group of rows associated with each cell.</Paragraph></InternalSection></Section><Section><Title>4.2 What next?</Title><MediaContent src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/lcdab_w8_conclusion.mp4" type="video" x_manifest="lcdab_w8_conclusion_1_server_manifest.xml" x_filefolderhash="d072e793" x_folderhash="d072e793" x_contenthash="d1e3addb"><Transcript><Speaker>RUTH ALEXANDER</Speaker><Remark>Hello again. You've now reached the end of this course on 'learning to code for data analysis'. You've seen and experienced some of the fundamentals of programming such as basic data types and structures like numbers, strings and lists and assignments and variables to store intermediate results. You've also seen the basic techniques to obtain, clean, transform, aggregate and visualise data. With a single line of code you can filter out the missing values, join two tables or make a chart. Finally you have seen how to use interactive notebooks to write up your own data analysis as a mixture of explanatory text and runnable code. </Remark><Remark>Notebooks can be easily shared among a group of colleagues or publicly which means you can make a real contribution to ongoing research and debates. We hope you're keen to apply your newly gained skills to other data sets on issues that you care about. You'll be able to find data sets you can explore and interrogate in the fields of health, education, energy, climate change, poverty, crime and many more besides. Below this video you'll find links to open data sources but don't forget that your national government or even your local authority might provide open data that's more relevant to you. In a short course like this we could only scratch the surface of coding and data analysis. </Remark><Remark>We hope to have inspired you to learn more about programming, data science and data management, or even statistics. Below you'll also find links to Open University courses, qualifications and free online resources that are related to the topics of this course. Whether you continue your studies with the Open University or not we do hope you really enjoyed learning to code for data analysis. We'd love to hear your feedback and suggestions. Thanks for participating and all the best for the future. </Remark></Transcript><Figure><Image src="https://www.open.edu/openlearn/pluginfile.php/1393338/mod_oucontent/oucontent/71687/ou_futurelearn_learn_to_code_vid_1005.jpg" x_folderhash="cbfeded3" x_contenthash="84fc983e" x_imagesrc="ou_futurelearn_learn_to_code_vid_1005.jpg" x_imagewidth="512" x_imageheight="288"/></Figure></MediaContent><Paragraph>As the course comes to an end, what’s next in your learning journey? The <a href="https://nationalcareers.service.gov.uk/?utm_source=Learn%20to%20Code%20for%20Data%20Analysis&amp;utm_medium=website&amp;utm_campaign=skillstoolkit">National Careers Service</a> can help you decide your next steps with your new skills. Ruth also mentions extending your learning by investigating more open data.</Paragraph><InternalSection><Heading>Exploring open data further</Heading><Paragraph>The last few years has seen a wide variety of local and national governments and agencies publishing data as ‘open data’ that can be freely re-used by anyone. Explore some of this data yourself, at the following links:</Paragraph><BulletedList><ListItem><Paragraph><a href="http://data.gov.uk/">UK government open data site</a> – a directory of UK public datasets</Paragraph></ListItem><ListItem><Paragraph><a href="http://data.gov/">US government open data site</a> – the home of the US Government’s open data</Paragraph></ListItem><ListItem><Paragraph><a href="http://index.okfn.org/dataset/">Open Knowledge Global Open Data Index</a> – a comprehensive directory of national open data initiatives</Paragraph></ListItem><ListItem><a href="http://opendatainception.io">Open Data Inception</a> – a geographic list of over 1500 data portals around the world</ListItem><ListItem><a href="https://www.google.com/publicdata/directory">Google Public Data Explorer</a> – a further list of data providers, with charts for some datasets</ListItem><ListItem><Paragraph>Many towns and cities also have their own data sites: search for the name of your town and the keywords ‘open data store’</Paragraph></ListItem><ListItem><Paragraph>Open data published by government departments and agencies such as <a href="http://www.education.gov.uk/schools/performance/">performance of UK schools</a> or <a href="http://landregistry.data.gov.uk/app/ppd">prices paid for house sales in the UK</a></Paragraph></ListItem><ListItem><Paragraph>The pandas library supports a growing number of external data sources such as <a href="http://pandas.pydata.org/pandas-docs/stable/remote_data.html">Google Analytics</a>.</Paragraph></ListItem></BulletedList></InternalSection><Paragraph><b>Get careers guidance</b></Paragraph><Paragraph>The <a href="https://nationalcareers.service.gov.uk/find-a-course/the-skills-toolkit?utm_source=openlearn&amp;utm_medium=referral&amp;utm_campaign=skillstoolkit_completed">National Careers Service</a> can help you decide your next steps with your new skills.</Paragraph><InternalSection><Heading>Complete our survey</Heading><Paragraph>We would love to know what you thought of the course and what you plan to do next. Whether you studied the course all in one go or dipped in and out, please take our <a href="https://www.surveymonkey.co.uk/r/BOCENDlearntocode">end-of-course survey</a>. Your feedback is anonymous but will help us to improve what we deliver.</Paragraph></InternalSection></Section></Session><Session><Title>Tell us what you think</Title><Paragraph>Now you’ve come to the end of the course, we would appreciate a few minutes of your time to complete this short <a href="https://www.surveymonkey.co.uk/r/BOCENDlearntocode">end-of-course survey</a> (you may have already completed this survey at the end of Week 4).</Paragraph><Paragraph>Additionally, if you found this course through the Skills Toolkit launched by the UK government in April 2020 and would be willing to provide feedback on how this course has helped you, please get in touch <a href="mailto:openlearn@open.ac.uk?subject=Learn to code course feedback">by emailing us</a>.</Paragraph></Session></Unit>
    
    
    <BackMatter><Acknowledgements><Paragraph>This free course was written by Michel Wermelinger.</Paragraph><Paragraph>Except for third party materials and otherwise stated (see <a href="http://www.open.ac.uk/conditions">terms and conditions</a> ), this content is made available under a <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_GB"> Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Licence </a>.</Paragraph><Paragraph>The material acknowledged below is Proprietary and used under licence (not subject to Creative Commons Licence). Grateful acknowledgement is made to the following sources for permission to reproduce material in this free course:</Paragraph><!--
The full URLs if required should the hyperlinks above break are as follows: Terms and conditions link  http://www.open.ac.uk/ conditions; Creative Commons link: http://creativecommons.org/ licenses/ by-nc-sa/ 4.0/ deed.en_GB]
--><Heading>Images</Heading><Paragraph><b>Course image</b> © peterhowell/iStockphoto.com</Paragraph><Paragraph><b>Figure 1 </b> © AFP/Getty Images</Paragraph><Paragraph><b>Figure 2</b> © Savushkin/istockphoto.com</Paragraph><Paragraph><b>Figure 3</b> © Kameleon007/istockphoto.com</Paragraph><Paragraph><b>Figure 4</b> © Picardo/Getty Images.co.uk</Paragraph><Paragraph><b>Figure 5</b> © kerriekerr/istockphoto.com</Paragraph><Paragraph><b>Figure 6</b> © Rawpixel Ltd/istockphoto.com</Paragraph><Paragraph><b>Figure 7</b> © merc67/istockphoto.com</Paragraph><Paragraph><b>Figure 8</b> © Katherine Feng/Globio/Getty Images</Paragraph><Paragraph><b>Figure 9</b> © AFP/Getty Images</Paragraph><Paragraph><b>Figure 10</b> © Nicolas Raymond in Flickr https://creativecommons.org/licenses/by/2.0</Paragraph><Paragraph><b>Figure 11</b> created by Jonatan argento in Wikipedia, https://creativecommons.org/licenses/by-sa/2.5/deed.en</Paragraph><Paragraph><b>Figure 12</b> © Fuse/Getty Images</Paragraph><Paragraph><b>Figure 13</b> © Predrag Vuckovic/Getty Images</Paragraph><Paragraph><b>Figure 1</b> © David Sucsy/istockphoto.com</Paragraph><Paragraph><b>Figure 2</b> © santiphotois/istockphoto.com</Paragraph><Paragraph><b>Figure 3</b> © Fuse/Getty Images</Paragraph><Paragraph><b>Figure 4</b> © Biddiboo/Getty Images</Paragraph><Paragraph><b>Figure 5</b> © maximkabb/istockphoto.com</Paragraph><Paragraph><b>Figure 6</b> © D-BASE/Getty Images</Paragraph><Paragraph><b>Figures 1 and 2</b> © Igor Zhuravlov/via istockphoto.com</Paragraph><Paragraph><b>Figure 3</b> © MartinCParker/istockphoto.com</Paragraph><Paragraph><b>Figure 4</b> © Adri Berger/Getty Images</Paragraph><Paragraph><b>Figure 5</b> © Donald Iain Smith/Getty Images</Paragraph><Paragraph><b>Figure 6</b> Public Domain by US Federal Government https://commons.wikimedia.org/wiki/File:Compass_rose.png</Paragraph><Paragraph><b>Figure 7</b> © Jim Reed/Getty Images</Paragraph><Paragraph><b>Figure 1</b> © David Sucsy/istockphoto.com</Paragraph><Paragraph><b>Figure 2</b> © santiphotois/istockphoto.com</Paragraph><Paragraph><b>Figure 3</b> © Fuse/Getty Images</Paragraph><Heading>Text</Heading><Paragraph>Exercise notebook 3 and Exercise 1: and other identifiable exercises and tables compiled using The World Bank data (2013).</Paragraph><Paragraph><b>Figure 1</b> © Biddiboo/Getty Images</Paragraph><Paragraph><b>Figure 2</b> © maximkabb/istockphoto.com</Paragraph><Paragraph><b>Figure 3</b> © D-BASE/Getty Images</Paragraph><Paragraph><b>Figure 6</b> © Uwe Kreijci/Getty Images</Paragraph><Paragraph><b>Figure 7</b> © Ariel Skelley/Getty Images</Paragraph><Paragraph><b>Figure 1</b> © cgouin/istockphoto.com</Paragraph><Paragraph><b>Figure 2</b> © Sam Camp/istockphoto.com</Paragraph><Paragraph><b>Figure 3</b> © Terribil-T/istockphoto.com</Paragraph><Paragraph><b>Figures 4 and 5</b> adapted from Comtrade data extraction interface: https://comtrade.un.org/data/</Paragraph><Paragraph><b>Figure 6</b> © ronstik.istockphoto.com</Paragraph><Paragraph><b>Figure 8</b> © webking/istockphoto.com</Paragraph><Paragraph><b>Figure 13</b> pivot table originally created by Nicolas Krutchen at Datacritic containing some Comtrade data describing milk imports to the UK</Paragraph><Paragraph><b>Figure 14</b> © Hung_Chung_Chih/istockphoto.com</Paragraph><Paragraph><b>Figure 15</b> © Jack Andersen/Getty Images</Paragraph><Paragraph><b>Figure 16</b> © PeopleImages/istockphoto.com</Paragraph><Paragraph><b>Figure 17</b> © oleg66/istockphoto.com</Paragraph><Heading>Text</Heading><Paragraph>Activity 3 Project: compiled from World Health Organisation (WHO) information)</Paragraph><Heading>Video</Heading><Paragraph>Exercise 1 video: © The Open University</Paragraph><Paragraph>Every effort has been made to contact copyright owners. If any have been inadvertently overlooked, the publishers will be pleased to make the necessary arrangements at the first opportunity.</Paragraph><Paragraph/><Paragraph><b>Don't miss out</b></Paragraph><Paragraph>If reading this text has inspired you to learn more, you may be interested in joining the millions of people who discover our free learning resources and qualifications by visiting The Open University – <a href="http://www.open.edu/openlearn/free-courses?LKCAMPAIGN=ebook_&amp;MEDIA=ol">www.open.edu/openlearn/free-courses</a>.</Paragraph></Acknowledgements></BackMatter>
</Item>
