Introduction to data wrangling

View

A core part of a data scientist's job is data wrangling: loading data, cleaning data, visualising data, transforming data, etc. Indeed an oft-quoted factoid states that data scientists spend 80% of their time cleaning data rather than creating insights. Hence, a key component of any scientist's toolbox is packages to simplify acting on data arrays

In Julia, groups of related items are usually stored in arrays, tuples, or dictionaries. As we saw in the previous session, arrays can be used for storing vectors and matrices.

An array is an ordered collection of elements. It's often indicated with square brackets and comma-separated items. You can create arrays that are full or empty, and arrays that hold values of different types or are restricted to values of a specific type.

A DataFrame is a data structure like a table or spreadsheet. You can use it for storing and exploring a set of related data values. Think of it as a smarter array for holding tabular data.


DataFrames are particularly valuable when considering data from multiple sources collected at different points in time. Consider the need to impact of dissolved oxygen on fish feeding patterns. One needs to combine data on dissolved oxygen and feed volumes from multiple cages in a farm. Naturally, the temporal evolution of both are critical so one needs to have timestamps assigned to both data arrays. Further, dissolved oxygen is influenced by several other variables such as water temperature and flow speed. Hence, one needs the ability to act on multiple time series datasets from different locations independently, and concatenate them in a form that lends itself towards analysis.

Luckily Julia provides the excellent DataFrames package that does just that. Let's explore data wrangling in Julia in our next live tutorial


Last modified: Tuesday, 19 October 2021, 4:48 PM