Introduction to data wrangling
A
core part of a data scientist's job is data wrangling: loading data,
cleaning data, visualising data, transforming data, etc. Indeed an
oft-quoted factoid states that data scientists spend 80% of their time
cleaning data rather than creating insights. Hence, a key component of
any scientist's toolbox is packages to simplify acting on data arrays
In Julia, groups of related items are usually stored in arrays, tuples, or dictionaries. As we saw in the previous session, arrays can be used for storing vectors and matrices.
An array is an ordered collection of elements. It's often indicated
with square brackets and comma-separated items. You can create arrays
that are full or empty, and arrays that hold values of different types
or are restricted to values of a specific type.
DataFrames are particularly valuable when considering data from multiple sources collected at different points in time. Consider the need to impact of dissolved oxygen on fish feeding patterns. One needs to combine data on dissolved oxygen and feed volumes from multiple cages in a farm. Naturally, the temporal evolution of both are critical so one needs to have timestamps assigned to both data arrays. Further, dissolved oxygen is influenced by several other variables such as water temperature and flow speed. Hence, one needs the ability to act on multiple time series datasets from different locations independently, and concatenate them in a form that lends itself towards analysis.
Luckily Julia provides the excellent DataFrames package that does just that. Let's explore data wrangling in Julia in our next live tutorial