Model Setup and Training
Data preprocessing focused on creating a curated matrix of environmental and hydroacoustic datasets to allow statistical and machine learning interrogation of relationships. Important points to consider included outlier removal, time-averaging, imputation, data augmentation, and representation of temporal dependencies).
The Aquaculture Biomass Monitor (ABM)
hydroacoustic sensor returns estimates of fish depth at sub-second
frequency. This data point reports the location (relative to the sensor)
of an individual (random) fish in the cage and is based on sensor
detected change in medium (water versus flesh). For a six-month study,
these generated about 45GB of data and it is important to process these high-frequency estimates of individual fish position into robust measurements of caged fish condition.
Data were first grouped into 1m bins to represent the frequency of returns at different depth levels based on the Echo Range (m) measurement (i.e. number of individual fish in each 1m bin). Measurements that are outside the extents of the cage were removed as outliers, and the remaining data were then time-averaged into hourly intervals. The binned data were depth-averaged to generate a time series vector that is amenable to machine learning analysis.
Data
gaps or missing values were either imputed or removed: if the gap was
less than 4 hours, data were imputed using a nearest neighbour linear
interpolation, while if gaps were greater than (or equal) 4 hours, this
portion was removed from analysis (i.e. both the environmental and
hydroacoustic data were removed). Autoregressive features (i.e. values
at previous points in time) are often informative for machine learning
models. We generated these features using a three-hour sliding window size
(i.e. values at previous one, two, and three hours). The resulting
matrix is combined with environmental data, and time-aligned. We used
our open-source packages, TSML and AutoMLPipeline for this preprocessing step.
The
features (environmental data primarily) and label (fish location) data
were split into two groups, to form the training-data set composed of
90% of the data, and the test-data set the remaining 10%. Training data and test data sets are two different but important parts of machine learning. The training data set
is the general term for the samples used to create the model, while the test data set is used to qualify performance. The test data set is a data set used to provide an unbiased evaluation of a final model fit on the training data set. It should not have been seen by the model at any stage during training.
After
preprocessing and hourly averaging, the total number of data points
available was 5,847. We then used the trained machine learning model
to interrogate how environmental data contributed to variations in fish
location and behaviour. This can be considered the true goal of the
machine learning implementation, and an accurate model simply served as a
means to achieve this goal.