Model Setup and Training

View

Data preprocessing focused on creating a curated matrix of environmental and hydroacoustic datasets to allow statistical and machine learning interrogation of relationships. Important points to consider included outlier removal, time-averaging, imputation, data augmentation, and representation of temporal dependencies).

The Aquaculture Biomass Monitor (ABM) hydroacoustic sensor returns estimates of fish depth at sub-second frequency. This data point reports the location (relative to the sensor) of an individual (random) fish in the cage and is based on sensor detected change in medium (water versus flesh). For a six-month study, these generated about 45GB of data and it is important to process these high-frequency estimates of individual fish position into robust measurements of caged fish condition.

Data were first grouped into 1m bins to represent the frequency of returns at different depth levels based on the Echo Range (m) measurement (i.e. number of individual fish in each 1m bin). Measurements that are outside the extents of the cage were removed as outliers, and the remaining data were then time-averaged into hourly intervals. The binned data were depth-averaged to generate a time series vector that is amenable to machine learning analysis. 

Data gaps or missing values were either imputed or removed: if the gap was less than 4 hours, data were imputed using a nearest neighbour linear interpolation, while if gaps were greater than (or equal) 4 hours, this portion was removed from analysis (i.e. both the environmental and hydroacoustic data were removed). Autoregressive features (i.e. values at previous points in time) are often informative for machine learning models. We generated these features using a three-hour sliding window size (i.e. values at previous one, two, and three hours). The resulting matrix is combined with environmental data, and time-aligned. We used our open-source packages, TSML and AutoMLPipeline for this preprocessing step.

The features (environmental data primarily) and label (fish location) data were split into two groups, to form the training-data set composed of 90% of the data, and the test-data set the remaining 10%. Training data and test data sets are two different but important parts of machine learning. The training data set is the general term for the samples used to create the model, while the test data set is used to qualify performance. The test data set is a data set used to provide an unbiased evaluation of a final model fit on the training data set. It should not have been seen by the model at any stage during training.

After preprocessing and hourly averaging, the total number of data points available was 5,847.  We then used the trained machine learning model to interrogate how environmental data contributed to variations in fish location and behaviour. This can be considered the true goal of the machine learning implementation, and an accurate model simply served as a means to achieve this goal.




Last modified: Tuesday, 19 October 2021, 5:05 PM