Machine Learning

View

Given sufficient data, machine learning (ML) models have the potential to successfully detect, quantify, and predict various phenomena in geosciences. While physics-based modelling involves providing a set of inputs to a model which generates the corresponding outputs based on a non-linear mapping encoded from a set of governing equations, supervised machine learning (ML) instead learns the requisite mapping by being shown a large number of corresponding inputs and outputs. In ML parlance, the model is trained by being shown a set of inputs (called features), and corresponding outputs (termed labels), from which it learns the prediction task -- or in our case, we wish to predict the distribution of fish in a cage (as sampled by a hydroacoustic sensor) based on a set of environmental measurements or features.

Random Forest (RF) is one of the most popular machine learning models and has demonstrated excellent performance in complex prediction problems characterised by a large number of explanatory variables and nonlinear dynamics. RF is a classification and regression method based on the aggregation of a large number of decision trees. Decision trees are a conceptually simple yet powerful prediction tool that breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The resulting intuitive pathway from explanatory variables to outcome serves to provide an easily interpretable model.

In RF (Breiman 2001), each tree is a standard Classification or Regression Tree (CART) that uses what is termed node "impurity" as a splitting criterion and selects the splitting predictor from a randomly selected subset of predictors (the subset is different at each split). Each node in the regression tree corresponds to the average of the response within the subdomains of the features corresponding to that node. The node impurity gives a measure of how badly the observations at a given node fit the model. In regression trees, this is typically measured by the residual sum of squares within that node. Each tree is constructed from a bootstrap sample drawn with replacement from the original data set, and the predictions of all trees are finally aggregated through majority voting (Boulesteix 2012).

While RF is popular for its relatively good performance with little hyperparameter tuning (i.e. works well with the default values specified in the software library), as with all machine learning models it is necessary to consider the bias-variance tradeoff -- the balance between a model that tracks the training data perfectly but does not generalise to new data and a model that is biased or incapable of learning the training data characteristics.

Some of the hyperparameters to tune include a number of trees, maximum depth of each tree, number of features to consider when looking for the best split, and splitting criteria (Probst 2019).

References:

Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.

Boulesteix, Anne‐Laure, et al. "Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2.6 (2012): 493-507.

Probst, Philipp, Marvin N. Wright, and Anne‐Laure Boulesteix. "Hyperparameters and tuning strategies for random forest." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9.3 (2019): e1301.

Last modified: Tuesday, 19 October 2021, 5:07 PM