- Alfred-Wegener-Institut - Helmholtz-Zentrum für Polar- und Meeresforschung
Thousands of ocean temperature and salinity measurements are collected every day around the world. Controlling the quality of this data is a human resource-intensive task because the control procedures still produce many false alarms only detected by a human expert. Indeed, quality control (QC) procedures have not yet benefited from the recent development of efficient machine learning methods to predict simple targets from complex multi-dimensional features. With increasing amounts of big data, algorithmic help is urgently needed, where artificial intelligence (AI) could play a dominant role. Developments in data mining and machine learning in automatic oceanographic data quality control need to be revolutionized. Such techniques provide a convenient framework to improve automatic QC by using supervised learning to reduce the discrepancy with the human expert evaluation.
This scientific work proposes a comparative analysis of machine learning classification algorithms for ocean data quality control to detect the suspect temperature gradient error. The objective of this work is to obtain a very effective QC classification method from ocean data using a representative set of supervised machine learning algorithms. The work to be presented consists of the second step of our overall system, in which the first is based on a deep convolutional neural network to detect good/bad profiles, and the second is to locate bad samples. For this reason, the dataset used to train the used benchmarking models is composed only of bad profiles.
The following algorithms are used in this study (with a hyperparameters optimisation): Multilayer Perceptron (MLP), Support Vector Machine (SVM) with different kernels, Random Forest (RF), Naive Bayes (NB), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). Optimization of the hyper-parameters using Grid-Search is required to ensure the best classification results.
The results obtained on the Unified Database for the Arctic and Subarctic Hydrography (UDASH) dataset are promising, especially with the MLP algorithm, in which we had an accuracy of 86.64% in the detection of good samples and 88.84% in the detection of the bad samples, where room for improvement exists. This system could have the potential to be used as a semi-automatic quality control system.