Poster

Community members

Machine-Learning-Based Comparative Study to Detect Suspect Temperature Gradient Error in Ocean Data.

Chouai, Mohamed1 , Reimers, F.1 , Vredenborg, M.1ORCID iD icon , Pinkernell, S.1 , Mieruch-Schnülle, S.1
  1. Alfred-Wegener-Institut - Helmholtz-Zentrum für Polar- und Meeresforschung
<p> <span> <span> <span> <span> <span> <span> <span> Thousands of ocean temperature and salinity measurements are collected every day around the world. Controlling the quality of this data is a human resource-intensive task because the control procedures still produce many false alarms only detected by a human expert. Indeed, quality control (QC) procedures have not yet benefited from the recent development of efficient machine learning methods to predict simple targets from complex multi-dimensional features. With increasing amounts of big data, algorithmic help is urgently needed, where artificial intelligence (AI) could play a dominant role. Developments in data mining and machine learning in automatic oceanographic data quality control need to be revolutionized. Such techniques provide a convenient framework to improve automatic QC by using supervised learning to reduce the discrepancy with the human expert evaluation. </span> </span> </span> </span> </span> </span> </span> </p> <p> <span> <span> <span> <span> <span> <span> <span> This scientific work proposes a comparative analysis of machine learning classification algorithms for ocean data quality control to detect the suspect temperature gradient error. The objective of this work is to obtain a very effective QC classification method from ocean data using a representative set of supervised machine learning algorithms. The work to be presented consists of the second step of our overall system, in which the first is based on a deep convolutional neural network to detect good/bad profiles, and the second is to locate bad samples. For this reason, the dataset used to train the used benchmarking models is composed only of bad profiles. </span> </span> </span> </span> </span> </span> </span> </p> <p> <span> <span> <span> <span> <span> <span> <span> The following algorithms are used in this study (with a hyperparameters optimisation):  Multilayer Perceptron (MLP), Support Vector Machine (SVM) with different kernels, Random Forest (RF), Naive Bayes (NB), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). Optimization of the hyper-parameters using Grid-Search is required to ensure the best classification results. </span> </span> </span> </span> </span> </span> </span> </p> <p> <span> <span> <span> <span> <span> <span> <span> The results obtained on the Unified Database for the Arctic and Subarctic Hydrography (UDASH) dataset are promising, especially with the MLP algorithm, in which we had an accuracy of 86.64% in the detection of good samples and 88.84% in the detection of the bad samples, where room for improvement exists. This system could have the potential to be used as a semi-automatic quality control system. </span> </span> </span> </span> </span> </span> </span> </p>