IEDEL algorithm reduces data quality issue labeling workload by up to 95%

 

Submitter:

Li, Mia — Cooperative Institute for Severe and High-Impact Weather Research and Operations (CIWRO)

Area of research:

Surface Properties

Journal Reference:

Li L, KE Kehoe, J Hu, RA Peppler, AJ Sockol, and CA Godine. 2024. "Iterative Error-Driven Ensemble Labeling (IEDEL) Algorithm for Enhanced Data Quality Control for the Atmospheric Radiation Measurement (ARM) Program User Facility." JGR Machine Learning and Computation, , e2024JH000192, 10.1029/2024JH000192.

Science

The implementation of IEDEL algorithm with unanimous voting can efficiently and effectively identify sporadic data-quality issues scattered throughout large-scale data sets. This algorithm iteratively reduces labeling noise (errors) by leveraging transfer learning with an ensemble of non-overfitting models, significantly minimizing the data review workload by up to 95%.

Impact

The IEDEL algorithm with unanimous voting enables the ARM Data Quality Office (DQO) to rapidly label diverse data-quality (DQ) issues spanning 30 years of ARM data, facilitating the creation of robust supervised machine learning models for timely DQ issue identification and immediate intervention, thereby significantly preventing further data contamination. With the curated DQ issue labels generated by IEDEL, the DQO can expedite transfer learning for DQ issue detection across various atmospheric measures, different instruments, and all ARM sites.

Summary

By leveraging AI algorithms for daily Data Quality (DQ) issue examination, the ARM Data Quality Office (DQO) can produce more time-efficient DQ reports, allowing our instrument mentors to promptly address identified issues and prevent further data contamination. The primary challenge in constructing AI models for DQ issue detection lies in the lack of precise labels. Initially, the DQO experimented on several unsupervised learning algorithms but found they are more suitable for detecting outliers rather than a broad range of DQ issues, which often manifest as specific time series data patterns. To address this, the DQO developed the IEDEL algorithm with unanimous voting, enabling efficient labeling of target data patterns from over 30 years of ARM data and the creation of robust supervised learning models to detect similar patterns in new data. In summary, the IEDEL algorithm offers a promising approach that not only improves DQ issue detection but also provides significant opportunities to contribute to the scientific field by providing accurate labels for various data patterns.