| Home > Publications Database > Impact of leakage on data harmonization in machine learning pipelines in class imbalance across sites |
| Journal Article | DZNE-2026-00301 |
; ; ; ; ; ; ; ;
2026
Elsevier
Amsterdam
This record in other databases:
Please use a persistent id in citations: doi:10.1016/j.neucom.2026.133146
Abstract: Due to the cost and complexity of data collection in biomedical domains, it is a common practice to combine data from multiple sites to obtain large datasets required for machine learning. However, undesired site-specific variability presents challenges. Data harmonization aims to address this issue by removing site-specific variance while preserving biologically relevant information. We show that the widely used ComBat-based harmonization improvements are driven by data leakage due to illicit use of target information when class labels are imbalanced across sites, a common scenario in biomedical domains. We propose a novel approach, PrettYharmonize, which leverages subtle differences in data harmonized using different pretended target values. Using controlled benchmark datasets and real-world magnetic resonance imaging and clinical ICU data, we demonstrate that our leakage-free PrettYharmonize method achieves performance comparable to leakage-prone methods. As such, it is a viable method to integrate ComBat-based methods into machine learning applications.
|
The record appears in these collections: |