Journal Article DZNE-2026-00301

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Impact of leakage on data harmonization in machine learning pipelines in class imbalance across sites

 ;  ;  ;  ;  ;  ;  ;  ;

2026
Elsevier Amsterdam

Neurocomputing 680, 133146 () [10.1016/j.neucom.2026.133146]

This record in other databases:  

Please use a persistent id in citations: doi:

Abstract: Due to the cost and complexity of data collection in biomedical domains, it is a common practice to combine data from multiple sites to obtain large datasets required for machine learning. However, undesired site-specific variability presents challenges. Data harmonization aims to address this issue by removing site-specific variance while preserving biologically relevant information. We show that the widely used ComBat-based harmonization improvements are driven by data leakage due to illicit use of target information when class labels are imbalanced across sites, a common scenario in biomedical domains. We propose a novel approach, PrettYharmonize, which leverages subtle differences in data harmonized using different pretended target values. Using controlled benchmark datasets and real-world magnetic resonance imaging and clinical ICU data, we demonstrate that our leakage-free PrettYharmonize method achieves performance comparable to leakage-prone methods. As such, it is a viable method to integrate ComBat-based methods into machine learning applications.

Classification:

Contributing Institute(s):
  1. Artificial Intelligence in Medicine (AG Reuter)
Research Program(s):
  1. 354 - Disease Prevention and Healthy Aging (POF4-354) (POF4-354)

Appears in the scientific report 2026
Database coverage:
Medline ; Creative Commons Attribution CC BY 4.0 ; OpenAccess ; Clarivate Analytics Master Journal List ; Current Contents - Engineering, Computing and Technology ; Ebsco Academic Search ; Essential Science Indicators ; IF >= 5 ; JCR ; SCOPUS ; Science Citation Index Expanded ; Web of Science Core Collection
Click to display QR Code for this record

The record appears in these collections:
Document types > Articles > Journal Article
Institute Collections > BN DZNE > BN DZNE-AG Reuter
Full Text Collection
Public records
Publications Database

 Record created 2026-03-31, last modified 2026-04-17