Journal Article DZNE-2024-00612

http://join2-wiki.gsi.de/foswiki/pub/Main/Artwork/join2_logo100x88.png
Development and validation of a reliable DNA copy-number-based machine learning algorithm (CopyClust) for breast cancer integrative cluster classification.

 ;  ;  ;  ;  ;  ;  ;

2024
Macmillan Publishers Limited, part of Springer Nature [London]

Scientific reports 14(1), 11861 () [10.1038/s41598-024-62724-6]

This record in other databases:    

Please use a persistent id in citations: doi:

Abstract: The Integrative Cluster subtypes (IntClusts) provide a framework for the classification of breast cancer tumors into 10 distinct groups based on copy number and gene expression, each with unique biological drivers of disease and clinical prognoses. Gene expression data is often lacking, and accurate classification of samples into IntClusts with copy number data alone is essential. Current classification methods achieve low accuracy when gene expression data are absent, warranting the development of new approaches to IntClust classification. Copy number data from 1980 breast cancer samples from METABRIC was used to train multiclass XGBoost machine learning algorithms (CopyClust). A piecewise constant fit was applied to the average copy number profile of each IntClust and unique breakpoints across the 10 profiles were identified and converted into ~ 500 genomic regions used as features for CopyClust. These models consisted of two approaches: a 10-class model with the final IntClust label predicted by a single multiclass model and a 6-class model with binary reclassification in which four pairs of IntClusts were combined for initial multiclass classification. Performance was validated on the TCGA dataset, with copy number data generated from both SNP arrays and WES platforms. CopyClust achieved 81% and 79% overall accuracy with the TCGA SNP and WES datasets, respectively, a nine-percentage point or greater improvement in overall IntClust subtype classification accuracy. CopyClust achieves a significant improvement over current methods in classification accuracy of IntClust subtypes for samples without available gene expression data and is an easily implementable algorithm for IntClust classification of breast cancer samples with copy number data.

Keyword(s): Humans (MeSH) ; Breast Neoplasms: genetics (MeSH) ; Breast Neoplasms: classification (MeSH) ; Machine Learning (MeSH) ; Female (MeSH) ; DNA Copy Number Variations: genetics (MeSH) ; Algorithms (MeSH) ; Cluster Analysis (MeSH) ; Gene Expression Profiling: methods (MeSH)

Classification:

Contributing Institute(s):
  1. Statistics and Machine Learning (AG Mukherjee)
Research Program(s):
  1. 354 - Disease Prevention and Healthy Aging (POF4-354) (POF4-354)

Appears in the scientific report 2024
Database coverage:
Medline ; Creative Commons Attribution CC BY (No Version) ; DOAJ ; OpenAccess ; Article Processing Charges ; BIOSIS Previews ; Biological Abstracts ; Clarivate Analytics Master Journal List ; Current Contents - Physical, Chemical and Earth Sciences ; DOAJ Seal ; Ebsco Academic Search ; Essential Science Indicators ; Fees ; IF < 5 ; JCR ; PubMed Central ; SCOPUS ; Science Citation Index Expanded ; Web of Science Core Collection ; Zoological Record
Click to display QR Code for this record

The record appears in these collections:
Document types > Articles > Journal Article
Institute Collections > BN DZNE > BN DZNE-AG Mukherjee
Full Text Collection
Public records
Publications Database

 Record created 2024-05-27, last modified 2024-08-09


OpenAccess:
Download fulltext PDF Download fulltext PDF (PDFA)
External link:
Download fulltextFulltext by Pubmed Central
Rate this document:

Rate this document:
1
2
3
 
(Not yet reviewed)