REBALANCING DATA FOR CANCER-ASSOCIATED THROMBOSIS: COMPARISON OF DIFFERENT RESAMPLING APPROACH
DOI:
https://doi.org/10.22159/ajpcr.2026v19i2.57152Keywords:
Cancer-associated thrombosis, Machine learning classification, Resampling techniques,, Classifier-resampling interactions,Abstract
Objective: Cancer-associated thrombosis (CAT) presents a complex challenge in oncology, exacerbated by data imbalances in related datasets that often lead to suboptimal outcomes in machine learning (ML) classification. Many ML algorithms were originally designed for balanced datasets, prompting this study to evaluate the interaction between logistic regression (LR) and eXtreme Gradient Boost (XGBoost) and data resampling techniques for improving prediction on imbalances in Malaysian data on CAT (MDCAT).
Methods: Random oversampling (ROS), random undersampling (RUS), and a combined oversampling and undersampling approach (BOTH) were applied to MDCAT dataset. Classification tasks were performed using LR and XGBoost in R version 4.3.1. Classifier performance was assessed using accuracy, sensitivity, specificity, and the area under the ROC curve (AUROC) to evaluate the impact of different resampling techniques.
Results: Applying LR and XGBoost to the imbalanced data revealed high specificity but low sensitivity in testing samples. A substantial decline in XGBoost performance was observed, with the AUC decreasing from 0.794 in training to 0.381. Metastasis, surgery, and Indian ethnicity showed statistically significant associated with the CAT event across all resampling techniques. Among XGBoost models, oversampling (XO) exhibited excellent training performance (Accuracy 0.99; AUC 0.98) but showed a large performance drop on the test set (Accuracy 0.82; AUC 0.72). Among LR models, logistic undersampling yielded the highest training accuracy (0.83) and AUC of 0.82. Tuning amplified the differences between resampling strategies and highlighted clear classifier–resampling interactions. XGBoost benefited most, particularly when trained on mixed and oversampled datasets, while LR remained comparatively stable.
Conclusion: This study demonstrated that the effectiveness of prediction models in imbalanced MDCAT dataset is strongly influenced by the interaction between classifier characteristics and resampling strategies. A tuned XGBoost model with mixed resampling outweighed the benefits of LR’s simplicity and stability, making it our recommended approach given the primary importance of AUC.
Downloads
References
1. Falanga A, Marchetti M, Russo L. The mechanisms of cancer-associated thrombosis. Thromb Res. 2015;135 Suppl 1:S8-11. doi: 10.1016/s0049- 3848(15)50432-5, PMID 25903541
2. Lei H, Zhang M, Wu Z, Liu C, Li X, Zhou W, et al. Development and validation of a risk prediction model for venous thromboembolism inlung cancer patients using machine learning. Front Cardiovasc Med. 2022;9:845210. doi: 10.3389/fcvm.2022.845210, PMID 35321110
3. Meng L, Wei T, Fan R, Su H, Liu J, Wang L, et al. Development and validation of a machine learning model to predict venous thromboembolism among hospitalized cancer patients. Asia Pac J Oncol Nurs. 2022;9(12):100128. doi: 10.1016/j.apjon.2022.100128, PMID 36276886
4. Javaid M, Haleem A, Singh RP, Suman R, Rab S. Significance of machine learning in healthcare: Features, pillars and applications. Int J Intell Netw. 2022;3:58-73. doi: 10.1016/j.ijin.2022.05.002
5. Mahadevappa MK, Krishnan GN, Murthannagari VR, Arun J. Harnessing artificial intelligence: Transforming clinical trials for the future. Int J Appl Pharm. 2025;17:102-10. doi: 10.22159/ ijap.2025v17i6.54181
6. Lafi Z, Matalqah S, Asha S, Asha N, Mhaidat H, Asha SY. Advanced fabrication and characterization of silver nanoparticles using AI techniques. Int J Appl Pharm. 2025;17:42-51. doi: 10.22159/ ijap.2025v17i5.55011
7. Mohamed MM, Jusril NA, Adenan MI, Wen NG. In silico identification of APOBEC3B small molecule inhibitors from DTP-NCI libraries. Int J Appl Pharm. 2021;13:165-70. doi: 10.22159/ijap.2021v13i3.41600
8. Pabinger I, Van Es N, Heinze G, Posch F, Riedl J, Reitter EM, et al. A clinical prediction model for cancer-associated venous thromboembolism: A development and validation study in two independent prospective cohorts. Lancet Haematol. 2018;5(7):e289-98. doi: 10.1016/s2352-3026(18)30063-2, PMID 29885940
9. Moik F, Englisch C, Pabinger I, Ay C. Risk assessment models of cancer-associated thrombosis - potentials and perspectives. Thromb Update. 2021;5:100075. doi: 10.1016/j.tru.2021.100075
10. Kaur H, Pannu HS, Malhi AK. A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Comput Surv. 2019;52:79. doi: 10.1145/3343440
11. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big Data. 2018;5(1):42. doi: 10.1186/s40537-018-0151-6
12. Schober P, Vetter TR. Nonparametric statistical methods in medical research. Anesth Analg. 2020;131(6):1862-3. doi: 10.1213/ ane.0000000000005101, PMID 33186171
13. Rilianto B, Kurniawan RG, Prasetyo BT, Windiani PR, Gotama KT, Kusdiansah M, et al. Risk factors of cerebral aneurysms rupture in an Indonesian population. Neurol Res. 2024;46(11):989-95. doi: 10.1080/01616412.2024.2376308, PMID 38971160
14. Xu Q, Lei H, Li X, Li F, Shi H, Wang G, et al. Machine learning predicts cancer-associated venous thromboembolism using clinically available variables in gastric cancer patients. Heliyon. 2023;9(1):e12681. doi: 10.1016/j.heliyon.2022.e12681, PMID 36632097
15. Tasci E, Zhuge Y, Camphausen K, Krauze AV. Bias and class imbalance in oncologic data - towards inclusive and transferrable AI in large scale oncology data sets. Cancers (Basel). 2022;14(12):2897. doi: 10.3390/ cancers14122897, PMID 35740563
16. Angchaisuksiri P. Cancer-associated thrombosis in Asia. Thromb J. 2016;14 Suppl 1:26. doi: 10.1186/s12959-016-0110-4, PMID 27766052
17. Wan ML, Wang Y, Zeng Z, Deng B, Zhu BS, Cao T, et al. Colorectal cancer (CRC) as a multifactorial disease and its causal correlations with multiple signaling pathways. Biosci Rep. 2020;40(3):BSR20200265. doi: 10.1042/bsr20200265, PMID 32149326
18. Montomoli J, Romeo L, Moccia S, Bernardini M, Migliorelli L, Berardini D, et al. Machine learning using the extreme gradient boosting (XGBoost) algorithm predicts 5-day delta of SOFA score at ICU admission in COVID-19 patients. J Intensive Med. 2021;1(2):110-6. doi: 10.1016/j.jointm.2021.09.002, PMID 36785563
19. Kim JS, Kwon D, Kim K, Lee SH, Lee SB, Kim K, et al. Machine learning-based prediction of pulmonary embolism to reduce unnecessary computed tomography scans in gastrointestinal cancer patients: a retrospective multicenter study. Sci Rep. 2024;14(1):25359. doi:10.1038/s41598-024-75977-y
20. Noorhalim N, Ali A, Shamsuddin SM. Handling Imbalanced Ratio for Class Imbalance Problem using SMOTE. In: Proceedings of the Third International Conference on Computing, Mathematics and Statistics; 2019. doi: 10.1007/978-981-13-7279-7_3
21. Rahman HA, Wah YB, Huat OS. Predictive performance of logistic regression for imbalanced data with categorical covariate. Pertanika J Sci Technol. 2020;29:1-10. doi: 10.47836/pjst.29.1.10
Published
How to Cite
Issue
Section
Copyright (c) 2026 Faiza Naimat, Mathumalar Loganathan Fahrni, Nurul Hanis Amiruddin Jafry, Khairil Anuar Md Isa, Yusnaini Md. Yusoff, Kwok Wen Ng

This work is licensed under a Creative Commons Attribution 4.0 International License.
The publication is licensed under CC By and is open access. Copyright is with author and allowed to retain publishing rights without restrictions.