REBALANCING DATA FOR CANCER-ASSOCIATED THROMBOSIS: COMPARISON OF DIFFERENT RESAMPLING APPROACH

FAIZA NAIMAT; KWOK-WEN NG; MATHUMALAR LOGANATHAN FAHRNI; NURUL HANIS AMIRUDDIN JAFRY; KHAIRIL ANUAR MD ISA; YUSNAINI MD YUSOFF

doi:10.22159/ajpcr.2026v19i2.57152

Authors

FAIZA NAIMAT Department of Pharmacy Practice and Clinical Pharmacy,, Faculty of Pharmacy, Universiti Teknologi MARA, Selangor.
KWOK-WEN NG Department of Pharmceutical Chemistry, Faculty of Pharmacy, Quest International University, Perak.
MATHUMALAR LOGANATHAN FAHRNI Department of Pharmacy Practice and Clinical Pharmacy,, Faculty of Pharmacy, Universiti Teknologi MARA, Selangor.
NURUL HANIS AMIRUDDIN JAFRY Pusat Pengajian Citra Universiti, Universiti Kebangsaan Malaysia, Selangor.
KHAIRIL ANUAR MD ISA Department of Basic Sciences, Faculty of Health Science, Universiti Teknologi MARA, Selangor, Malaysia.
YUSNAINI MD YUSOFF Pusat Pengajian Citra Universiti, Universiti Kebangsaan Malaysia, Selangor.

DOI:

https://doi.org/10.22159/ajpcr.2026v19i2.57152

Keywords:

Cancer-associated thrombosis, Machine learning classification, Resampling techniques,, Classifier-resampling interactions,

Abstract

Objective: Cancer-associated thrombosis (CAT) presents a complex challenge in oncology, exacerbated by data imbalances in related datasets that often lead to suboptimal outcomes in machine learning (ML) classification. Many ML algorithms were originally designed for balanced datasets, prompting this study to evaluate the interaction between logistic regression (LR) and eXtreme Gradient Boost (XGBoost) and data resampling techniques for improving prediction on imbalances in Malaysian data on CAT (MDCAT).

Methods: Random oversampling (ROS), random undersampling (RUS), and a combined oversampling and undersampling approach (BOTH) were applied to MDCAT dataset. Classification tasks were performed using LR and XGBoost in R version 4.3.1. Classifier performance was assessed using accuracy, sensitivity, specificity, and the area under the ROC curve (AUROC) to evaluate the impact of different resampling techniques.

Results: Applying LR and XGBoost to the imbalanced data revealed high specificity but low sensitivity in testing samples. A substantial decline in XGBoost performance was observed, with the AUC decreasing from 0.794 in training to 0.381. Metastasis, surgery, and Indian ethnicity showed statistically significant associated with the CAT event across all resampling techniques. Among XGBoost models, oversampling (XO) exhibited excellent training performance (Accuracy 0.99; AUC 0.98) but showed a large performance drop on the test set (Accuracy 0.82; AUC 0.72). Among LR models, logistic undersampling yielded the highest training accuracy (0.83) and AUC of 0.82. Tuning amplified the differences between resampling strategies and highlighted clear classifier–resampling interactions. XGBoost benefited most, particularly when trained on mixed and oversampled datasets, while LR remained comparatively stable.

Conclusion: This study demonstrated that the effectiveness of prediction models in imbalanced MDCAT dataset is strongly influenced by the interaction between classifier characteristics and resampling strategies. A tuned XGBoost model with mixed resampling outweighed the benefits of LR’s simplicity and stability, making it our recommended approach given the primary importance of AUC.

Downloads

Download data is not yet available.

References

1. Falanga A, Marchetti M, Russo L. The mechanisms of cancer-associated thrombosis. Thromb Res. 2015;135 Suppl 1:S8-11. doi: 10.1016/s0049- 3848(15)50432-5, PMID 25903541

2. Lei H, Zhang M, Wu Z, Liu C, Li X, Zhou W, et al. Development and validation of a risk prediction model for venous thromboembolism inlung cancer patients using machine learning. Front Cardiovasc Med. 2022;9:845210. doi: 10.3389/fcvm.2022.845210, PMID 35321110

3. Meng L, Wei T, Fan R, Su H, Liu J, Wang L, et al. Development and validation of a machine learning model to predict venous thromboembolism among hospitalized cancer patients. Asia Pac J Oncol Nurs. 2022;9(12):100128. doi: 10.1016/j.apjon.2022.100128, PMID 36276886

4. Javaid M, Haleem A, Singh RP, Suman R, Rab S. Significance of machine learning in healthcare: Features, pillars and applications. Int J Intell Netw. 2022;3:58-73. doi: 10.1016/j.ijin.2022.05.002

5. Mahadevappa MK, Krishnan GN, Murthannagari VR, Arun J. Harnessing artificial intelligence: Transforming clinical trials for the future. Int J Appl Pharm. 2025;17:102-10. doi: 10.22159/ ijap.2025v17i6.54181

6. Lafi Z, Matalqah S, Asha S, Asha N, Mhaidat H, Asha SY. Advanced fabrication and characterization of silver nanoparticles using AI techniques. Int J Appl Pharm. 2025;17:42-51. doi: 10.22159/ ijap.2025v17i5.55011

7. Mohamed MM, Jusril NA, Adenan MI, Wen NG. In silico identification of APOBEC3B small molecule inhibitors from DTP-NCI libraries. Int J Appl Pharm. 2021;13:165-70. doi: 10.22159/ijap.2021v13i3.41600

8. Pabinger I, Van Es N, Heinze G, Posch F, Riedl J, Reitter EM, et al. A clinical prediction model for cancer-associated venous thromboembolism: A development and validation study in two independent prospective cohorts. Lancet Haematol. 2018;5(7):e289-98. doi: 10.1016/s2352-3026(18)30063-2, PMID 29885940

9. Moik F, Englisch C, Pabinger I, Ay C. Risk assessment models of cancer-associated thrombosis - potentials and perspectives. Thromb Update. 2021;5:100075. doi: 10.1016/j.tru.2021.100075

10. Kaur H, Pannu HS, Malhi AK. A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Comput Surv. 2019;52:79. doi: 10.1145/3343440

11. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big Data. 2018;5(1):42. doi: 10.1186/s40537-018-0151-6

12. Schober P, Vetter TR. Nonparametric statistical methods in medical research. Anesth Analg. 2020;131(6):1862-3. doi: 10.1213/ ane.0000000000005101, PMID 33186171

13. Rilianto B, Kurniawan RG, Prasetyo BT, Windiani PR, Gotama KT, Kusdiansah M, et al. Risk factors of cerebral aneurysms rupture in an Indonesian population. Neurol Res. 2024;46(11):989-95. doi: 10.1080/01616412.2024.2376308, PMID 38971160

14. Xu Q, Lei H, Li X, Li F, Shi H, Wang G, et al. Machine learning predicts cancer-associated venous thromboembolism using clinically available variables in gastric cancer patients. Heliyon. 2023;9(1):e12681. doi: 10.1016/j.heliyon.2022.e12681, PMID 36632097

15. Tasci E, Zhuge Y, Camphausen K, Krauze AV. Bias and class imbalance in oncologic data - towards inclusive and transferrable AI in large scale oncology data sets. Cancers (Basel). 2022;14(12):2897. doi: 10.3390/ cancers14122897, PMID 35740563

16. Angchaisuksiri P. Cancer-associated thrombosis in Asia. Thromb J. 2016;14 Suppl 1:26. doi: 10.1186/s12959-016-0110-4, PMID 27766052

17. Wan ML, Wang Y, Zeng Z, Deng B, Zhu BS, Cao T, et al. Colorectal cancer (CRC) as a multifactorial disease and its causal correlations with multiple signaling pathways. Biosci Rep. 2020;40(3):BSR20200265. doi: 10.1042/bsr20200265, PMID 32149326

18. Montomoli J, Romeo L, Moccia S, Bernardini M, Migliorelli L, Berardini D, et al. Machine learning using the extreme gradient boosting (XGBoost) algorithm predicts 5-day delta of SOFA score at ICU admission in COVID-19 patients. J Intensive Med. 2021;1(2):110-6. doi: 10.1016/j.jointm.2021.09.002, PMID 36785563

19. Kim JS, Kwon D, Kim K, Lee SH, Lee SB, Kim K, et al. Machine learning-based prediction of pulmonary embolism to reduce unnecessary computed tomography scans in gastrointestinal cancer patients: a retrospective multicenter study. Sci Rep. 2024;14(1):25359. doi:10.1038/s41598-024-75977-y

20. Noorhalim N, Ali A, Shamsuddin SM. Handling Imbalanced Ratio for Class Imbalance Problem using SMOTE. In: Proceedings of the Third International Conference on Computing, Mathematics and Statistics; 2019. doi: 10.1007/978-981-13-7279-7_3

21. Rahman HA, Wah YB, Huat OS. Predictive performance of logistic regression for imbalanced data with categorical covariate. Pertanika J Sci Technol. 2020;29:1-10. doi: 10.47836/pjst.29.1.10

REBALANCING DATA FOR CANCER-ASSOCIATED THROMBOSIS: COMPARISON OF DIFFERENT RESAMPLING APPROACH

Authors

DOI:

Keywords:

Abstract

Downloads

References

Published

How to Cite

Issue

Section

Most read articles by the same author(s)