TY - JOUR
T1 - Performance analysis of cost-sensitive learning methods with application to imbalanced medical data
AU - Mienye, Ibomoiye Domor
AU - Sun, Yanxia
N1 - Publisher Copyright:
© 2021 The Authors
PY - 2021/1
Y1 - 2021/1
N2 - Many real-world machine learning applications require building models using highly imbalanced datasets. Usually, in medical datasets, the healthy patients or samples are dominant, making them the majority class, while the sick patients are few, making them the minority class. Researchers have proposed numerous machine learning methods to predict medical diagnosis. Still, the class imbalance problem makes it difficult for classifiers to adequately learn and distinguish between the minority and majority classes. Cost-sensitive learning and resampling techniques are used to deal with the class imbalance problem. This research focuses on developing robust cost-sensitive classifiers by modifying the objective functions of some well-known algorithms, such as logistic regression, decision tree, extreme gradient boosting, and random forest, which are then used to efficiently predict medical diagnosis. Meanwhile, as opposed to resampling techniques, our approach does not alter the original data distribution. Firstly, we implement the standard versions of these algorithms to provide a baseline for performance comparison. Secondly, we develop their corresponding cost-sensitive algorithms. For the proposed approaches, it is not necessary to change the distribution of the original data as the modified algorithms consider the imbalanced class distribution during training, thereby resulting in more reliable performance than when the data is resampled. Four popular medical datasets, including the Pima Indians Diabetes, Haberman Breast Cancer, Cervical Cancer Risk Factors, and Chronic Kidney Disease datasets, are used in the experiments to validate the performance of the proposed approach. The experimental results show that the cost-sensitive methods yield superior performance compared to the standard algorithms.
AB - Many real-world machine learning applications require building models using highly imbalanced datasets. Usually, in medical datasets, the healthy patients or samples are dominant, making them the majority class, while the sick patients are few, making them the minority class. Researchers have proposed numerous machine learning methods to predict medical diagnosis. Still, the class imbalance problem makes it difficult for classifiers to adequately learn and distinguish between the minority and majority classes. Cost-sensitive learning and resampling techniques are used to deal with the class imbalance problem. This research focuses on developing robust cost-sensitive classifiers by modifying the objective functions of some well-known algorithms, such as logistic regression, decision tree, extreme gradient boosting, and random forest, which are then used to efficiently predict medical diagnosis. Meanwhile, as opposed to resampling techniques, our approach does not alter the original data distribution. Firstly, we implement the standard versions of these algorithms to provide a baseline for performance comparison. Secondly, we develop their corresponding cost-sensitive algorithms. For the proposed approaches, it is not necessary to change the distribution of the original data as the modified algorithms consider the imbalanced class distribution during training, thereby resulting in more reliable performance than when the data is resampled. Four popular medical datasets, including the Pima Indians Diabetes, Haberman Breast Cancer, Cervical Cancer Risk Factors, and Chronic Kidney Disease datasets, are used in the experiments to validate the performance of the proposed approach. The experimental results show that the cost-sensitive methods yield superior performance compared to the standard algorithms.
KW - Cost-sensitive learning
KW - Imbalanced classification
KW - Machine learning
KW - Medical diagnosis
UR - http://www.scopus.com/inward/record.url?scp=85111923392&partnerID=8YFLogxK
U2 - 10.1016/j.imu.2021.100690
DO - 10.1016/j.imu.2021.100690
M3 - Article
AN - SCOPUS:85111923392
SN - 2352-9148
VL - 25
JO - Informatics in Medicine Unlocked
JF - Informatics in Medicine Unlocked
M1 - 100690
ER -