Abstract
Missing values are a common feature of real-world datasets, particularly in healthcare data. This can be challenging when applying machine learning algorithms, as most models perform poorly in the presence of incomplete data. The goal of this study is to evaluate the performance of seven imputation techniques: Mean Imputation, Median Imputation, Last Observation Carried Forward (LOCF), K-Nearest Neighbor (KNN) Imputation, Interpolation, MissForest, and Multiple Imputation by Chained Equations (MICE) on three healthcare datasets. Various levels of missing data were introduced—10%, 15%, 20%, and 25%—and the imputation techniques were used to fill in the gaps. The methods were compared using root mean squared error (RMSE) and mean absolute error (MAE). The results indicate that MissForest imputation performed best, followed by MICE. Additionally, we examined whether feature selection should be performed before or after imputation, using recall, precision, F1-score, and accuracy as evaluation metrics. The result suggests that performing imputation before feature selection is better. Since there is limited research on the order of imputation and feature selection, and ongoing debate among researchers, we hope the findings of this study will encourage data scientists and researchers to prioritize imputation before feature selection when working with datasets containing missing values.
| Original language | English |
|---|---|
| Pages (from-to) | 6357-6373 |
| Number of pages | 17 |
| Journal | International Journal of Data Science and Analytics |
| Volume | 20 |
| Issue number | 7 |
| DOIs | |
| Publication status | Published - Nov 2025 |
Keywords
- Healthcare datasets
- Imputation techniques
- Machine learning
- MissForest
- Missing data imputation
ASJC Scopus subject areas
- Information Systems
- Modeling and Simulation
- Computer Science Applications
- Computational Theory and Mathematics
- Applied Mathematics
Fingerprint
Dive into the research topics of 'A comparative study of imputation techniques for missing values in healthcare diagnostic datasets'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver