TY - JOUR
T1 - Partial imputation of unseen records to improve classification using a hybrid multi-layered artificial immune system and genetic algorithm
AU - Duma, Mlungisi
AU - Marwala, Tshilidzi
AU - Twala, Bhekisipho
AU - Nelwamondo, Fulufhelo
PY - 2013
Y1 - 2013
N2 - Missing data in large insurance datasets affects the learning and classification accuracies in predictive modelling. Insurance datasets will continue to increase in size as more variables are added to aid in managing client risk and will therefore be even more vulnerable to missing data. This paper proposes a hybrid multi-layered artificial immune system and genetic algorithm for partial imputation of missing data in datasets with numerous variables. The multi-layered artificial immune system creates and stores antibodies that bind to and annihilate an antigen. The genetic algorithm optimises the learning process of a stimulated antibody. The evaluation of the imputation is performed using the RIPPER, k-nearest neighbour, naïve Bayes and logistic discriminant classifiers. The effect of the imputation on the classifiers is compared with that of the mean/mode and hot deck imputation methods. The results demonstrate that when missing data imputation is performed using the proposed hybrid method, the classification improves and the robustness to the amount of missing data is increased relative to the mean/mode method for data missing completely at random (MCAR) missing at random (MAR), and not missing at random (NMAR).The imputation performance is similar to or marginally better than that of the hot deck imputation.
AB - Missing data in large insurance datasets affects the learning and classification accuracies in predictive modelling. Insurance datasets will continue to increase in size as more variables are added to aid in managing client risk and will therefore be even more vulnerable to missing data. This paper proposes a hybrid multi-layered artificial immune system and genetic algorithm for partial imputation of missing data in datasets with numerous variables. The multi-layered artificial immune system creates and stores antibodies that bind to and annihilate an antigen. The genetic algorithm optimises the learning process of a stimulated antibody. The evaluation of the imputation is performed using the RIPPER, k-nearest neighbour, naïve Bayes and logistic discriminant classifiers. The effect of the imputation on the classifiers is compared with that of the mean/mode and hot deck imputation methods. The results demonstrate that when missing data imputation is performed using the proposed hybrid method, the classification improves and the robustness to the amount of missing data is increased relative to the mean/mode method for data missing completely at random (MCAR) missing at random (MAR), and not missing at random (NMAR).The imputation performance is similar to or marginally better than that of the hot deck imputation.
KW - Correlation-based feature extraction
KW - Genetic algorithms
KW - Missing data
KW - Multi-layered artificial immune system
UR - http://www.scopus.com/inward/record.url?scp=84885094995&partnerID=8YFLogxK
U2 - 10.1016/j.asoc.2013.08.005
DO - 10.1016/j.asoc.2013.08.005
M3 - Article
AN - SCOPUS:84885094995
SN - 1568-4946
VL - 13
SP - 4461
EP - 4480
JO - Applied Soft Computing Journal
JF - Applied Soft Computing Journal
IS - 12
ER -