TY - GEN
T1 - Evaluating the impact of missing data imputation
AU - Pantanowitz, Adam
AU - Marwala, Tshilidzi
PY - 2009
Y1 - 2009
N2 - This paper presents an impact assessment for the imputation of missing data. The assessment is performed by measuring the impacts of missing data on the statistical nature of the data, on a classifier, and on a logistic regression system. The data set used is HIV seroprevalence data from an antenatal clinic study survey performed in 2001. Data imputation is performed through the use of Random Forests, selected based on best imputation performance above five other techniques. Test sets are developed which consist of the original data and of imputed data with varying numbers of specifically selected missing variables imputed. Results indicate that, for this data set, the evaluated properties and tested paradigms are fairly immune to missing data imputation. The impact is not highly significant, with, for example, linear correlations of 96 % between HIV status probability prediction with a full set and with a set of two imputed variables using the logistic regression analysis.
AB - This paper presents an impact assessment for the imputation of missing data. The assessment is performed by measuring the impacts of missing data on the statistical nature of the data, on a classifier, and on a logistic regression system. The data set used is HIV seroprevalence data from an antenatal clinic study survey performed in 2001. Data imputation is performed through the use of Random Forests, selected based on best imputation performance above five other techniques. Test sets are developed which consist of the original data and of imputed data with varying numbers of specifically selected missing variables imputed. Results indicate that, for this data set, the evaluated properties and tested paradigms are fairly immune to missing data imputation. The impact is not highly significant, with, for example, linear correlations of 96 % between HIV status probability prediction with a full set and with a set of two imputed variables using the logistic regression analysis.
KW - Impact
KW - Imputation
KW - Missing data
KW - Random forest
KW - Sensitivity
UR - http://www.scopus.com/inward/record.url?scp=70350325172&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-03348-3_59
DO - 10.1007/978-3-642-03348-3_59
M3 - Conference contribution
AN - SCOPUS:70350325172
SN - 3642033474
SN - 9783642033476
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 577
EP - 586
BT - Advanced Data Mining and Applications - 5th International Conference, ADMA 2009, Proceedings
T2 - 5th International Conference on Advanced Data Mining and Applications, ADMA 2009
Y2 - 17 August 2009 through 19 August 2009
ER -