Missing data imputation through the use of the random forest algorithm

Adam Pantanowitz, Tshilidzi Marwala

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

56 Citations (Scopus)

Abstract

This paper presents a comparison of different paradigms used for missing data imputation. The data set used is HIV seroprevalence data from an antenatal clinic study survey performed in 2001. Data imputation is performed through fivemethods:RandomForests; auto-associative neural networks with genetic algorithms; auto-associative neuro-fuzzy configurations; and two random forest and neural network based hybrids. Results indicate that Random Forests are superior in imputing missing data for the given data set in terms of accuracy and in terms of computation time, with accuracy increases of up to 32 % on average for certain variables when compared with auto-associative networks. While the concept of hybrid systems has promise, the presented systems appear to be hindered by their auto-associative neural network components.

Original languageEnglish
Title of host publicationAdvances in Computational Intelligence
PublisherSpringer Verlag
Pages53-62
Number of pages10
ISBN (Print)9783642031557
DOIs
Publication statusPublished - 2009
Event2nd International Workshop on Advanced Computational Intelligence, IWACI 2009 - Mexico City, Mexico
Duration: 22 Jun 200923 Jun 2009

Publication series

NameAdvances in Intelligent and Soft Computing
Volume61 AISC
ISSN (Print)1867-5662

Conference

Conference2nd International Workshop on Advanced Computational Intelligence, IWACI 2009
Country/TerritoryMexico
CityMexico City
Period22/06/0923/06/09

Keywords

  • Auto-associative
  • Imputation
  • Missing data
  • Neural network
  • Random forest

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Missing data imputation through the use of the random forest algorithm'. Together they form a unique fingerprint.

Cite this