TY - GEN
T1 - Differentially expressed gene identification based on separability index
AU - Perez, Meir
AU - Featherston, Jonathan
AU - Rubin, David M.
AU - Marwala, Tshilidzi
AU - Scottz, Lesley E.
AU - Stevens, Wendy
PY - 2009
Y1 - 2009
N2 - The identification of differentially expressed genes is central to microarray data analysis. Presented in this paper is an approach to differentially expressed gene identification based on a Separability Index (SI). Features are selected by identifying the optimal number of top ranking genes which result in maximum class separability. The approach was implemented on a training dataset comprising 400 samples from three types of cancers: colon, breast and lung cancer. The top 4222 genes resulted in a maximum separability of 91%. These genes were then used to classify a testing dataset comprising 250 samples, using a K-nearest neighbour (K-NN) classifier, achieving an accuracy of 92%. This outperformed a K-NN classifier trained on features selected based on p < 1:8311 × 10-7 (Bonferroni corrected p-value cut-off criterion of p < 0:01), which achieved an accuracy of 89.6%. The performance is attributed to the non-arbitrary nature of the maximum SI selection criterion, which is an inherent property of the data, as opposed to the arbitrary assignment of a p-value cut-off. Hierarchical clustering was used to identify clusters of genes, amongst the 4222 genes, with similar expression patterns for each of the three cancers. These clusters were then examined for functional enrichment and significant biological pathways, which were identified for all three cancer types.
AB - The identification of differentially expressed genes is central to microarray data analysis. Presented in this paper is an approach to differentially expressed gene identification based on a Separability Index (SI). Features are selected by identifying the optimal number of top ranking genes which result in maximum class separability. The approach was implemented on a training dataset comprising 400 samples from three types of cancers: colon, breast and lung cancer. The top 4222 genes resulted in a maximum separability of 91%. These genes were then used to classify a testing dataset comprising 250 samples, using a K-nearest neighbour (K-NN) classifier, achieving an accuracy of 92%. This outperformed a K-NN classifier trained on features selected based on p < 1:8311 × 10-7 (Bonferroni corrected p-value cut-off criterion of p < 0:01), which achieved an accuracy of 89.6%. The performance is attributed to the non-arbitrary nature of the maximum SI selection criterion, which is an inherent property of the data, as opposed to the arbitrary assignment of a p-value cut-off. Hierarchical clustering was used to identify clusters of genes, amongst the 4222 genes, with similar expression patterns for each of the three cancers. These clusters were then examined for functional enrichment and significant biological pathways, which were identified for all three cancer types.
KW - Analysis of variance
KW - Differential expression
KW - K-nearest neighbour
KW - Microarray
KW - Separability index
UR - http://www.scopus.com/inward/record.url?scp=77950842162&partnerID=8YFLogxK
U2 - 10.1109/ICMLA.2009.73
DO - 10.1109/ICMLA.2009.73
M3 - Conference contribution
AN - SCOPUS:77950842162
SN - 9780769539263
T3 - 8th International Conference on Machine Learning and Applications, ICMLA 2009
SP - 429
EP - 434
BT - 8th International Conference on Machine Learning and Applications, ICMLA 2009
T2 - 8th International Conference on Machine Learning and Applications, ICMLA 2009
Y2 - 13 December 2009 through 15 December 2009
ER -