TY - GEN
T1 - Effect of data parameters and seeding on K-means and K-medoids
AU - Olukanmi, Peter
AU - Nelwamondo, Fulufhelo
AU - Marwala, Tshilidzi
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/8
Y1 - 2020/8
N2 - k-means and k-medoids are arguably the two most popular clustering methods. This paper reports an empirical study of the relative (de)merits of these two methods. We compare their performances under varying data conditions. We also assess the impact of replacing random selection of initial cluster centers, with a systematic approach. The technique employed in our study, for initial centroid selection, is the state-of-art D2-weighing technique used in k-means++, a variant of k-means. We use an extensive set of well-known and recommended benchmark datasets. These datasets allow for the study of performance under varying data size; varying cluster count, imbalance and overlap; and varying dimensionality. We establish that although k-means is much less accurate than k-medoids (which is grossly inefficient and less scalable), the accuracy gap is reduced significantly by seeding k-means. On the other hand, seeding does not improve k-medoids' accuracy but significantly worsens its efficiency (typically doubles its running time). Increased dimensionality tends to improve the accuracy of both algorithms, while increased overlap tends to worsen them.
AB - k-means and k-medoids are arguably the two most popular clustering methods. This paper reports an empirical study of the relative (de)merits of these two methods. We compare their performances under varying data conditions. We also assess the impact of replacing random selection of initial cluster centers, with a systematic approach. The technique employed in our study, for initial centroid selection, is the state-of-art D2-weighing technique used in k-means++, a variant of k-means. We use an extensive set of well-known and recommended benchmark datasets. These datasets allow for the study of performance under varying data size; varying cluster count, imbalance and overlap; and varying dimensionality. We establish that although k-means is much less accurate than k-medoids (which is grossly inefficient and less scalable), the accuracy gap is reduced significantly by seeding k-means. On the other hand, seeding does not improve k-medoids' accuracy but significantly worsens its efficiency (typically doubles its running time). Increased dimensionality tends to improve the accuracy of both algorithms, while increased overlap tends to worsen them.
KW - Comparative study
KW - Comparison
KW - K-medoids
KW - Performance evaluation
KW - k-means
UR - http://www.scopus.com/inward/record.url?scp=85092021864&partnerID=8YFLogxK
U2 - 10.1109/icABCD49160.2020.9183892
DO - 10.1109/icABCD49160.2020.9183892
M3 - Conference contribution
AN - SCOPUS:85092021864
T3 - 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems, icABCD 2020 - Proceedings
BT - 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems, icABCD 2020 - Proceedings
A2 - Pudaruth, Sameerchand
A2 - Singh, Upasana
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems, icABCD 2020
Y2 - 6 August 2020 through 7 August 2020
ER -