Effect of data parameters and seeding on K-means and K-medoids

Peter Olukanmi, Fulufhelo Nelwamondo, Tshilidzi Marwala

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

k-means and k-medoids are arguably the two most popular clustering methods. This paper reports an empirical study of the relative (de)merits of these two methods. We compare their performances under varying data conditions. We also assess the impact of replacing random selection of initial cluster centers, with a systematic approach. The technique employed in our study, for initial centroid selection, is the state-of-art D2-weighing technique used in k-means++, a variant of k-means. We use an extensive set of well-known and recommended benchmark datasets. These datasets allow for the study of performance under varying data size; varying cluster count, imbalance and overlap; and varying dimensionality. We establish that although k-means is much less accurate than k-medoids (which is grossly inefficient and less scalable), the accuracy gap is reduced significantly by seeding k-means. On the other hand, seeding does not improve k-medoids' accuracy but significantly worsens its efficiency (typically doubles its running time). Increased dimensionality tends to improve the accuracy of both algorithms, while increased overlap tends to worsen them.

Original languageEnglish
Title of host publication2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems, icABCD 2020 - Proceedings
EditorsSameerchand Pudaruth, Upasana Singh
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728167701
DOIs
Publication statusPublished - Aug 2020
Event2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems, icABCD 2020 - Durban, KwaZulu Natal, South Africa
Duration: 6 Aug 20207 Aug 2020

Publication series

Name2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems, icABCD 2020 - Proceedings

Conference

Conference2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems, icABCD 2020
Country/TerritorySouth Africa
CityDurban, KwaZulu Natal
Period6/08/207/08/20

Keywords

  • Comparative study
  • Comparison
  • K-medoids
  • Performance evaluation
  • k-means

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Computer Science Applications
  • Computer Vision and Pattern Recognition
  • Hardware and Architecture
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Effect of data parameters and seeding on K-means and K-medoids'. Together they form a unique fingerprint.

Cite this