Learning the k in k-means via the camp-meidell inequality

Peter O. Olukanmi, Fulufhelo Nelwamondo, Tshilidzi Marwala

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

Determining k, the number clusters, is a key challenge in k-means clustering. We propose an approach, named Automatic k-means (k-means-auto), which judges a k value to be right, if at least 88.89% of D is less than twice its standard deviation, where D is a vector of length N (number of points) comprising the distance of each point to its cluster centroid. The bound is derived from Camp-Meidell's inequality for unimodal and symmetrical distributions. First in this paper, we show that k-means' equal-variance Gaussian cluster assumption induces Gaussian fit for D. Thus, if D is Gaussian, then all clusters are Gaussian, and the underlying k value is appropriate. We chose to test D for unimodality and symmetry instead of Gaussian fit, as a means of relaxation, since clusters in real data are hardly perfectly Gaussian. The proposed approach is fast since D does not have to be computed afresh: it is already available as a by-product of the clustering process. On 10 well-known datasets, k-means-auto finds the true k in 6 cases and comes very close (± 1 cluster) in 3 other cases. Compared with two well-known methods, gap statistics and silhouette, it outperforms the former and competes closely (statistically insignificant difference at 95% confidence) with the latter. Over all the cases, the running time range for k-means-auto, gap and silhouette are 0.02s-0.44s (factor of 22), 0.31s-13.63s (factor of 108) and 0.02s-70.78s (factor of 3539) respectively. Thus, k-means-auto surpasses the other two methods in efficiency.

Original languageEnglish
Title of host publication2019 6th International Conference on Soft Computing and Machine Intelligence, ISCMI 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages211-216
Number of pages6
ISBN (Electronic)9781728145778
DOIs
Publication statusPublished - Nov 2019
Event6th International Conference on Soft Computing and Machine Intelligence, ISCMI 2019 - Johannesburg, South Africa
Duration: 19 Nov 201920 Nov 2019

Publication series

Name2019 6th International Conference on Soft Computing and Machine Intelligence, ISCMI 2019

Conference

Conference6th International Conference on Soft Computing and Machine Intelligence, ISCMI 2019
Country/TerritorySouth Africa
CityJohannesburg
Period19/11/1920/11/19

Keywords

  • Clustering
  • K-means
  • Learning k
  • Number of clusters

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Computer Vision and Pattern Recognition
  • Computational Mathematics
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Learning the k in k-means via the camp-meidell inequality'. Together they form a unique fingerprint.

Cite this