Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities

Peter Olukanmi, Fulufhelo Nelwamondo, Tshilidzi Marwala, Bhekisipho Twala

Research output: Contribution to journalArticlepeer-review

13 Citations (Scopus)

Abstract

We address two key challenges of k-means clustering. In the first part of the paper, we show that: when a dataset is partitioned with an appropriate number of clusters (k), not more than 1/9 of D will exceed twice its standard deviation (2 s.d.), and not more than 4/9 of D will exceed its standard deviation (1 s.d.) (σ), where D is a vector comprising the distance of each point to its cluster centroid. Our bounds assume unimodal symmetrical clusters (a generalization of k-means’ Gaussian assumption). In the second part of the paper, we show that a non-outlier will not be further from its cluster centroid than 14.826 times the median of absolute deviations from the median of D. Interestingly, D is already available from the k-means process. The first insight leads to an enhanced k-means algorithm (named Automatic k-means) which efficiently estimates k. Unlike popular techniques, ours eliminates the need to supply a search range for k. Meanwhile, since practical datasets may deviate from the ideal distribution, the 1 and 2 s.d. tests may yield different k estimates. Both estimates constitute effective lower and upper bounds. Thus, our algorithm also provides a general way to speed up and automate existing techniques, via automatically determined narrow search range. We demonstrate this by presenting enhanced versions of the popular silhouette and gap statistics techniques (Auto-Silhouette and Auto-Gap). We apply the second theoretical insight to incorporate automatic outlier detection into k-means. Our outlier-aware algorithm (named k-means#) is identical to the standard k-means in the absence of outliers. In the presence of outliers, it is identical to a known outlier-aware algorithm, named k-means-−, except for the crucial difference that k-means−− relies on the user to supply the number of outliers, while our algorithm is automated. Our technique solves a puzzle described by the authors of k-means−− regarding the difficulty of complete automation, which was considered an open problem.

Original languageEnglish
Pages (from-to)5939-5958
Number of pages20
JournalNeural Computing and Applications
Volume34
Issue number8
DOIs
Publication statusPublished - Apr 2022

Keywords

  • Chebyshev
  • Number of clusters
  • Outliers
  • k-means clustering

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities'. Together they form a unique fingerprint.

Cite this