TY - JOUR
T1 - Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities
AU - Olukanmi, Peter
AU - Nelwamondo, Fulufhelo
AU - Marwala, Tshilidzi
AU - Twala, Bhekisipho
N1 - Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
PY - 2022/4
Y1 - 2022/4
N2 - We address two key challenges of k-means clustering. In the first part of the paper, we show that: when a dataset is partitioned with an appropriate number of clusters (k), not more than 1/9 of D will exceed twice its standard deviation (2 s.d.), and not more than 4/9 of D will exceed its standard deviation (1 s.d.) (σ), where D is a vector comprising the distance of each point to its cluster centroid. Our bounds assume unimodal symmetrical clusters (a generalization of k-means’ Gaussian assumption). In the second part of the paper, we show that a non-outlier will not be further from its cluster centroid than 14.826 times the median of absolute deviations from the median of D. Interestingly, D is already available from the k-means process. The first insight leads to an enhanced k-means algorithm (named Automatic k-means) which efficiently estimates k. Unlike popular techniques, ours eliminates the need to supply a search range for k. Meanwhile, since practical datasets may deviate from the ideal distribution, the 1 and 2 s.d. tests may yield different k estimates. Both estimates constitute effective lower and upper bounds. Thus, our algorithm also provides a general way to speed up and automate existing techniques, via automatically determined narrow search range. We demonstrate this by presenting enhanced versions of the popular silhouette and gap statistics techniques (Auto-Silhouette and Auto-Gap). We apply the second theoretical insight to incorporate automatic outlier detection into k-means. Our outlier-aware algorithm (named k-means#) is identical to the standard k-means in the absence of outliers. In the presence of outliers, it is identical to a known outlier-aware algorithm, named k-means-−, except for the crucial difference that k-means−− relies on the user to supply the number of outliers, while our algorithm is automated. Our technique solves a puzzle described by the authors of k-means−− regarding the difficulty of complete automation, which was considered an open problem.
AB - We address two key challenges of k-means clustering. In the first part of the paper, we show that: when a dataset is partitioned with an appropriate number of clusters (k), not more than 1/9 of D will exceed twice its standard deviation (2 s.d.), and not more than 4/9 of D will exceed its standard deviation (1 s.d.) (σ), where D is a vector comprising the distance of each point to its cluster centroid. Our bounds assume unimodal symmetrical clusters (a generalization of k-means’ Gaussian assumption). In the second part of the paper, we show that a non-outlier will not be further from its cluster centroid than 14.826 times the median of absolute deviations from the median of D. Interestingly, D is already available from the k-means process. The first insight leads to an enhanced k-means algorithm (named Automatic k-means) which efficiently estimates k. Unlike popular techniques, ours eliminates the need to supply a search range for k. Meanwhile, since practical datasets may deviate from the ideal distribution, the 1 and 2 s.d. tests may yield different k estimates. Both estimates constitute effective lower and upper bounds. Thus, our algorithm also provides a general way to speed up and automate existing techniques, via automatically determined narrow search range. We demonstrate this by presenting enhanced versions of the popular silhouette and gap statistics techniques (Auto-Silhouette and Auto-Gap). We apply the second theoretical insight to incorporate automatic outlier detection into k-means. Our outlier-aware algorithm (named k-means#) is identical to the standard k-means in the absence of outliers. In the presence of outliers, it is identical to a known outlier-aware algorithm, named k-means-−, except for the crucial difference that k-means−− relies on the user to supply the number of outliers, while our algorithm is automated. Our technique solves a puzzle described by the authors of k-means−− regarding the difficulty of complete automation, which was considered an open problem.
KW - Chebyshev
KW - Number of clusters
KW - Outliers
KW - k-means clustering
UR - http://www.scopus.com/inward/record.url?scp=85123079241&partnerID=8YFLogxK
U2 - 10.1007/s00521-021-06689-x
DO - 10.1007/s00521-021-06689-x
M3 - Article
AN - SCOPUS:85123079241
SN - 0941-0643
VL - 34
SP - 5939
EP - 5958
JO - Neural Computing and Applications
JF - Neural Computing and Applications
IS - 8
ER -