TY - GEN
T1 - Learning the k in k-means via the camp-meidell inequality
AU - Olukanmi, Peter O.
AU - Nelwamondo, Fulufhelo
AU - Marwala, Tshilidzi
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/11
Y1 - 2019/11
N2 - Determining k, the number clusters, is a key challenge in k-means clustering. We propose an approach, named Automatic k-means (k-means-auto), which judges a k value to be right, if at least 88.89% of D is less than twice its standard deviation, where D is a vector of length N (number of points) comprising the distance of each point to its cluster centroid. The bound is derived from Camp-Meidell's inequality for unimodal and symmetrical distributions. First in this paper, we show that k-means' equal-variance Gaussian cluster assumption induces Gaussian fit for D. Thus, if D is Gaussian, then all clusters are Gaussian, and the underlying k value is appropriate. We chose to test D for unimodality and symmetry instead of Gaussian fit, as a means of relaxation, since clusters in real data are hardly perfectly Gaussian. The proposed approach is fast since D does not have to be computed afresh: it is already available as a by-product of the clustering process. On 10 well-known datasets, k-means-auto finds the true k in 6 cases and comes very close (± 1 cluster) in 3 other cases. Compared with two well-known methods, gap statistics and silhouette, it outperforms the former and competes closely (statistically insignificant difference at 95% confidence) with the latter. Over all the cases, the running time range for k-means-auto, gap and silhouette are 0.02s-0.44s (factor of 22), 0.31s-13.63s (factor of 108) and 0.02s-70.78s (factor of 3539) respectively. Thus, k-means-auto surpasses the other two methods in efficiency.
AB - Determining k, the number clusters, is a key challenge in k-means clustering. We propose an approach, named Automatic k-means (k-means-auto), which judges a k value to be right, if at least 88.89% of D is less than twice its standard deviation, where D is a vector of length N (number of points) comprising the distance of each point to its cluster centroid. The bound is derived from Camp-Meidell's inequality for unimodal and symmetrical distributions. First in this paper, we show that k-means' equal-variance Gaussian cluster assumption induces Gaussian fit for D. Thus, if D is Gaussian, then all clusters are Gaussian, and the underlying k value is appropriate. We chose to test D for unimodality and symmetry instead of Gaussian fit, as a means of relaxation, since clusters in real data are hardly perfectly Gaussian. The proposed approach is fast since D does not have to be computed afresh: it is already available as a by-product of the clustering process. On 10 well-known datasets, k-means-auto finds the true k in 6 cases and comes very close (± 1 cluster) in 3 other cases. Compared with two well-known methods, gap statistics and silhouette, it outperforms the former and competes closely (statistically insignificant difference at 95% confidence) with the latter. Over all the cases, the running time range for k-means-auto, gap and silhouette are 0.02s-0.44s (factor of 22), 0.31s-13.63s (factor of 108) and 0.02s-70.78s (factor of 3539) respectively. Thus, k-means-auto surpasses the other two methods in efficiency.
KW - Clustering
KW - K-means
KW - Learning k
KW - Number of clusters
UR - http://www.scopus.com/inward/record.url?scp=85081557949&partnerID=8YFLogxK
U2 - 10.1109/ISCMI47871.2019.9004417
DO - 10.1109/ISCMI47871.2019.9004417
M3 - Conference contribution
AN - SCOPUS:85081557949
T3 - 2019 6th International Conference on Soft Computing and Machine Intelligence, ISCMI 2019
SP - 211
EP - 216
BT - 2019 6th International Conference on Soft Computing and Machine Intelligence, ISCMI 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th International Conference on Soft Computing and Machine Intelligence, ISCMI 2019
Y2 - 19 November 2019 through 20 November 2019
ER -