TY - JOUR
T1 - k-Means-MIND
T2 - comparing seeds without repeated k-means runs
AU - Olukanmi, Peter
AU - Nelwamondo, Fulufhelo
AU - Marwala, Tshilidzi
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
PY - 2022
Y1 - 2022
N2 - A key drawback of the popular k-means clustering algorithm is its susceptibility to local minima. This problem is often addressed by performing repeated runs of the algorithm, and choosing the best run afterward. The approach is effective but computationally expensive: it multiplies the running time proportional to the number of repeats. We observe, in this paper, that repeated k-means runs is equivalent to comparing different candidate initializations using the k-means objective function obtained after running the algorithm fully, as the fitness function. This observation implies that if it were possible to compare the initializations ab initio without depending on the full algorithm for judgment, then there will be no need for repeats. Unfortunately, this phenomenon has not been studied in the literature, to our knowledge. We pursue the development of such a technique for comparing initializations directly. We choose as the "best", the initialization that possesses the largest minimum inter-center distance (MIND). This proposed technique also serves as a general technique for optimizing k-means seeding algorithms. We demonstrate the concept with MIND-optimized versions of two popular algorithms: k-means and k-means++. Experiments on real-world and benchmark synthetic datasets as well as mathematical analysis establish drastic efficiency gains when compared to repeated k-means. Furthermore, our approach significantly improves the accuracy of the standard versions of the algorithms, and it is easy to implement.
AB - A key drawback of the popular k-means clustering algorithm is its susceptibility to local minima. This problem is often addressed by performing repeated runs of the algorithm, and choosing the best run afterward. The approach is effective but computationally expensive: it multiplies the running time proportional to the number of repeats. We observe, in this paper, that repeated k-means runs is equivalent to comparing different candidate initializations using the k-means objective function obtained after running the algorithm fully, as the fitness function. This observation implies that if it were possible to compare the initializations ab initio without depending on the full algorithm for judgment, then there will be no need for repeats. Unfortunately, this phenomenon has not been studied in the literature, to our knowledge. We pursue the development of such a technique for comparing initializations directly. We choose as the "best", the initialization that possesses the largest minimum inter-center distance (MIND). This proposed technique also serves as a general technique for optimizing k-means seeding algorithms. We demonstrate the concept with MIND-optimized versions of two popular algorithms: k-means and k-means++. Experiments on real-world and benchmark synthetic datasets as well as mathematical analysis establish drastic efficiency gains when compared to repeated k-means. Furthermore, our approach significantly improves the accuracy of the standard versions of the algorithms, and it is easy to implement.
KW - Efficient
KW - Initialization
KW - Repeat
KW - k-Means clustering
UR - http://www.scopus.com/inward/record.url?scp=85135473149&partnerID=8YFLogxK
U2 - 10.1007/s00521-022-07554-1
DO - 10.1007/s00521-022-07554-1
M3 - Article
AN - SCOPUS:85135473149
SN - 0941-0643
JO - Neural Computing and Applications
JF - Neural Computing and Applications
ER -