TY - GEN
T1 - K-means-Lite++
T2 - 6th International Conference on Soft Computing and Machine Intelligence, ISCMI 2019
AU - Olukanmi, Peter O.
AU - Nelwamondo, Fulufhelo
AU - Marwala, Tshilidzi
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/11
Y1 - 2019/11
N2 - The k-means-lite algorithm was recently introduced to address the poor scalability of the popular k-means algorithm. This paper modifies k-means-lite by applying D2 sampling, the state-of-the-art seeding technique employed by k-means++, to each sample in the former. K-means-lite estimates its final centroids by applying standard k-means to the combination of all centroids obtained from initial application of k-means to a few (say, five) samples. Although k-means-lite achieves drastic efficiency gain, the five-sample implementation trades off some of the standard algorithm's accuracy. One way to improve its accuracy is to increase the number of samples. But it may be difficult to determine a priori what performance level is attainable, or the number of samples required to attain such performance. These would vary per problem. In this paper, we show experimentally that without increasing the sample size or number of samples in k-means-lite, our modified algorithm, named k-means-lite++, addresses this problem and is even more accurate than the standard k-means. For the cases tested, k-means-lite++ matches the accuracy of k-means++, which is widely considered the state-of-the-art k-means algorithm. The extra cost of sample seeding is compensated for by faster convergence. So, although drastically more accurate, k-means-lite++ has efficiency that is comparable to k-means-lite's. Thus, the proposed algorithm may also be viewed as a solution to scalability issues associated with k-means++'s sequential sampling.
AB - The k-means-lite algorithm was recently introduced to address the poor scalability of the popular k-means algorithm. This paper modifies k-means-lite by applying D2 sampling, the state-of-the-art seeding technique employed by k-means++, to each sample in the former. K-means-lite estimates its final centroids by applying standard k-means to the combination of all centroids obtained from initial application of k-means to a few (say, five) samples. Although k-means-lite achieves drastic efficiency gain, the five-sample implementation trades off some of the standard algorithm's accuracy. One way to improve its accuracy is to increase the number of samples. But it may be difficult to determine a priori what performance level is attainable, or the number of samples required to attain such performance. These would vary per problem. In this paper, we show experimentally that without increasing the sample size or number of samples in k-means-lite, our modified algorithm, named k-means-lite++, addresses this problem and is even more accurate than the standard k-means. For the cases tested, k-means-lite++ matches the accuracy of k-means++, which is widely considered the state-of-the-art k-means algorithm. The extra cost of sample seeding is compensated for by faster convergence. So, although drastically more accurate, k-means-lite++ has efficiency that is comparable to k-means-lite's. Thus, the proposed algorithm may also be viewed as a solution to scalability issues associated with k-means++'s sequential sampling.
KW - Accurate
KW - Clustering
KW - Efficient
KW - Initialization
KW - K-means
KW - K-means++
KW - K-means-lite
KW - Seeding
UR - http://www.scopus.com/inward/record.url?scp=85081537613&partnerID=8YFLogxK
U2 - 10.1109/ISCMI47871.2019.9004300
DO - 10.1109/ISCMI47871.2019.9004300
M3 - Conference contribution
AN - SCOPUS:85081537613
T3 - 2019 6th International Conference on Soft Computing and Machine Intelligence, ISCMI 2019
SP - 223
EP - 227
BT - 2019 6th International Conference on Soft Computing and Machine Intelligence, ISCMI 2019
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 19 November 2019 through 20 November 2019
ER -