Rethinking k-means clustering in the age of massive datasets: a constant-time approach

P. Olukanmi, F. Nelwamondo, T. Marwala

Research output: Contribution to journalArticlepeer-review

11 Citations (Scopus)

Abstract

We introduce a highly efficient k-means clustering approach. We show that the classical central limit theorem addresses a special case (k = 1) of the k-means problem and then extend it to the general case. Instead of using the full dataset, our algorithm named k-means-lite applies the standard k-means to the combination C (size nk) of all sample centroids obtained from n independent small samples. Unlike ordinary uniform sampling, the approach asymptotically preserves the performance of the original algorithm. In our experiments with a wide range of synthetic and real-world datasets, k-means-lite matches the performance of k-means when C is constructed using 30 samples of size 40 + 2k. Although the 30-sample choice proves to be a generally reliable rule, when the proposed approach is used to scale k-means++ (we call this scaled version k-means-lite++), k-means++’ performance is matched in several cases, using only five samples. These two new algorithms are presented to demonstrate the proposed approach, but the approach can be applied to create a constant-time version of any other k-means clustering algorithm, since it does not modify the internal workings of the base algorithm.

Original languageEnglish
Pages (from-to)15445-15467
Number of pages23
JournalNeural Computing and Applications
Volume32
Issue number19
DOIs
Publication statusPublished - 1 Oct 2020

Keywords

  • Clustering
  • Efficiency
  • Large datasets
  • Scalable
  • k-means

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Rethinking k-means clustering in the age of massive datasets: a constant-time approach'. Together they form a unique fingerprint.

Cite this