K-Means-Lite: Real Time Clustering for Large Datasets

Peter O. Olukanmi, Fulufhelo Nelwamondo, Tshilidzi Marwala

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

15 Citations (Scopus)

Abstract

We present a simple algorithm to address the poor scalability of k-means, arguably the most popular clustering algorithm. Our algorithm, named k-means-lite, is based on an intuitive extension of the classical central limit theorem. It obtains the k centroids which k-means seeks, by making inference from a few small samples, rather than by repeated exhaustive comparison of data points and centroids. Experiments show that, compared to k-means, k-means-lite achieves drastic efficiency gain, and solves large datasets (up to 1 million points tested) in real time. The efficiency gain is increasingly manifest as data size and number of clusters increase. Interestingly, k-means-lite also produces better clustering quality than k-means on the largest 7 of 10 datasets tested.

Original languageEnglish
Title of host publication5th International Conference on Soft Computing and Machine Intelligence, ISCMI 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages54-59
Number of pages6
ISBN (Electronic)9781728113012
DOIs
Publication statusPublished - 2 Jul 2018
Event5th International Conference on Soft Computing and Machine Intelligence, ISCMI 2018 - Nairobi, Kenya
Duration: 21 Nov 201822 Nov 2018

Publication series

Name5th International Conference on Soft Computing and Machine Intelligence, ISCMI 2018

Conference

Conference5th International Conference on Soft Computing and Machine Intelligence, ISCMI 2018
Country/TerritoryKenya
CityNairobi
Period21/11/1822/11/18

Keywords

  • accurate
  • clustering
  • efficient
  • k-means
  • real time
  • scalable

ASJC Scopus subject areas

  • Control and Optimization
  • Artificial Intelligence
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'K-Means-Lite: Real Time Clustering for Large Datasets'. Together they form a unique fingerprint.

Cite this