KMEANS
the K-Means Data Clustering Problem

KMEANS, a MATLAB library which handles the K-Means problem, which organizes a set of N points in M dimensions into K clusters;

In the K-Means problem, a set of N points X(I) in M-dimensions is given. The goal is to arrange these points into K clusters, with each cluster having a representative point Z(J), usually chosen as the centroid of the points in the cluster.

        Z(J) = Sum ( all X(I) in cluster J ) X(I) /
               Sum ( all X(I) in cluster J ) 1.

The energy of cluster J is

        E(J) = Sum ( all X(I) in cluster J ) || X(I) - Z(J) ||^2

For a given set of clusters, the total energy is then simply the sum of the cluster energies E(J). The goal is to choose the clusters in such a way that the total energy is minimized. Usually, a point X(I) goes into the cluster with the closest representative point Z(J). So to define the clusters, it's enough simply to specify the locations of the cluster representatives.

This is actually a fairly hard problem. Most algorithms do reasonably well, but cannot guarantee that the best solution has been found. It is very common for algorithms to get stuck at a solution which is merely a "local minimum". For such a local minimum, every slight rearrangement of the solution makes the energy go up; however a major rearrangement would result in a big drop in energy.

A simple algorithm for the problem is known as the "H-Means algorithm". It alternates between two procedures:

Using the given cluster centers, assign each point to the cluster with the nearest center;
Using the given cluster assignments, replace each cluster center by the centroid or average of the points in the cluster.

These steps are repeated until no points are moved, or some other termination criterion is reached.

A more sophisticated algorithm, known as the "K-Means algorithm", takes advantage of the fact that it is possible to quickly determine the decrease in energy caused by moving a point from its current cluster to another. It repeats the following procedure:

For each point, move it to another cluster if that would lower the energy. If you move a point, immediately update the cluster centers of the two affected clusters.

This procedure is repeated until no points are moved, or some other termination criterion is reached.

The Weighted K-Means Problem

A natural extension of the K-Means problem allows us to include some more information, namely, a set of weights associated with the data points. These might represent a measure of importance, a frequency count, or some other information. The intent is that a point with a weight of 5.0 is twice as "important" as a point with a weight of 2.5, for instance. This gives rise to the "weighted" K-Means problem.

In the weighted K-Means problem, we are given a set of N points X(I) in M-dimensions, and a corresponding set of nonnegative weights W(I). The goal is to arrange the points into K clusters, with each cluster having a representative point Z(J), usually chosen as the weighted centroid of the points in the cluster:

        Z(J) = Sum ( all X(I) in cluster J ) W(I) * X(I) /
               Sum ( all X(I) in cluster J ) W(I).

The weighted energy of cluster J is

        E(J) = Sum ( all X(I) in cluster J ) W(I) * || X(I) - Z(J) ||^2

Licensing:

The computer code and data files described and made available on this web page are distributed under the GNU LGPL license.

Languages:

KMEANS is available in a C version and a C++ version and a FORTRAN90 version and a MATLAB version.

Related Data and Programs:

ASA058, a MATLAB library which implements the K-means algorithm of Sparks.

ASA136, a MATLAB library which implements the Hartigan and Wong clustering algorithm.

CITIES, a MATLAB library which handles various problems associated with a set of "cities" on a map.

CITIES, a dataset directory which contains sets of data defining groups of cities.

IMAGE_QUANTIZATION, a MATLAB library which demonstrates how the KMEANS algorithm can be used to reduce the number of colors or shades of gray in an image.

kmeans_test

KMEANS_FAST, a MATLAB library which contains several different algorithms for the K-Means problem, which organizes a set of N points in M dimensions into K clusters, by Charles Elkan.

LORENZ_CLUSTER, a MATLAB library which takes a set of N points on a trajectory of solutions to the Lorenz equations, and applies the K-means algorithm to organize the data into K clusters.

MATLAB_KMEANS, MATLAB programs which illustrate the use of MATLAB's kmeans() function for clustering N sets of M-dimensional data into K clusters.

SAMMON_DATA, a MATLAB program which generates six sets of M-dimensional data for cluster analysis.

SPAETH, a dataset directory which contains a set of test data.

SPAETH2, a dataset directory which contains a set of test data.

Reference:

John Hartigan, Manchek Wong,
Algorithm AS 136: A K-Means Clustering Algorithm,
Applied Statistics,
Volume 28, Number 1, 1979, pages 100-108.
Wendy Martinez, Angel Martinez,
Computational Statistics Handbook with MATLAB,
Chapman and Hall / CRC, 2002.
David Sparks,
Algorithm AS 58: Euclidean Cluster Analysis,
Applied Statistics,
Volume 22, Number 1, 1973, pages 126-130.

Source Code:

cluster_energy_compute.m computes the energy of the clusters.
cluster_initialize_1.m initializes the clusters to data points.
cluster_initialize_2.m initializes the cluster centers to random values.
cluster_initialize_3.m initializes the cluster centers to random values.
cluster_initialize_4.m initializes the cluster centers to random values.
cluster_initialize_5.m initializes the cluster centers to random values.
cluster_print_summary.m prints a summary of data about a clustering.
cluster_variance_compute.m computes the cluster variance.
file_column_count.m counts the number of columns in the first line of a file.
file_row_count.m counts the number of row records in a file.
hmeans_01.m applies the H-Means algorithm.
hmeans_02.m applies the H-Means algorithm.
hmeans_w_01.m applies the weighted H-Means algorithm.
hmeans_w_02.m applies the weighted H-Means algorithm.
i4_uniform.m returns a pseudorandom I4.
i4mat_write.m, writes an I4MAT file.
kmeans_01.m applies the K-Means algorithm.
kmeans_02.m applies the K-Means algorithm.
kmeans_02_optra.m carries out the optimal transfer stage.
kmeans_01_qtran.m carries out the quick transfer stage.
kmeans_03.m applies the K-Means algorithm.
kmeans_w_01.m applies the weighted K-Means algorithm.
kmeans_w_03.m applies the weighted K-Means algorithm.
r4_uniform_01.m returns a unit pseudorandom R4.
r8_uniform_01.m returns a unit pseudorandom R8.
r8mat_data_read.m, reads a table from an R8MAT file.
r8mat_header_read.m, reads a table header from an R8MAT file.
r8mat_uniform_01.m, fills an R8MAT with pseudorandom numbers.
r8mat_write.m, writes an R8MAT file.
r8vec_uniform_01.m returns a unit pseudorandom R8VEC.
random_initialize.m initializes the random number seed.
s_len_trim.m, returns the length of a string to the last nonblank.
s_word_count.m, counts the number of words in a string.
timestamp.m, prints the current YMDHMS date as a timestamp.

Last revised on 06 February 2019.