KMEANS
the K-Means Data Clustering Problem

KMEANS is a C++ library which handles the K-Means problem, which organizes a set of N points in M dimensions into K clusters;

In the K-Means problem, a set of N points X(I) in M-dimensions is given. The goal is to arrange these points into K clusters, with each cluster having a representative point Z(J), usually chosen as the centroid of the points in the cluster.

        Z(J) = Sum ( all X(I) in cluster J ) X(I) /
               Sum ( all X(I) in cluster J ) 1.

The energy of cluster J is

        E(J) = Sum ( all X(I) in cluster J ) || X(I) - Z(J) ||^2

For a given set of clusters, the total energy is then simply the sum of the cluster energies E(J). The goal is to choose the clusters in such a way that the total energy is minimized. Usually, a point X(I) goes into the cluster with the closest representative point Z(J). So to define the clusters, it's enough simply to specify the locations of the cluster representatives.

This is actually a fairly hard problem. Most algorithms do reasonably well, but cannot guarantee that the best solution has been found. It is very common for algorithms to get stuck at a solution which is merely a "local minimum". For such a local minimum, every slight rearrangement of the solution makes the energy go up; however a major rearrangement would result in a big drop in energy.

A simple algorithm for the problem is known as the "H-Means algorithm". It alternates between two procedures:

Using the given cluster centers, assign each point to the cluster with the nearest center;
Using the given cluster assignments, replace each cluster center by the centroid or average of the points in the cluster.

These steps are repeated until no points are moved, or some other termination criterion is reached.

A more sophisticated algorithm, known as the "K-Means algorithm", takes advantage of the fact that it is possible to quickly determine the decrease in energy caused by moving a point from its current cluster to another. It repeats the following procedure:

For each point, move it to another cluster if that would lower the energy. If you move a point, immediately update the cluster centers of the two affected clusters.

This procedure is repeated until no points are moved, or some other termination criterion is reached.

The Weighted K-Means Problem

A natural extension of the K-Means problem allows us to include some more information, namely, a set of weights associated with the data points. These might represent a measure of importance, a frequency count, or some other information. The intent is that a point with a weight of 5.0 is twice as "important" as a point with a weight of 2.5, for instance. This gives rise to the "weighted" K-Means problem.

In the weighted K-Means problem, we are given a set of N points X(I) in M-dimensions, and a corresponding set of nonnegative weights W(I). The goal is to arrange the points into K clusters, with each cluster having a representative point Z(J), usually chosen as the weighted centroid of the points in the cluster:

        Z(J) = Sum ( all X(I) in cluster J ) W(I) * X(I) /
               Sum ( all X(I) in cluster J ) W(I).

The weighted energy of cluster J is

        E(J) = Sum ( all X(I) in cluster J ) W(I) * || X(I) - Z(J) ||^2

Licensing:

The computer code and data files described and made available on this web page are distributed under the GNU LGPL license.

Languages:

KMEANS is available in a C version and a C++ version and a FORTRAN90 version and a MATLAB version.

Related Data and Programs:

ASA058, a C++ library which implements the K-means algorithm of Sparks.

ASA136, a C++ library which implements the Hartigan and Wong clustering algorithm.

CITIES, a C++ library which handles various problems associated with a set of "cities" on a map.

CITIES, a dataset directory which contains sets of data defining groups of cities.

POINT_MERGE, a C++ library which considers N points in M dimensional space, and counts or indexes the unique or "tolerably unique" items.

SPAETH, a dataset directory which contains a set of test data.

SPAETH2, a dataset directory which contains a set of test data.

Reference:

John Hartigan, Manchek Wong,
Algorithm AS 136: A K-Means Clustering Algorithm,
Applied Statistics,
Volume 28, Number 1, 1979, pages 100-108.
Wendy Martinez, Angel Martinez,
Computational Statistics Handbook with MATLAB,
Chapman and Hall / CRC, 2002.
David Sparks,
Algorithm AS 58: Euclidean Cluster Analysis,
Applied Statistics,
Volume 22, Number 1, 1973, pages 126-130.

Source Code:

kmeans.cpp, the source code.
kmeans.hpp, the include file.

Examples and Tests:

kmeans_test.cpp, a sample problem.
kmeans_test.txt, the output file.

There are data files read by the sample code:

points_100.txt, 100 2D points, used as a case study by the sample problem.
points_100.png, an image of the points.
ruspini.txt, 75 points in 2D, with integer coordinates, and 0 < X < 120, 0 < Y < 160, which might naturally be grouped into 4 sets.
weights_equal_100.txt, 100 equal weights, for doing equally weighted calculations when a program expects weights.
weights_unequal_100.txt, 100 weights, not all equal, for testing programs that can use weights.

TEST01 applies HMEANS_01 to points_100.txt:

test01_clusters.txt the cluster assignments.
test01_centers.txt the cluster centers.

TEST02 applies HMEANS_02 to points_100.txt:

test02_clusters.txt the cluster assignments.
test02_centers.txt the cluster centers.

TEST03 applies KMEANS_01 to points_100.txt:

test03_clusters.txt the cluster assignments.
test03_centers.txt the cluster centers.

TEST04 applies KMEANS_02 to points_100.txt:

test04_clusters.txt the cluster assignments.
test04_centers.txt the cluster centers.

TEST05 applies KMEANS_03 to points_100.txt:

test05_clusters.txt the cluster assignments.
test05_centers.txt the cluster centers.

TEST06 applies HMEANS_01 + KMEANS_01 to points_100.txt:

test06_clusters.txt the cluster assignments.
test06_centers.txt the cluster centers.

TEST07 applies HMEANS_01 + KMEANS_02 to points_100.txt:

test07_clusters.txt the cluster assignments.
test07_centers.txt the cluster centers.

TEST08 applies KMEANS_01 + KMEANS_03 to points_100.txt:

test08_clusters.txt the cluster assignments.
test08_centers.txt the cluster centers.

TEST09 applies HMEANS_W_01 to points_100.txt and weights_equal_100.txt:

test09_clusters.txt the cluster assignments.
test09_centers.txt the cluster centers.

TEST10 applies HMEANS_W_02 to points_100.txt and weights_equal_100.txt:

test10_clusters.txt the cluster assignments.
test10_centers.txt the cluster centers.

TEST11 applies KMEANS_W_01 to points_100.txt and weights_equal_100.txt:

test11_clusters.txt the cluster assignments.
test11_centers.txt the cluster centers.

TEST12 applies KMEANS_W_03 to points_100.txt and weights_equal_100.txt:

test12_clusters.txt the cluster assignments.
test12_centers.txt the cluster centers.

TEST13 applies HMEANS_W_01 to points_100.txt and weights_unequal_100.txt:

test13_clusters.txt the cluster assignments.
test13_centers.txt the cluster centers.

TEST14 applies HMEANS_W_02 to points_100.txt and weights_unequal_100.txt:

test14_clusters.txt the cluster assignments.
test14_centers.txt the cluster centers.

TEST15 applies KMEANS_W_01 to points_100.txt and weights_unequal_100.txt:

test15_clusters.txt the cluster assignments.
test15_centers.txt the cluster centers.

TEST16 applies KMEANS_W_03 to points_100.txt and weights_unequal_100.txt:

test16_clusters.txt the cluster assignments.
test16_centers.txt the cluster centers.

List of Routines:

CH_CAP capitalizes a single character.
CH_EQI is true if two characters are equal, disregarding case.
CH_TO_DIGIT returns the integer value of a base 10 digit.
CLUSTER_ENERGY_COMPUTE computes the energy of the clusters.
CLUSTER_INITIALIZE_1 initializes the clusters to data points.
CLUSTER_INITIALIZE_2 initializes the cluster centers to random values.
LUSTER_INITIALIZE_3 initializes the cluster centers to random values.
LUSTER_INITIALIZE_4 initializes the cluster centers to random values.
LUSTER_INITIALIZE_5 initializes the cluster centers to random values.
LUSTER_PRINT_SUMMARY prints a summary of data about a clustering.
LUSTER_VARIANCE_COMPUTE computes the variance of the clusters.
FILE_COLUMN_COUNT counts the columns in the first line of a file.
FILE_ROW_COUNT counts the number of row records in a file.
MEANS_01 applies the H-Means algorithm.
MEANS_02 applies the H-Means algorithm.
MEANS_W_01 applies the weighted H-Means algorithm.
MEANS_W_02 applies the weighted H-Means algorithm.
I4_MAX returns the maximum of two I4's.
I4_MIN returns the minimum of two I4's.
I4_UNIFORM returns a scaled pseudorandom I4.
I4MAT_WRITE writes an I4MAT file with no header.
I4VEC_SUM sums the entries of an I4VEC.
I4VEC_ZERO zeroes an I4VEC.
I4VEC_ZERO_NEW creates and zeroes an I4VEC.
MEANS_01 applies the K-Means algorithm.
MEANS_02 applies the K-Means algorithm.
MEANS_02_OPTRA carries out the optimal transfer stage.
MEANS_02_QTRAN carries out the quick transfer stage.
MEANS_03 applies the K-Means algorithm.
MEANS_W_01 applies the weighted K-Means algorithm.
MEANS_W_03 applies the weighted K-Means algorithm.
R4_NINT returns the nearest integer to an R4.
R8_HUGE returns a "huge" R8.
R8_MAX returns the maximum of two R8's.
R8_MIN returns the minimum of two R8's.
R8_UNIFORM_01 returns a unit pseudorandom R8.
R8MAT_DATA_READ reads the data from an R8MAT file.
R8MAT_HEADER_READ reads the header from an R8MAT file.
R8MAT_MM_NEW multiplies two matrices.
R8MAT_UNIFORM_01 returns a unit pseudorandom R8MAT.
R8MAT_UNIFORM_01_NEW returns a unit pseudorandom R8MAT.
R8MAT_WRITE writes an R8MAT file.
R8VEC_ALL_NONPOSITIVE: ( all ( A <= 0 ) ) for R8VEC's.
R8VEC_ANY_NEGATIVE: ( any ( A < 0 ) ) for R8VEC's.
R8VEC_I4VEC_DOT_PRODUCT computes the dot product of an R8VEC and an I4VEC.
R8VEC_MIN_INDEX returns the index of the minimum value in an R8VEC.
R8VEC_SUM returns the sum of an R8VEC.
R8VEC_UNIFORM_01 returns a unit pseudorandom R8VEC.
R8VEC_ZERO zeroes an R8VEC.
R8VEC_ZERO_NEW creates and zeroes an R8VEC.
S_LEN_TRIM returns the length of a string to the last nonblank.
S_TO_R8 reads an R8 from a string.
S_TO_R8VEC reads an R8VEC from a string.
S_WORD_COUNT counts the number of "words" in a string.
TIMESTAMP prints the current YMDHMS date as a time stamp.

You can go up one level to the C++ source codes.

Last revised on 10 October 2011.

KMEANS the K-Means Data Clustering Problem