CVT_BASIS
Data Clustering by K-Means Techniques
CVT_BASIS
is a FORTRAN90 program which
computes good cluster centers
for a set of data.
The clustering process uses the K-Means algorithm, which can be
considered to be a discrete version of the CVT algorithm (Centroidal
Voronoi Tessellation).
The data is a collection of vectors, with each vector stored in
a separate file. The files are presumed to have "sequential" names,
such as "fred01.txt", "fred02.txt", and so on. Each file must be a
TABLE file, that is
a series of N lines, with M values on every line (although
comment lines may be inserted as well.)
The program is given the name of the first file in the sequence.
It reads the data from each file in the sequence, and carries out
the K Means clustering process to determine K cluster centers.
It writes each of these cluster centers out to a separate file.
The cluster centers will generally be "well spread out" in the space
spanned by the set of data. Such a set might be useful, for instance,
in determining a basis for a low-dimensional approximation of the
data.
INPUT: at run time, the user specifies:
-
uv0_file, the name of the first data file (the program
will assume all the files are numbered consecutively).
Note that you may now specify more than one set of solution families.
Enter "none" if there are no more families, or else the name of the
first file in the next family. Up to 10 separate families of
files are allowed.
-
cluster_lo, cluster_hi, the range of cluster sizes to check.
In most cases, you simply want to specify the same number
for both these values, namely, the requested basis size.
-
cluster_it_max, the number of different times you want to
try to cluster the data; I often use 15.
-
energy_it_max, the number of times you want to try to improve
a given clustering by swapping points from one cluster to another;
I often use 50 or 100.
-
comment, "Y" if initial comments may be included in the
beginning of the output files. These comments always start with
a "#" character in column 1.
Licensing:
The computer code and data files described and made available on this web page
are distributed under
the GNU LGPL license.
Languages:
CVT_BASIS is available in
a FORTRAN90 version.
Related Data and Programs:
BRAIN_SENSOR_POD,
a MATLAB program which
applies the method of Proper Orthogonal Decomposition
to seek underlying patterns in sets of 40 sensor readings of
brain activity.
BURGERS,
a data set directory which
contains solutions of the 1 dimensional Burgers equation;
CAVITY_FLOW,
a dataset directory which
contains solutions of a driven cavity flow in 2D;
CVT_BASIS_FLOW,
a FORTRAN90 program which
is similar to CVT_BASIS, but is specialized to handle
a particular family of fluid flow solutions.
CVTP,
a FORTRAN90 library which
creates a CVTP, that is, a Centroidal Voronoi Tessellation
on a periodic domain.
INOUT_FLOW,
a dataset directory which
contains solutions for flow in and out of a chamber in 2D;
INOUT_FLOW2,
a dataset directory which
contains solutions for flow in and out of a chamber in 2D,
using a finer grid and more timesteps;
SVD_BASIS,
a FORTRAN90 program which
uses the singular value decomposition to extract representative
modes from a set of data vectors.
TCELL_FLOW,
a dataset directory which
contains solutions for flow through a T-cell in 2D;
Reference:
-
Franz Aurenhammer,
Voronoi diagrams -
a study of a fundamental geometric data structure,
ACM Computing Surveys,
Volume 23, Number 3, pages 345-405, September 1991.
-
John Burkardt, Max Gunzburger, Hyung-Chun Lee,
Centroidal Voronoi Tessellation-Based Reduced-Order
Modelling of Complex Systems,
SIAM Journal on Scientific Computing,
Volume 28, Number 2, 2006, pages 459-484.
-
John Burkardt, Max Gunzburger, Janet Peterson, Rebecca Brannon,
User Manual and Supporting Information for Library of Codes
for Centroidal Voronoi Placement and Associated Zeroth,
First, and Second Moment Determination,
Sandia National Laboratories Technical Report SAND2002-0099,
February 2002.
-
Qiang Du, Vance Faber, Max Gunzburger,
Centroidal Voronoi Tessellations: Applications and Algorithms,
SIAM Review, Volume 41, 1999, pages 637-676.
-
Lili Ju, Qiang Du, Max Gunzburger,
Probabilistic methods for centroidal Voronoi tessellations
and their parallel implementations,
Parallel Computing,
Volume 28, 2002, pages 1477-1500.
-
Wendy Martinez, Angel Martinez,
Computational Statistics Handbook with MATLAB,
Chapman and Hall / CRC, 2002.
Source Code:
Examples and Tests:
-
run 01, example seeking 2 clusters;
-
run 02, example seeking 4 clusters;
-
run 03, example seeking 8 clusters;
-
run 04, compute clusterings
of sizes 1 through 16, determine energies, and output size
versus energy data;
List of Routines:
-
MAIN is the main routine for the CVT_BASIS program.
-
ANALYSIS_RAW computes the energy for a range of number of clusters.
-
CH_CAP capitalizes a single character.
-
CH_EQI is a case insensitive comparison of two characters for equality.
-
CH_IS_DIGIT returns .TRUE. if a character is a decimal digit.
-
CH_TO_DIGIT returns the integer value of a base 10 digit.
-
CLUSTER_CENSUS computes and prints the population of each cluster.
-
CLUSTER_INITIALIZE_RAW initializes the cluster centers to random values.
-
CLUSTER_LIST prints out the assignments.
-
DATA_TO_GNUPLOT writes data to a file suitable for processing by GNUPLOT.
-
DIGIT_INC increments a decimal digit.
-
DIGIT_TO_CH returns the character representation of a decimal digit.
-
ENERGY_RAW computes the total energy of a given clustering.
-
FILE_COLUMN_COUNT counts the number of columns in the first line of a file.
-
FILE_EXIST reports whether a file exists.
-
FILE_NAME_INC generates the next filename in a series.
-
FILE_ROW_COUNT counts the number of row records in a file.
-
GET_UNIT returns a free FORTRAN unit number.
-
HMEANS_RAW seeks the minimal energy of a cluster of a given size.
-
I4_INPUT prints a prompt string and reads an integer from the user.
-
I4_RANGE_INPUT reads a pair of integers from the user, representing a range.
-
I4_UNIFORM returns a scaled pseudorandom I4.
-
I4VEC_PRINT prints an integer vector.
-
KMEANS_RAW tries to improve a partition of points.
-
NEAREST_CLUSTER_RAW finds the cluster nearest to a data point.
-
R8_UNIFORM_01 returns a unit pseudorandom R8.
-
R8MAT_DATA_READ reads data from an R8MAT file.
-
R8MAT_HEADER_READ reads the header from an R8MAT file.
-
R8MAT_WRITE writes an R8MAT file.
-
R8VEC_NORM2 returns the 2-norm of a vector.
-
R8VEC_RANGE_INPUT reads two DP vectors from the user, representing a range.
-
R8VEC_UNIT_EUCLIDEAN normalizes a N-vector in the Euclidean norm.
-
RANDOM_INITIALIZE initializes the FORTRAN 90 random number seed.
-
S_BLANK_DELETE removes blanks from a string, left justifying the remainder.
-
S_EQI is a case insensitive comparison of two strings for equality.
-
S_INPUT prints a prompt string and reads a string from the user.
-
S_REP_CH replaces all occurrences of one character by another.
-
S_TO_R8 reads an R8 from a string.
-
S_TO_R8VEC reads an R8VEC from a string.
-
S_TO_I4 reads an I4 from a string.
-
S_TO_I4VEC reads an I4VEC from a string.
-
S_WORD_COUNT counts the number of "words" in a string.
-
TIMESTAMP prints the current YMDHMS date as a time stamp.
You can go up one level to
the FORTRAN90 source codes.
Last revised on 27 November 2012.