K-Means Clustering
K-Means clustering implementation. Features:
- Initialization methods: random, forgy, and k-means++.
- Uses the Silhouette method to determine the optimum cluster count.
- No external dependencies.
Examples
Generate 1000 rows of test data, clustered around 3 points:
# generate test data
./gen-data.rb 1000 3 > example.dat
# classify data, plot results as png
./km-test kmeans example.dat example.png
Run all tests and save best results in current directory along with PNGs of results:
for i in tests/*dat; do
# path to output data file
# (ex: src: "tests/c3-1e2-0.dat", dst: "kmeans-c3-1e3-0.dat")
dst_path=kmeans-$(basename $i)
# path to output png
# (ex: src: "tests/c3-1e2-0.dat", dst: "kmeans-c3-1e3-0.png")
png_path=kmeans-$(basename ${i/dat/png})
# run test (use kmeans for initialization)
echo $i
./km-test kmeans $i $png_path > $dst_path
done
Initialization Methods
Supported initialization methods:
rand
: Pick random points as the initial cluster centroids.- forgy: Pick random points from the data set as the initial cluster centroids.
- kmeans: Use the k-means++ initialization method to pick the initial cluster centroids. This is the recommended initialization method.
It's probably best to just stick with kmeans unless you know what you're doing.
Data File Format
Reads and writes newline-delimited plain text files in the following format:
- Each line is a row.
- Each row consists of one or more columns, delimited by a space.
- Columns are floating point or integer values.
- The first row is called the header row.
- The header row contains two unsigned integer columns which indicate the layout of the remaining rows.
- The first header row column indicates the number of floating
point columns per row (
num_floats
). - The second header row column indicates the number of integer
columns per row (
num_ints
). - The remaining rows contain
num_floats
floating point columns, followed bynum_ints
signed integer columns.
Example data file:
3 0
1.2 3.6 5.2
9.8 6.5 4.3
3.2 5.6 8.7
See the files in tests/
for additional example data files. You can
also use the top-level gen-data.rb
script to generate additional test
data.
Test Data
The test data files in the tests/
directory use the following naming
convention:
c<num_clusters>-<num_rows>-<N>.dat
Where:
num_clusters
: Number of clusters.num_rows
: Number of rows, in exponent notation (1e3
= 1000,1e4
= 10000, etc).N
: Distinguishing suffix (usually 0).