aboutsummaryrefslogtreecommitdiff

K-Means Clustering

K-Means clustering implementation. Features:

Examples

Generate 1000 rows of test data, clustered around 3 points:

# generate test data
./gen-data.rb 1000 3 > example.dat

# classify data, plot results as png
./km-test kmeans example.dat example.png

Run all tests and save best results in current directory along with PNGs of results:

for i in tests/*dat; do
  # path to output data file
  # (ex: src: "tests/c3-1e2-0.dat", dst: "kmeans-c3-1e3-0.dat")
  dst_path=kmeans-$(basename $i)

  # path to output png
  # (ex: src: "tests/c3-1e2-0.dat", dst: "kmeans-c3-1e3-0.png")
  png_path=kmeans-$(basename ${i/dat/png})

  # run test (use kmeans for initialization)
  echo $i
  ./km-test kmeans $i $png_path > $dst_path
done

Initialization Methods

Supported initialization methods:

  • rand: Pick random points as the initial cluster centroids.
  • forgy: Pick random points from the data set as the initial cluster centroids.
  • kmeans: Use the k-means++ initialization method to pick the initial cluster centroids. This is the recommended initialization method.

It's probably best to just stick with kmeans unless you know what you're doing.

Data File Format

Reads and writes newline-delimited plain text files in the following format:

  • Each line is a row.
  • Each row consists of one or more columns, delimited by a space.
  • Columns are floating point or integer values.
  • The first row is called the header row.
  • The header row contains two unsigned integer columns which indicate the layout of the remaining rows.
  • The first header row column indicates the number of floating point columns per row (num_floats).
  • The second header row column indicates the number of integer columns per row (num_ints).
  • The remaining rows contain num_floats floating point columns, followed by num_ints signed integer columns.

Example data file:

3 0
1.2 3.6 5.2
9.8 6.5 4.3
3.2 5.6 8.7

See the files in tests/ for additional example data files. You can also use the top-level gen-data.rb script to generate additional test data.

Test Data

The test data files in the tests/ directory use the following naming convention:

c<num_clusters>-<num_rows>-<N>.dat

Where:

  • num_clusters: Number of clusters.
  • num_rows: Number of rows, in exponent notation (1e3 = 1000, 1e4 = 10000, etc).
  • N: Distinguishing suffix (usually 0).