K-Means Clustering ================== [K-Means][kmeans] clustering implementation. Features: * Initialization methods: random, [forgy][forgy], and [k-means++][kmeanspp]. * Uses the [Silhouette method][silhouette] to determine the optimum cluster count. * No external dependencies. Examples -------- Generate 1000 rows of test data, clustered around 3 points: # generate test data ./gen-data.rb 1000 3 > example.dat # classify data, plot results as png ./km-test kmeans example.dat example.png Run all tests and save best results in current directory along with PNGs of results: for i in tests/*dat; do # path to output data file # (ex: src: "tests/c3-1e2-0.dat", dst: "kmeans-c3-1e3-0.dat") dst_path=kmeans-$(basename $i) # path to output png # (ex: src: "tests/c3-1e2-0.dat", dst: "kmeans-c3-1e3-0.png") png_path=kmeans-$(basename ${i/dat/png}) # run test (use kmeans for initialization) echo $i ./km-test kmeans $i $png_path > $dst_path done Initialization Methods ---------------------- Supported initialization methods: * `rand`: Pick random points as the initial cluster centroids. * [forgy][forgy]: Pick random points from the data set as the initial cluster centroids. * [kmeans][kmeanspp]: Use the [k-means++][kmeanspp] initialization method to pick the initial cluster centroids. This is the recommended initialization method. It's probably best to just stick with [kmeans][kmeanspp] unless you know what you're doing. Data File Format ---------------- Reads and writes newline-delimited plain text files in the following format: * Lines are delimited by newlines * Each line is a record. * Record fields are delimited by whitespace. * The first row specifies the *shape* of the remaining rows as two unsigned integers. The first unsigned integer -- `num_floats` -- indicates the number of floating point columns per row, and the second unsigned integer -- `num_ints` -- indicates the number of signed integer values per row. * The remaining lines contain data rows. Each row consists of `num_floats` floating point numbers, followed by `num_ints` signed integer values. Example data file: 3 0 1.2 3.6 5.2 9.8 6.5 4.3 3.2 5.6 8.7 See the files in `tests/` for additional example data files. You can also use the top-level `gen-data.rb` script to generate additional test data. Test Data --------- The test data files in the `tests/` directory use the following naming convention: c--.dat Where: * `num_clusters`: Number of clusters. * `num_rows`: Number of rows, in exponent notation (`1e3` = 1000, `1e4` = 10000, etc). * `N`: Distinguishing suffix (usually 0). [kmeans]: https://en.wikipedia.org/wiki/K-means_clustering "K-Means clustering" [kmeanspp]: https://en.wikipedia.org/wiki/K-means%2B%2B "k-means++ initialization method" [forgy]: https://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods "Forgy initialization method" [silhouette]: https://en.wikipedia.org/wiki/Silhouette_%28clustering%29 "Silhouette method"