K-Means Clustering ================== [K-Means][kmeans] clustering implementation. Features: * Initialization methods: random, [forgy][forgy], and [k-means++][kmeanspp]. * Uses the [Silhouette method][silhouette] to determine the optimum cluster count. * No external dependencies. Examples -------- Generate 1000 rows of test data, clustered around 3 points: # generate test data ./gen-data.rb 1000 3 > example.dat # classify data, plot results as png ./km-test kmeans example.dat example.png Run all tests and save best results in current directory along with PNGs of results: for i in tests/*dat; do # path to output data file # (ex: src: "tests/c3-1e2-0.dat", dst: "kmeans-c3-1e3-0.dat") dst_path=kmeans-$(basename $i) # path to output png # (ex: src: "tests/c3-1e2-0.dat", dst: "kmeans-c3-1e3-0.png") png_path=kmeans-$(basename ${i/dat/png}) # run test (use kmeans for initialization) echo $i ./km-test kmeans $i $png_path > $dst_path done Initialization Methods ---------------------- Supported initialization methods: * `rand`: Pick random points as the initial cluster centroids. * [forgy][forgy]: Pick random points from the data set as the initial cluster centroids. * [kmeans][kmeanspp]: Use the [k-means++][kmeanspp] initialization method to pick the initial cluster centroids. This is the recommended initialization method. It's probably best to just stick with [kmeans][kmeanspp] unless you know what you're doing. Data File Format ---------------- Reads and writes newline-delimited plain text files in the following format: * Each line is a row. * Each row consists of one or more columns, delimited by a space. * Columns are floating point or integer values. * The first row is called the header row. * The header row contains two unsigned integer columns which indicate the layout of the remaining rows. * The first header row column indicates the number of floating point columns per row (`num_floats`). * The second header row column indicates the number of integer columns per row (`num_ints`). * The remaining rows contain `num_floats` floating point columns, followed by `num_ints` signed integer columns. Example data file: 3 0 1.2 3.6 5.2 9.8 6.5 4.3 3.2 5.6 8.7 See the files in `tests/` for additional example data files. You can also use the top-level `gen-data.rb` script to generate additional test data. Test Data --------- The test data files in the `tests/` directory use the following naming convention: c--.dat Where: * `num_clusters`: Number of clusters. * `num_rows`: Number of rows, in exponent notation (`1e3` = 1000, `1e4` = 10000, etc). * `N`: Distinguishing suffix (usually 0). [kmeans]: https://en.wikipedia.org/wiki/K-means_clustering "K-Means clustering" [kmeanspp]: https://en.wikipedia.org/wiki/K-means%2B%2B "k-means++ initialization method" [forgy]: https://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods "Forgy initialization method" [silhouette]: https://en.wikipedia.org/wiki/Silhouette_%28clustering%29 "Silhouette method"