K-Means Clustering ================== [K-Means][kmeans] clustering implementation. Features: * Initialization methods: random, [forgy][forgy], and [k-means++][kmeanspp]. * Uses [Silhouette method][silhouette] to determine the optimum cluster count. * No external dependencies. Examples -------- Generate 1000 rows of test data, clustered around 3 points: # generate test data ./gen-data.rb 1000 3 > example.dat # classify data, plot results as png ./km-test kmeans example.dat example.png Run all tests and save best clustering in current directory along with PNGs of results: for i in tests/*dat; do # build paths png_path=kmeans-$(basename ${i/dat/png}) dst_path=kmeans-$(basename $i) # run test echo $i ./km-test kmeans $i $png_path > $dst_path done Initialization Methods ---------------------- Supported initialization methods: * `rand`: Pick random points as the initial cluster centroids. * [forgy][forgy]: Pick random points from the data set as the initial cluster centroids. * [kmeans][kmeanspp]: Use the [k-means++][kmeanspp] initialization method to pick the initial cluster centroids. This is the recommended initialization method. Data File Format ---------------- Reads and writes newline-delimited plain text files in the following format: * Lines are delimited by newlines * Each line is a record. * Record fields are delimited by whitespace. * The first row specifies the *shape* of the remaining rows as two unsigned integers. The first unsigned integer -- `num_floats` -- indicates the number of floating point columns per row, and the second unsigned integer -- `num_ints` -- indicates the number of signed integer values per row. * The remaining lines contain data rows. Each row consists of `num_floats` floating point numbers, followed by `num_ints` signed integer values. Example data file: 3 0 1.2 3.6 5.2 9.8 6.5 4.3 3.2 5.6 8.7 See the files in `tests/` for additional examples. You can also use the top-level `gen-data.rb` script to generate additional test data. [kmeans]: https://en.wikipedia.org/wiki/K-means_clustering "K-Means clustering" [kmeanspp]: https://en.wikipedia.org/wiki/K-means%2B%2B "k-means++ initialization method" [forgy]: https://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods "Forgy initialization method" [silhouette]: https://en.wikipedia.org/wiki/Silhouette_%28clustering%29 "Silhouette method"