diff options
author | Paul Duncan <pabs@pablotron.org> | 2019-02-05 03:51:13 -0500 |
---|---|---|
committer | Paul Duncan <pabs@pablotron.org> | 2019-02-05 03:51:13 -0500 |
commit | 7907db7627aafa35a45f7246e79f5369b6714828 (patch) | |
tree | e66dd3c36b1204d4ad5fee65882acb380bc92de1 /README.md | |
parent | 4041516e6cca8a44de6cb7f7e6feb0930df4c1b6 (diff) | |
download | kmeans-7907db7627aafa35a45f7246e79f5369b6714828.tar.bz2 kmeans-7907db7627aafa35a45f7246e79f5369b6714828.zip |
populate README, clean up comments, remove unused includes
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 55 |
1 files changed, 53 insertions, 2 deletions
@@ -1,7 +1,15 @@ -KMeans Classifier Implementation +K-Means Clustering +================== -Examples: +[K-Means][kmeans] clustering implementation. Features: +* Initialization methods: random, [forgy][forgy], and [k-means++][kmeanspp]. +* Uses [Silhouette method][silhouette] to determine the optimum cluster + count. +* No external dependencies. + +Examples +-------- Generate 1000 rows of test data, clustered around 3 points: # generate test data @@ -22,3 +30,46 @@ PNGs of results: echo $i ./km-test kmeans $i $png_path > $dst_path done + +Initialization Methods +---------------------- +Supported initialization methods: + +* `rand`: Pick random points as the initial cluster centroids. +* [forgy][forgy]: Pick random points from the data set as the initial + cluster centroids. +* [kmeans][kmeanspp]: Use the [k-means++][kmeanspp] initialization method + to pick the initial cluster centroids. This is the recommended + initialization method. + +Data File Format +---------------- +Reads and writes newline-delimited plain text files in the following +format: + +* Lines are delimited by newlines +* Each line is a record. +* Record fields are delimited by whitespace. +* The first row specifies the *shape* of the remaining rows as two + unsigned integers. The first unsigned integer -- `num_floats` -- + indicates the number of floating point columns per row, and the second + unsigned integer -- `num_ints` -- indicates the number of signed + integer values per row. +* The remaining lines contain data rows. Each row consists of + `num_floats` floating point numbers, followed by `num_ints` signed + integer values. + +Example data file: + + 3 0 + 1.2 3.6 5.2 + 9.8 6.5 4.3 + 3.2 5.6 8.7 + +See the files in `tests/` for additional examples. You can also use +the top-level `gen-data.rb` script to generate additional test data. + + [kmeans]: https://en.wikipedia.org/wiki/K-means_clustering "K-Means clustering" + [kmeanspp]: https://en.wikipedia.org/wiki/K-means%2B%2B "k-means++ initialization method" + [forgy]: https://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods "Forgy initialization method" + [silhouette]: https://en.wikipedia.org/wiki/Silhouette_%28clustering%29 "Silhouette method" |