From 7907db7627aafa35a45f7246e79f5369b6714828 Mon Sep 17 00:00:00 2001 From: Paul Duncan Date: Tue, 5 Feb 2019 03:51:13 -0500 Subject: populate README, clean up comments, remove unused includes --- README.md | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 53 insertions(+), 2 deletions(-) (limited to 'README.md') diff --git a/README.md b/README.md index 3667e8c..e43568f 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,15 @@ -KMeans Classifier Implementation +K-Means Clustering +================== -Examples: +[K-Means][kmeans] clustering implementation. Features: +* Initialization methods: random, [forgy][forgy], and [k-means++][kmeanspp]. +* Uses [Silhouette method][silhouette] to determine the optimum cluster + count. +* No external dependencies. + +Examples +-------- Generate 1000 rows of test data, clustered around 3 points: # generate test data @@ -22,3 +30,46 @@ PNGs of results: echo $i ./km-test kmeans $i $png_path > $dst_path done + +Initialization Methods +---------------------- +Supported initialization methods: + +* `rand`: Pick random points as the initial cluster centroids. +* [forgy][forgy]: Pick random points from the data set as the initial + cluster centroids. +* [kmeans][kmeanspp]: Use the [k-means++][kmeanspp] initialization method + to pick the initial cluster centroids. This is the recommended + initialization method. + +Data File Format +---------------- +Reads and writes newline-delimited plain text files in the following +format: + +* Lines are delimited by newlines +* Each line is a record. +* Record fields are delimited by whitespace. +* The first row specifies the *shape* of the remaining rows as two + unsigned integers. The first unsigned integer -- `num_floats` -- + indicates the number of floating point columns per row, and the second + unsigned integer -- `num_ints` -- indicates the number of signed + integer values per row. +* The remaining lines contain data rows. Each row consists of + `num_floats` floating point numbers, followed by `num_ints` signed + integer values. + +Example data file: + + 3 0 + 1.2 3.6 5.2 + 9.8 6.5 4.3 + 3.2 5.6 8.7 + +See the files in `tests/` for additional examples. You can also use +the top-level `gen-data.rb` script to generate additional test data. + + [kmeans]: https://en.wikipedia.org/wiki/K-means_clustering "K-Means clustering" + [kmeanspp]: https://en.wikipedia.org/wiki/K-means%2B%2B "k-means++ initialization method" + [forgy]: https://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods "Forgy initialization method" + [silhouette]: https://en.wikipedia.org/wiki/Silhouette_%28clustering%29 "Silhouette method" -- cgit v1.2.3