populate README, clean up comments, remove unused includes

author: Paul Duncan <pabs@pablotron.org> 2019-02-05 03:51:13 -0500
committer: Paul Duncan <pabs@pablotron.org> 2019-02-05 03:51:13 -0500
commit: 7907db7627aafa35a45f7246e79f5369b6714828 (patch)
tree: e66dd3c36b1204d4ad5fee65882acb380bc92de1 /README.md
parent: 4041516e6cca8a44de6cb7f7e6feb0930df4c1b6 (diff)
download: kmeans-7907db7627aafa35a45f7246e79f5369b6714828.tar.xz
kmeans-7907db7627aafa35a45f7246e79f5369b6714828.zip
1 files changed, 53 insertions, 2 deletions
diff --git a/README.md b/README.md
index 3667e8c..e43568f 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,15 @@
-KMeans Classifier Implementation
+K-Means Clustering
+==================
 
-Examples:
+[K-Means][kmeans] clustering implementation.  Features:
 
+* Initialization methods: random, [forgy][forgy], and [k-means++][kmeanspp].
+* Uses [Silhouette method][silhouette] to determine the optimum cluster
+  count.
+* No external dependencies.
+
+Examples
+--------
 Generate 1000 rows of test data, clustered around 3 points:
 
     # generate test data
@@ -22,3 +30,46 @@ PNGs of results:
       echo $i
       ./km-test kmeans $i $png_path > $dst_path
     done
+
+Initialization Methods
+----------------------
+Supported initialization methods:
+
+* `rand`: Pick random points as the initial cluster centroids.
+* [forgy][forgy]: Pick random points from the data set as the initial
+  cluster centroids.
+* [kmeans][kmeanspp]: Use the [k-means++][kmeanspp] initialization method
+  to pick the initial cluster centroids.  This is the recommended
+  initialization method.
+
+Data File Format
+----------------
+Reads and writes newline-delimited plain text files in the following
+format:
+
+* Lines are delimited by newlines
+* Each line is a record.
+* Record fields are delimited by whitespace.
+* The first row specifies the *shape* of the remaining rows as two
+  unsigned integers.  The first unsigned integer -- `num_floats` --
+  indicates the number of floating point columns per row, and the second
+  unsigned integer -- `num_ints` -- indicates the number of signed
+  integer values per row.
+* The remaining lines contain data rows.  Each row consists of
+  `num_floats` floating point numbers, followed by `num_ints` signed
+  integer values.
+
+Example data file:
+
+    3 0
+    1.2 3.6 5.2
+    9.8 6.5 4.3
+    3.2 5.6 8.7
+
+See the files in `tests/` for additional examples.  You can also use
+the top-level `gen-data.rb` script to generate additional test data.
+
+  [kmeans]: https://en.wikipedia.org/wiki/K-means_clustering "K-Means clustering"
+  [kmeanspp]: https://en.wikipedia.org/wiki/K-means%2B%2B "k-means++ initialization method"
+  [forgy]: https://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods "Forgy initialization method"
+  [silhouette]: https://en.wikipedia.org/wiki/Silhouette_%28clustering%29 "Silhouette method"
author	Paul Duncan <pabs@pablotron.org>	2019-02-05 03:51:13 -0500
committer	Paul Duncan <pabs@pablotron.org>	2019-02-05 03:51:13 -0500
commit	7907db7627aafa35a45f7246e79f5369b6714828 (patch)
tree	e66dd3c36b1204d4ad5fee65882acb380bc92de1 /README.md
parent	4041516e6cca8a44de6cb7f7e6feb0930df4c1b6 (diff)
download	kmeans-7907db7627aafa35a45f7246e79f5369b6714828.tar.xz kmeans-7907db7627aafa35a45f7246e79f5369b6714828.zip