aboutsummaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md55
1 files changed, 53 insertions, 2 deletions
diff --git a/README.md b/README.md
index 3667e8c..e43568f 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,15 @@
-KMeans Classifier Implementation
+K-Means Clustering
+==================
-Examples:
+[K-Means][kmeans] clustering implementation. Features:
+* Initialization methods: random, [forgy][forgy], and [k-means++][kmeanspp].
+* Uses [Silhouette method][silhouette] to determine the optimum cluster
+ count.
+* No external dependencies.
+
+Examples
+--------
Generate 1000 rows of test data, clustered around 3 points:
# generate test data
@@ -22,3 +30,46 @@ PNGs of results:
echo $i
./km-test kmeans $i $png_path > $dst_path
done
+
+Initialization Methods
+----------------------
+Supported initialization methods:
+
+* `rand`: Pick random points as the initial cluster centroids.
+* [forgy][forgy]: Pick random points from the data set as the initial
+ cluster centroids.
+* [kmeans][kmeanspp]: Use the [k-means++][kmeanspp] initialization method
+ to pick the initial cluster centroids. This is the recommended
+ initialization method.
+
+Data File Format
+----------------
+Reads and writes newline-delimited plain text files in the following
+format:
+
+* Lines are delimited by newlines
+* Each line is a record.
+* Record fields are delimited by whitespace.
+* The first row specifies the *shape* of the remaining rows as two
+ unsigned integers. The first unsigned integer -- `num_floats` --
+ indicates the number of floating point columns per row, and the second
+ unsigned integer -- `num_ints` -- indicates the number of signed
+ integer values per row.
+* The remaining lines contain data rows. Each row consists of
+ `num_floats` floating point numbers, followed by `num_ints` signed
+ integer values.
+
+Example data file:
+
+ 3 0
+ 1.2 3.6 5.2
+ 9.8 6.5 4.3
+ 3.2 5.6 8.7
+
+See the files in `tests/` for additional examples. You can also use
+the top-level `gen-data.rb` script to generate additional test data.
+
+ [kmeans]: https://en.wikipedia.org/wiki/K-means_clustering "K-Means clustering"
+ [kmeanspp]: https://en.wikipedia.org/wiki/K-means%2B%2B "k-means++ initialization method"
+ [forgy]: https://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods "Forgy initialization method"
+ [silhouette]: https://en.wikipedia.org/wiki/Silhouette_%28clustering%29 "Silhouette method"