aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: e43568fff472bb67ef358232626905052966148f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
K-Means Clustering
==================

[K-Means][kmeans] clustering implementation.  Features:

* Initialization methods: random, [forgy][forgy], and [k-means++][kmeanspp].
* Uses [Silhouette method][silhouette] to determine the optimum cluster
  count.
* No external dependencies.

Examples
--------
Generate 1000 rows of test data, clustered around 3 points:

    # generate test data
    ./gen-data.rb 1000 3 > example.dat

    # classify data, plot results as png
    ./km-test kmeans example.dat example.png

Run all tests and save best clustering in current directory along with
PNGs of results:

    for i in tests/*dat; do
      # build paths
      png_path=kmeans-$(basename ${i/dat/png})
      dst_path=kmeans-$(basename $i)

      # run test
      echo $i
      ./km-test kmeans $i $png_path > $dst_path
    done

Initialization Methods
----------------------
Supported initialization methods:

* `rand`: Pick random points as the initial cluster centroids.
* [forgy][forgy]: Pick random points from the data set as the initial
  cluster centroids.
* [kmeans][kmeanspp]: Use the [k-means++][kmeanspp] initialization method
  to pick the initial cluster centroids.  This is the recommended
  initialization method.

Data File Format
----------------
Reads and writes newline-delimited plain text files in the following
format:

* Lines are delimited by newlines
* Each line is a record.
* Record fields are delimited by whitespace.
* The first row specifies the *shape* of the remaining rows as two
  unsigned integers.  The first unsigned integer -- `num_floats` --
  indicates the number of floating point columns per row, and the second
  unsigned integer -- `num_ints` -- indicates the number of signed
  integer values per row.
* The remaining lines contain data rows.  Each row consists of
  `num_floats` floating point numbers, followed by `num_ints` signed
  integer values.

Example data file:

    3 0
    1.2 3.6 5.2
    9.8 6.5 4.3
    3.2 5.6 8.7

See the files in `tests/` for additional examples.  You can also use
the top-level `gen-data.rb` script to generate additional test data.

  [kmeans]: https://en.wikipedia.org/wiki/K-means_clustering "K-Means clustering"
  [kmeanspp]: https://en.wikipedia.org/wiki/K-means%2B%2B "k-means++ initialization method"
  [forgy]: https://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods "Forgy initialization method"
  [silhouette]: https://en.wikipedia.org/wiki/Silhouette_%28clustering%29 "Silhouette method"