aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: ea107cd0855ecf9f768812c55c6a702ff12548ac (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
K-Means Clustering
==================

[K-Means][kmeans] clustering implementation.  Features:

* Initialization methods: random, [forgy][forgy], and
  [k-means++][kmeanspp].
* Uses the [Silhouette method][silhouette] to determine the optimum
  cluster count.
* No external dependencies.

Examples
--------
Generate 1000 rows of test data, clustered around 3 points:

    # generate test data
    ./gen-data.rb 1000 3 > example.dat

    # classify data, plot results as png
    ./km-test kmeans example.dat example.png

Run all tests and save best results in current directory along with
PNGs of results:

    for i in tests/*dat; do
      # path to output data file
      # (ex: src: "tests/c3-1e2-0.dat", dst: "kmeans-c3-1e3-0.dat")
      dst_path=kmeans-$(basename $i)

      # path to output png
      # (ex: src: "tests/c3-1e2-0.dat", dst: "kmeans-c3-1e3-0.png")
      png_path=kmeans-$(basename ${i/dat/png})

      # run test (use kmeans for initialization)
      echo $i
      ./km-test kmeans $i $png_path > $dst_path
    done

Initialization Methods
----------------------
Supported initialization methods:

* `rand`: Pick random points as the initial cluster centroids.
* [forgy][forgy]: Pick random points from the data set as the initial
  cluster centroids.
* [kmeans][kmeanspp]: Use the [k-means++][kmeanspp] initialization method
  to pick the initial cluster centroids.  This is the recommended
  initialization method.

It's probably best to just stick with [kmeans][kmeanspp] unless you know
what you're doing.

Data File Format
----------------
Reads and writes newline-delimited plain text files in the following
format:

* Each line is a row.
* Each row consists of one or more columns, delimited by a space.
* Columns are floating point or integer values.
* The first row is called the header row.
* The header row contains two unsigned integer columns which indicate
  the layout of the remaining rows.
* The first header row column indicates the number of floating
  point columns per row (`num_floats`).
* The second header row column indicates the number of integer
  columns per row (`num_ints`).
* The remaining rows contain `num_floats` floating point
  columns, followed by `num_ints` signed integer columns.

Example data file:

    3 0
    1.2 3.6 5.2
    9.8 6.5 4.3
    3.2 5.6 8.7

See the files in `tests/` for additional example data files.  You can
also use the top-level `gen-data.rb` script to generate additional test
data.

Test Data
---------
The test data files in the `tests/` directory use the following naming
convention:

    c<num_clusters>-<num_rows>-<N>.dat

Where:

* `num_clusters`: Number of clusters.
* `num_rows`: Number of rows, in exponent notation (`1e3` = 1000, `1e4`
  = 10000, etc).
* `N`: Distinguishing suffix (usually 0).

  [kmeans]: https://en.wikipedia.org/wiki/K-means_clustering "K-Means clustering"
  [kmeanspp]: https://en.wikipedia.org/wiki/K-means%2B%2B "k-means++ initialization method"
  [forgy]: https://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods "Forgy initialization method"
  [silhouette]: https://en.wikipedia.org/wiki/Silhouette_%28clustering%29 "Silhouette method"