blob: ea107cd0855ecf9f768812c55c6a702ff12548ac (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
|
K-Means Clustering
==================
[K-Means][kmeans] clustering implementation. Features:
* Initialization methods: random, [forgy][forgy], and
[k-means++][kmeanspp].
* Uses the [Silhouette method][silhouette] to determine the optimum
cluster count.
* No external dependencies.
Examples
--------
Generate 1000 rows of test data, clustered around 3 points:
# generate test data
./gen-data.rb 1000 3 > example.dat
# classify data, plot results as png
./km-test kmeans example.dat example.png
Run all tests and save best results in current directory along with
PNGs of results:
for i in tests/*dat; do
# path to output data file
# (ex: src: "tests/c3-1e2-0.dat", dst: "kmeans-c3-1e3-0.dat")
dst_path=kmeans-$(basename $i)
# path to output png
# (ex: src: "tests/c3-1e2-0.dat", dst: "kmeans-c3-1e3-0.png")
png_path=kmeans-$(basename ${i/dat/png})
# run test (use kmeans for initialization)
echo $i
./km-test kmeans $i $png_path > $dst_path
done
Initialization Methods
----------------------
Supported initialization methods:
* `rand`: Pick random points as the initial cluster centroids.
* [forgy][forgy]: Pick random points from the data set as the initial
cluster centroids.
* [kmeans][kmeanspp]: Use the [k-means++][kmeanspp] initialization method
to pick the initial cluster centroids. This is the recommended
initialization method.
It's probably best to just stick with [kmeans][kmeanspp] unless you know
what you're doing.
Data File Format
----------------
Reads and writes newline-delimited plain text files in the following
format:
* Each line is a row.
* Each row consists of one or more columns, delimited by a space.
* Columns are floating point or integer values.
* The first row is called the header row.
* The header row contains two unsigned integer columns which indicate
the layout of the remaining rows.
* The first header row column indicates the number of floating
point columns per row (`num_floats`).
* The second header row column indicates the number of integer
columns per row (`num_ints`).
* The remaining rows contain `num_floats` floating point
columns, followed by `num_ints` signed integer columns.
Example data file:
3 0
1.2 3.6 5.2
9.8 6.5 4.3
3.2 5.6 8.7
See the files in `tests/` for additional example data files. You can
also use the top-level `gen-data.rb` script to generate additional test
data.
Test Data
---------
The test data files in the `tests/` directory use the following naming
convention:
c<num_clusters>-<num_rows>-<N>.dat
Where:
* `num_clusters`: Number of clusters.
* `num_rows`: Number of rows, in exponent notation (`1e3` = 1000, `1e4`
= 10000, etc).
* `N`: Distinguishing suffix (usually 0).
[kmeans]: https://en.wikipedia.org/wiki/K-means_clustering "K-Means clustering"
[kmeanspp]: https://en.wikipedia.org/wiki/K-means%2B%2B "k-means++ initialization method"
[forgy]: https://en.wikipedia.org/wiki/K-means_clustering#Initialization_methods "Forgy initialization method"
[silhouette]: https://en.wikipedia.org/wiki/Silhouette_%28clustering%29 "Silhouette method"
|