-
Notifications
You must be signed in to change notification settings - Fork 12
70 clustering
Package clustering.kmeans.matrix
, available in Lua with
require("aprilann.clustering.kmeans.matrix")
.
The package clustering
is developed to contain different clustering
implementations. Currently, only one is available.
This package contains an implementation of k-means clustering designed to be very efficient with large databases (as large as it could be contained in your main memory), but with a small number of clusters.
This is the main function of the clustering algorithm. It receives a table with
different input arguments, and depending on them, the algorithm could be
specialized. The function returns a number with the distortion of the clustering
result, and a matrix
with the centroids (ordered by rows).
In this case, the function receives a matrix
with the initial set of
centroids. The given table argument must contain the following fields:
-
data
: it is amatrix
where the data points are ordered by rows. -
centroids
: it is amatrix
with the initial value of the centroids, oredered by rows. -
max_iter=100
: the maximum number of clustering iterations, by default it ismax_iter=100
. -
threshold=1e-05
: the threshold for stopping clustering algorithm, by default it isthreshold=1e-05
. -
verbose=false
: a boolean indicating verbosity.
The algorithm is executed and the given centroids matrix
will be updated with
the newer centroids, resulting from the clustering algorithm.
> data = matrix(5,2):linear()
> clusters = matrix(2,2,{0,0, 1,1})
> = clusters
0 0
1 1
# Matrix of size [2,2] in row_major [0x1131c30 data= 0x1131d00]
> res, C = clustering.kmeans.matrix{
data = data,
centroids = clusters
}
> = res
3.999998196261
> = C
1 2
6 7
# Matrix of size [2,2] in row_major [0x1131c30 data= 0x1131d00]
In this case, the algorithm is implemented in two parts, first the refine
algorithm published by
P.S. Bradley and U.M. Fayyad
is used to initialize the centroids matrix
. After that, the standard
clustering algorithm will be used. The given table argument must contain the
following fields:
-
data
: it is amatrix
where the data points are ordered by rows. -
K
: a number indicating how many clusters you want to compute using refine algorithm. -
random
: arandom
object used by the refine algorithm. -
subsamples=10
: how many random subsamples of the data will be used in the refine algorithm, by default it issubsamples=10
. -
percentage=0.01
: a percentage of the data used by refine algorithm, by default it ispercentage=0.01
, that is, a 1%. -
max_iter=100
: the maximum number of clustering iterations, by default it ismax_iter=100
. -
threshold=1e-05
: the threshold for stopping clustering algorithm, by default it isthreshold=1e-05
. -
verbose=false
: a boolean indicating verbosity.
The algorithm is executed and a centroids matrix
will be returned.
> data = matrix(5,2):linear()
> res,C = clustering.kmeans.matrix{
data = data,
K = 2,
random = random(1234),
}
> = res
3.999998196261
> = C
2 3
7 8
# Matrix of size [2,2] in row_major [0x25d1180 data= 0x25d1250]
This function classifies X
(a matrix
with data) in the closest centroid C
(a matrix
with the centroids), and returns the score
of the classification
and the tags T
(a matrixInt32
). The function receives positional arguments:
-
X
: amatrix
with the data ordered by rows (N rows, D columns). -
C
: amatrix
with the centroids ordered by rows (K centroids). -
T
: amatrixInt32
with size Nx1, which contains for every row ofX
the number of its closest centroid. This argument is optional, if not given, a newmartrixInt32
will be allocated. -
verbose=false
: a boolean indicating if verbosity is desired. By default it isfalse
> = data
0 1
2 3
4 5
6 7
8 9
# Matrix of size [5,2] in row_major [0x1717960 data= 0x1714f10]
> score,T = clustering.kmeans.matrix.find_clusters(data,C)
> = score
-51.4
> = T
1
1
1
2
2
# MatrixInt32 of size [5,1] in row_major [0x2862a70 data= 0x282e830]