Skip to content
Paco Zamora Martinez edited this page Jan 23, 2014 · 2 revisions

Introduction

Package clustering.kmeans.matrix, available in Lua with require("aprilann.clustering.kmeans.matrix").

The package clustering is developed to contain different clustering implementations. Currently, only one is available.

Package clustering.kmeans.matrix

This package contains an implementation of k-means clustering designed to be very efficient with large databases (as large as it could be contained in your main memory), but with a small number of clusters.

distortion,centroids = clustering.kmeans.matrix{ ... }

This is the main function of the clustering algorithm. It receives a table with different input arguments, and depending on them, the algorithm could be specialized. The function returns a number with the distortion of the clustering result, and a matrix with the centroids (ordered by rows).

Starting from a set of known centroids

In this case, the function receives a matrix with the initial set of centroids. The given table argument must contain the following fields:

  • data: it is a matrix where the data points are ordered by rows.

  • centroids: it is a matrix with the initial value of the centroids, oredered by rows.

  • max_iter=100: the maximum number of clustering iterations, by default it is max_iter=100.

  • threshold=1e-05: the threshold for stopping clustering algorithm, by default it is threshold=1e-05.

  • verbose=false: a boolean indicating verbosity.

The algorithm is executed and the given centroids matrix will be updated with the newer centroids, resulting from the clustering algorithm.

> data = matrix(5,2):linear()
> clusters = matrix(2,2,{0,0,  1,1})
> = clusters
 0           0         
 1           1         
# Matrix of size [2,2] in row_major [0x1131c30 data= 0x1131d00]
> res, C = clustering.kmeans.matrix{ 
    data = data,
    centroids = clusters
  }
> = res
3.999998196261
> = C
 1           2         
 6           7         
# Matrix of size [2,2] in row_major [0x1131c30 data= 0x1131d00]

Starting from scratch

In this case, the algorithm is implemented in two parts, first the refine algorithm published by P.S. Bradley and U.M. Fayyad is used to initialize the centroids matrix. After that, the standard clustering algorithm will be used. The given table argument must contain the following fields:

  • data: it is a matrix where the data points are ordered by rows.

  • K: a number indicating how many clusters you want to compute using refine algorithm.

  • random: a random object used by the refine algorithm.

  • subsamples=10: how many random subsamples of the data will be used in the refine algorithm, by default it is subsamples=10.

  • percentage=0.01: a percentage of the data used by refine algorithm, by default it is percentage=0.01, that is, a 1%.

  • max_iter=100: the maximum number of clustering iterations, by default it is max_iter=100.

  • threshold=1e-05: the threshold for stopping clustering algorithm, by default it is threshold=1e-05.

  • verbose=false: a boolean indicating verbosity.

The algorithm is executed and a centroids matrix will be returned.

> data = matrix(5,2):linear()
> res,C = clustering.kmeans.matrix{ 
    data = data,
    K = 2,
    random = random(1234),
  }
> = res
3.999998196261
> = C
 2           3         
 7           8         
# Matrix of size [2,2] in row_major [0x25d1180 data= 0x25d1250]

score,T=clustering.kmeans.matrix.find_clusters(X,C [,T [,verbose]])

This function classifies X (a matrix with data) in the closest centroid C (a matrix with the centroids), and returns the score of the classification and the tags T (a matrixInt32). The function receives positional arguments:

  1. X: a matrix with the data ordered by rows (N rows, D columns).

  2. C: a matrix with the centroids ordered by rows (K centroids).

  3. T: a matrixInt32 with size Nx1, which contains for every row of X the number of its closest centroid. This argument is optional, if not given, a new martrixInt32 will be allocated.

  4. verbose=false: a boolean indicating if verbosity is desired. By default it is false

> = data
 0           1         
 2           3         
 4           5         
 6           7         
 8           9         
# Matrix of size [5,2] in row_major [0x1717960 data= 0x1714f10]
> score,T = clustering.kmeans.matrix.find_clusters(data,C)
> = score
-51.4
> = T
          1
          1
          1
          2
          2
# MatrixInt32 of size [5,1] in row_major [0x2862a70 data= 0x282e830]
Clone this wiki locally