Skip to content
Francisco Zamora-Martinez edited this page Feb 9, 2014 · 9 revisions

Introduction

The package knn contains algorithms to deal with K-Nearest-Neighbors algorithm. Currently, an implementation based on k-d tree is available, allowing to work with large databases (as far as it could be loaded into main memory), but with low dimensionality.

Class knn.kdtree

This class is the basic implementation of the k-d tree for KNN classification. It allows to work with matrices of data (as many matrices as you want). After all the data is given, the tree is built. No insertion or remove operations are implemented, but it is possible to insert new data and build it again if needed. After the tree is ready, it is possible to query for the nearest-neighbor or the K-nearest-neighbors.

Constructor: kdt = knn.kdtree(D,random)

The constructor receives two values, a number D with the number of dimensions (columns) of your data, and a random object (used at build method).

> D = 6
> kdt = knn.kdtree(D,random(9248))
> = kdt
instance 0x214ad70 of knn.kdtree

kdt = kdt:push(matrix)

This method receives a matrix with data ordered by rows, and returns the caller kdtree object. The matrix pointer is retained by the kdtree, but it is not inserted into the structure, the build method is who will perform the insertion.

> rnd = random(1234)
> m1  = matrix(100,D):uniformf(-1,1,rnd)
> m2  = matrix(200,D):uniformf(-1,1,rnd)
> kdt:push(m1)
> kdt:push(m2)

kdt = kdt:build()

This method processes all the given data matrices, and builds the k-d tree structure.

> kdt:build() -- after that, it is ready to queries

id,distance = kdt:searchNN(matrix)

This method allows to query a built kdtree object for the nearest-neighbor. It receives a bi-dimensional matrix with size 1xD, and returns two values:

  • id is the position of the nearest-neighbor. If you take all the pushed matrices ordered by rows, this number is the row corresponding to the sample in the concatenated data.

  • distance is a number with the square of the euclidean distance.

> id,dist = kdt:searchNN(matrix(1,D):uniformf(-1,1,rnd))
> = id
26
> -- the 26 is located at the first matrix
> = dist
0.49883383064235
> id,dist = kdt:searchNN(matrix(1,D):uniformf(-1,1,rnd))
> = id
178
> -- the 178 is located at the second matrix
> = dist
0.3402419188674

point,matrix = kdt:get_point_matrix(id)

This method receives a number id of a point in the kdtree object (as returned by kdt:searchNN(...) method), and returns a point which is a matrix of size 1xD with the corresponding point data, and the matrix object where the point is contained. Be careful, this method returns a reference to the original data, any change in the data will led to unexpected behavior of the kdtree object.

> NN = kdt:get_point_matrix(id)
> = NN
 0.657501   -0.604099    0.426221   -0.421949   -0.32904     0.75809   
# Matrix of size [1,6] in row_major [0xc80410 data= 0xc754b0]
> = m2(id-100,':')
 0.657501   -0.604099    0.426221   -0.421949   -0.32904     0.75809   
# Matrix of size [1,6] in row_major [0xc7ceb0 data= 0xc754b0]

result = kdt:searchKNN(K,matrix)

This method performs the K-Nearest-Neighbors search, using the given K as number of neighbors and the given matrix (with size 1xD) as data point. The method returns a table with pairs id,distance.

> result = kdt:searchKNN(4,matrix(1,D):uniformf(-1,1,rnd))
> = result
table: 0x197caa0
> for i=1,#result do print(result[i][1], result[i][2]) end
152	0.42534565841526
40	0.54756417013329
101	0.5931407024824
166	0.66157509210318

class = knn.kdtree.classifyKNN(result, get_class_function)

This method receives the result of kdt:searchKNN(...) method and a function which transforms a pattern id in a class. All the result pairs are traversed, computing the corresponding class from each id. The majority vote class will be returned. Normally, the get_class_function will be a function which looks into a dataset, a Lua table, or a matrix, looking for the class of the corresponding id number. In some tasks, because of the order of the data, it is possible to compute the class with a math operation. It depends on your data and your experiment framework how to implement this function.

> best_class = knn.kdtree.classifyKNN(result,
                                      function(id) return id%10 end)
> = best_class
2

table,cls,best = knn.kdtree.posteriorKNN(result, get_cls_func)

This method receives the result of kdt:searchKNN(...) method and a function get_cls_func which transforms a pattern id in a class. All the result pairs are traversed, computing the class posterior probability. A table with pairs of {class,log posterior} is returned. The posterior is computed considering the negative of euclidean squared distances as log-scores, and normalizing over all the log-scores in the result table. In theory, as more K neighbors were taken, the better posterior will be obtained. However, in practice, with values of K between 10 and 20 could be enough. Note that the posteriors table is traversed using pairs, not ipairs, because two reasons: first, the class identifier could be anything, not only a number, it depends in your get_class_function; and second, even with numeric class, identifiers not all the classes has to be present in the result table, so the posteriors table is not an array, it could contains gaps.

Additionally to the posteriors table, this function returns the maximum posterior cls class and its value best.

> result = kdt:searchKNN(20,matrix(1,D):uniformf(-1,1,rnd))
> posteriors,bestcls,best = knn.kdtree.posteriorKNN(result,
                                                    function(id) return id%10 end)
> for c,p in pairs(posteriors) do print(c,p) end
1	-3.3052420168408
2	-2.7092774252334
4	-2.289865236587
5	-2.4045060888367
6	-1.8979527591223
7	-2.0151971414884
8	-2.0497985519464
0	-2.2390630830692
9	-1.6785405659992
> -- in this example, the best class is 9, has the maximum posterior
> print(bestcls,best)
9	-1.6785405659992

predicted,logp = knn.kdtree.regressionKNN(result, get_tgt_func)

This function receives a result table and a get_tgt_func, and computes a prediction which is a weighted mean of the neighbors in result, weighted by the posteriors computed in a similar fashion as posteriorsKNN function, but ignoring the marginalization over classes.

The get_tgt_func is a Lua function which receives a training pattern id and returns the target (predicted) value associated with it. The returned value must be an instance of matrix, a Lua table or a Lua number. In any case, the predicted value will be a matrix.

Additionally, this functions returns the the logp with the log-posterior of every neighbor.

> function price(id)
    -- DO_STUFF
    return price_of_id
  end
> predicted = knn.kdtree.regressionKNN(result, price)
Clone this wiki locally