-
Notifications
You must be signed in to change notification settings - Fork 12
71 knn
The package knn
contains algorithms to deal with K-Nearest-Neighbors
algorithm. Currently, an implementation based on
k-d tree is available, allowing to work
with large databases (as far as it could be loaded into main memory), but with low
dimensionality.
This class is the basic implementation of the k-d tree for KNN classification. It allows to work with matrices of data (as many matrices as you want). After all the data is given, the tree is built. No insertion or remove operations are implemented, but it is possible to insert new data and build it again if needed. After the tree is ready, it is possible to query for the nearest-neighbor or the K-nearest-neighbors.
The constructor receives two values, a number D
with the number of dimensions
(columns) of your data, and a random
object (used at build method).
> D = 6
> kdt = knn.kdtree(D,random(9248))
> = kdt
instance 0x214ad70 of knn.kdtree
This method receives a matrix
with data ordered by rows, and returns the
caller kdtree
object. The matrix
pointer is retained by the kdtree
,
but it is not inserted into the structure, the build method is who will
perform the insertion.
> rnd = random(1234)
> m1 = matrix(100,D):uniformf(-1,1,rnd)
> m2 = matrix(200,D):uniformf(-1,1,rnd)
> kdt:push(m1)
> kdt:push(m2)
This method processes all the given data matrices, and builds the k-d tree structure.
> kdt:build() -- after that, it is ready to queries
This method allows to query a built kdtree
object for the
nearest-neighbor. It receives a bi-dimensional matrix
with size 1xD
, and
returns two values:
-
id
is the position of the nearest-neighbor. If you take all the pushed matrices ordered by rows, this number is therow
corresponding to the sample in the concatenated data. -
distance
is a number with the square of the euclidean distance.
> id,dist = kdt:searchNN(matrix(1,D):uniformf(-1,1,rnd))
> = id
26
> -- the 26 is located at the first matrix
> = dist
0.49883383064235
> id,dist = kdt:searchNN(matrix(1,D):uniformf(-1,1,rnd))
> = id
178
> -- the 178 is located at the second matrix
> = dist
0.3402419188674
This method receives a number id
of a point in the kdtree
object (as
returned by kdt:searchNN(...)
method), and returns a point
which is a
matrix
of size 1xD
with the corresponding point data, and the matrix
object where the point is contained. Be careful, this method returns a reference
to the original data, any change in the data will led to unexpected behavior of
the kdtree
object.
> NN = kdt:get_point_matrix(id)
> = NN
0.657501 -0.604099 0.426221 -0.421949 -0.32904 0.75809
# Matrix of size [1,6] in row_major [0xc80410 data= 0xc754b0]
> = m2(id-100,':')
0.657501 -0.604099 0.426221 -0.421949 -0.32904 0.75809
# Matrix of size [1,6] in row_major [0xc7ceb0 data= 0xc754b0]
This method performs the K-Nearest-Neighbors search, using the given K
as number of neighbors and the given matrix
(with size 1xD
) as data
point. The method returns a table with pairs id,distance
.
> result = kdt:searchKNN(4,matrix(1,D):uniformf(-1,1,rnd))
> = result
table: 0x197caa0
> for i=1,#result do print(result[i][1], result[i][2]) end
152 0.42534565841526
40 0.54756417013329
101 0.5931407024824
166 0.66157509210318
This method receives the result of kdt:searchKNN(...)
method
and a function which transforms a pattern id
in a class. All the result
pairs are traversed, computing the corresponding class from each id
. The
majority vote class will be returned. Normally, the get_class_function
will be a function which looks into a dataset
, a Lua table, or a matrix
,
looking for the class of the corresponding id
number. In some tasks,
because of the order of the data, it is possible to compute the class
with a math operation. It depends on your data and your experiment framework
how to implement this function.
> best_class = knn.kdtree.classifyKNN(result,
function(id) return id%10 end)
> = best_class
2
This method receives the result of kdt:searchKNN(...)
method
and a function get_cls_func
which transforms a
pattern id
in a class. All the result
pairs are traversed, computing the class posterior probability. A table with
pairs of {class,log posterior}
is returned. The posterior is computed
considering the negative of euclidean squared distances as log-scores, and
normalizing over all the log-scores in the result
table. In theory,
as more K
neighbors were taken, the better posterior will be obtained.
However, in practice, with values of K
between 10 and 20 could be enough.
Note that the posteriors table is traversed using pairs
, not ipairs
,
because two reasons: first, the class identifier could be anything, not only a number,
it depends in your get_class_function
; and second, even with numeric class,
identifiers not all the classes has to be present in the result
table,
so the posteriors table is not an array, it could contains gaps.
Additionally to the posteriors
table, this function returns the maximum
posterior cls
class and its value best
.
> result = kdt:searchKNN(20,matrix(1,D):uniformf(-1,1,rnd))
> posteriors,bestcls,best = knn.kdtree.posteriorKNN(result,
function(id) return id%10 end)
> for c,p in pairs(posteriors) do print(c,p) end
1 -3.3052420168408
2 -2.7092774252334
4 -2.289865236587
5 -2.4045060888367
6 -1.8979527591223
7 -2.0151971414884
8 -2.0497985519464
0 -2.2390630830692
9 -1.6785405659992
> -- in this example, the best class is 9, has the maximum posterior
> print(bestcls,best)
9 -1.6785405659992
This function receives a result
table and a get_tgt_func
, and computes
a prediction which is a weighted mean of the neighbors in result
, weighted by
the posteriors computed in a similar fashion as posteriorsKNN
function, but
ignoring the marginalization over classes.
The get_tgt_func
is a Lua function
which receives a training pattern id
and returns the target (predicted)
value associated with it. The returned value must be an instance of matrix
,
a Lua table
or a Lua number
. In any case,
the predicted
value will be a matrix
.
Additionally, this functions returns
the the logp
with the log-posterior of every neighbor.
> function price(id)
-- DO_STUFF
return price_of_id
end
> predicted = knn.kdtree.regressionKNN(result, price)