-
Notifications
You must be signed in to change notification settings - Fork 12
10 dataset
Package dataset
could be loaded via the standalone binary, or in Lua with
require("aprilann.dataset")
.
The dataset
table is a namespace and a Lua abstract class which adds an
abstraction layer of set of patterns to the multi-dimensional matrices. It is
also possible to do patterns pre-processing and, union and join operations of
different datasets, an identity matrix dataset, and so on.
Every dataset implements following methods:
-
number = ds:numPatterns()
, it returns the number of patterns in the givends
dataset. -
number = ds:patternSize()
, it returns the size of one pattern. -
table = ds:getPattern(i)
, it receives a number between 1 and numPatterns(), and returns a table with the i-th pattern. -
ds:putPattern(i,t)
, it receives a number between 1 and numPatterns(), and a table with patternSize() numbers, and overwrites the i-th pattern with the given table. -
iterator = ds:patterns()
, an iterator function to use in Luafor
statements:for i,t in ds:patterns() do ... end
. -
table = ds:mean()
, it returns the mean per each pattern component. -
table,table = ds:mean_deviation()
, it returns the mean and standard deviation per each pattern component. -
number,number = ds:min_max()
, it returns the minimum and maximum value of the dataset. -
ds:normalize_mean_deviation()
, it receives two tables of patternSize length, the first with means, and the second with standard deviations, and the method normalizes the data substracting mean and dividing by standard deviation. -
matrix = ds:toMatrix()
, it returns a new allocated bi-dimensionalmatrix
object which contains all dataset patterns (numPatterns rows and patternSize columns).
This is the most important kind of dataset, allowing to create patterns moving a multi-dimensional window through a matrix
object. This dataset
takes the matrix
by reference, so any change in the
matrix
will be reflected in the patterns produced by the dataset
:
xor_in = matrix(4,2, {0,0,
0,1,
1,0,
1,1})
xor_out = matrix(4, {0, 1, 1, 0})
-- by default, dataset.matrix traverses the matrix by rows
ds_xor_in = dataset.matrix(xor_in)
ds_xor_out = dataset.matrix(xor_out)
For a given matrix with dimensions n1,n2,...,nk, by default the dataset contains n1 number of patterns with size n2 x ... x nk. For a bidimensional matrix it is a row-major order traversal. For a vector, it is the traversal of all its elements:
> a = matrix(2, 2, {1,2,3,4})
> b = dataset.matrix(a)
> for i,j in b:patterns() do print(table.concat(j,",")) end
1,2
3,4
> a = matrix(2,2,2,{1,2,3,4,5,6,7,8})
> b = dataset.matrix(a)
> for i,j in b:patterns() do print(table.concat(j,",")) end
1,2,3,4
5,6,7,8
> a = matrix(4,{1,2,3,4})
> b = dataset.matrix(a)
> for i,j in b:patterns() do print(table.concat(j,",")) end
1
2
3
4
Until this point, none benefit of dataset
over matrix
is presented. We are going to show
that for the same given matrix
, we could generate several different dataset
modifying some
parameters which has been taken by default until now.
When we instantiate a dataset.matrix
, the first argument is a K-dimensional matrix
with
size n1 X n2 x ... x nK
.
The second argument could be a Lua table with the following fields:
-
patternSize
, a table array with K positive integers. It indicates the size of each pattern taken from the underlyingmatrix
. By default it ispatternSize={ 1, n2, n3, ..., nK }
. -
offset
, a table array with K signed integers. It indicates the offset of the first pattern. A negative value is useful to compute a pattern which traverses thematrix
limits. The first initial position is0
. Its default value isoffset={ 0, 0, ..., 0 }
. -
numSteps
, a table with K estrict positive integers (> 0). It indicates the number of steps used for each dimension to generate all the possible patterns. Its default value isnumSteps={ n1, 1, ..., 1 }
. The totalnumPatterns()
method returns the product of allnumSteps
components. -
stepSize
, a table with K signed integers. It indicates the number of coodinates which are slided for each dimension with every pattern. Its default value isstepSize={ 1, ..., 1 }
. Obviusly, in everyi
dimension wherenumSteps[i]=1
, thestepSize[i]
is not important. Depending on the values ofstepSize
andpatternSize
, thematrix
will be traversed with overlapping between patterns or not. -
orderStep
, a table with a permutation of the K dimensions, indicating the order formatrix
traversal. By default, thematrix
is traversed inrow_major
order, so its value isorderStep={ K-1, K-2, ..., 2, 1, 0 }
. Varying the order of this numbers, it is possible to produce a different order traversal, as for example acol_major
order. -
defaultValue
is a number (not necesarily an integer), used to fill the pattern positions which are out of thematrix
limits. By default its value isdefaultValue=0
. -
circular
is a table with K booleans (true or false) which indicate for everymatrix
dimension if it is circular or not. By default it is false in all dimensionscircular={ false, false, ..., false }
. When a dimension is not circular, the pattern positions out of thematrix
limits are filled withdefaultValue
. When a dimension is circular, the pattern positions out of thematrix
are re-interpreted starting at the first position of this dimension in the matrix. For example, a bi-dimensionalmatrix
whith one circular dimension seems cilindrical. If the two dimensions are circular, it seems thyroidal (like a donut).
Look a short example of this parameters. We want to generate a dataset with binary XOR patterns using
only one matrix:
> m_xor = matrix.fromString[[
4 3
ascii
0 0 0
0 1 1
1 0 1
1 1 0
]]
> ds_input = dataset.matrix(m_xor,{patternSize={1,2}})
> ds_output = dataset.matrix(m_xor,{offset={0,2},patternSize={1,1}})
> for i=1,ds_input:numPatterns() do
>> printf("%d -> Input: %s Output: %s\n",i,
>> table.concat(ds_input:getPattern(i),","),table.concat(ds_output:getPattern(i),","))
>> end
1 -> Input: 0,0 Output: 0
2 -> Input: 0,1 Output: 1
3 -> Input: 1,0 Output: 1
4 -> Input: 1,1 Output: 0
We could implement the following function:
function dataset_pair(m,sizein,sizeout)
local d_in = dataset.matrix(m,{patternSize = {1,sizein}})
local d_out = dataset.matrix(m,{offset={0,sizein},patternSize = {1,sizeout}})
return d_in,d_out
end
-- which could be used as this
ds_input,ds_output = dataset_pair(m_xor,2,1)
This dataset
represents the traversing of an identity matrix
. It receives as first argument
the number of patterns (which is at the same time the patternSize), a second optional argument which is
the value of zero (by default is 0.0
), and a third optional argument with the value of one (default
is 1.0
).
> ds_eye = dataset.identity(5)
> print(ds_eye:toMatrix())
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
# Matrix of size [5,5] in row_major [0x1418bd0 data= 0x1418cd0]
The dataset.identity
is equivalent to following code, but is more efficient:
> ds_eye = dataset.matrix(matrix(5,5):zeros():diag(1))
> print(ds_eye:toMatrix())
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
# Matrix of size [5,5] in row_major [0x129f930 data= 0x12fb470]
The dataset.indexed
allows to map indexes with patterns. It is useful to specify the output of
a classification task, in which case the underlying dataset
will be the association of ANN output
for each of the classes. Another possibility is to use dataset.indexed
to select a random set of
patterns from the underlying dataset
. NOTE that dataset.indexed
uses float numbers to represent
the indices, so the maximum integer number which could be indexed is 16777216
. If you need more
resolution, use dataset.index_filter
(which is less general than this).
The constructor receives 2 arguments, the first is the base dataset
. The second is a table array with
as many dataset
objects as patternSize()
of the base dataset
, acting every one of this as a
dictionary. The patternSize()
of the resulting dataset.indexed
object is equals to the sum of the
patternSize()
of all the dictionaries.
Following code is an example for a classification task ANN output:
> dict = dataset.identity(10)
> -- a random matrix with integers [1,10]
> m_base = matrix(100):uniform(1,10,random(1234))
> ds_base = dataset.matrix(m_base)
> indexed_ds = dataset.indexed( ds_base, { dict })
The following is code for a random subset of patterns from a given dataset
:
-- a matrix with 100 patterns with real numbers in [-1,1]
> m_dict = matrix(100, 10):uniformf(-1,1,random(1234))
> dict = dataset.matrix(m_dict)
> -- a random matrix with 10 integers in range [1,100], a selection of patterns
> m_base = matrix(10):uniform(1,100,random(1234))
> ds_base = dataset.matrix(m_base)
> indexed_ds = dataset.indexed( ds_base, { dict })
The dataset.index_filter
is like dataset.indexed
but only for the case of indexing a random
subset of patterns from a given base dataset
, which receives as first argument.
As second argument, a vector of unsigned integers (util.vector_uint
) is expected.
> -- a dataset with 100 patterns of size 5, randomized at range [0,1]
> base_ds = dataset.matrix(matrix(100,5):uniformf())
> uint_vector = util.vector_uint()
> rnd = random(1234)
> -- a subset of 10 patterns from indices at range [1,100]
> for i=1,10 do uint_vector:push_back( rnd:randInt(1,100) ) end
> print(uint_vector)
48 84 39 54 77 25 16 50
24 27
# vector_uint of size 10
> index_filter_ds = dataset.index_filter(base_ds, uint_vector)
> print(index_filter_ds:toMatrix())
0.528819 0.915766 0.220549 0.828223 0.28173
0.73919 0.424762 0.354582 0.368474 0.0355779
0.512678 0.494687 0.731773 0.672073 0.411915
0.575729 0.169612 0.346667 0.925921 0.332662
0.298257 0.460495 0.179573 0.32725 0.610076
0.219746 0.15807 0.581498 0.531874 0.200707
0.00641197 0.86275 0.407079 0.279832 0.602674
0.456097 0.463612 0.521626 0.951389 0.659111
0.4136 0.734821 0.212726 0.314356 0.50499
0.662668 0.584882 0.457253 0.325801 0.217475
# Matrix of size [10,5] in row_major [0x12a2710 data= 0x13eaa10]
The dataset.join
object joins the outputs from several dataset
objects which has the same
numPatterns
. The patternSize
of the resulting dataset
is equals to the sum of every patternSize
of its components. It requieres as argument a table with the datasets
which you want to join.
-- ds1, ds2 and ds3 are three datasets with the same numPatterns
> join_ds = dataset.join{ ds1, ds2, ds3 }
This dataset
allows to convert several dataset
objects with the same patternSize
as they were
one unique dataset
which its numPatterns
is equals to the sum of all the numPatterns
of every
given dataset. It receives only one argument, a table with the dataset
which will be unionized.
> -- ds1, ds2 and ds3 are datasets with the same patternSize
> union_ds = dataset.union{ ds1, ds2, ds3 }
The dataset.slice
is useful to extract a contiguous subset of patterns from a given dataset
(for
more general subsets use dataset.indexed
or dataset.index_filter
). It requieres 3 arguments.
The first is the base dataset
. The second and third arguments are the initial and final indices of
the patterns which form the subset (first valid index is 1, and last valid index is numPatterns() of
base dataset
).
> -- slice with 100 patterns, from 101 to 200
> slice_ds = dataset.slice(base_ds, 101, 200)
The dataset.deriv
receives a dataset and outputs the original data, the first derivative, or the second derivative, depending on the parameters received. It receives a table with a maximum of four fields:
-
dataset
: the base dataset, which contains data for derivative computation. -
deriv0
: an optinal boolean, by default istrue
, which indicates if the output of the dataset will contain the original pattern, without derivative. -
deriv1
: an optinal boolean, by default istrue
, which indicates if the output of the dataset will contain the first derivative. -
deriv2
: an optinal boolean, by default istrue
, which indicates if the output of the dataset will contain the second derivative.
> -- ds is the base dataset
> only_first_deriv_ds = dataset.deriv{ dataset=ds, deriv0=false, deriv1=true, deriv2=false }
The contextualizer
is a dataset
which adds context from the adjacent patterns (left and right).
If any of the adjacent patterns is out of the base dataset
size, it fills it with the first
or the last pattern. The constructor receives four arguments:
- The base
dataset
. - The size of the left context.
- The size of the right context.
- A
boolean
optionally argument indicating if the left and right contexts needs to be swapped. By default isfalse
, and in almost all cases it is what you need ;)
> ds = dataset.contextualizer(dataset.identity(2,0,1),1,1)
>
> print(ds:toMatrix())
1 0 1 0 0 1
1 0 0 1 0 1
# Matrix of size [2,6] in row_major [0x18357b0 data= 0x18358b0]
This dataset
allows to select a subset of the components of patterns produced by another dataset
.
So, the resulting dataset
will have the same number of patterns, but different pattern size.
The subset is an interval of the base dataset
. It receives three positional arguments:
- The base
dataset
. - The first position in the interval (counting from 1).
- The last position in the interval (counting from 1).
> ds = dataset.split(dataset.identity(5,0,1), 2, 4)
> print(ds:toMatrix())
0 0 0
1 0 0
0 1 0
0 0 1
0 0 0
# Matrix of size [5,3] in row_major [0xcb0f80 data= 0xcb1080]
This dataset
applies on-the-fly a subtraction and division normalization, as for example
a zero-mean one-standard-deviation normalization. So, for a dataset
with N
patternSize, given a vector of sub values s1, s2, ..., sN
, and a vector of div
values d1, d2, ..., dN
, a ds:getPattern(i)
of the resulting dataset
will produce a pattern
with (v1-s1)/d1, (v2-s2)/d2, ..., (vN-sN)/dN
, being vj
the j
component of pattern i
.
> eye_ds = dataset.identity(5,0,1)
> sub,div = {1,2,-1,2,-1},{0.1,0.1,0.1,0.1,0.1}
> ds = dataset.sub_and_div_normalization(eye_ds,sub,div)
> print(ds:toMatrix())
0 -20 10 -20 10
-10 -10 10 -20 10
-10 -20 20 -20 10
-10 -20 10 -10 10
-10 -20 10 -20 20
# Matrix of size [5,5] in row_major [0xf47d70 data= 0xcfa060]
token = ds:getPattern(number)
token = ds:getPatternBunch(table)
ds = dataset.token.sparse_matrix(sparse matrix in CSR)
> m = matrix.sparse.diag{1,2,3,4,5,6}
> ds = dataset.token.sparse_matrix(m)
> print(ds:getPattern(1))
1 0 0 0 0 0
# SparseMatrix of size [1,6] in csr [0x2aea350 data= 0x2aea420 0x2aea4a0 0x2aea4e0], 1 non-zeros
> print(ds:getPatternBunch{3,5})
0 0 3 0 0 0
0 0 0 0 5 0
# SparseMatrix of size [2,6] in csr [0x2aeab70 data= 0x2aea4a0 0x2aea420 0x2aea7b0], 2 non-zeros
ds = dataset.token.union(table)
ds = dataset.token.vector(psize)
ds:push_back(token)
ds = dataset.token.filter(dataset, obj)
It is possible to develop Lua dataset
classes which has to complain interface of dataset.token
class.
The unique restriction is that your Lua dataset
couldn't be used as input to other C++ dataset
objects.
However, the Lua dataset
can use C++ objects or Lua objects without making any distinction.
The following is a piece of a pure Lua dataset.token
which replicates the
behavior of dataset.join
, but using tokens. matrix
type is needed for
instances which you want to join.
ds_join,ds_join_methods = class("ds_join")
function ds_join:constructor(t)
assert(type(t)=="table" and #t>0,
"Needs an array of dataset.token instances as argument")
local psize = 0 -- we sum here the pattern size of all the given datasets
local nump = 0 -- we store here the number of patterns, which must be
-- equals in all the given datasets
local data = {} -- this table will store the given datasets
for _,v in ipairs(t) do
psize = psize + v:patternSize()
local aux_nump = v:numPatterns()
assert(nump==0 or nump==aux_nump)
nump = aux_nump
table.insert(data, v)
end
self.data=data
self.num_patterns=nump
self.pattern_size=psize
end
function ds_join_methods:numPatterns() return self.num_patterns end
function ds_join_methods:patternSize() return self.pattern_size end
function ds_join_methods:getPattern(idx)
-- use the given matrix or construct a new one
local m = matrix(1,self:patternSize())
local col_pos = 1
for _,ds in ipairs(self.data) do
local psize = ds:patternSize()
local dest_m = m:slice({1,col_pos}, {1,psize})
dest_m:copy(ds:getPattern(idx))
col_pos = col_pos + psize
end
return m
end
function ds_join_methods:getPatternBunch(idxs)
-- use the given matrix or construct a new one
local m = matrix(#idxs,self:patternSize())
assert(m:dim(1)==#idxs and m:dim(2)==self:patternSize())
local col_pos = 1
for _,ds in ipairs(self.data) do
local psize = ds:patternSize()
local dest_m = m:slice({1,col_pos}, {#idxs,psize})
dest_m:copy(ds:getPatternBunch(idxs))
col_pos = col_pos + psize
end
return m
end