Skip to content
Paco Zamora Martinez edited this page Jan 14, 2015 · 8 revisions

Introduction

Package dataset could be loaded via the standalone binary, or in Lua with require("aprilann.dataset").

The dataset table is a namespace and a Lua abstract class which adds an abstraction layer of set of patterns to the multi-dimensional matrices. It is also possible to do patterns pre-processing and, union and join operations of different datasets, an identity matrix dataset, and so on.

Every dataset implements following methods:

  • number = ds:numPatterns(), it returns the number of patterns in the given ds dataset.
  • number = ds:patternSize(), it returns the size of one pattern.
  • table = ds:getPattern(i), it receives a number between 1 and numPatterns(), and returns a table with the i-th pattern.
  • ds:putPattern(i,t), it receives a number between 1 and numPatterns(), and a table with patternSize() numbers, and overwrites the i-th pattern with the given table.
  • iterator = ds:patterns(), an iterator function to use in Lua for statements: for i,t in ds:patterns() do ... end.
  • table = ds:mean(), it returns the mean per each pattern component.
  • table,table = ds:mean_deviation(), it returns the mean and standard deviation per each pattern component.
  • number,number = ds:min_max(), it returns the minimum and maximum value of the dataset.
  • ds:normalize_mean_deviation(), it receives two tables of patternSize length, the first with means, and the second with standard deviations, and the method normalizes the data substracting mean and dividing by standard deviation.
  • matrix = ds:toMatrix(), it returns a new allocated bi-dimensional matrix object which contains all dataset patterns (numPatterns rows and patternSize columns).

dataset.matrix

This is the most important kind of dataset, allowing to create patterns moving a multi-dimensional window through a matrix object. This dataset takes the matrix by reference, so any change in the matrix will be reflected in the patterns produced by the dataset:

xor_in = matrix(4,2, {0,0,
                      0,1,
                      1,0,
                      1,1})
xor_out = matrix(4, {0, 1, 1, 0})
-- by default, dataset.matrix traverses the matrix by rows
ds_xor_in  = dataset.matrix(xor_in)
ds_xor_out = dataset.matrix(xor_out)

For a given matrix with dimensions n1,n2,...,nk, by default the dataset contains n1 number of patterns with size n2 x ... x nk. For a bidimensional matrix it is a row-major order traversal. For a vector, it is the traversal of all its elements:

> a = matrix(2, 2, {1,2,3,4})
> b = dataset.matrix(a)
> for i,j in b:patterns() do print(table.concat(j,",")) end
1,2
3,4
> a = matrix(2,2,2,{1,2,3,4,5,6,7,8})
> b = dataset.matrix(a)
> for i,j in b:patterns() do print(table.concat(j,",")) end
1,2,3,4
5,6,7,8
> a = matrix(4,{1,2,3,4})
> b = dataset.matrix(a)
> for i,j in b:patterns() do print(table.concat(j,",")) end
1
2
3
4

Until this point, none benefit of dataset over matrix is presented. We are going to show that for the same given matrix, we could generate several different dataset modifying some parameters which has been taken by default until now.

When we instantiate a dataset.matrix, the first argument is a K-dimensional matrix with size n1 X n2 x ... x nK. The second argument could be a Lua table with the following fields:

  • patternSize, a table array with K positive integers. It indicates the size of each pattern taken from the underlying matrix. By default it is patternSize={ 1, n2, n3, ..., nK }.

  • offset, a table array with K signed integers. It indicates the offset of the first pattern. A negative value is useful to compute a pattern which traverses the matrix limits. The first initial position is 0. Its default value is offset={ 0, 0, ..., 0 }.

  • numSteps, a table with K estrict positive integers (> 0). It indicates the number of steps used for each dimension to generate all the possible patterns. Its default value is numSteps={ n1, 1, ..., 1 }. The total numPatterns() method returns the product of all numSteps components.

  • stepSize, a table with K signed integers. It indicates the number of coodinates which are slided for each dimension with every pattern. Its default value is stepSize={ 1, ..., 1 }. Obviusly, in every i dimension where numSteps[i]=1, the stepSize[i] is not important. Depending on the values of stepSize and patternSize, the matrix will be traversed with overlapping between patterns or not.

  • orderStep, a table with a permutation of the K dimensions, indicating the order for matrix traversal. By default, the matrix is traversed in row_major order, so its value is orderStep={ K-1, K-2, ..., 2, 1, 0 }. Varying the order of this numbers, it is possible to produce a different order traversal, as for example a col_major order.

  • defaultValue is a number (not necesarily an integer), used to fill the pattern positions which are out of the matrix limits. By default its value is defaultValue=0.

  • circular is a table with K booleans (true or false) which indicate for every matrix dimension if it is circular or not. By default it is false in all dimensions circular={ false, false, ..., false }. When a dimension is not circular, the pattern positions out of the matrix limits are filled with defaultValue. When a dimension is circular, the pattern positions out of the matrix are re-interpreted starting at the first position of this dimension in the matrix. For example, a bi-dimensional matrix whith one circular dimension seems cilindrical. If the two dimensions are circular, it seems thyroidal (like a donut).

Look a short example of this parameters. We want to generate a dataset with binary XOR patterns using only one matrix:

> m_xor = matrix.fromString[[
4 3
ascii
0 0 0
0 1 1
1 0 1
1 1 0
]]
> ds_input  = dataset.matrix(m_xor,{patternSize={1,2}})
> ds_output = dataset.matrix(m_xor,{offset={0,2},patternSize={1,1}})
> for i=1,ds_input:numPatterns() do
>> printf("%d -> Input: %s Output: %s\n",i,
>> table.concat(ds_input:getPattern(i),","),table.concat(ds_output:getPattern(i),","))
>> end
1 -> Input: 0,0 Output: 0
2 -> Input: 0,1 Output: 1
3 -> Input: 1,0 Output: 1
4 -> Input: 1,1 Output: 0

We could implement the following function:

function dataset_pair(m,sizein,sizeout)
  local d_in  = dataset.matrix(m,{patternSize = {1,sizein}})
  local d_out = dataset.matrix(m,{offset={0,sizein},patternSize = {1,sizeout}})
  return d_in,d_out
end
-- which could be used as this
ds_input,ds_output = dataset_pair(m_xor,2,1)

dataset.identity

This dataset represents the traversing of an identity matrix. It receives as first argument the number of patterns (which is at the same time the patternSize), a second optional argument which is the value of zero (by default is 0.0), and a third optional argument with the value of one (default is 1.0).

> ds_eye = dataset.identity(5)
> print(ds_eye:toMatrix())
 1           0           0           0           0
 0           1           0           0           0
 0           0           1           0           0
 0           0           0           1           0
 0           0           0           0           1
# Matrix of size [5,5] in row_major [0x1418bd0 data= 0x1418cd0]

The dataset.identity is equivalent to following code, but is more efficient:

> ds_eye = dataset.matrix(matrix(5,5):zeros():diag(1))
> print(ds_eye:toMatrix())
 1           0           0           0           0
 0           1           0           0           0
 0           0           1           0           0
 0           0           0           1           0
 0           0           0           0           1
# Matrix of size [5,5] in row_major [0x129f930 data= 0x12fb470]

dataset.indexed

The dataset.indexed allows to map indexes with patterns. It is useful to specify the output of a classification task, in which case the underlying dataset will be the association of ANN output for each of the classes. Another possibility is to use dataset.indexed to select a random set of patterns from the underlying dataset. NOTE that dataset.indexed uses float numbers to represent the indices, so the maximum integer number which could be indexed is 16777216. If you need more resolution, use dataset.index_filter (which is less general than this).

The constructor receives 2 arguments, the first is the base dataset. The second is a table array with as many dataset objects as patternSize() of the base dataset, acting every one of this as a dictionary. The patternSize() of the resulting dataset.indexed object is equals to the sum of the patternSize() of all the dictionaries.

Following code is an example for a classification task ANN output:

> dict = dataset.identity(10)
> -- a random matrix with integers [1,10]
> m_base = matrix(100):uniform(1,10,random(1234))
> ds_base = dataset.matrix(m_base)
> indexed_ds = dataset.indexed( ds_base, { dict })

The following is code for a random subset of patterns from a given dataset:

-- a matrix with 100 patterns with real numbers in [-1,1]
> m_dict = matrix(100, 10):uniformf(-1,1,random(1234))
> dict = dataset.matrix(m_dict)
> -- a random matrix with 10 integers in range [1,100], a selection of patterns
> m_base = matrix(10):uniform(1,100,random(1234))
> ds_base = dataset.matrix(m_base)
> indexed_ds = dataset.indexed( ds_base, { dict })

dataset.index_filter

The dataset.index_filter is like dataset.indexed but only for the case of indexing a random subset of patterns from a given base dataset, which receives as first argument. As second argument, a vector of unsigned integers (util.vector_uint) is expected.

> -- a dataset with 100 patterns of size 5, randomized at range [0,1]
> base_ds = dataset.matrix(matrix(100,5):uniformf())
> uint_vector = util.vector_uint()
> rnd = random(1234)
> -- a subset of 10 patterns from indices at range [1,100]
> for i=1,10 do uint_vector:push_back( rnd:randInt(1,100) ) end
> print(uint_vector)
      48       84       39       54       77       25       16       50
      24       27
# vector_uint of size 10
> index_filter_ds = dataset.index_filter(base_ds, uint_vector)
> print(index_filter_ds:toMatrix())
 0.528819    0.915766    0.220549    0.828223    0.28173
 0.73919     0.424762    0.354582    0.368474    0.0355779
 0.512678    0.494687    0.731773    0.672073    0.411915
 0.575729    0.169612    0.346667    0.925921    0.332662
 0.298257    0.460495    0.179573    0.32725     0.610076
 0.219746    0.15807     0.581498    0.531874    0.200707
 0.00641197  0.86275     0.407079    0.279832    0.602674
 0.456097    0.463612    0.521626    0.951389    0.659111
 0.4136      0.734821    0.212726    0.314356    0.50499
 0.662668    0.584882    0.457253    0.325801    0.217475
# Matrix of size [10,5] in row_major [0x12a2710 data= 0x13eaa10]

dataset.join

The dataset.join object joins the outputs from several dataset objects which has the same numPatterns. The patternSize of the resulting dataset is equals to the sum of every patternSize of its components. It requieres as argument a table with the datasets which you want to join.

-- ds1, ds2 and ds3 are three datasets with the same numPatterns
> join_ds = dataset.join{ ds1, ds2, ds3 }

dataset.union

This dataset allows to convert several dataset objects with the same patternSize as they were one unique dataset which its numPatterns is equals to the sum of all the numPatterns of every given dataset. It receives only one argument, a table with the dataset which will be unionized.

> -- ds1, ds2 and ds3 are datasets with the same patternSize
> union_ds = dataset.union{ ds1, ds2, ds3 }

dataset.slice

The dataset.slice is useful to extract a contiguous subset of patterns from a given dataset (for more general subsets use dataset.indexed or dataset.index_filter). It requieres 3 arguments. The first is the base dataset. The second and third arguments are the initial and final indices of the patterns which form the subset (first valid index is 1, and last valid index is numPatterns() of base dataset).

> -- slice with 100 patterns, from 101 to 200
> slice_ds = dataset.slice(base_ds, 101, 200)

dataset.deriv

The dataset.deriv receives a dataset and outputs the original data, the first derivative, or the second derivative, depending on the parameters received. It receives a table with a maximum of four fields:

  • dataset: the base dataset, which contains data for derivative computation.
  • deriv0: an optinal boolean, by default is true, which indicates if the output of the dataset will contain the original pattern, without derivative.
  • deriv1: an optinal boolean, by default is true, which indicates if the output of the dataset will contain the first derivative.
  • deriv2: an optinal boolean, by default is true, which indicates if the output of the dataset will contain the second derivative.
> -- ds is the base dataset
> only_first_deriv_ds = dataset.deriv{ dataset=ds, deriv0=false, deriv1=true, deriv2=false }

dataset.contextualizer

The contextualizer is a dataset which adds context from the adjacent patterns (left and right). If any of the adjacent patterns is out of the base dataset size, it fills it with the first or the last pattern. The constructor receives four arguments:

  1. The base dataset.
  2. The size of the left context.
  3. The size of the right context.
  4. A boolean optionally argument indicating if the left and right contexts needs to be swapped. By default is false, and in almost all cases it is what you need ;)
> ds = dataset.contextualizer(dataset.identity(2,0,1),1,1)
>
> print(ds:toMatrix())
 1           0           1           0           0           1
 1           0           0           1           0           1
# Matrix of size [2,6] in row_major [0x18357b0 data= 0x18358b0]

dataset.split

This dataset allows to select a subset of the components of patterns produced by another dataset. So, the resulting dataset will have the same number of patterns, but different pattern size. The subset is an interval of the base dataset. It receives three positional arguments:

  1. The base dataset.
  2. The first position in the interval (counting from 1).
  3. The last position in the interval (counting from 1).
> ds = dataset.split(dataset.identity(5,0,1), 2, 4)
> print(ds:toMatrix())
 0           0           0
 1           0           0
 0           1           0
 0           0           1
 0           0           0
# Matrix of size [5,3] in row_major [0xcb0f80 data= 0xcb1080]

dataset.perturbation

dataset.salt_noise

dataset.sub_and_div_normalization

This dataset applies on-the-fly a subtraction and division normalization, as for example a zero-mean one-standard-deviation normalization. So, for a dataset with N patternSize, given a vector of sub values s1, s2, ..., sN, and a vector of div values d1, d2, ..., dN, a ds:getPattern(i) of the resulting dataset will produce a pattern with (v1-s1)/d1, (v2-s2)/d2, ..., (vN-sN)/dN, being vj the j component of pattern i.

> eye_ds = dataset.identity(5,0,1)
> sub,div = {1,2,-1,2,-1},{0.1,0.1,0.1,0.1,0.1}
> ds = dataset.sub_and_div_normalization(eye_ds,sub,div)
> print(ds:toMatrix())
 0          -20          10         -20          10
-10         -10          10         -20          10
-10         -20          20         -20          10
-10         -20          10         -10          10
-10         -20          10         -20          20
# Matrix of size [5,5] in row_major [0xf47d70 data= 0xcfa060]

The token dataset: dataset.token

Methods

numPatterns

patternSize

getPattern

token = ds:getPattern(number)

getPatternBunch

token = ds:getPatternBunch(table)

putPattern

putPatternBunch

patterns

dataset.token.sparse_matrix

ds = dataset.token.sparse_matrix(sparse matrix in CSR)

> m = matrix.sparse.diag{1,2,3,4,5,6}
> ds = dataset.token.sparse_matrix(m)
> print(ds:getPattern(1))
 1           0           0           0           0           0
# SparseMatrix of size [1,6] in csr [0x2aea350 data= 0x2aea420 0x2aea4a0 0x2aea4e0], 1 non-zeros
> print(ds:getPatternBunch{3,5})
 0           0           3           0           0           0
 0           0           0           0           5           0
# SparseMatrix of size [2,6] in csr [0x2aeab70 data= 0x2aea4a0 0x2aea420 0x2aea7b0], 2 non-zeros

dataset.token.union

ds = dataset.token.union(table)

dataset.token.vector

ds = dataset.token.vector(psize)

ds:push_back(token)

dataset.token.filter

ds = dataset.token.filter(dataset, obj)

My own Lua dataset.token

It is possible to develop Lua dataset classes which has to complain interface of dataset.token class. The unique restriction is that your Lua dataset couldn't be used as input to other C++ dataset objects. However, the Lua dataset can use C++ objects or Lua objects without making any distinction.

The following is a piece of a pure Lua dataset.token which replicates the behavior of dataset.join, but using tokens. matrix type is needed for instances which you want to join.

ds_join,ds_join_methods = class("ds_join")

function ds_join:constructor(t)
  assert(type(t)=="table" and #t>0,
         "Needs an array of dataset.token instances as argument")
  local psize = 0  -- we sum here the pattern size of all the given datasets
  local nump  = 0  -- we store here the number of patterns, which must be
                   -- equals in all the given datasets
  local data  = {} -- this table will store the given datasets
  for _,v in ipairs(t) do
    psize = psize + v:patternSize()
    local aux_nump = v:numPatterns()
    assert(nump==0 or nump==aux_nump)
    nump = aux_nump
    table.insert(data, v)
  end
  self.data=data
  self.num_patterns=nump
  self.pattern_size=psize
end

function ds_join_methods:numPatterns() return self.num_patterns end

function ds_join_methods:patternSize() return self.pattern_size end

function ds_join_methods:getPattern(idx)
  -- use the given matrix or construct a new one
  local m = matrix(1,self:patternSize())
  local col_pos = 1
  for _,ds in ipairs(self.data) do
    local psize  = ds:patternSize()
    local dest_m = m:slice({1,col_pos}, {1,psize})
    dest_m:copy(ds:getPattern(idx))
    col_pos = col_pos + psize
  end
  return m
end

function ds_join_methods:getPatternBunch(idxs)
  -- use the given matrix or construct a new one
  local m = matrix(#idxs,self:patternSize())
  assert(m:dim(1)==#idxs and m:dim(2)==self:patternSize())
  local col_pos = 1
  for _,ds in ipairs(self.data) do
    local psize  = ds:patternSize()
    local dest_m = m:slice({1,col_pos}, {#idxs,psize})
    dest_m:copy(ds:getPatternBunch(idxs))
    col_pos = col_pos + psize
  end
  return m
end
Clone this wiki locally