Skip to content
Paco Zamora Martinez edited this page Sep 30, 2015 · 19 revisions

Introduction

Package stats could be loaded via the standalone binary, or in Lua with require("aprilann.stats).

This package contains utilities for statistical purposes.

Matrix functions

stats.hist

stats.ihist

stats.amean

result = stats.amean(m[, dim])

stats.gmean

result = stats.gmean(m[, dim])

stats.hmean

result = stats.hmean(m[, dim])

stats.var

result = stats.var(m[, dim])

stats.std

result = stats.std(m[, dim])

stats.acf

result,lags = stats.acf(m[, { lag_max, lag_step, cor }])

stats.cov

result = stats.cov(x[, y[, { centered, true_mean }]])

stats.cor

result = stats.cor(x[, y[, { centered }]])

stats.percentile

r1,r2,... = stats.percentile(m, p1, p2, ...)

stats.center

result,center = stats.center(m[, center])

stats.standardize

result,center,scale = stats.standardize(m[,{ center, scale }])

stats.summary

tbl = stats.summary(m)

Statistical distributions

stats.dist.uniform

stats.dist.normal

stats.dist.lognormal

stats.dist.binomial

stats.dist.bernoulli

stats.dist.beta

stats.dist.exponential

Running measures

Mean and variance class

stats.running.mean_var`

This class is useful to compute mean and variance over a large number of elements in an efficient way, following (this method)[http://www.johndcook.com/standard_deviation.html] to avoid instability. This class has the following methods:

  • obj=stats.running.mean_var() it is a constructor which builds an instance.

  • obj = obj:add(number) a method for adds a number to the set. It returns the caller object.

  • obj = obj:add(iterator) a method which adds the sequence of numbers returned by the given iterator function. It returns the caller object.

  • obj = obj:add(table) a method which adds all the elements of the given table (as array) to the set. The elements could be numbers or functions. It returns the caller object.

  • mean,variance = obj:compute() computes and returns the accumulated mean and variance from all the calls to add method.

  • number = obj:size() returns the number of elements added.

  • obj = obj:clear() re-initializes the object.

> obj = stats.running.mean_var()
> obj:add(4)
> obj:add(10)
> print(obj:compute())
7	18
> obj:add({2,8,6,24})
> print(obj:compute())
9	62
> obj:add( pairs({ a=2, b=10 }) )
> print(obj:compute())
8.25	50.785714285714
> print(obj:size())
8

Combinatorial

stats.comb

result = stats.comb(n,k)

stats.log_comb

result = stats.log_comb(n,k)

Bootstrapping and confidence intervals

boot

table = stats.boot{ ... }

This function receives a data sample, performs a random resampling of the data and gives it to a statistic function. The procedure is repeated several times (bootstrapping technique), and the function returns a matrix with the statistics computed for all the repetitions.

The function receives a table with this fields:

  • size=number or table the sample population size, or a table with several sample population sizes.

  • R=number the number of repetitions of the procedure.

  • k=number the number of results returned by statistic function. By default it is 1.

  • statistic=function a function which receives a matrixInt32 with a list of indices for resampling the data. The function can compute a number of k statistics (with k>=1), being returned as multiple results.

  • verbose=false an optional boolean indicating if you want or not a verbose output. By default it is false.

  • seed=1234 an optional number indicating the initial seed for random numbers, by default it is 1234.

  • random an optional random number generator. Fields seed and random are forbidden together, only one can be indicated.

  • ncores=1 an optional number indicating the number of CPU cores to use for the computation. It allows to speed-up large bootstraps. By default it is 1.

The following example takes 1000 random values and performs bootstrap resampling over them, computing the mean of the resampled population:

-- the random population
> errors = stats.dist.normal():sample(random(567),1000)
> -- executes the bootstrap resampling function
> resampled_data = stats.boot{
  size = errors:size(), R = 1000, seed = 1234, k=2,
  statistic = function(sample)
    local s = errors:index(1, sample)
    local var,mean = stats.var(s)
    -- this function returns two results, the mean and the variance (k=2)
    return mean,var
  end,
}
> -- the 95% confidence interval is computed, being the range [a,b]
> a,b = stats.boot.ci(resampled_data, 0.95, 1)
> print(a,b)
-0.073265952430665	0.051443199906498

boot.ci

a,b = stats.boot.ci(data [, confidence=0.95 [, pos=1 ]])

This function receives a matrix with data computed by stats.boot and returns the confidence interval for the given confidence value and the given statistic function position. It returns two numbers, the left limit, and the right limit, being the interval [a,b].

-- the 95% confidence interval [a,b] of sample mean
local a,b = stats.boot.ci(resampled_data, 0.95, 1)
-- the 95% confidence interval [a,b] of sample variance
local a,b = stats.boot.ci(resampled_data, 0.95, 2)

Principal Components Analysis

pca.center_by_pattern

matrix = stats.pca.center_by_pattern(matrix)

This function modifies in-place the given matrix, centering all the patterns to have zero-mean. The patterns must be ordered by rows, and the features by columns. The function returns the same matrix as given.

> -- four patterns, 6 features
> m = matrix(4,6):uniformf(1,2)
> = m
 1.10795     1.55917     1.35379     1.22971     1.38055     1.81201   
 1.1907      1.58711     1.38786     1.55067     1.01365     1.6242    
 1.94076     1.33665     1.28377     1.72529     1.26619     1.60847   
 1.64965     1.65704     1.55421     1.80517     1.68546     1.92166   
# Matrix of size [4,6] in row_major [0x2575340 data= 0x249c490]
> stats.pca.center_by_pattern(m)
> = m
-0.299245    0.151974   -0.0534089  -0.177485   -0.0266466   0.404811  
-0.201669    0.194743   -0.00450444  0.15831    -0.378713    0.231833  
 0.413907   -0.190203   -0.243086    0.198439   -0.260668    0.081612  
-0.0625433  -0.0551615  -0.15799     0.0929705  -0.0267389   0.209462  
# Matrix of size [4,6] in row_major [0x2575340 data= 0x249c490]
> = m:sum(2):scal(1/m:dim(2)) -- each pattern is centered
-3.97364e-08
-1.58946e-07
-1.19209e-07
-1.39078e-07
# Matrix of size [4,1] in row_major [0x2519460 data= 0x2491bd0]

pca

U,S,VT = stats.pca(matrix)

This function implements standard PCA algorithm, based on covariance matrix computation, and Singular Values Decomposition (SVD). The function receives a matrix, where features are columns, and data samples are by rows, that is, for M patterns and N features, the matrix will be of MxN size. The function needs that input matrix was normalized, normally by centering the mean of each pattern (for example, using stats.pca.center_by_pattern(...) function, or other normalization depending in the task, see UFLDL Tutorial for more information about data normalization).

The function return three matrix objects, where:

  • U is an orthogonal matrix which contains the eigen-vectors of the covariance matrix, with one eigen-vector per column, sorted in decreasing order.

  • S is a vector (diagonal matrix) which corresponds with the singular values of the covariance matrix, sorted in decreasing order.

  • VT is equivalent to U, and can be ignored.

> -- 10000 patterns, 100 features
> m = matrix(10000,100):uniformf(1, 2, random(1234))
> m = stats.pca.center_by_pattern(m)
> U,S,VT = stats.pca(m)

pca.threshold

i,s_i,s_mass = stats.pca.threshold(S [,mass=0.99] )

This function computes the cutting threshold for a given S vector with the singular values sorted in decreasing order. It receives the matrix, and an optional number with the probability mass which you want to retain, by default it is mass=0.99. The function computes three values:

  • i is the index where you need to cut.

  • s_i is the singular value at the index i, that is S:get(i).

  • s_mass is the mass accumulated until this index.

> = stats.pca.threshold(S, 0.5)
45	0.084392011165619	0.49575713463855
> = stats.pca.threshold(S, 0.7)
65	0.079482264816761	0.69391569135546
> = stats.pca.threshold(S)
97	0.06935129314661	0.98337004707581

pca.whitening

matrix = stats.pca.whitening(X, U, S [, epsilon=0.0] )

This function implments PCA whitening, given a data matrix X (patterns by rows, features by columns), the U and S matrix objects returned by stats.pca(X), and an optional regularization value epsilon which by default is epsilon=0.0. The function returns a new allocated matrix, result of applying the whitening process.

> -- loading digits from TEST/digitos/digits.png
> img = ImageIO.read("TEST/digitos/digits.png")
> m = img:to_grayscale():matrix()
> ds = dataset.matrix(m, { patternSize={16,16}, stepSize={16,16},
                           numSteps={100,10} })
> data_matrix = ds:toMatrix():clone()
> data_matrix = stats.pca.center_by_pattern(data_matrix)
> -- PCA computation
> U,S,VT = stats.pca(data_matrix)
> -- PCA whitening
> data_whitened = stats.pca.whitening(data_matrix, U, S, 0.02)
> = data_whitened
Large matrix, not printed to display
# Matrix of size [1000,256] [0x1be71b0 data= 0x1b23540]

If you want, it is possible to do dimension reduction by given U and S matrix slices, instead of the whole matrices.

> -- PCA whitening and dimensionality reduction (256 => 100)
> out = stats.pca.whitening(data_matrix, U(':','1:100'), S('1:100'), 0.02)
> = out
Large matrix, not printed to display
# Matrix of size [1000,100] [0xe5c070 data= 0xe5c140]

See also documentation of ann.components.pca_whitening (when available).

zca.whitening

X = stats.zca.whitening(X,U,S,epsilon)

This function receives the same arguments as stats.pca.whitening, but instead of do a PCA whitening, it computes a ZCA whitening. In this case, the output is the given matrix X, so, the computation is done in-place.

> -- ZCA whitening
> data_whitened = stats.zca.whitening(data_matrix:clone(), U, S, 0.02)
> -- write to disk
> aux = data_whitened:clone("row_major")
> for sw in aux:sliding_window():iterate() do sw:adjust_range(0,1) end
> ImageIO.write(Image(aux), "digits-whitened.png")

pca.gs_pca

T,P,R = stats.pca.gs_pca{ ... }

WARNING this implementation is in experimental stage.

Implementation of PCA-GS algorithm, an iterative efficient algorithm for PCA computation. This code is translated from GSL CBLAS implementation of the paper Parallel GPU Implementation of Iterative PCA Algorithms, M. Andrecut. The function receives a table with the following fields:

  • X=matrix: a MxN matrix, M number of patterns, N pattern size.

  • K=number: the number of components that you want to compute, K <= N.

  • max_iter=number: the maximum number of iterations computing every component. It is an optional parameter, by default it is max_iter=10000.

  • epsilon=number the convergence criterion. It is an optional parameter, by default it is epsilon=1e-07.

The function returns three matrices:

  • The T scores matrix, with size MxK.

  • The P loads matrix, with size NxK.

  • The R residuals matrix, with size MxN.

confusion_matrix

stats.confusion_matrix

Clone this wiki locally