-
Notifications
You must be signed in to change notification settings - Fork 12
50 stats
Package stats
could be loaded via the standalone binary, or in Lua with
require("aprilann.stats)
.
This package contains utilities for statistical purposes.
result = stats.amean(m[, dim])
result = stats.gmean(m[, dim])
result = stats.hmean(m[, dim])
result = stats.var(m[, dim])
result = stats.std(m[, dim])
result,lags = stats.acf(m[, { lag_max, lag_step, cor }])
result = stats.cov(x[, y[, { centered, true_mean }]])
result = stats.cor(x[, y[, { centered }]])
r1,r2,... = stats.percentile(m, p1, p2, ...)
result,center = stats.center(m[, center])
result,center,scale = stats.standardize(m[,{ center, scale }])
tbl = stats.summary(m)
stats.running.mean_var`
This class is useful to compute mean and variance over a large number of elements in an efficient way, following (this method)[http://www.johndcook.com/standard_deviation.html] to avoid instability. This class has the following methods:
-
obj=stats.running.mean_var()
it is a constructor which builds an instance. -
obj = obj:add(number)
a method for adds a number to the set. It returns the caller object. -
obj = obj:add(iterator)
a method which adds the sequence of numbers returned by the given iterator function. It returns the caller object. -
obj = obj:add(table)
a method which adds all the elements of the given table (as array) to the set. The elements could be numbers or functions. It returns the caller object. -
mean,variance = obj:compute()
computes and returns the accumulated mean and variance from all the calls toadd
method. -
number = obj:size()
returns the number of elements added. -
obj = obj:clear()
re-initializes the object.
> obj = stats.running.mean_var()
> obj:add(4)
> obj:add(10)
> print(obj:compute())
7 18
> obj:add({2,8,6,24})
> print(obj:compute())
9 62
> obj:add( pairs({ a=2, b=10 }) )
> print(obj:compute())
8.25 50.785714285714
> print(obj:size())
8
result = stats.comb(n,k)
result = stats.log_comb(n,k)
table = stats.boot{ ... }
This function receives a data sample, performs a random resampling of the data and gives it to a statistic function. The procedure is repeated several times (bootstrapping technique), and the function returns a matrix with the statistics computed for all the repetitions.
The function receives a table with this fields:
-
size=number or table
the sample population size, or a table with several sample population sizes. -
R=number
the number of repetitions of the procedure. -
k=number
the number of results returned bystatistic
function. By default it is 1. -
statistic=function
a function which receives amatrixInt32
with a list of indices for resampling the data. The function can compute a number of k statistics (with k>=1), being returned as multiple results. -
verbose=false
an optional boolean indicating if you want or not a verbose output. By default it isfalse
. -
seed=1234
an optional number indicating the initial seed for random numbers, by default it is1234
. -
random
an optional random number generator. Fieldsseed
andrandom
are forbidden together, only one can be indicated. -
ncores=1
an optional number indicating the number of CPU cores to use for the computation. It allows to speed-up large bootstraps. By default it is1
.
The following example takes 1000
random values and performs bootstrap
resampling over them, computing the mean of the resampled population:
-- the random population
> errors = stats.dist.normal():sample(random(567),1000)
> -- executes the bootstrap resampling function
> resampled_data = stats.boot{
size = errors:size(), R = 1000, seed = 1234, k=2,
statistic = function(sample)
local s = errors:index(1, sample)
local var,mean = stats.var(s)
-- this function returns two results, the mean and the variance (k=2)
return mean,var
end,
}
> -- the 95% confidence interval is computed, being the range [a,b]
> a,b = stats.boot.ci(resampled_data, 0.95, 1)
> print(a,b)
-0.073265952430665 0.051443199906498
a,b = stats.boot.ci(data [, confidence=0.95 [, pos=1 ]])
This function receives a matrix with data computed by stats.boot
and returns the confidence interval for the given confidence
value and
the given statistic function position.
It returns two numbers, the left limit, and the right limit, being the
interval [a,b]
.
-- the 95% confidence interval [a,b] of sample mean
local a,b = stats.boot.ci(resampled_data, 0.95, 1)
-- the 95% confidence interval [a,b] of sample variance
local a,b = stats.boot.ci(resampled_data, 0.95, 2)
matrix = stats.pca.center_by_pattern(matrix)
This function modifies in-place the given matrix
, centering all the
patterns to have zero-mean. The patterns must be ordered by rows, and the
features by columns. The function returns the same matrix
as given.
> -- four patterns, 6 features
> m = matrix(4,6):uniformf(1,2)
> = m
1.10795 1.55917 1.35379 1.22971 1.38055 1.81201
1.1907 1.58711 1.38786 1.55067 1.01365 1.6242
1.94076 1.33665 1.28377 1.72529 1.26619 1.60847
1.64965 1.65704 1.55421 1.80517 1.68546 1.92166
# Matrix of size [4,6] in row_major [0x2575340 data= 0x249c490]
> stats.pca.center_by_pattern(m)
> = m
-0.299245 0.151974 -0.0534089 -0.177485 -0.0266466 0.404811
-0.201669 0.194743 -0.00450444 0.15831 -0.378713 0.231833
0.413907 -0.190203 -0.243086 0.198439 -0.260668 0.081612
-0.0625433 -0.0551615 -0.15799 0.0929705 -0.0267389 0.209462
# Matrix of size [4,6] in row_major [0x2575340 data= 0x249c490]
> = m:sum(2):scal(1/m:dim(2)) -- each pattern is centered
-3.97364e-08
-1.58946e-07
-1.19209e-07
-1.39078e-07
# Matrix of size [4,1] in row_major [0x2519460 data= 0x2491bd0]
U,S,VT = stats.pca(matrix)
This function implements standard PCA algorithm, based on covariance matrix
computation, and Singular Values Decomposition (SVD). The function receives a
matrix
, where features are columns, and data
samples are by rows, that is, for M
patterns and N
features, the matrix
will be of MxN
size. The function needs that input matrix
was normalized,
normally by centering the mean of each pattern (for example, using
stats.pca.center_by_pattern(...)
function, or other normalization depending
in the task, see
UFLDL Tutorial
for more information about data normalization).
The function return three matrix
objects, where:
-
U
is an orthogonal matrix which contains the eigen-vectors of the covariance matrix, with one eigen-vector per column, sorted in decreasing order. -
S
is a vector (diagonal matrix) which corresponds with the singular values of the covariance matrix, sorted in decreasing order. -
VT
is equivalent toU
, and can be ignored.
> -- 10000 patterns, 100 features
> m = matrix(10000,100):uniformf(1, 2, random(1234))
> m = stats.pca.center_by_pattern(m)
> U,S,VT = stats.pca(m)
i,s_i,s_mass = stats.pca.threshold(S [,mass=0.99] )
This function computes the cutting threshold for a given S
vector with the
singular values sorted in decreasing order. It receives the matrix, and an
optional number with the probability mass which you want to retain, by
default it is mass=0.99
. The function computes three values:
-
i
is the index where you need to cut. -
s_i
is the singular value at the indexi
, that isS:get(i)
. -
s_mass
is the mass accumulated until this index.
> = stats.pca.threshold(S, 0.5)
45 0.084392011165619 0.49575713463855
> = stats.pca.threshold(S, 0.7)
65 0.079482264816761 0.69391569135546
> = stats.pca.threshold(S)
97 0.06935129314661 0.98337004707581
matrix = stats.pca.whitening(X, U, S [, epsilon=0.0] )
This function implments
PCA whitening,
given a data matrix
X
(patterns by rows, features by columns), the U
and
S
matrix
objects returned by stats.pca(X)
, and an optional
regularization value epsilon
which by default is epsilon=0.0
. The function
returns a new allocated matrix
, result of applying the whitening process.
> -- loading digits from TEST/digitos/digits.png
> img = ImageIO.read("TEST/digitos/digits.png")
> m = img:to_grayscale():matrix()
> ds = dataset.matrix(m, { patternSize={16,16}, stepSize={16,16},
numSteps={100,10} })
> data_matrix = ds:toMatrix():clone()
> data_matrix = stats.pca.center_by_pattern(data_matrix)
> -- PCA computation
> U,S,VT = stats.pca(data_matrix)
> -- PCA whitening
> data_whitened = stats.pca.whitening(data_matrix, U, S, 0.02)
> = data_whitened
Large matrix, not printed to display
# Matrix of size [1000,256] [0x1be71b0 data= 0x1b23540]
If you want, it is possible to do dimension reduction by given U
and S
matrix slices, instead of the whole matrices.
> -- PCA whitening and dimensionality reduction (256 => 100)
> out = stats.pca.whitening(data_matrix, U(':','1:100'), S('1:100'), 0.02)
> = out
Large matrix, not printed to display
# Matrix of size [1000,100] [0xe5c070 data= 0xe5c140]
See also documentation of ann.components.pca_whitening
(when available).
X = stats.zca.whitening(X,U,S,epsilon)
This function receives the same arguments as stats.pca.whitening
, but instead
of do a PCA whitening, it computes a ZCA whitening. In this case, the output is
the given matrix
X
, so, the computation is done in-place.
> -- ZCA whitening
> data_whitened = stats.zca.whitening(data_matrix:clone(), U, S, 0.02)
> -- write to disk
> aux = data_whitened:clone("row_major")
> for sw in aux:sliding_window():iterate() do sw:adjust_range(0,1) end
> ImageIO.write(Image(aux), "digits-whitened.png")
T,P,R = stats.pca.gs_pca{ ... }
WARNING this implementation is in experimental stage.
Implementation of PCA-GS algorithm, an iterative efficient algorithm for PCA computation. This code is translated from GSL CBLAS implementation of the paper Parallel GPU Implementation of Iterative PCA Algorithms, M. Andrecut. The function receives a table with the following fields:
-
X=matrix
: a MxNmatrix
, M number of patterns, N pattern size. -
K=number
: the number of components that you want to compute, K <= N. -
max_iter=number
: the maximum number of iterations computing every component. It is an optional parameter, by default it ismax_iter=10000
. -
epsilon=number
the convergence criterion. It is an optional parameter, by default it isepsilon=1e-07
.
The function returns three matrices:
-
The
T
scores matrix, with size MxK. -
The
P
loads matrix, with size NxK. -
The
R
residuals matrix, with size MxN.
stats.confusion_matrix