numerics.hpp
Documentation- Table of Contents
- Utility Methods
- Quadrature and Finite Differences
- Discrete Derivatives
- Interpolation
- Namespace
neighbors
- Data Science & Machine Learning
- Feature Maps
- K-Fold Train Test Split
- Label Encoding
- Data Binning
- K-Means Clustering
- Linear Models
- LASSO Regression
- Ridge Regression
- Splines
- Logistic Regression
- Neural Networks and SGD Models
- Linear SGD Models
- Neural Network Models
- k Nearest Neighbors Estimators
- k Nearest Neighbors Classifier
- k Nearest Neighbors Regressor
- Kernel Estimators
- Kernel Density Estimation
- Kernel Smoothing
numerics::optimization
Documentationnumerics::ode
Documentationnumerics::neuralnet
Documentation
all definitions are members of namespace numerics
, all examples assume:
using namespace std;
using namespace arma; // typically only vec and mat
using namespace numerics;
using namespace ode;
using namespace optimization;
int mod(int a, int b);
In C++
, the modulo operator (a%b
) returns the remainder from dividing a
by b
, while mod(a,b)
function returns a%b
whenever a
is negative.
Example:
int a = -3 % 11; // a = 2
int b = mod(-3, 11); // b = 8
void meshgrid(arma::mat& xgrid, arma::mat& ygrid,
const arma::vec& x, const arma::vec& y);
void meshgrid(arma::mat& xgrid, const arma::vec& x);
This function constructs a 2D grid of points (representing a region of the xy-plane), with non-uniform points defined in x
and y
.
Example:
vec x = linspace(0, 2, 10);
vec y = -cos(regspace(-M_PI, M_PI, 20))
mat XX, YY;
meshgrid(XX, YY, x, y);
Given a polynomial coefficient vector (produced from arma::polyfit
), we can produce derivatives of the polynomial with the following function:
arma::vec polyder(const arma::vec& p, unsigned int k = 1);
where p
is the coefficient vector, and k
is order of the derivative. By default, k = 1
which corresponds to the first derivative (and k = 2
is the second derivative, and so on). The output is also a polynomial coefficient vector.
We can integrate the polynomial:
arma::vec polyint(const arma::vec& p, double c = 0);
where p
is the coefficient vector, and c
is the constant of integration; by default c = 0
. The output is also a polynomial coefficient vector.
Given a discrete probability mass function, we can produce a random sample.
arma::uvec sample_from(int n, const arma::vec& pmf, const arma::uvec& labels=arma::uvec());
int sample_from(const arma::vec& pmf, const arma::uvec& labels=arma::uvec());
The output is either a set of integers between 0 and n-1 refering to the index associated with the pmf, or a set of labels sampled from labels
using the pmf for random indexing.
If we simply need a decent approximation of an integral of a function integrate()
is more than sufficient:
double integrate(const function<double(double)>& f, double a, double b, const string& method="lobatto");
This function uses the default parameters for the integration method
selected to approximate the integral. The method
can be one of "lobatto","simpson","chebyshev"
.
The "lobatto"
method corresponds to an adaptive Gauss-Lobatto 4 point approximation with a 7 point error estimate. This method adaptively modifies the evaluation mesh to bound the error below the requested tolerance. The Lobatto scheme is ideal for smooth functions which are easy to evaluate because this scheme doess not recycle function evaluations. This method has the stand-alone function
double lobatto_integral(const function<double(double)>& f, double a, double b, double tol = 1e-8);
The "simpson"
method corresponds to an adaptive Simpson's method scheme which uses quadratic interpolation to approximate the integral over an interval. This method modifies the evaluation mesh by comparing the approximation on each element of the mesh with an approximation over two elements in the same range in order to bound the error bellow a specified tolerance. This scheme is generally applicable and recycles function evaluations. This method has a stand-alone function with the added benefit of storing function evaluations to a standard library map
.
double simpson_integral(map<double,double>& fmap, const function<double(double)>& f, double a, double b, double tol = 1e-5);
double simpson_integral(const function<double(double)>& f, double a, double b, double tol = 1e-5);
Finally, the "chebyshev"
method corresponds to integration by polynomial interpolation. This scheme is only idea for very smooth functions, but can achieve very high accuracy with very few function evaluations relative to both "simpson"
and "lobatto"
. Traditionally, polynomial interpolation takes
double chebyshev_integral(const function<double(double)>& f, double a, double b, uint m = 32);
Where m
is the number of function evaluations which achieves high accuracy for modest values.
note: If the function is not (atleast) continuous, the approximation may quickly become ill conditioned.
double f(double x) return exp(-x*x); // e^(-x^2)
double lower_bound = 0;
double upper_bound = 3.14;
double I = integrate(f, lower_bound, upper_bound);
double I_simp = integrate(f, lower_bound, upper_bound, 1e-6,"simpson");
double deriv(const function<double(double)>& f, double x,
double h = 1e-2, bool catch_zero = true);
This function uses 4th order finite differences with step size h
to approximate the derivative at a point x
. We can also ask the derivative function to catch zeros, i.e. round to zero whenever f'(x) < h
; this option can improve the numerical stability of methods relying on results from deriv()
, e.g. approximating sparse Hessians or Jacobians.
note: deriv()
may never actually evaluate the function at the point of interest, which is ideal if the function is not well behaved there.
double f(double x) return sin(x);
double x = 0;
double df = deriv(f,x); // should return ~1.0
double g(double x) return cos(x);
double dg = deriv(g,x,1e-2,false); // should return ~0.0
/*
in this case we are better off if catch_zero = true because d(cos x)/dx = 0 for x = 0.
*/
We can also approximate gradient vectors and Jacobian matrices:
vec grad(const function<double(const vec&)> f,
const vec& x,
double h = 1e-2,
bool catch_zeros = true);
mat approx_jacobian(const function<vec(const vec&)> f,
const vec& x,
double h = 1e-2, bool catch_zero = true);
vec jacobian_diag(const function<vec(const vec&)> f,
const vec& x,
double h = 1e-2);
These functions are wrappers for the deriv()
function. So the functionality is similar. The function jacobian_diag()
computes only the diagonal of the jacobian matrix, which may only make sense when the jacobian is a square matrix.
double f(const vec& x) return dot(x,x); // return x^2 + y^2 + ...
vec F(const vec& x) return x%(x+1); // returns x.*(x+1)
vec x = {0,1,2};
vec gradient = grad(f,x); // should return [2*0, 2*1, 2*2] = [0,2,4]
mat Jac = approx_jacobian(F,x);
/* Jac = [1.0 0.0 0.0
0.0 3.0 0.0
0.0 0.0 4.0];
*/
Given function defined over an interval, we can approximate the derivative of the function with spectral accuracy using the FFT
. The function spectral_deriv()
does this using polynomial interpolation. We sample the function at chebyshev nodes scaled to the interval [a,b]; the function returns a polynomial object that can be evaluated anywhere on the interval. We can specify more sample points for more accuracy.
Polynomial spectral_deriv(const function<double(double)>& f, double a, double b, uint sample_points = 50);
Further details of the Polynomial
class can be found here.
double f(double x) return sin(x*x);
auto df = spectral_deriv(f, -2, 2);
vec x = linspace(-2,2);
vec derivative = df(x);
For more discrete derivatives see numerics::ode
.
The following class implements real-valued algebraic polynomials which can be used for interpolation and approximation. The class implements basic vector space operations such as addition, negation, and multiplication; additionally, it implements derivatives and integrals.
class Polynomial {
public:
const u_int& degree;
const vec& coefficients;
explicit Polynomial(double s=0);
explicit Polynomial(const vec& p);
explicit Polynomial(vec&& p);
explicit Polynomial(const vec& x, const vec& y);
explicit Polynomial(const vec& x, const vec& y, u_int deg);
Polynomial(const Polynomial& P);
void operator=(const Polynomial& P);
double operator()(double x) const;
vec operator()(const vec& x) const;
Polynomial derivative(u_int k=1) const;
Polynomial integral(double c=0) const;
Polynomial operator+(const Polynomial& P) const;
Polynomial operator+(double c) const;
Polynomial operator-() const;
Polynomial operator-(const Polynomial& P) const;
Polynomial operator-(double c) const;
Polynomial operator*(const Polynomial& P) const;
Polynomial operator*(double c) const;
Polynomial& operator+=(const Polynomial& P);
Polynomial& operator+=(double c);
Polynomial& operator-=(const Polynomial& P);
Polynomial& operator-=(double c);
Polynomial& operator*=(const Polynomial& P);
Polynomial& operator*=(double c);
};
The polynomial can be constructed by setting it to a scalar, or by providing coefficients that are ordered:
i.e. the same ordering as in Armadillo's polyfit
and polyval
functions. Finally, the class can also be constructed by fitting a polynomial to a set of data (specifying deg
sets the degree of the polynomial; if deg
is not specified the it will be set to x.n_elem-1
i.e. the interpolating polynomial).
This class is the base for the PieceWisePoly
class and is also an output of spectral_derivative
and the solution form of BVPCheb
.
If there is only a need to fit and interpolate a data set once, we may find it more efficient ($\mathcal O(n^3)\rightarrow\mathcal O(n^2)$) to interpolate using Lagrange interpolation:
mat lagrange_interp(const vec& x, const mat& Y, const vec& xgrid);
where xgrid
is the set of values to interpolate over. Note that this function is also vectorized to except a matrix Y
and the interpolation is performed column-wise, i.e. Y
must have the same number of rows as elements in x
.
This class implements a data-structure used for storing and evaluating piece-wise polynomials defined on subinterval of a single connected interval. It does not implement its own constructor beyond a specifying an extrapolation method.
class PieceWisePoly {
public:
PieceWisePoly(const std::string& extrapolation="const", double val=0);
double operator()(double t) const;
vec operator()(const vec& t) const;
PieceWisePoly derivative(int k=1) const;
PieceWisePoly integral(double c=0) const;
};
The class implements an evaluation operation, and derivative and integral methods. The evaluation is well defined everywhere via the extrapolation method which is one of:
"const"
: constant value set byval
on construction."boundary"
: when evaluated at a point less than the lower boundary, it evaluates to the same value as exactly on the boundary. Same for upper boundary."linear"
: linear exrapolation which assures continuity in the function and its first derivative."polynomial"
: uses the polynomial pieces at the boundaries for exrapolation."periodic"
: uses periodic extension.
The derivative
function computes derivative to each piece. The integral
function computes the integral of each piece but guarantees continuity between the pieces. If a PieceWisePoly
object ppoly
is defined on the interval
There are two functions currently implemented that actually construct PieceWisePoly
objects. They are
PieceWisePoly natural_cubic_spline(
const vec& x,
const vec& y,
const string& extrapolation="boundary",
double val=0
);
PieceWisePoly hermite_cubic_spline(
const vec& x,
const vec& y,
const string& extrapolation="linear",
double val=0
);
PieceWisePoly hermite_cubic_spline(
const vec& x,
const vec& y,
const vec& yp,
const string& extrapolation="linear",
double val=0
);
The function natural_cubic_spline
implements a (x,y)
with "natural boundary conditions", i.e. the second derivative is zero at both end-points of the interpolation range.
The function hermite_cubic_spline
implements a (x,y,y')
. When yp
is not specified, the derivative is inferred using first order finite differences (which is a classic Hermite spline).
Given sorted uniformly spaced points on an interval, we can interpolate the data using a linear combination of sinc functions
mat sinc_interp(const vec& x, const mat& Y, const vec& xgrid);
Note that this function is vectorized to except a matrix Y
and the interpolation is performed column-wise, i.e. Y
must have the same number of rows as elements in x
.
This namespace holds utility classes for the more notable class KDTree
which implements a batch initialized K-dimensional tree structure for
class neighbors::KDTree {
public:
const mat& data; // read only view of the data
const u_long& p_norm;
const u_long& leaf_size;
explicit KDTree(int pnorm=2, int leafsize=30);
explicit KDTree(const string& pnorm, int leafsize=30);
KDTree(const KDTree& T);
void fit(const mat& x);
void fit(mat&& x);
double min(u_long dim) const;
double max(u_long dim) const;
void query(mat& distances, umat& neighbors, const mat& pts, u_long k) const;
void query(umat& neighbors, const mat& pts, u_long k) const;
};
The data structure is initialized by specifying a p-norm to induce a distance function between points, this parameter is any integer >= 1 or "inf"
(i.e. a valid p
value for Armadillo's norm()
). Additionally, the leaf-size of the tree can be specified. It is often more efficient to perform nearest neighbor queries using the brute force method over a full tree traversal for small data sets so the tree structure terminates recursion once fewer than or equal to leafsize
many points are under a node and stores those points in that node.
This class is the underlying data structure for KNeighborsEstimator
.
Many of the models in this section inherit from the following classes:
template<typename eT> class AutoEstimator {
public:
virtual void fit(const mat& x) = 0;
virtual Col<eT> fit_predict(const mat& x) = 0;
virtual Col<eT> predict(const mat& x) const = 0;
};
template<typename eT> class Estimator {
public:
virtual void fit(const mat& x, const Col<eT>& y) = 0;
virtual Col<eT> predict(const mat& x) const = 0;
virtual double score(const mat& x, const Col<eT>& y) const = 0;
};
typedef Estimator<double> Regressor;
typedef Estimator<arma::uword> Classifier;
Classes inheriting from AutoEstimator
operate on a single variable to extract useful information from it. Classes inhering from Estimator
operate on two variables to predict one from the other.
The score function returns a quality of estimate metric for data pairs (x,y)
relative to the learned model.
The following function implements the cubic kernel
mat cubic_kernel(const mat& x);
mat cubic_kernel(const mat& x1, const mat& x2);
The first instance with only a single input computes the Gram matrix of the kernel for given data set which is symmetric and can therefore be computed more efficiently.
The second version computes a matrix which has x1.n_rows
rows and x2.n_rows
columns.
These two functions are used in the implementation of the Splines
class.
The following class implements multi-dimensional polynomial terms for featurization of data. Due to the combinatory nature of multi-dimensional polynomials, the computational cost is exponential in the degree
or highest order terms desired.
class PolyFeatures {
public:
explicit PolyFeatures(int degree, bool include_intercept=false);
void fit(const mat& x);
mat fit_predict(const mat& x);
mat predict(const mat& x) const;
};
When performing cross validation we may want to split the data into training and testing subsets. The KFolds1Arr
and KFolds2Arr
procedure splits the data into k
equal sized subsets of the data (randomly selected) for repeated training and testing.
template<typename A> class KFolds1Arr {
public:
const Mat<A>& X;
explicit KFolds1Arr(int n_folds=2);
void fit(const Mat<A>& xx);
viod fit(Mat<A>&& xx);
Mat<A> train(u_long i);
Mat<A> test(u_long i);
};
template<typename A, typename B> class KFolds2Arr {
public:
const Mat<A>& X;
const Mat<B>& y;
explicit KFolds2Arr(int n_folds=2);
void fit(const aMat<A>& xx, const Mat<B>& yy);
void fit(Mat<A>&& xx, Mat<B>&& yy);
Mat<A> trainX(u_long i) const;
Mat<B> trainY(u_long i) const;
Mat<A> testX(u_long i) const;
Mat<B> testY(u_long i) const;
};
typedef KFolds2Arr<double,double> KFolds;
The classes are initialized by specifying the number of splits. Calling fit
copies or moves the arrays into the data-structure. Calling train(i)
, trainX(i)
, or trainY(i)
returns the i-th training set of the respective arrays. Calling test(i)
, testX(i)
, or testY(i)
returns the testing set of the i-th arrays. The const references X
and y
point to the fitted arrays in memory for read only use.
mat X = randu(100,1);
mat Y = 2*x;
KFolds train_test(X,Y,3);
int j = 0;
mat train_X = train_test.trainX(j);
mat test_X = train_test.testX(j);
mat train_Y = train_test.trainY(j);
mat test_Y = train_test.testY(j);
Most classification models require labeled data to take the form of either integer labels or a one-hot encoding. The following classes transform data into these forms:
template<class T> class LabelEncoder {
public:
LabelEncoder();
void fit(const vector<T>& y);
uvec encode(const vector<T>& y) const;
vector<T> decode(const uvec& y) const;
};
class OneHotEncoder {
public:
explicit OneHotEncoder(bool plus_minus=false);
void fit(const uvec& x);
mat encode(const uvec& x) const;
uvec decode(const mat& x) const;
};
The class LabelEncoder
can be used for converting standard library vectors contain any object to integer labels. The class OneHotEncoder
can be used for encoding integer labels into one-hot labels. The one-hot labels always encodes positive labels as 1
; by default negative labels are encoded as 0
, but can be set to -1
if plus_minus
is set to true
. The first strategy is most often used for logistic regression and other cross entropy classifiers, while the latter is used for support vector machines.
Whenever data is sampled from continuous features, it is sometimes useful (and easy) to bin the data into uniformly spaced bins. A primary benefit is reducing the complexity to
The following class produces bins for univariate data.
class BinData {
public:
const u_long& n_bins;
const double& bin_width;
const vec& bins;
const vec& counts;
explicit BinData(long bins=0);
void fit(const mat& x);
void fit(const mat& x, const vec& y);
};
We set the number of bins on initialization, but when bins
is not specified, it is selected in fit
. It will be set to 500 if the data size n>1000
, or n/10
if n/10 > 30
, and n/5
in all other cases.
The member variable bins
stores the bins corresponding to the x
variable, and counts
corresponds to the number of elements belonging to each bin using a linear weighting strategy. If fitted with only x
, then
sum(counts) == x.n_rows
And the result corresponds to a histogram. If fitted with both x
and y
, then
sum(counts) == sum(y)
Where linear weights are used to average the values of y
corresponding to each discrete value in bins
.
vec x = randn(200);
vec y = exp(-x);
int n_bins = 20;
BinData distribution(n_bins);
distribution.fit(x);
// if we plot distribution.bins and distribution.counts we should expect a histogram approximating a normal distribution
BinData discretized(n_bins);
discretized.fit(x,y);
// if we plot discretized.bins and discretized.counts we should expect a decaying exponentional
we can cluster unlabled data using Lloyd's algorithm accelerated by both kmeans++ for initialization and the triangle inequality for accelerating updates between iterations. Further speed-up can be attained using stochastic gradient descent (i.e. mini-batch kmeans) for very large data.
class KMeans : public AutoEstimator<arma::uword> {
public:
const mat& clusters;
const mat& cluster_distances;
const mat& points_nearest_centers;
const uvec& index_nearest_centers;
explicit KMeans(int k, const string& pnorm, double tol=1e-2, long max_iter=100);
explicit KMeans(int k, int p_norm=2, double tol=1e-2, long max_iter=100);
virtual void fit(const mat& data) override;
uvec predict(const mat& data) const override;
virtual uvec fit_predict(const mat& data) override;
};
class KMeansSGD : public KMeans {
public:
explicit KMeansSGD(int k, int p_norm=2, int batch_size=100, double tol=1e-2, long max_iter=100);
explicit KMeansSGD(int k, const string& p_norm, int batch_size=100, double tol=1e-2, long max_iter=100);
void fit(const mat& data) override;
uvec fit_predict(const mat& data) override;
};
The classes are both constructed by specifying the number of clusters, the distance measure, the tolerance for the stopping criteria (KMeansSGD
we can also specify the batch size used in each iteration. The distance measure can be an integer >= 1 or "inf"
.
The function fit
performs the clustering on the specified data, and predict
returns the row index in clusters
corresponding to the nearest cluster center of each query point, i.e.
mat nearest_center = km.clusters.rows(km.predict(x));
// nearest_center.row(i) is the nearest cluster to x.row(i)
Note that calling fit_predict
is more efficient in this instance than calling fit
then predict
.
The member variables listed below are cached results from fit
:
clusters
:k
(number of clusters) by dimension. Each row is a cluster center.cluster_distances
:k
byk
. Distances between the cluster centers.points_nearest_centers
:k
by dimension. The points in the data which are nearest to each cluster center, such as in a k-medioids problem.index_nearest_centers
:k
by 1. The locations indata
corresponding topoints_nearest_centers
.
Numerics implements a variety of linear models which inherit from the following class
class LinearModel : public Regressor {
public:
const vec& linear_coefs;
const double& intercept;
const double& lambda;
const double& eff_df;
explicit LinearModel(bool intercept);
void fit(const mat& X, const vec& y) = 0;
vec predict(const mat& X) const = 0;
double score(const mat& X, const vec& y) const = 0;
};
These models implement coefficient regularization with built-in cross-validation due to the efficiency of computation for these types of models. The member variable linear_coefs
are the learned model coefficients, the variable intercept
is the corresponding intercept term which is computed if intercept=true
is set on construction. The variable lambda
corresponds to the regularization term which typically appears as:
eff_df
is the effective degrees of freedom in the system which is always less than or equal to X.n_cols - 1
, this value is induced from a choice of lambda
.
The score
method of all linear models returns the coefficient of determination(predict(X)
and y
that are input into the function.
For data (x,y)
, the LASSO regression problem is as follows:
This implementation uses coordinate descent, and accepts a tol
parameter and max_iter
as stopping criteria for the optimization task.
class LassoCV : public LinearModel {
public:
explicit LassoCV(bool intercept=true, double tol=1e-5, long max_iter=1000);
void fit(const mat& X, const vec& y) override;
vec predict(const mat& X) const override;
double score(const mat& X, const vec& y) const override; // mse
};
For this class, eff_df
is the number of non-zero coefficients and therefore always an integer value.
There is also a stand-alone function without cross validation:
int coordinate_lasso(
mat& w,
const mat& X,
const mat& y,
double lambda,
double tol=1e-4,
u_long max_iter=1000,
bool verbose=false
);
int coordinate_lasso(
rowvec& b,
mat& w,
const mat& X,
const mat& y,
double lambda,
double tol=1e-4,
long max_iter=1000,
bool verbose=false
);
Which solves the more general matrix problem, in the first version:
$$\min_W |X W - Y|F^2 + \lambda |W|{1,1}$$
and in the second version,
$$\min_{W,b} |X W + \mathbf{1}b - Y|F^2 + \lambda |W|{1,1}$$
This solver allows explicit selection of lambda
, and can communicate progress by setting verbose=true
which will display a progress bar and an exit message.
For data (X,y)
, the Ridge regression problem is
class RidgeCV : public LinearModel {
public:
const mat& cov_eigvecs;
const vec& cov_eigvals;
explicit RidgeCV(bool intercept=true);
void fit(const mat& x, const vec& y) override;
vec predict(const mat& x) const override;
double score(const mat& x, const vec& y) const override;
};
The Ridge regression problem has an explicit solution
$$\tilde X = \begin{bmatrix} X & \mathbb{1} \end{bmatrix}.$$
$$\begin{bmatrix}\mathbf{w} \ b \end{bmatrix}= (\tilde{X}^T \tilde{X} + \lambda I)^{-1} \tilde{X}^T y$$
RidgeCV
solves the problem using the eigen-value decomposition of the matrix:
RidgeCV
determines an optimal eff_df
is not necessarily an integer.
Fit a nonlinear model to data (X,y)
using the kernel trick, specifically, the cubic kernel. The generated model solves the problem:
RidgeCV
class Splines : public LinearModel {
public:
const vec& kernel_coefs;
const mat& X;
const mat& kernel_eigvecs;
const vec& kernel_eigvals;
explicit Splines();
void set_lambda(double l);
void set_df(double df);
void fit(const mat& x, const vec& y) override;
arma::vec predict(const mat& x) const override;
double score(const mat& x, const vec& y) const override;
};
Since the number of features scales with the number of data points, it may be expensive to search for an optimal set_lambda
and set_df
enable the user to specify the regularization term which may be informed from either
Fit a linear logistic regression model for data (X,y)
. By default, the regression is done via L2-regularization and the regularization parameter is determined by cross validation but can be set with set_lambda
. The model minimizes a regularized log-likelihood:
$$\min_{W,b} \lambda|W|F^2 -\sum{i=1}^n y^T_i\log\sigma(x_i W + \mathbf{1}b).$$
Where LBFGS
solver solver is used to solve the minimization problem.
class LogisticRegression : public Classifier {
public:
const double& lambda;
const mat& linear_coefs;
const rowvec& intercepts;
const OneHotEncoder& encoder;
explicit LogisticRegression();
void set_lambda(double l);
void fit(const mat& x, const uvec& y) override;
uvec predict(const mat& x) const override;
mat predict_proba(const mat& x) const;
double score(const mat& x, const uvec& y) const override;
};
The model requires integer labels for fitting, and stores an instance of a OneHotEncoder
in memory which can be refereced via encoder
. The predict
method returns predicted integer labels, and predict_proba
returns the softmax output which can be interpreted as the likelihood of each class given x
. The score
function evaluates the accuracy of predict(x)
relative to y
.
The following classes inherit from
class ModelSGD {
public:
explicit ModelSGD(
const string& loss,
long max_iter,
double tol,
double l2,
double l1,
const string& optimizer,
bool verbose
);
};
These models implement a neural-network which can be interfaced with a fit
method inherited from an Estimator
type. These models all solve the following problem:
$$\min_{W} \texttt{loss}(f(X;W), y) + \texttt{l1}|W|F^2 + \texttt{l2}|W|{1,1}.$$
The model parameters are not necessarily a single matrix as perhaps suggested here, but the interpretation of the loss functions is the same. The parameters l2
and l1
should therefore be non-negative values. The parameters loss
and optimizer
have an explanation in the neural-network section. The parameter max_iter
is the maximum iteration for the solver, and verbose
enables or disables communication with the solver which displays a progress bar as well as an exit message.
The model solves the minimization problem using stochastic gradient descent (or one of its variants). As a result, the model training procedure is fast, but not-necessarily as optimal as a model trained with the full data at each step of the optimization procedure.
There are two linear SGD models, first linear regression:
class LinearRegressorSGD : public Regressor, public ModelSGD {
public:
explicit LinearRegressorSGD(
const string& loss="mse",
long max_iter=200,
double tol=1e-4,
double l2=1e-4,
double l1=0,
const string& optimizer="adam",
bool verbose=false
);
void fit(const mat& x, const vec& y) override;
vec predict(const mat& x) const override;
double score(const mat& x, const vec& y) const override;
};
This class implements a regularized linear model whose parameters are trained via stochastic gradient descent. The accepted losses are either mse
(mean-squared-error) in which case the model minimizes:
$$\min_{\mathbf{w},b} |X\mathbf{w}+b-y|_2^2 + \texttt{l2}|\mathbf{w}|_2^2 + \texttt{l1}|\mathbf{w}|1$$
or mae
(mean-absolute-error) in which case the model minimizes:
$$\min{\mathbf{w},b} |X\mathbf{w}+b-y|_1 + \texttt{l2}|\mathbf{w}|_2^2 + \texttt{l1}|\mathbf{w}|_1.$$
The model's score
is the coefficient of determination (
The second linear model is a linear classifier:
class LinearClassifierSGD : public Classifier, public ModelSGD {
public:
explicit LinearClassifierSGD(
const string& loss="categorical_crossentropy",
long max_iter=200,
double tol=1e-4,
double l2=1e-4,
double l1=0,
const string& optimizer="adam",
bool verbose=false
);
void fit(const mat& x, const uvec& y) override;
mat predict_proba(const mat& x) const;
uvec predict(const mat& x) const override;
double score(const mat& x, const uvec& y) const override;
};
This class implements a regularized linear classifier whose parameters are trained via stochastic gradient descent. Currently, the only loss function is categorical_crossentropy
so the model minimizes:
$$\min_{W,b} \texttt{l2}|W|F^2 + \texttt{l1}|W|{1,1} -\sum_{i=1}^n y^T_i\log\sigma(x_i W + \mathbf{1}b).$$
Behaves just as logistic regression in terms of predict
, predict_proba
, and score
.
Another set of models trained using SGD are neural-network models. These models use a set of linear layers with non-linear activations functions to predict non-linear trends in data. Just as with the linear SGD models, there are also two neural network models. First the regressor:
class NeuralNetRegressor : public Regressor, public ModelSGD {
public:
explicit NeuralNetRegressor(
const vector<pair<int,string>>& layers={{100,"relu"}},
const string& loss="mse",
long max_iter=200,
double tol=1e-4,
double l2=1e-4,
double l1=0,
const string& optimizer="adam",
bool verbose=false
);
void fit(const mat& x, const vec& y) override;
vec predict(const mat& x) const override;
double score(const mat& x, const vec& y) const override;
};
In addition to the inputs accepted by ModelSGD
, the hidden layer design should also be specified to the model in construction. This is achieved by providing a standard vector
of pair
s where first
is an int
specifying the number of nodes in the layer and second
is a string
specifying the activation function. Note that this is only the design of the hidden layers, by default, the model constructs the input and output layers based on x
and y
during fit. So by default the model one hidden layer with 100 nodes and the ReLU activation, i.e.:
The classifier:
class NeuralNetClassifier : public Classifier, public ModelSGD {
public:
explicit NeuralNetClassifier(
const vector<pair<int,string>>& layers={{100,"relu"}},
const string& loss="categorical_crossentropy",
long max_iter=200,
double tol=1e-4,
double l2=1e-4,
double l1=0,
const string& optimizer="adam",
bool verbose=false
);
void fit(const mat& x, const uvec& y) override;
mat predict_proba(const mat& x) const;
uvec predict(const mat& x) const override;
double score(const mat& x, const uvec& y) const override;
};
This model acts essentially like the regressor but predicts instead the one-hot encoding of y
and maps the output to probabilities via the softmax function. Thus, the default model is:
Numerics implements a kNN regressor and classifier which inherit from the following virtual class:
template<typename eT> class KNeighborsEstimator : public Estimator<eT> {
public:
const u_int& k;
const arma::uvec& ks;
const arma::vec& scores;
const neighbors::KDTree& X;
const arma::Col<eT>& y;
explicit KNeighborsEstimator(
int K,
int p_norm,
bool use_distance_weights,
int leaf_size
);
explicit KNeighborsEstimator(
int K,
const string& p_norm,
bool use_distance_weights,
int leaf_size
);
explicit KNeighborsEstimator(
uvec& Ks,
int p_norm,
bool use_distance_weights,
int leaf_size
);
explicit KNeighborsEstimator(
uvec& Ks,
const string& p_norm,
bool use_distance_weights,
int leaf_size
);
void fit(const arma::mat& xx, const arma::Col<eT>& yy) override;
void fit(arma::mat&& xx, arma::Col<eT>&& yy);
arma::Col<eT> predict(const arma::mat& xx) const;
virtual double score(const arma::mat& xx, const arma::Col<eT>& yy) const = 0;
};
This class implements inference from nearest neighbors using the KDTree
class for fast queries. When predicting there are two options (1) average value of the k nearest neighbors or (2) an inverse distance weighing of the nearest neighbors (set use_distance_weights=true
). The number of nearest neighbors is specified by K
or Ks
. If more than one value is provided, then one will be selected by cross-validation. Additionally, the p_norm
for computing distances can be provided as well as the leaf_size
of the KDTree
. Since the data must be referenced for queries, the solver can be fit by copying the data or moving it.
The classifier is defined:
class KNeighborsClassifier : public KNeighborsEstimator<uword> {
public:
explicit KNeighborsClassifier(
int K,
int p_norm=2,
bool use_distance_weights=false,
int leaf_size=30
);
explicit KNeighborsClassifier(
int K,
const string& p_norm,
bool use_distance_weights=false,
int leaf_size=30
);
explicit KNeighborsClassifier(
uvec& Ks,
int p_norm=2,
bool use_distance_weights=false,
int leaf_size=30
);
explicit KNeighborsClassifier(
uvec& Ks,
const string& p_norm,
bool use_distance_weights=false,
int leaf_size=30
);
double score(const mat& xx, const uvec& yy) const override;
mat predict_proba(const mat& xx) const;
};
It differs from its parent only in its implementation score
(which is the accuracy), predict
(though via a private member rather than by override), and predict_proba
.
The regressor is defined:
class KNeighborsRegressor : public KNeighborsEstimator<double> {
public:
explicit KNeighborsRegressor(
int K,
int p_norm=2,
bool use_distance_weights=false,
int leaf_size=30
);
explicit KNeighborsRegressor(
int K,
const string& p_norm,
bool use_distance_weights=false,
int leaf_size=30
);
explicit KNeighborsRegressor(
const uvec& Ks,
int p_norm=2,
bool use_distance_weights=false,
int leaf_size=30
);
explicit KNeighborsRegressor(
const uvec& Ks,
const string& p_norm,
bool use_distance_weights=false,
int leaf_size=30
);
double score(const arma::mat& xx, const arma::vec& yy) const override;
};
It differs from its parent onl in its implementation of score
(which is R2), and predict
(though via a private member rather than by override).
The following classes inherit from the class
class KernelEstimator {
public:
const mat& X;
const BinData& bin_data;
const double& bandwidth;
explicit KernelEstimator(const string& kernel, bool binning, long n_bins);
explicit KernelEstimator(double bdw, const string& kernel, bool binning, long n_bins);
};
These classes implement regression tasks by weighing neighbor information via a kernel. These kernels are radial basis functions
-
"gaussian"
:$k(r) = \frac{1}{\sqrt{2\pi}}e^{-r^2/2}$ -
"square"
:$k(r) =0.5I_{r\leq 1}(r)$ -
"triangle"
:$k(r) = (1-r)I_{r\leq 1}(r)$ -
"parabolic"
:$k(r) = \frac{3}{4}(1-r^2)I_{r\leq 1}(r)$
The bandwidth parameter controls the radius of information, so a small value corresponds to a very wide radius whereas a large value corresponds to a very narrow radius. When not specified, the bandwidth is infered using a variety of strategies specific to each estimator.
Finally, the data may also be binned to a specified number of bins as specified by setting binning=true
and setting n_bins
to the desired quantity (the latter is ignored if binning=false
). We can make kernel smoothing more efficient (effectively $O(1)$) for large data by first binning the data (linear binning). We compute the kernels with respect to the bins rather than the observations without reducing the quality of fit significantly. For large data sets Gramacki (2018) argues that 400-500 bins is almost always sufficient for univariate distributions.
There is also a small namespace bw
which implements basic 1D kernel estimation methods:
namespace bw {
vec eval_kernel(const vec& r, const string& K="gaussian");
double dpi(const vec& x, double s=0, const string& K="gaussian");
double dpi_binned(const BinData& bins, double s=0, const string& K="gaussian");
double rot1(int n, double s);
double rot2(const vec& x, double s = 0);
double grid_mse(
const vec& x,
const string& K,
double s=0,
int grid_size=20,
bool binning=false
);
double grid_mse(
const vec& x,
const vec& y,
const string& K,
double s=0,
int grid_size=20,
bool binning=false
);
}
eval_kernel
: evaluates any of the implemented kernels for values ofr
.dpi(x, s=0, K="gaussian")
: compute the direct plug in (L=2) bandwidth estimate for kernel density estimation.- x : data to estimate bandwidth for.
- s : precomputed standard deviation of x, if s <= 0, then s will be computed.
- K : kernel.
dpi_binned(bins, s=0, K=gaussian)
: compute the direct plug in (L=2) bandwidth estimate for kernel density estimation for pre-binned data.- bins : BinData object of prebinned data.
- s : precomputed standard deviation of data, if s <= 0, then s will be computed.
- K : kernel.
rot1(n, s)
: the original rule of thumb bdw = 1.06 * s * n^(-1/5). Originally proposed by Silverman (1986) is optimal whenever the true distribution is normal.- n : size of data.
- s : standard deviation of data.
rot2(x, s=0)
: a more common rule of thumb bdw = 0.9 * min(IQR/1.34, s) * n^(-1/5). More robust at treating mixture models and skew distributions.- x : data to compute bandwidth estimate for.
- s : precomputed standard deviation of data, if s <= 0, then s will be computed.
grid_mse(x, K, s=0, grid_size=20, binning=false)
: compute an optimal bandwidth for kernel density estimation by grid search cross-validation using an approximate RMSE, where the true density is estimated using a pilot density. The pilot density is computed using the entire data set and the bdw_rot2 estimate for the bandwidth. Then the MSE for each bandwidth is computed by predicting the density for a testing subset of the data computed using the rest of the data.- x : data to estimate bandwidth for.
- K : kernel to use.
- s : precomputed standard deviation, if s <= 0, then s will be computed.
- grid_size : number of bandwidths to test, the range being [0.05*s, range(x)/4] using log spacing.
- binning : whether to prebin the data. Typically the estimate is just as good, but the computational cost can significantly reduced.
grid_mse(x, y, K, s=0, grid_size=20, binning=false)
: compute an optimal bandwidth for kernel smoothing by grid search cross-validation using an approximate RMSE.- x : independent variable.
- y : dependent variable.
- K : smoothing kernel.
- s : precomputed standard deviation of x, if s <= 0, then s will be computed.
- binning : whether to bin the data or not.
The density function of random data can be estimated by a weighted sum of kernels centered on the data and evaluated at a query point. The class KDE
implements density estimation for one-dimensional data.
class KDE : public KernelEstimator, public AutoEstimator<double> {
public:
explicit KDE(
const string& kernel="gaussian",
const string& bandwidth_estimator="min_sd_iqr",
bool binning=false,
long n_bins=30
);
explicit KDE(
double bdw,
const string& kernel="gaussian",
bool binning=false,
long n_bins=30
);
void fit(const mat& x) override;
vec fit_predict(const mat& x) override;
vec predict(const mat& x) const override;
vec sample(int n) const;
};
The model is initialized by specifying a kernel as described in kernel estimators. Additionally, the bandwidth can be specified via bdw
or a bandwidth_estimator
strategy can be specified which is one of:
rule_of_thumb
: corresponds tobw::rot1
.min_sd_iqr
: corresponds tobw::rot2
.plug_in
: corresponds tobw::dpi
.grid_cv
: corresponds tobw::grid_mse
.
The predict
method returns the probability density at query points (rows of x
).
The sample
method returns n
random samples from the distribution. The sampling is done randomly sampling the data (with replacement) and subsampling from each of those points according to the CDF defined by the kernels centered at those points.
Kernel estimation may be applied to the task of non-linear regression. The class KernelSmooth
implements such an estimator for one-dimensional data.
class KernelSmooth : public KernelEstimator, public Regressor {
public:
const mat& y;
explicit KernelSmooth(
const string& kernel="gaussian",
bool binning=true,
long n_bins=30
);
explicit KernelSmooth(
double bdw,
const string& kernel="gaussian",
bool binning=true,
long n_bins=30
);
void fit(const mat& X, const vec& y) override;
vec predict(const mat& X) const override;
double score(const mat& x, const vec& y) const override;
};
The model is initialized by specifying a kernel as described in kernel estimators. A bandwidth can be specified with bdw
or one is selected with bw::grid_mse
.
The score
method evaluates the predict(x)
and y
.
The following section is an overview of the namespace numerics::optimization
which implements a variety of methods for linear and nonlinear root finding and optimization.
The header file optimization.hpp
declares the following types for convinience:
typedef function<vec(const vec&)> VecFunc; // R^n -> R^k
typedef function<mat(const vec&)> MatFunc; // R^n -> R^{k x n}
typedef function<double(const vec&)> dFunc; // R^n -> R
Armadillo features a very robust solve()
and spsolve()
direct solvers for linear systems, but in the case where less precise solutions of very large systems (especially sparse systems) iterative solvers may be more efficient. The functions pcg()
solve systems of linear equations
void pcg(mat& x, const mat& A, const mat& b, double tol = 1e-3, int max_iter = 0);
void pcg(mat& x, const sp_mat& A, const mat& b, double tol = 1e-3, int max_iter = 0);
if max_iter <= 0
, then max_iter = b.n_rows
.
For solving linear maximization problems with linear constraints, we have the simplex algorithm that computes solutions using partial row reduction of linear system of equations. The simplex method solves problems of the form:
double simplex(arma::vec& x, const arma::vec& f, const arma::mat& A, const arma::vec& b);
The function returns the maximum value within a convex polygon defined by x
.
All of the nonlinear solver inherit from the nlsolver
class:
class NonLinSolver {
public:
const double& tol;
const u_long& max_iter;
const u_long& n_iter;
const short& exit_flag;
explicit NonLinSolver(double tol, long maxiter, bool verbose);
void set_tol(double t);
void set_max_iter(long m);
std::string get_exit_flag() const;
};
All of the solvers have tol
speicifications which varies in interpretation for each method, but it is essentially a small parameter that determines convergence. In general, tol
is either
$$|x_{k+1} - x_k|\infty < \texttt{tol},$$
or
$$|f(x{k+1})|_\infty < \texttt{tol}.$$
The parameter maxiter
lets you specify the maximum iterations to compute before stopping the scheme; this parameter also varies from scheme to scheme based on convergence properties and cost per iteration of each method.
If verbose == true
, then the solver will display a progress bar showing the number of iterations complete relative to max_iter
, and display a message upon completion.
The parameter exit_flag
takes on values {-1,0,1,2}
where:
- -1 : the solver was never given a problem to solve, e.g.
fsolve
was never called. - 0 : the solver successfully converged.
- 1 : the maximum number of iterations was reached.
- 2 : a NaN or Infinite value was encountered during a function evaluation.
The function get_exit_flag
will provide a string which explains why the solver stopped, i.e. one of the options above.
This is an implementation of Newton's method for systems of nonlinear equations. A jacobian function of the system of equations is required. As well as a good initial guess:
class Newton : public NonLinSolver {
public:
const vec& fval;
const mat& Jacobian;
explicit Newton(double tol=1e-3, long maxiter=100, bool verbose=false);
void use_pcg();
void use_lu();
virtual void fsolve(vec& x, const VecFunc& f, const MatFunc& jacobian);
};
The function fsolve
solves the system x
and this value will be updated with the solution found by Newton's method. The functions use_pcg
, and use_lu
allows the user to specify how to invert the jacobian at each iteration where use_pcg
toggles the use of conjugate gradient method, while use_lu
toggles the use of arma::solve()
; by default Newton's method inverts the jacobian directly.
The solver also stores the final value of fval
for verifying the solution, as well as the final value of the jacobian in Jacobian
.
There is also a single variable version:
double newton_1d(const function<double(double)>& f,
const function<double(double)>& df,
double x,
double err = 1e-5);
The following solvers inherit from the virtual class QuasiNewton
:
class QausiNewton : public Newton {
public:
explicit QausiNewton(double tol=1e-3, long maxiter=100, bool verbose=false);
virtual void fsolve(arma::vec& x, const VecFunc& f) = 0;
};
This solver is similar to Newton's method, but does not require the jacobian matrix to be evaluated at every step; instead, the solver takes rank 1 updates of the inverse of the estimated Jacobian using the secant equations (wikipedia). Providing a jacobian function does improve the scheme (especially at initialization), but this solver requires far fewer Jacobian evaluations than Newton's method. If none is provided the initial jacobian is computed using finite differencing as this drastically improves the convergence.
class Broyden : public QausiNewton {
public:
explicit Broyden(double tol=1e-3, long maxiter=100, bool verbose=false);
void fsolve(vec& x, const VecFunc& f, const MatFunc& jacobian) override;
void fsolve(vec& x, const VecFunc& f) override;
};
This solver performs Newton like iterations, replacing the Jacobian with a damped least squares version (wikipedia). The jacobian is updated with Broyden's rank 1 updates of the Jacobian itself (rather than the inverse). Just like Broyden's method, providing a jacobian function is helpful but not necessary. When a jacobian is not provided, it will be approximated by finite-differences.
class LmLSQR : public QausiNewton {
public:
explicit LmLSQR(double tol=1e-3, long maxiter=100, bool verbose=false);
void set_damping_parameters(double tau, double nu);
void fsolve(vec& x, const VecFunc& f, const MatFunc& jacobian) override;
void fsolve(vec& x, const VecFunc& f) override;
};
This solver has two tuneable parameters affecting convergence 1.0e-2
. The parameter 2.0
. Both values must be positive and can be set with set_damping_parameters
.
Solves problems of the form
class MixFPI : public NonLinSolver {
public:
explicit MixFPI(
int steps_to_remember=5,
double tol=1e-3,
long maxiter=100,
bool verbose=false
);
void fix(arma::vec& x, const VecFunc& f);
};
The parameter steps_to_remember
is equal to
Adaptively selects between secant method, and inverse interpolation to find a simple root of a single variable function in the interval [a,b]
.
double fzero(const function<double(double)>& f, double a, double b, double tol=1e-8);
Uses the secant as the approximation for the derivative used in Newton's method. Attempts to bracket solution for faster convergence, so providing an interval rather than two initial guesses is best.
double secant(const function<double(double)>& f, double a, double b, double tol=1e-8);
Uses the bisection method to find the solution to a nonlinear equation within an interval.
double bisect(const function<double(double)>& f, double a, double b, double tol=1e-8);
provided a continuous function
double fminbnd(const function<double(double)>& f, double a, double b, double tol=1e-8);
solves the problem:
provided a continuous and finite function
double fminsearch(const function<double(double)>& f, double x0, double alpha=0);
solves the problem alpha
specifies an initial step size for the algorithm in the positive direction. If one is not provided (or a non-positive value is provided) then
The following optimizers inherit from the virtual class:
class GradientOptimizer : public NonLinSolver {
public:
const vec& grad;
explicit GradientOptimizer(double tol=1e-3, long maxiter=100, bool verbose=false);
virtual void minimize(vec& x, const dFunc& f, const VecFunc& grad_f) = 0;
};
The function minimize
solves the problem:
x
. The input grad_f
grad
is the gradient at the final point.
Uses the BFGS algorithm for minimization using the strong Wolfe conditions. This method uses symmetric rank 1 updates to the inverse of the hessian using the secant equation with the further constraint that the hessian remain symmetric positive definite.
class BFGS : public GradientOptimizer {
public:
const arma::mat& inv_hessian;
explicit BFGS(
double tol=1e-3,
long maxiter=100,
bool verbose=false,
double wolfe1=1e-4,
double wolfe2=0.9
);
void use_pcg();
void use_chol();
void enable_finite_differences();
void disable_finite_differences();
void minimize(vec& x, const dFunc& f, const VecFunc& grad_f) override;
void minimize(vec& x, const dFunc& f, const VecFunc& grad_f, const MatFunc& hessian);
};
The functions enable/disable_finite_differences
allows the user to specify whether to approximate the initial hessian by finite diferences, which is disabled by default, so instead the initial hessian is set to the identity matrix.
The parameters wolfe_c1
, wolfe_c2
are the traditional paramters of the strong wolfe conditions:
wolfe_c1 = 1e-4
and wolfe_c2 = 0.9
.
Similarly to Newton and Quasi-Newton methods the Hessian can be inverted iteratively or directly, this time using the Cholesky decomposition when the Hessian is positive definite. The method rarely requires matrix inversion, and ideally never inverts the Hessian at all.
note: like Broyden's method, BFGS
stores the inverse hessian in memory, this may become inneficient in space and time when the problem is sufficiently large.
Uses the limited memory BFGS algorithm, which differs from BFGS by storing a limited number of previous values of x
and grad_f(x)
rather than a full matrix. The number of steps stored can be specified by steps
.
class LBFGS : public GradientOptimizer {
public:
explicit LBFGS(
long steps=5,
double tol=1e-3,
long max_iter=100,
bool verbose=false,
double wolfe1=1e-4,
double wolfe2=0.9
);
void minimize(vec& x, const dFunc& f, const VecFunc& grad_f) override;
};
This class implements gradient descent in three ways.
- GD with line search for step size
- GD with constant step size
- GD with constant step size and Nestrov momentum.
class MomentumGD :public GradientOptimizer {
public:
const double& damping_parameter;
const double& step_size;
explicit MomentumGD(double tol=1e-3, long maxiter=1000, bool verbose=false);
void set_step_size(double alpha, double damping_p=0.90);
void minimize(vec& x, const dFunc& f, const VecFunc& grad_f) override;
};
The parameter damping_param
has a good explaination found in this article where the author refers to it as
The step size can be specified by alpha
this can improve performance over the adaptive line minimization when the gradient is easy to evaluate but the may require more iterations until convergence.
Calling set_step_size
and setting alpha
and damping_param
disables line minimization which is otherwise used by default.
The Nelder-Mead method is a derivative free method that constructs a simplex in n dimensions and iteratively updates its vertices in the direction where the function decreases in value. This method is ideal for low dimensional problems (e.g. 2,3, 10 dimensions, maybe not 100, though). The simplex is initialized using a guess
int n = x.n_elem;
mat U = orth(randn(n,n));
This is done to guarantee the simplex is spread out, moreover the direction is random (as oposed to, e.g., the coordinate directions) to avoid pathological cases.
class NelderMead : public NonLinSolver {
public:
explicit NelderMead(double tol=1e-3, long maxiter=1000, bool verbose=false);
void set_step_size(double s);
void set_expansion_param(double e);
void set_contraction_param(double c);
void set_shrinking_param(double s);
void set_initial_simplex_size(double s);
void minimize(vec& x, const dFunc& f);
};
The function set_step_size
sets the scaling of the reflection step, the default value is 1. The function set_expansion_param
sets the scaling of the expanding step, the default value is 2. The function set_contraction_param
sets the scaling of the contraction step, the default value is 0.5. The function set_shrinking_param
sets the scaling of the shrinking step, the default value is 0.5. All of these parameters are explained here.
The parameter initial_simplex_size
is the value
This method uses a genetic algorithm for maximization.
class GeneticOptimizer : public NonLinSolver {
public:
explicit GeneticOptimizer(long seed=0, double tol=1e-1, long maxiter=100);
void set_search_radius(double r);
void set_population_size(long popsize);
void set_diversity_parameters(
double reproduction=0.5,
double mutation=0.5,
double diversity=0.2,
long cutoff=30
);
void maximize(
vec& x,
const dFunc& f,
const vec& lower_bound,
const vec& upper_bound
);
void maximize(vec& x, const dFunc& f);
};
This method has a variety of parameters for updating the population of parameters to minimize with respect to:
population_size
: number of samples.reproduction
: parameter for geometric probability distribution of "reproducing agents". i.e. ifreproduction
is close to 1, then only the most fit will reproduce and the algorithm will converge more quickly at the cost of possibly not optimal results (such a getting stuck at local maxima). Ifreproduction
is close to 0, then most members will be able to participate at the cost of slower convergence. default value = 0.5.mutation
: rate at which to introduce random perturbation to the population. Values close to 1 result in a population with higher variance resulting in slower convergence. Values close to 0 result in a population with higher variance resulting in faster convergence at the cost possible not optimal results (such as getting stuck at local optima).diversity
: rate at which we encourage diversity. A higher value means agents further from optimality will have a higher change of reproduction which increases the chance of leaving local minima.cutoff
: number of iterations after which we stop incentivising variance in the population. A lower value means quicker convergence at the cost possible not optimal results (such as getting stuck at local optima).
We can use this method for both box constrained and unconstrained maximization.
numerics.hpp
documentation- differential operators
- Initial Value Problems
- Boundary Value Problems
- Poisson's Equation
The following are all members of namespace numerics::ode
.
Given an interval
void diffmat4(mat& D, vec& x, double L, double R, u_int npts);
void diffmat2(mat& D, vec& x, double L, double R, u_int npts);
void cheb(mat& D, vec& x, ouble L, double R, u_int npts);
void cheb(mat& D, vec& x, u_int npts);
In all of these functions the discrete operator is assigned to D
, and the sample points are assigned to x
. The parameters L
and R
define the end points of the interval npts
defines how many points to sample from the interval.
The function diffmat4
samples the interval uniformly and provides a fourth order error term i.e. error is
The function diffmat2
samples the interval uniformly and provies a second order error term i.e. error is
The function cheb
samples the interval at Chebyshev nodes and converges spectrally. If no interval is provided, the interval is set to
In all three cases the
A more generic differentiation matrix is offered by the following two functions:
rowvec diffvec(const vec& x, double x0, u_int k=1);
void diffmat(mat& D, const vec& x, u_int k=1, u_int npt=2);
void diffmat(sp_mat& D, const vec& x, u_int k=1, u_int npt=2);
Where diffvec
returns a rowvector
The function diffmat
produces a differentiation matrix npt
parameter will allow the user to select the number of points to use in the approximation, in general, the differentiation matrix will use a differencing scheme:
- on the points
$x_{i-1}, x_i$ (backwards differencing) whenevernpt == 2
- on the points
$x_{i-1},\dots,x_{i+\texttt{npt}-2}$ (forward biased) whenevernpt % 2 == 0
. - on the points
$x_{i-\lfloor\texttt{npt}/2\rfloor},\dots,x_{i+\lfloor\texttt{npt}/2\rfloor}$ (centeral difference) whenevernpt % 2 != 0
.
An exception is made at the points where these indices fall out of bounds (near the boundaries), in which case the stencil is shifted to the nearest points; requiring that the stencil still use npt
is essential for preserving the order of the scheme. Moreover, we expect the error in the approximation to be diffmat
is that we can find any order derivative, moreover for any
These matrices are especially useful for solving general linear differential equations.
Consider the linear ODE:
vec f(const vec& x) {
// do something
}
mat D;
vec x, y;
double L, R, alpha, beta;
int N;
int method; // set to 1 or 2
diffmat2(D,x,L,R,N); // or diffmat4(), or cheb(), or diffmat()
mat A = D.rows(1,N-1).cols(1,N-1) + alpha*eye(N-1,N-1);
vec d0 = D.col(0);
vec F = f(x.rows(1,N-1)) - d0.rows(1,N-1)*beta;
vec y(N);
y.rows(1,N-1) = solve(A,F);
If we have a system of F
is F = vectorise(F.t())
). Once a solution y
is y = reshape(y,m,n).t()
). For these larger systems both initial value problems and boundary value problems may be solved.
We define a system of initial value problem as having the form:
The initial value problem solvers inherit from the InitialValueProblem
virtual class equipped with a solve_ivp
method, we also define the following types:
typedef function<vec(double,const vec&)> odefunc;
typedef function<mat(double,const vec&)> odejacobian;
class InitialValueProblem {
public:
const long& stopping_event;
const vector<double>& t;
const vector<vec>& U;
InitialValueProblem();
void add_stopping_event(
const function<double(double,const vec&)>& event,
event_direction dir = event_direction::ALL
);
virtual void solve_ivp(
const odefunc& f,
double t0,
double tf,
const vec& U0
) = 0;
};
Events are defined in a later section.
The member variable t
and U
define the dependent variable and solution values, respectively, such that:
Besides the base class InitialValueProblem
, the solvers are further stratified into three subcategories:
AdaptiveIVP
: variable step size solvers:These solvers adaptively select a step size which is restricted greater thanclass AdaptiveIVP { public: const double& step_min; const double& max_err; explicit AdaptiveIVP(double tol, double minstep); };
minstep
to ensure the error is bellowtol
. The members variablesstep_min
andmax_err
are read only copies ofminstep
andtol
respectively.StepIVP
: single step size solversThese solvers use a single step size for the entire integration which is specified on initialization withclass StepIVP { public: const double& step; explicit StepIVP(double step_size); };
step_size
. The only time the step size is ever adjusted is at the final step to reachtf
exactly, and when an event is detected.ImplicitIVP
: implicit solvers (solve a nonlinear equation at every step)These solvers can be either constant or variable step integrators, but simply solve a nonlinear system of equations at each step. To solve these systems, Newton's method (or a variant) is employed, and the tolerance and maximum iterations can be specified for the Newton solver viaclass ImplicitIVP { public: const u_long& max_solver_iter; const double& max_solver_err; ImplicitIVP(); void set_solver_parameters(double tol, long max_iter); virtual void solve_ivp( const odefunc& f, const odejacobian& J, double t0, double tf, const vec& U0 ) = 0; };
tol
andmax_iter
by callingset_solver_parameters
. These solvers overloadsolve_ivp
to additionally accept a jacobian function.
Fourth order explicit Runge-Kutta solver with adaptive step size for error control.
class rk45 : public InitialValueProblem, public AdaptiveIVP {
public:
explicit rk45(double tol=1e-4, double minstep=1e-6);
};
The classic fourth order explicit Runge-Kutta solver with constant step size.
class rk4 : public InitialValueProblem, public StepIVP {
public:
explicit rk4(double step_size=0.01);
};
Fourth order diagonally implicit Runge-Kutta solver with adaptive step size controlled via a third order approximation. Method is A-stable and L-stable.
class rk45i : public InitialValueProblem, public AdaptiveIVP, public ImplicitIVP {
public:
explicit rk45i(double tol=1e-4, double minstep=1e-6);
};
Fifth order semi-implicit Runge-Kutta solver with constant step size. Method is A-stable and L-stable.
class rk5i : public InitialValueProblem, public StepIVP, public ImplicitIVP {
public:
explicit rk5i(double step_size = 0.01);
};
First order implicit Euler's method with constant step size. (Euler's method is not accurate but very stable):
class am1 : public InitialValueProblem, public StepIVP, public ImplicitIVP {
public:
explicit am1(double step_size = 0.01);
};
Second order implicit linear multistep method with constant step size. (Will likely be replaced by a B-stable alternative)
class am2 : public InitialValueProblem, public StepIVP, public ImplicitIVP {
public:
explicit am2(double step_size = 0.01);
};
All IVP solvers can handle events during integration which will trigger the solver to stop. These events are indicated to the solver via:
enum class event_direction {
NEGATIVE = -1,
ALL = 0,
POSITIVE = 1
};
void add_stopping_event(
const function<double(double,const vec&)>& event,
event_direction dir
);
An event function specifies to the solver that whenever dir = NEGATIVE
, the solver will stop iff:
Multiple events can be added via this function, and all of them will checked during integration.
The InitialValueProblem
member variable stopping_event
will be set to the event index (of the events added) that stopped it. e.g. if the third event function added stops the solver, then stopping_event = 2
.
We can solve boundary value problems using finite difference methods. A procedure for simple linear problems was described in the operators section, but the following methods are applicable to a far wider set of problems:
[L,R]
. The values of the solution are found using Newton's method. In order to solve this problem, the solvers require an initial guess of the solution; the quality of this initial guess can improve the rate at which a solution is found. One method for providing an initial guess is by solving the linearized problem
All BVP solvers inherit from the virtual class BoundaryValueProblem
, we also define the types (see also initial value problems):
typedef function<vec(double,const vec&)> odefunc;
typedef function<mat(double,const vec&)> odejacobian;
typedef function<vec(const vec&, const vec&)> boundary_conditions;
template<class SolutionT>
class BoundaryValueProblem {
public:
const u_long& num_iter;
const vec& x;
const mat& u;
const mat& du;
const vector<SolutionT>& solution;
explicit BoundaryValueProblem(double tol, long max_iter);
virtual void solve_bvp(
const odefunc& f,
const boundary_conditions& bc,
const vec& x,
const mat& U
) = 0;
virtual void solve_bvp(
const odefunc& f,
const odejacobian& J,
const boundary_conditions& bc,
const vec& x,
const :mat& U
) = 0;
mat operator()(const vec& x) const;
vec operator()(double x) const;
string get_exit_flag();
};
The class is constructed by specifying tol
which indicates the tolerance for error which is solver specific, and max_iter
which is the maximum acceptible number of iterations for the solver.
The template SolutionT
defines the approximation (polynomials, etc.) and is uses for continuous evaluation of the solution. The solution along with its derivative is stored in u
and du
respectively which are the values of the collocation defined on grid points x
, so
The solution may be evaluated anywhere on [L,R]
with the call method (operator()
).
The method solve_bvp
solves the boundary value problem using an initial grid x
and an initial guess U
. It is overloaded to accept the jacobian of f
. The member variable num_iter
is the number of iterations needed by Newton's method to converge to a solution. The function get_exit_flag
produces a message indicating the reason for stopping Newton's method, this outputs one of:
"solution successfully found over the specified domain."
"solution could not be found within specified error tolerance."
"NaN or infinite value encountered."
"could not solve system of linear equations."
We first introduce the method of finite differences where the derivatives of the solution are approximated by nearby values. The finite difference scheme is determined using the diffmat
function over the grid specified by x
during the call to solve_bvp
where the number of nearby values k
(called npt
in diffmat
) is specified on construction. For uniformly spaced data, the method should be k > 1
.
class BVPk : public BoundaryValueProblem<PieceWisePoly> {
public:
explicit BVPk(int k=4, double tol=1e-5, long max_iter=100);
};
In BVPk
, the tol
parameter is for the stopping criteria:
n=x.n_elem
, and d
is the dimension of the problem, i.e. d=U.n_rows
.
The solution type is a PieceWisePoly
object with constant boundary value extrapolation.
This method uses polynomial interpolation to achieve an exponentially converging scheme. The BVP is solved at Chebyshev nodes scaled to the interval of interest. Because the points are specific, the initial guess is interpolated (linearly) to infer the initial solution (the initial solution can have any number of points).
class BVPCheb : public BoundaryValueProblem<Polynomial> {
public:
explicit BVPCheb(long num_points = 32, double tol=1e-5, long max_iter=100);
};
Where num_points
is the number of points to use in the approximation. Note that for problems with analytic solutions, bvp_cheb
converges exponentially, thus only few points are ever needed (e.g. <50 points, while BVPk
may require >1000 points for a comparable approximation).
In BVPCheb
, the tol
parameter is for the stopping criteria:
d
is the dimension of the problem, i.e. d=U.n_rows
.
The solution type is a Polynomial
object.
This method uses a 4th order Lobatto IIIa collocation formula to construct a uniformly accurate solution with respect to the L2-norm by refining the grid at every iteration of the algorithm. The approximation is a C1 piece-wise cubic polynomial which collocates the solution at each grid point
The class is defined:
class BVP3a : public BoundaryValueProblem<PieceWisePoly> {
public:
explicit BVP3a(double tol=1e-3, long max_iter=100);
};
In BVP3a
, the parameter tol
is for the stopping criteria:
$$\int_{x_{j-1}}^{x_j} |\hat u'(x) - u'(x)|2^2, dx = \int{x_{j-1}}^{x_j} |\hat u'(x) - f(x,\hat u)|2^2, dx < \texttt{tol}$$
for each $j$. Where $\hat u$ is the piecewise-cubic approximation. The integral is approximated by the five-point Gauss-Lobatto quadrature rule. The grid is adaptively modified such that for any interval $(x{j-1}, x_j)$ where the integrated error is greater than tol
is subdivided into two intervals.
Each Newton step solves a sparse system of linear equations, so the cost per iteration is approximately:
n=x.n_elem
(though x
changes between iterations), and d
is the dimension of the problem, i.e. d=U.n_rows
.
The solution type is a PieceWisePoly
object with constant boundary value extrapolation.
Given a rectangular region
We solve the problem with the function:
void poisson_helmholtz_2d(
mat& X,
mat& Y,
mat& U,
const function<mat(const mat&, const mat&)>& f,
const function<mat(const mat&, const mat&)>& bc,
double eig = 0,
int num_grid_points = 32);
We initialize X
with the bounds in X = {xL, xR}
, and the same for Y
, i.e. Y = {yL, yR}
. The matrix U
does not need to be intialized. The function bc
should equal g(x,y)
. The parameter eig
is 0
by default corresponding to Poisson's equation.
X,Y,U
will be overwritten with the grid points solved for.
The parameter num_grid_points
is the number of grid points along each axis, meaning the total number of points solved for will be num_grid_points
^2. This solver uses the Chebyshev spectral method only. The solver takes
The following section provides an overview of the namespace numerics::nueralnet
which implements feed forward neural network models for solving deep-learning tasks.
A dense feed forward networks is composed of fully connected layers with each layer taking the output of the previous layer, multiplying them by weights adding a bias and applying an activation function.
The most fundemental structure to the network are the individual layers. These layers are defined according to the following function.
class Layer {
public:
string name;
const u_long& input_shape;
const u_long& units;
const mat& weights;
const mat& bias;
const mat& cached_output;
explicit Layer(u_long outshape);
explicit Layer(u_long inshape, u_long outshape);
Layer(const Layer& L);
void set_activation(const string& activation);
void set_activation(const Activation& activation);
void set_weights(const mat& w);
void set_bias(const mat& b);
void disable_training_weights();
void enable_training_weight();
void disable_training_bias();
void enable_training_bias();
};
An instance of a Layer
is initialized by setting the input_shape
with inshape
and units
with outshape
. These values indicate the the number of inputs the layer should expect, and the number of outputs it should produce. If Layer
is initialized with only one input, it is assumed that it is the outshape
and the inshape
will be inferred when compiled in a model. A layer computes the following:
activation
is set byset_activation
which can be either an activation function, or the name of an existing activation function. For example, it can be set by"relu"
or byRelu()
. By default, the activation is"linear"
.- w indicates the
weights
which has shapeinput_shape
byunits
. Its value can be set by callingset_weights
. The value ofweights
can be locked by callingdisable_training_weights
which will prevent anOptimizer
from updating its values untilenable_training_weights
is called. - b indicates the
bias
which has shape 1 byunits
. Its value can be set by callingset_bias
. The value ofbias
can be locked by callingdisable_training_bias
which will prevent anOptimizer
from updating its values untilenable_training_bias
is called.
The parameter cached_output
is the output of the layer from when the layer was called last.
A feed forward network can be constructed via the class:
class Model {
public:
const u_long& total_parameters;
const vector<Layer>& layers;
Model();
explicit Model(Layer& input_layer);
Model(const Model& model);
void set_loss(const string& loss);
void set_loss(const Loss& loss);
void set_l2(double l2);
void set_l1(double l1);
void set_optimizer(const string& optim);
void set_optimizer(const Optimizer& optim);
void attach(const Layer& L);
void compile();
void save(const string& fname);
void load(const string& fname);
void fit(
const mat& x,
const mat& y,
const fit_parameters& fitp = fit_parameters()
);
mat predict(mat x) const;
};
This class stores a vector of Layer
instances, a read only view is accessed via layers
.