diff --git a/docs/issue123.jmd b/docs/issue123.jmd
new file mode 100644
index 0000000..68f34a0
--- /dev/null
+++ b/docs/issue123.jmd
@@ -0,0 +1,145 @@
+---
+title: "Comments on issue #123"
+author: "Douglas Bates"
+date: "2018-10-02"
+---
+
+[The issue](https://github.com/dmbates/MixedModels.jl/issues/123) concerns missing linear algebra methods for some of the matrix types in the package.
+
+Sadly I haven't addressed the issue earlier and it is now quite old. Things have changed with the release of
+```{julia; term=true}
+using BlockArrays, DataFrames, InteractiveUtils, LinearAlgebra, MixedModels, Random
+versioninfo()
+```
+and the most recent `MixedModels` release (v1.1.0 is the latest as I write this).
+
+A function to generate an example showing the problem is given in the issue but it uses simulated data without setting the random number seed.
+I modified this function to take an `AbstractRNG` as the first argument, as is common now in simulation functions.
+```{julia; term=true}
+function simulatemodel(rng::AbstractRNG, n::Int, ng::Int)
+    df = DataFrame(Y = randn(rng, n), X = rand(rng, n), G = rand(rng, 1:ng, n), H = rand(rng, 1:ng, n))
+    f = @formula(Y ~ 1 + X + (1 + X|H) + (1|G)) # simple model with random slopes and intercepts
+    m = LinearMixedModel(f, df)
+end
+```
+
+## Internal representation of a LinearMixedModel
+
+A couple of things about the internal representation of the model in the `LinearMixedModel` type.
+The `trms` field is a vector of matrix-like objects constructed from the terms in the model formula.
+The random-effects terms are first, ordered by the number of random effects.
+These are followed by the model matrix for the fixed-effects and the response, represented as a matrix with only 1 column.
+
+```{julia;term=true}
+m1 = simulatemodel(MersenneTwister(1234321), 10_000, 100);
+show(typeof.(m1.trms))
+show(size.(m1.trms))
+```
+
+### The `A` and Λ fields
+
+The `A` field in the `LinearMixedModel` type is a blocked matrix corresponding to `M'M`, where `M` is the horizonal concatenation of these terms.
+The size of `M` would be `(10000,303)` so `A` is of size `(303,303)` but divided into blocks corresponding to the terms.
+```{julia;term=true}
+nblocks(m1.A)
+blocksizes(m1.A)
+```
+
+`A` is symmetric and sparse.  Only the lower triangle is stored explicitly.
+```{julia; term=true}
+Symmetric(m1.A, :L)
+```
+
+There is another matrix, `Λ`, stored implicitly in the random-effects terms and depending on the parameter vector `θ`.
+It is block-diagonal with the same block sizes as `A`.
+All the off-diagonal blocks are zero.
+The rightmost two diagonal blocks, corresponding to the fixed-effects and the response, are always the identity.
+The diagonal blocks for the random-effects terms are either a multiple of the identity, for a scalar random-effects term like `(1|G)`, or repetitions of small, lower-triangular matrix, `λ`, for vector-valued random-effects terms like `(1+X | H)`.
+
+The size of `λ` for a vector-valued random effects term is the dimension of the random effects for each level of the grouping factor.
+In this case `(1+X | H)` generates a two-dimensional random effect for each of the 100 levels of `H`. 
+
+
+```{julia; term=true}
+show(m1.θ)
+m1.λ[1]
+m1.λ[2]
+```
+
+Optimization of the log-likelihood is with respect to the `θ` parameters, as can be seen in the verbose output.
+```{julia; term=true}
+fit!(m1, true)
+show(m1.θ)
+m1.λ[1]
+m1.λ[2]
+```
+
+### The `L` field
+
+The `L` field in a `LinearMixedModel` is the lower-triangular Cholesky factor of
+
+\begin{equation}
+  \begin{bmatrix}
+    \Lambda^\prime\mathbf{Z}^\prime\mathbf{Z}\Lambda + \mathbf{I} & \Lambda^\prime\mathbf{Z}^\prime\mathbf{X} & \Lambda^\prime\mathbf{Z}^\prime\mathbf{y} \\
+    \mathbf{X}^\prime\mathbf{Z}\Lambda  & \mathbf{X}^\prime\mathbf{X} & \mathbf{X}^\prime\mathbf{y} \\
+    \mathbf{y}^\prime\mathbf{Z}\Lambda  & \mathbf{y}^\prime\mathbf{X} & \mathbf{y}^\prime\mathbf{y}
+  \end{bmatrix}
+\end{equation}
+
+in the same block pattern as `A`.
+```{julia; term=true}
+m1.L
+```
+
+An inelegant but fast summary of the block sizes and types of the `A` and `L` fields is available as
+```{julia; term=true}
+describeblocks(m1)
+```
+For each of the blocks in the lower triangle, the type and size of the block in `A` is listed followed by the type of the `L` block.
+The `(1,1)` block is always the biggest block and hence the most important in preserving sparsity.
+It will be `Diagonal` if the term with the most random effects is a scalar random-effects term.
+For a vector-valued random-effects term, as in this example, it is `UniformBlockDiagonal` which means that it is block diagonal with $k$ diagonal blocks of size $\ell\times\ell$ where $k$ is the number of levels of the grouping factor and $\ell$ is the dimension of the random effects associated with each level.
+In this example $k=100$ and $\ell=2$.
+
+Although the `(1,1)` block of `L` always has the same structure as that of `A`, the `(2,2)` block can be subject to "fill-in".
+Here the `(2,2)` block of `A` is `Diagonal` but the `(2,2)` block of `L` has non-zero off-diagonal elements and is stored as a full, dense matrix.
+
+Furthermore, the `(2,1)` block is created as a sparse matrix but may not have a high proportion of zeros, in which case it is converted to a dense matrix as has been done here.
+The "break-even" point where the complexity of sparse matrix operations is offset by storing and computing with only the non-zeros, as opposed to all the elements of a dense matrix even when most of them as zero, is surprisingly high, especially when using multi-threaded accelerated BLAS (Basic Linear Algebra Subroutines) such as [MKL](https://software.intel.com/en-us/mkl).
+
+### Updating `L`.
+
+Once the model representation has been created, evaluation of the profiled log-likelihood requires only installation of a new value of `θ` and updating of `L`.
+
+The `updateL!` function is now defined (in `src/pls.jl`) as
+```{julia;eval=false}
+function updateL!(m::LinearMixedModel{T}) where T
+    trms = m.trms
+    A = m.A
+    Ldat = m.L.data
+    nblk = nblocks(A, 2)
+    for j in 1:nblk
+        Ljj = scaleInflate!(Ldat[Block(j, j)], A[Block(j, j)], trms[j])
+        LjjH = isa(Ljj, Diagonal) ? Ljj : Hermitian(Ljj, :L)
+        for jj in 1:(j - 1)
+            rankUpdate!(-one(T), Ldat[Block(j, jj)], LjjH)
+        end
+        cholUnblocked!(Ljj, Val{:L})
+        for i in (j + 1):nblk
+            Lij = Λc_mul_B!(trms[i], A_mul_Λ!(copyto!(Ldat[Block(i, j)], A[Block(i, j)]), trms[j]))
+            for jj in 1:(j - 1)
+                αβA_mul_Bc!(-one(T), Ldat[Block(i, jj)], Ldat[Block(j, jj)], one(T), Lij)
+            end
+            rdiv!(Lij, isa(Ljj, Diagonal) ? Ljj : LowerTriangular(Ljj)')
+        end
+    end
+    m
+end
+```
+
+It relies on having appropriate methods for `scaleInflate!`, `rankUpdate!`, `cholUnblocked!`, `Λc_mul_B!`, `A_mul_Λ!`, `αβA_mul_Bc!` and `rdiv!`.
+All of the methods act in-place, as indicated by the `!` at the end of the name.
+
+Making sure that all of the available methods are available is a bit tricky.
+I keep accumulating examples but I don't have an exhaustive collection by any means so there may be situations that I have missed.
+I appreciate it when users bring attention to a (reproducible, if possible) example that fails because then I know what I need to add.
diff --git a/docs/jmd/MultipleTerms.jmd b/docs/jmd/MultipleTerms.jmd
new file mode 100644
index 0000000..c9e36d9
--- /dev/null
+++ b/docs/jmd/MultipleTerms.jmd
@@ -0,0 +1,512 @@
+
+# Models With Multiple Random-effects Terms
+
+The mixed models considered in the previous chapter had only one random-effects term, which was a simple, scalar random-effects term, and a single fixed-effects coefficient. Although such models can be useful, it is with the facility to use multiple random-effects terms and to use random-effects terms beyond a simple, scalar term that we can begin to realize the flexibility and versatility of mixed models.
+
+In this chapter we consider models with multiple simple, scalar random-effects terms, showing examples where the grouping factors for these terms are in completely crossed or nested or partially crossed configurations. For ease of description we will refer to the random effects as being crossed or nested although, strictly speaking, the distinction between nested and non-nested refers to the grouping factors, not the random effects.
+
+```{julia;term=true}
+using DataFrames, Distributions, FreqTables, MixedModels, RData, Random
+using Gadfly
+using Gadfly.Geom: density, histogram, line, point
+using Gadfly.Guide: xlabel, ylabel
+const dat = Dict(Symbol(k)=>v for (k,v) in 
+    load(joinpath(dirname(pathof(MixedModels)), "..", "test", "dat.rda")));
+const ppt250 = inv(500) : inv(250) : 1.;
+const zquantiles = quantile.(Normal(), ppt250);
+function hpdinterval(v, level=0.95)
+    n = length(v)
+    if !(0 < level < 1)
+        throw(ArgumentError("level = $level must be in (0, 1)"))
+    end
+    if (lbd = floor(Int, (1 - level) * n)) < 2
+        throw(ArgumentError(
+            "level = $level is too large from sample size $n"))
+    end
+    ordstat = sort(v)
+    leftendpts = ordstat[1:lbd]
+    rtendpts = ordstat[(1 + n - lbd):n]
+    (w, ind) = findmin(rtendpts - leftendpts)
+    return [leftendpts[ind], rtendpts[ind]]
+end
+```
+
+## A Model With Crossed Random Effects
+
+One of the areas in which the methods in the package for are particularly effective is in fitting models to cross-classified data where several factors have random effects associated with them.
+For example, in many experiments in psychology the reaction of each of a group of subjects to each of a group of stimuli or items is measured.
+If the subjects are considered to be a sample from a population of subjects and the items are a sample from a population of items, then it would make sense to associate random effects with both these factors.
+
+In the past it was difficult to fit mixed models with multiple, crossed grouping factors to large, possibly unbalanced, data sets.
+The methods in the package are able to do this.
+To introduce the methods let us first consider a small, balanced data set with crossed grouping factors.
+
+### The `Penicillin` Data
+
+The data are derived from Table 6.6, p. 144 of Davies (), where they are described as coming from an investigation to
+
+> assess the variability between samples of penicillin by the *B. subtilis* method. In this test method a bulk-innoculated nutrient agar medium is poured into a Petri dish of approximately 90 mm. diameter, known as a plate. When the medium has set, six small hollow cylinders or pots (about 4 mm. in diameter) are cemented onto the surface at equally spaced intervals. A few drops of the penicillin solutions to be compared are placed in the respective cylinders, and the whole plate is placed in an incubator for a given time. Penicillin diffuses from the pots into the agar, and this produces a clear circular zone of inhibition of growth of the organisms, which can be readily measured. The diameter of the zone is related in a known way to the concentration of penicillin in the solution.
+
+```{julia;term=true}
+describe(dat[:Penicillin])
+```
+
+of the data, then plot it
+
+
+The variation in the diameter is associated with the plates and with the samples. Because each plate is used only for the six samples shown here we are not interested in the contributions of specific plates as much as we are interested in the variation due to plates and in assessing the potency of the samples after accounting for this variation.
+Thus, we will use random effects for the factor.
+We will also use random effects for the factor because, as in the `Dyestuff` example, we are more interested in the sample-to-sample variability in the penicillin samples than in the potency of a particular sample.
+
+In this experiment each sample is used on each plate.
+We say that the and factors are *crossed*, as opposed to *nested* factors, which we will describe in the next section.
+By itself, the designation “crossed” just means that the factors are not nested.
+If we wish to be more specific, we could describe these factors as being *completely crossed*, which means that we have at least one observation for each combination of a level of `G` and a level of `H`.
+We can see this in Fig. [fig:Penicillindot] and, because there are moderate numbers of levels in these factors, we can check it in a cross-tabulation
+```{julia;term=true}
+freqtable(dat[:Penicillin][:H], dat[:Penicillin][:G])
+```
+
+Like the `Dyestuff` data, the factors in the `Penicillin` data are balanced.
+That is, there are exactly the same number of observations on each plate and for each sample and, furthermore, there is the same number of observations on each combination of levels.
+In this case there is exactly one observation for each combination of `G`, the plate, and `H`, the sample.
+We would describe the configuration of these two factors as an unreplicated, completely balanced, crossed design.
+
+In general, balance is a desirable but precarious property of a data set.
+We may be able to impose balance in a designed experiment but we typically cannot expect that data from an observation study will be balanced.
+Also, as anyone who analyzes real data soon finds out, expecting that balance in the design of an experiment will produce a balanced data set is contrary to “Murphy’s Law”.
+That’s why statisticians allow for missing data.
+Even when we apply each of the six samples to each of the 24 plates, something could go wrong for one of the samples on one of the plates, leaving us without a measurement for that combination of levels and thus an unbalanced data set.
+
+### A Model For the `Penicillin` Data
+
+A model incorporating random effects for both the plate and the sample is straightforward to specify — we include simple, scalar random effects terms for both these factors.
+
+```{julia;term=true}
+penm = fit!(LinearMixedModel(@formula(Y ~ 1 + (1|G) + (1|H)), dat[:Penicillin]))
+```
+
+This model display indicates that the sample-to-sample variability has the greatest contribution, then plate-to-plate variability and finally the “residual” variability that cannot be attributed to either the sample or the plate. These conclusions are consistent with what we see in the data plot (Fig. [fig:Penicillindot]).
+
+The prediction intervals on the random effects (Fig. [fig:fm03ranef])
+
+confirm that the conditional distribution of the random effects for has much less variability than does the conditional distribution of the random effects for , in the sense that the dots in the bottom panel have less variability than those in the top panel. (Note the different horizontal axes for the two panels.) However, the conditional distribution of the random effect for a particular , say sample F, has less variability than the conditional distribution of the random effect for a particular plate, say plate m. That is, the lines in the bottom panel are wider than the lines in the top panel, even after taking the different axis scales into account. This is because the conditional distribution of the random effect for a particular sample depends on 24 responses while the conditional distribution of the random effect for a particular plate depends on only 6 responses.
+
+In chapter [chap:ExamLMM] we saw that a model with a single, simple, scalar random-effects term generated a random-effects model matrix, $\mathbf Z$, that is the matrix of indicators of the levels of the grouping factor. When we have multiple, simple, scalar random-effects terms, as in model , each term generates a matrix of indicator columns and these sets of indicators are concatenated to form the model matrix $\mathbf Z$. The transpose of this matrix, shown in Fig. [fig:fm03Ztimage], contains rows of indicators for each factor.
+
+The relative covariance factor, $\Lambda_\theta$, (Fig. [fig:fm03LambdaLimage], left panel) is no longer a multiple of the identity. It is now block diagonal, with two blocks, one of size 24 and one of size 6, each of which is a multiple of the identity. The diagonal elements of the two blocks are $\theta_1$ and $\theta_2$, respectively. The numeric values of these parameters can be obtained as
+
+```{julia;term=true}
+show(penm.θ)
+```
+
+or as the `Final parameter vector` in the `opsum` field of `penm`
+
+```{julia;term=true}
+penm.optsum
+```
+
+The first parameter is the relative standard deviation of the random effects for `plate`, which has the value `0.8455646/0.5499331 = 1.53758` at convergence, and the second is the relative standard deviation of the random effects for `sample` (`1.7706475/0.5499331 = 3.21975`).
+
+Because $\Lambda_\theta$ is diagonal, the pattern of non-zeros in $\Lambda_\theta'\mathbf Z'\mathbf Z\Lambda_\theta+\mathbf I$ will be the same as that in $\mathbf Z'\mathbf Z$, shown in the middle panel of Fig. [fig:fm03LambdaLimage]. The sparse Cholesky factor, $\mathbf L$, shown in the right panel, is lower triangular and has non-zero elements in the lower right hand corner in positions where $\mathbf Z'\mathbf Z$ has systematic zeros. We say that “fill-in” has occurred when forming the sparse Cholesky decomposition. In this case there is a relatively minor amount of fill but in other cases there can be a substantial amount of fill and we shall take precautions so as to reduce this, because fill-in adds to the computational effort in determining the MLEs or the REML estimates.
+
+A bootstrap simulation of the model
+
+```{julia;term=true}
+@time penmbstp = bootstrap(10000, penm);
+```
+
+provides the density plots
+
+```{julia;echo=false;fig_width=8}
+plot(penmbstp, x = :β₁, density, xlabel("β₁"))
+```
+
+```{julia;echo=false;fig_width=8}
+plot(penmbstp, x = :σ, density, xlabel("σ"))
+```
+
+```{julia;echo=false;fig_width=8}
+plot(penmbstp, x = :σ₁, density, xlabel("σ₁"))
+```
+
+```{julia;term=true}
+plot(penmbstp, x = :σ₂, density, xlabel("σ₂"))
+```
+
+A profile zeta plot (Fig. [fig:fm03prplot]) for the parameters in model
+
+![Profile zeta plot of the parameters in model .](figure/fm03prplot-1)
+
+[fig:fm03prplot]
+
+leads to conclusions similar to those from Fig. [fig:fm1prof] for model in the previous chapter.
+The fixed-effect parameter, $\beta_1$, for the constant term has symmetric intervals and is over-dispersed relative to the normal distribution.
+The logarithm of $\sigma$ has a good normal approximation but the standard deviations of the random effects, $\sigma_1$ and $\sigma_2$, are skewed.
+The skewness for $\sigma_2$ is worse than that for $\sigma_1$, because the estimate of $\sigma_2$ is less precise than that of $\sigma_1$, in both absolute and relative senses.
+For an absolute comparison we compare the widths of the confidence intervals for these parameters.
+```
+                     2.5 %     97.5 %
+    .sig01       0.6335658  1.1821040
+    .sig02       1.0957893  3.5562919
+    .sigma       0.4858454  0.6294535
+    (Intercept) 21.2666274 24.6778176
+```
+In a relative comparison we examine the ratio of the endpoints of the interval divided by the estimate.
+```
+
+               2.5 %   97.5 %
+    .sig01 0.7492746 1.397993
+    .sig02 0.6188634 2.008469
+```
+
+The lack of precision in the estimate of $\sigma_2$ is a consequence of only having 6 distinct levels of the factor. The factor, on the other hand, has 24 distinct levels. In general it is more difficult to estimate a measure of spread, such as the standard deviation, than to estimate a measure of location, such as a mean, especially when the number of levels of the factor is small. Six levels are about the minimum number required for obtaining sensible estimates of standard deviations for simple, scalar random effects terms.
+
+shows patterns similar to those in Fig. [fig:fm1profpair] for pairs of parameters in model fit to the data. On the $\zeta$ scale (panels below the diagonal) the profile traces are nearly straight and orthogonal with the exception of the trace of $\zeta(\sigma_2)$ on $\zeta(\beta_0)$ (the horizontal trace for the panel in the $(4,2)$ position). The pattern of this trace is similar to the pattern of the trace of $\zeta(\sigma_1)$ on $\zeta(\beta_1)$ in Fig. [fig:fm1profpair]. Moving $\beta_0$ from its estimate, $\widehat{\beta}_0$, in either direction will increase the residual sum of squares. The increase in the residual variability is reflected in an increase of one or more of the dispersion parameters. The balanced experimental design results in a fixed estimate of $\sigma$ and the extra apparent variability must be incorporated into $\sigma_1$ or $\sigma_2$.
+
+Contours in panels of parameter pairs on the original scales (i.e. panels above the diagonal) can show considerable distortion from the ideal elliptical shape. For example, contours in the $\sigma_2$ versus $\sigma_1$ panel (the $(1,2)$ position) and the $\log(\sigma)$ versus $\sigma_2$ panel (in the $(2,3)$ position) are dramatically non-elliptical. However, the distortion of the contours is not due to these parameter estimates depending strongly on each other. It is almost entirely due to the choice of scale for $\sigma_1$ and $\sigma_2$. When we plot the contours on the scale of $\log(\sigma_1)$ and $\log(\sigma_2)$ instead (Fig. [fig:lpr03pairs])
+
+![Profile pairs plot for the parameters in model fit to the data. In this plot the parameters $\sigma_1$ and $\sigma_2$ are on the scale of the natural logarithm, as is the parameter $\sigma$ in this and other profile pairs plots.](figure/fm03lprpairs-1)
+
+[fig:fm03lprpairs]
+
+they are much closer to the elliptical pattern.
+
+Conversely, if we tried to plot contours on the scale of $\sigma_1^2$ and $\sigma_2^2$ (not shown), they would be hideously distorted.
+
+## A Model With Nested Random Effects
+
+In this section we again consider a simple example, this time fitting a model with *nested* grouping factors for the random effects.
+
+### The `Pastes` Data
+
+The third example from Davies (1972) is described as coming from
+
+> deliveries of a chemical paste product contained in casks where, in addition to sampling and testing errors, there are variations in quality between deliveries …As a routine, three casks selected at random from each delivery were sampled and the samples were kept for reference. …Ten of the delivery batches were sampled at random and two analytical tests carried out on each of the 30 samples.
+
+The structure and summary of the data object are
+
+```{julia;term=true}
+describe(dat[:Pastes])
+```
+
+As stated in the description in Davies (1972), there are 30 samples, three from each of the 10 delivery batches. We have labelled the levels of the `sample` factor with the label of the `batch` factor followed by `a`, `b` or `c` to distinguish the three samples taken from that batch.
+
+```{julia;term=true}
+freqtable(dat[:Pastes][:H], dat[:Pastes][:G])
+```
+
+When plotting the `strength` versus `sample` and in the data we should remember that we have two strength measurements on each of the 30 samples. It is tempting to use the cask designation (‘a’, ‘b’ and ‘c’) to determine, say, the plotting symbol within a `sample`. It would be fine to do this within a batch but the plot would be misleading if we used the same symbol for cask ‘a’ in different batches. There is no relationship between cask ‘a’ in batch ‘A’ and cask ‘a’ in batch ‘B’. The labels ‘a’, ‘b’ and ‘c’ are used only to distinguish the three samples within a batch; they do not have a meaning across batches.
+
+![Strength of paste preparations according to the and the within the batch. There were two strength measurements on each of the 30 samples; three samples each from 10 batches.](figure/Pastesplot-1)
+
+[fig:Pastesplot]
+
+In Fig. [fig:Pastesplot] we plot the two strength measurements on each of the samples within each of the batches and join up the average strength for each sample. The perceptive reader will have noticed that the levels of the factors on the vertical axis in this figure, and in Fig. [fig:Dyestuffdot] and [fig:Penicillindot], have been reordered according to increasing average response. In all these cases there is no inherent ordering of the levels of the covariate such as or . Rather than confuse our interpretation of the plot by determining the vertical displacement of points according to a random ordering, we impose an ordering according to increasing mean response. This allows us to more easily check for structure in the data, including undesirable characteristics like increasing variability of the response with increasing mean level of the response.
+
+In Fig. [fig:Pastesplot] we order the samples within each batch separately then order the batches according to increasing mean strength.
+
+Figure [fig:Pastesplot] shows considerable variability in strength between samples relative to the variability within samples. There is some indication of variability between batches, in addition to the variability induced by the samples, but not a strong indication of a batch effect. For example, batches I and D, with low mean strength relative to the other batches, each contained one sample (I:b and D:c, respectively) that had high mean strength relative to the other samples. Also, batches H and C, with comparatively high mean batch strength, contain samples H:a and C:a with comparatively low mean sample strength. In Sect. [sec:TestingSig2is0] we will examine the need for incorporating batch-to-batch variability, in addition to sample-to-sample variability, in the statistical model.
+
+#### Nested Factors
+
+Because each level of occurs with one and only one level of we say that is *nested within* . Some presentations of mixed-effects models, especially those related to *multilevel modeling*  or *hierarchical linear models* , leave the impression that one can only define random effects with respect to factors that are nested. This is the origin of the terms “multilevel”, referring to multiple, nested levels of variability, and “hierarchical”, also invoking the concept of a hierarchy of levels. To be fair, both those references do describe the use of models with random effects associated with non-nested factors, but such models tend to be treated as a special case.
+
+The blurring of mixed-effects models with the concept of multiple, hierarchical levels of variation results in an unwarranted emphasis on “levels” when defining a model and leads to considerable confusion. It is perfectly legitimate to define models having random effects associated with non-nested factors. The reasons for the emphasis on defining random effects with respect to nested factors only are that such cases do occur frequently in practice and that some of the computational methods for estimating the parameters in the models can only be easily applied to nested factors.
+
+This is not the case for the methods used in the package. Indeed there is nothing special done for models with random effects for nested factors. When random effects are associated with multiple factors exactly the same computational methods are used whether the factors form a nested sequence or are partially crossed or are completely crossed. There is, however, one aspect of nested grouping factors that we should emphasize, which is the possibility of a factor that is *implicitly nested* within another factor. Suppose, for example, that the factor was defined as having three levels instead of 30 with the implicit assumption that is nested within . It may seem silly to try to distinguish 30 different batches with only three levels of a factor but, unfortunately, data are frequently organized and presented like this, especially in text books. The factor in the data is exactly such an implicitly nested factor. If we cross-tabulate `cask` and `batch`
+```
+        batch
+    cask A B C D E F G H I J
+       a 2 2 2 2 2 2 2 2 2 2
+       b 2 2 2 2 2 2 2 2 2 2
+       c 2 2 2 2 2 2 2 2 2 2
+```
+we get the impression that the and factors are crossed, not nested. If we know that the cask should be considered as nested within the batch then we should create a new categorical variable giving the batch-cask combination, which is exactly what the factor is. A simple way to create such a factor is to use the interaction operator, ‘’, on the factors. It is advisable, but not necessary, to apply to the result thereby dropping unused levels of the interaction from the set of all possible levels of the factor. (An “unused level” is a combination that does not occur in the data.) A convenient code idiom is
+
+or
+
+In a small data set like we can quickly detect a factor being implicitly nested within another factor and take appropriate action. In a large data set, perhaps hundreds of thousands of test scores for students in thousands of schools from hundreds of school districts, it is not always obvious if school identifiers are unique across the entire data set or just within a district. If you are not sure, the safest thing to do is to create the interaction factor, as shown above, so you can be confident that levels of the district:school interaction do indeed correspond to unique schools.
+
+### Fitting a Model With Nested Random Effects
+
+Fitting a model with simple, scalar random effects for nested factors is done in exactly the same way as fitting a model with random effects for crossed grouping factors. We include random-effects terms for each factor, as in
+
+```{julia;term=true}
+pstsm = fit!(LinearMixedModel(@formula(Y ~ 1 + (1|G) + (1|H)), dat[:Pastes]))
+```
+
+Not only is the model specification similar for nested and crossed factors, the internal calculations are performed according to the methods described in Sect. [sec:definitions] for each model type. Comparing the patterns in the matrices $\Lambda$, $\mathbf Z'\mathbf Z$ and $\mathbf L$ for this model (Fig. [fig:fm04LambdaLimage]) to those in Fig. [fig:fm03LambdaLimage] shows that models with nested factors produce simple repeated structures along the diagonal of the sparse Cholesky factor, $\mathbf L$. This type of structure has the desirable property that there is no “fill-in” during calculation of the Cholesky factor. In other words, the number of non-zeros in $\mathbf L$ is the same as the number of non-zeros in the lower triangle of the matrix being factored, $\Lambda'\mathbf Z'\mathbf Z\Lambda+\mathbf I$ (which, because $\Lambda$ is diagonal, has the same structure as $\mathbf Z'\mathbf
+Z$).
+
+### Assessing Parameter Estimates in Model `pstsm`
+
+The parameter estimates are: $\widehat{\sigma_1}=$2.904, the standard deviation of the random effects for `sample`; $\widehat{\sigma_2}=$1.095, the standard deviation of the random effects for `batch`; $\widehat{\sigma}=$0.823, the standard deviation of the residual noise term; and $\widehat{\beta_0}=$60.053, the overall mean response, which is labeled `(Intercept)` in these models.
+
+The estimated standard deviation for `sample` is nearly three times as large as that for `batch`, which confirms what we saw in Fig. [fig:Pastesplot]. Indeed our conclusion from Fig. [fig:Pastesplot] was that there may not be a significant batch-to-batch variability in addition to the sample-to-sample variability.
+
+
+Plots of the prediction intervals of the random effects (Fig. [fig:fm04ranef])
+
+![95% prediction intervals on the random effects for model fit to the data.](figure/fm04ranef-1)
+
+[fig:fm04ranef]
+
+confirm this impression in that all the prediction intervals for the random effects for contain zero. Furthermore, a bootstrap sample
+
+```{julia;term=true}
+Random.seed!(4321234);
+@time pstsbstp = bootstrap(10000, pstsm);
+```
+
+```{julia;echo=false;fig_width=8}
+plot(pstsbstp, x = :β₁, density, xlabel("β₁"))
+```
+
+```{julia;term=true}
+plot(pstsbstp, x = :σ, density, xlabel("σ"))
+```
+
+```{julia;term=true}
+plot(x = pstsbstp[:σ₁], Geom.density(), Guide.xlabel("σ₁"))
+```
+
+```{julia;term=true}
+plot(x = pstsbstp[:σ₂], Geom.density(), Guide.xlabel("σ₂"))
+```
+
+and a normal probability plot of
+
+```{julia;term=true}
+plot(x = zquantiles, y = quantile(pstsbstp[:σ₂], ppt250), Geom.line,
+    Guide.xlabel("Standard Normal Quantiles"), Guide.ylabel("σ₂"))
+```
+
+```{julia;term=true}
+count(x -> x < 1.0e-5, pstsbstp[:σ₂])
+```
+
+Over 1/3 of the bootstrap samples of $\sigma_2$ are zero.  Even a 50% confidence interval on this parameter will extend to zero.  One way to calculate confidence intervals based on a bootstrap sample is sort the sample and consider all the contiguous intervals that contain, say, 95% of the samples then choose the smallest of these.  For example,
+
+```{julia;term=true}
+hpdinterval(pstsbstp[:σ₂])
+```
+
+provides the confidence interval
+
+![Profile zeta plots for the parameters in model .](figure/fm04prplot-1)
+
+[fig:fm04prplot]
+
+shows that the even the 50% profile-based confidence interval on $\sigma_2$ extends to zero.
+
+Because there are several indications that $\sigma_2$ could reasonably be zero, resulting in a simpler model incorporating random effects for only, we perform a statistical test of this hypothesis.
+
+### Testing $H_0:\sigma_2=0$ Versus $H_a:\sigma_2>0$
+
+One of the many famous statements attributed to Albert Einstein is “Everything should be made as simple as possible, but not simpler.” In statistical modeling this *principal of parsimony* is embodied in hypothesis tests comparing two models, one of which contains the other as a special case. Typically, one or more of the parameters in the more general model, which we call the *alternative hypothesis*, is constrained in some way, resulting in the restricted model, which we call the *null hypothesis*. Although we phrase the hypothesis test in terms of the parameter restriction, it is important to realize that we are comparing the quality of fits obtained with two nested models. That is, we are not assessing parameter values per se; we are comparing the model fit obtainable with some constraints on parameter values to that without the constraints.
+
+Because the more general model, $H_a$, must provide a fit that is at least as good as the restricted model, $H_0$, our purpose is to determine whether the change in the quality of the fit is sufficient to justify the greater complexity of model $H_a$. This comparison is often reduced to a *p-value*, which is the probability of seeing a difference in the model fits as large as we did, or even larger, when, in fact, $H_0$ is adequate. Like all probabilities, a p-value must be between 0 and 1. When the p-value for a test is small (close to zero) we prefer the more complex model, saying that we “reject $H_0$ in favor of $H_a$”. On the other hand, when the p-value is not small we “fail to reject $H_0$”, arguing that there is a non-negligible probability that the observed difference in the model fits could reasonably be the result of random chance, not the inherent superiority of the model $H_a$. Under these circumstances we prefer the simpler model, $H_0$, according to the principal of parsimony.
+
+These are the general principles of statistical hypothesis tests. To perform a test in practice we must specify the criterion for comparing the model fits, the method for calculating the p-value from an observed value of the criterion, and the standard by which we will determine if the p-value is “small” or not. The criterion is called the *test statistic*, the p-value is calculated from a *reference distribution* for the test statistic, and the standard for small p-values is called the *level* of the test.
+
+In Sect. [sec:variability] we referred to likelihood ratio tests (LRTs) for which the test statistic is the difference in the deviance. That is, the LRT statistic is $d_0-d_a$ where $d_a$ is the deviance in the more general ($H_a$) model fit and $d_0$ is the deviance in the constrained ($H_0$) model. An approximate reference distribution for an LRT statistic is the $\chi^2_\nu$ distribution where $\nu$, the degrees of freedom, is determined by the number of constraints imposed on the parameters of $H_a$ to produce $H_0$.
+
+The restricted model fit
+
+```{julia;term=true}
+pstsm1 = fit!(LinearMixedModel(@formula(Y ~ 1 + (1|G)), dat[:Pastes]))
+```
+
+is compared to model `pstsm` with
+
+```{julia;term=true}
+MixedModels.lrt(pstsm1, pstsm)
+```
+
+which provides a p-value of $0.5234$. Because typical standards for “small” p-values are 5% or 1%, a p-value over 50% would not be considered significant at any reasonable level.
+
+We do need to be cautious in quoting this p-value, however, because the parameter value being tested, $\sigma_2=0$, is on the boundary of set of possible values, $\sigma_2\ge 0$, for this parameter. The argument for using a $\chi^2_1$ distribution to calculate a p-value for the change in the deviance does not apply when the parameter value being tested is on the boundary. As shown in Pinheiro and Bates (2000), the p-value from the $\chi^2_1$ distribution will be “conservative” in the sense that it is larger than a simulation-based p-value would be. In the worst-case scenario the $\chi^2$-based p-value will be twice as large as it should be but, even if that were true, an effective p-value of 26% would not cause us to reject $H_0$ in favor of $H_a$.
+
+### Assessing the Reduced Model, `pstsm1`
+
+A bootstrap sample
+
+```{julia;term=true}
+@time psts1bstp = bootstrap(10000, pstsm1);
+```
+
+provides empirical density plots
+
+```{julia;echo=false;fig_width=8}
+plot(psts1bstp, x = :σ, density, xlabel("σ"))
+```
+
+and
+
+```{julia;echo=false;fig_width=8}
+plot(psts1bstp, x = :σ₁, density, xlabel("σ₁"))
+```
+
+The profile zeta plots for the remaining parameters in model (Fig. [fig:fm04aprplot])
+
+![Profile zeta plots for the parameters in model .](figure/fm04aprplot-1)
+
+[fig:fm04aprplot]
+
+are similar to the corresponding panels in Fig. [fig:fm04prplot], as confirmed by the numerical values of the confidence intervals.
+```
+                     2.5 %    97.5 %
+    .sig01       2.1579337  4.053589
+    .sig02       0.0000000  2.946591
+    .sigma       0.6520234  1.085448
+    (Intercept) 58.6636504 61.443016
+```
+
+```
+                     2.5 %    97.5 %
+    .sig01       2.4306377  4.122011
+    .sigma       0.6520207  1.085448
+    (Intercept) 58.8861831 61.220484
+```
+The confidence intervals on $\log(\sigma)$ and $\beta_0$ are similar for the two models. The confidence interval on $\sigma_1$ is slightly wider in model `pstsm1` than in `pstsm`, because the variability that is attributed to $\sigma_2$ in `pstsm` is incorporated into the variability due to $\sigma_1$ in `pstsm1`.
+
+![Profile pairs plot for the parameters in model fit to the data.](figure/fm04aprpairs-1)
+
+[fig:fm04aprpairs]
+
+The patterns in the profile pairs plot (Fig. [fig:fm04aprpairs]) for the reduced model are similar to those in Fig. [fig:fm1profpair], the profile pairs plot for model .
+
+## A Model With Partially Crossed Random Effects
+
+Especially in observational studies with multiple grouping factors, the configuration of the factors frequently ends up neither nested nor completely crossed. We describe such situations as having *partially crossed* grouping factors for the random effects.
+
+Studies in education, in which test scores for students over time are also associated with teachers and schools, usually result in partially crossed grouping factors. If students with scores in multiple years have different teachers for the different years, the student factor cannot be nested within the teacher factor. Conversely, student and teacher factors are not expected to be completely crossed. To have complete crossing of the student and teacher factors it would be necessary for each student to be observed with each teacher, which would be unusual. A longitudinal study of thousands of students with hundreds of different teachers inevitably ends up partially crossed.
+
+In this section we consider an example with thousands of students and instructors where the response is the student’s evaluation of the instructor’s effectiveness. These data, like those from most large observational studies, are quite unbalanced.
+
+### The `InstEval` Data
+
+The data are from a special evaluation of lecturers by students at the Swiss Federal Institute for Technology–Zürich (ETH–Zürich), to determine who should receive the “best-liked professor” award. These data have been slightly simplified and identifying labels have been removed, so as to preserve anonymity.
+
+The variables
+
+```{julia;term=true}
+names(dat[:InstEval])
+```
+
+have somewhat cryptic names. Factor `s` designates the student and `d` the instructor. The factor `dept` is the department for the course and `service` indicates whether the course was a service course taught to students from other departments.
+
+Although the response, `Y`, is on a scale of 1 to 5,
+
+```{julia;term=true}
+freqtable(dat[:InstEval][:Y])'
+```
+
+it is sufficiently diffuse to warrant treating it as if it were a continuous response.
+
+At this point we will fit models that have random effects for student, instructor, and department (or the combination) to these data. In the next chapter we will fit models incorporating fixed-effects for instructor and department to these data.
+
+```{julia;term=true}
+@time instm = fit(LinearMixedModel, @formula(Y ~ 1 + A + (1|G) + (1|H) + (1|I)), dat[:InstEval])
+```
+
+(Fitting this complex model to a moderately large data set takes a few seconds on a modest desktop computer. Although this is more time than required for earlier model fits, it is a remarkably short time for fitting a model of this size and complexity. In some ways it is remarkable that such a model can be fit at all on such a computer.)
+
+All three estimated standard deviations of the random effects are less than $\widehat{\sigma}$, with $\widehat{\sigma}_3$, the estimated standard deviation of the random effects for `dept`, less than one-tenth of $\widehat{\sigma}$.
+
+It is not surprising that zero is within all of the prediction intervals on the random effects for this factor (Fig. [fig:fm05ranef]). In fact, zero is close to the middle of all these prediction intervals. However, the p-value for the LRT of $H_0:\sigma_3=0$ versus $H_a:\sigma_3>0$
+
+`         `
+
+    Data: InstEval
+    Models:
+    fm05a: y ~ 1 + (1 | s) + (1 | d)
+    fm05: y ~ 1 + (1 | s) + (1 | d) + (1 | dept:service)
+          Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
+    fm05a  4 237786 237823 -118889   237778                         
+    fm05   5 237663 237709 -118827   237653 124.43      1  < 2.2e-16
+
+is highly significant. That is, we have very strong evidence that we should reject $H_0$ in favor of $H_a$.
+
+The seeming inconsistency of these conclusions is due to the large sample size ($n=73421$). When a model is fit to a very large sample even the most subtle of differences can be highly “statistically significant”. The researcher or data analyst must then decide if these terms have practical significance, beyond the apparent statistical significance.
+
+The large sample size also helps to assure that the parameters have good normal approximations. We could profile this model fit but doing so would take a very long time and, in this particular case, the analysts are more interested in a model that uses fixed-effects parameters for the instructors, which we will describe in the next chapter.
+
+We could pursue other mixed-effects models here, such as using the factor and not the interaction to define random effects, but we will revisit these data in the next chapter and follow up on some of these variations there.
+
+![Image of the sparse Cholesky factor, $\mathbf L$, from model ](figure/Multiple-fm05Limage)
+
+[fig:fm05Limage]
+
+Chapter Summary
+---------------
+
+A simple, scalar random effects term in an model formula is of the form , where is an expression whose value is the *grouping factor* of the set of random effects generated by this term. Typically, is simply the name of a factor, such as in the terms or in the examples in this chapter. However, the grouping factor can be the value of an expression, such as in the last example.
+
+Because simple, scalar random-effects terms can differ only in the description of the grouping factor we refer to configurations such as crossed or nested as applying to the terms or to the random effects, although it is more accurate to refer to the configuration as applying to the grouping factors.
+
+A model formula can include several such random effects terms. Because configurations such as nested or crossed or partially crossed grouping factors are a property of the data, the specification in the model formula does not depend on the configuration. We simply include multiple random effects terms in the formula specifying the model.
+
+One apparent exception to this rule occurs with implicitly nested factors, in which the levels of one factor are only meaningful within a particular level of the other factor. In the data, levels of the factor are only meaningful within a particular level of the factor. A model formula of
+```
+strength ~ 1 + (1 | cask) + (1 | batch)
+```
+would result in a fitted model that did not appropriately reflect the sources of variability in the data. Following the simple rule that the factor should be defined so that distinct experimental or observational units correspond to distinct levels of the factor will avoid such ambiguity.
+
+For convenience, a model with multiple, nested random-effects terms can be specified as
+```
+strength ~ 1 + (1 | batch/cask)
+```
+which internally is re-expressed as
+```
+strength ~ 1 + (1 | batch) + (1 | batch:cask)
+```
+We will avoid terms of the form , preferring instead an explicit specification with simple, scalar terms based on unambiguous grouping factors.
+
+The data, described in Sec. [sec:InstEval], illustrate some of the characteristics of the real data to which mixed-effects models are now fit. There is a large number of observations associated with several grouping factors; two of which, student and instructor, have a large number of levels and are partially crossed. Such data are common in sociological and educational studies but until now it has been very difficult to fit models that appropriately reflect such a structure. Much of the literature on mixed-effects models leaves the impression that multiple random effects terms can only be associated with nested grouping factors. The resulting emphasis on hierarchical or multilevel configurations is an artifact of the computational methods used to fit the models, not the models themselves.
+
+The parameters of the models fit to small data sets have properties similar to those for the models in the previous chapter. That is, profile-based confidence intervals on the fixed-effects parameter, $\beta_0$, are symmetric about the estimate but overdispersed relative to those that would be calculated from a normal distribution and the logarithm of the residual standard deviation, $\log(\sigma)$, has a good normal approximation. Profile-based confidence intervals for the standard deviations of random effects ($\sigma_1$, $\sigma_2$, etc.) are symmetric on a logarithmic scale except for those that could be zero.
+
+Another observation from the last example is that, for data sets with a very large numbers of observations, a term in a model may be “statistically significant” even when its practical significance is questionable.
+
+Exercises
+---------
+
+These exercises use data sets from the package for . Recall that to access a particular data set, you must either attach the package
+
+or load just the one data set
+
+`   `
+
+We begin with exercises using the `ergostool` data from the `nlme` package. The analysis and graphics in these exercises is performed in Chap. [chap:Covariates]. The purpose of these exercises is to see if you can use the material from this chapter to anticipate the results quoted in the next chapter.
+
+1. Check the documentation, the structure () and a summary of the data from the package. (If you are familiar with the Star Trek television series and movies, you may want to speculate about what, exactly, the “Borg scale” is.) Are these factors are nested, partially crossed or completely crossed. Is this a replicated or an unreplicated design?
+
+2. Create a plot, similar to Fig. [fig:Penicillindot], showing the effort by subject with lines connecting points corresponding to the same stool types. Order the levels of the factor by increasing average .
+
+3. The experimenters are interested in comparing these specific stool types. In the next chapter we will fit a model with fixed-effects for the factor and random effects for , allowing us to perform comparisons of these specific types. At this point fit a model with random effects for both and . What are the relative sizes of the estimates of the standard deviations, $\widehat{\sigma}_1$ (for ), $\widehat{\sigma}_2$ (for ) and $\widehat{\sigma}$ (for the residual variability)?
+
+4. Refit the model using maximum likelihood. Check the parameter estimates and, in the case of the fixed-effects parameter, $\beta_0$, its standard error. In what ways have the parameter estimates changed? Which parameter estimates have not changed?
+
+5. Profile the fitted model and construct 95% profile-based confidence intervals on the parameters. (Note that you will get the same profile object whether you start with the REML fit or the ML fit. There is a slight advantage in starting with the ML fit.) Is the confidence interval on $\sigma_1$ close to being symmetric about its estimate? Is the confidence interval on $\sigma_2$ close to being symmetric about its estimate? Is the corresponding interval on $\log(\sigma_1)$ close to being symmetric about its estimate?
+
+6. Create the profile zeta plot for this model. For which parameters are there good normal approximations?
+
+7. Create a profile pairs plot for this model. Comment on the shapes of the profile traces in the transformed ($\zeta$) scale and the shapes of the contours in the original scales of the parameters.
+
+8. Create a plot of the 95% prediction intervals on the random effects for using (Substitute the name of your fitted model for in the call to .) Is there a clear winner among the stool types? (Assume that lower numbers on the Borg scale correspond to less effort).
+
+9. Create a plot of the 95% prediction intervals on the random effects for .
+
+10. Check the documentation, the structure () and a summary of the data from the package. Use a cross-tabulation to discover whether and are nested, partially crossed or completely crossed.
+
+11. Fit a model of the in the data with random effects for , and .
+
+12. Plot the prediction intervals for each of the three sets of random effects.
+
+13. Profile the parameters in this model. Create a profile zeta plot. Does including the random effect for appear to be warranted. Does your conclusion from the profile zeta plot agree with your conclusion from examining the prediction intervals for the random effects for ?
+
+14. Refit the model without random effects for . Perform a likelihood ratio test of $H_0:\sigma_3=0$ versus $H_a:\sigma_3>0$. Would you reject $H_0$ in favor of $H_a$ or fail to reject $H_0$? Would you reach the same conclusion if you adjusted the p-value for the test by halving it, to take into account the fact that 0 is on the boundary of the parameter region?
+
+15. Profile the reduced model (i.e. the one without random effects for ) and create profile zeta and profile pairs plots. Can you explain the apparent interaction between $\log(\sigma)$ and $\sigma_1$? (This is a difficult question.)
diff --git a/docs/jmd/Representation.jmd b/docs/jmd/Representation.jmd
new file mode 100644
index 0000000..220e820
--- /dev/null
+++ b/docs/jmd/Representation.jmd
@@ -0,0 +1,134 @@
+
+# Representation of mixed-effects models
+
+The representation of a linear mixed-effects model
+```jl
+type LinearMixedModel{T <: AbstractFloat} <: MixedModel
+    formula::Formula
+    mf::ModelFrame
+    wttrms::Vector
+    trms::Vector
+    sqrtwts::Diagonal{T}
+    Λ::Vector
+    A::Hermitian
+    L::LowerTriangular
+    optsum::OptSummary{T}
+end
+```
+underlies all the other types of mixed-effects models. The members of this struct are
+* `formula`: the formula for the model
+* `mf`: the model frame, the `terms` component for labelling fixed effects
+* `wttrms`: a length `nt` vector of weighted model matrices. The last two elements are `X` and `y`.
+* `trms`: a vector of unweighted model matrices.  If `isempty(sqrtwts)` the same object as `wttrms`
+* `Λ`: a length `nt - 2` vector of lower triangular matrices
+* `sqrtwts`: the `Diagonal` matrix of the square roots of the case weights.  Allowed to be size 0
+* `A`: a Hermitian blocked matrix representing `hcat(Z,X,y)'hcat(Z,X,y)`
+* `L`: a LowerTriangular blocked matrix - the Cholesky factor of `Λ'AΛ+I`
+* `opt`: an [`OptSummary`](@ref) object
+
+If there are no case weights then the size of `sqrtwts` is $0\times 0$ and `wttrms` is the same as `trms`.  To describe the other components, it is helpful to consider a few simple examples.
+
+## A model with a single, scalar random-effects term
+
+```{julia;term=true}
+using Feather, LinearAlgebra, MixedModels
+```
+
+```{julia;term=true}
+dsfilename = Pkg.dir("MixedModels", "test", "data", "Dyestuff.feather")
+dyestuff = Feather.read(dsfilename, nullable = false);
+```
+
+```{julia;term=true}
+m1 = lmm(Yield ~ 1 + (1 | Batch), dyestuff);
+```
+
+In the model formula there are three terms: the _response term_, `Yield`, the _Intercept_ term, `1`, which is part of the fixed-effects, and the scalar random-effects term, `(1 | Batch)`.  Random-effects terms consist of an expression on the left-hand side of the `|`, which is evaluated as a model matrix, and an expression on the right hand side: `Batch`, in this case, which is evaluated to a categorical array called the _grouping factor_ for the term.  When the left-hand side evaluates to a model matrix with a single column, as it does here, the term is said to be a _scalar random-effects term_.
+
+The `trms` member contains these three terms in the order `random-effects`, `fixed-effects`, `response`.
+
+```{julia;term=true}
+m1.trms
+```
+
+The `ScalarReMat` is a compact representation of the _indicator matrix_ for the grouping factor.  Its columns are indicators of the 6 levels of the grouping factor, `Batch`.
+
+```{julia;term=true}
+full(m1.trms[1])
+```
+
+The `A` member is a blocked matrix with $3$ blocks in each dimension, corresponding to the 3 terms.  The $(i, j)$ block is `m1.trms[i]'m1.trms[j]`  Because of symmetry only the blocks in the lower triangle are stored explicitly.  (Generally the term _Hermitian_ is used in Julia rather than _symmetric_ even for real-valued matrices, where they are synonyms.)
+
+```{julia;term=true}
+m1.A[1, 1]
+```
+
+```{julia;term=true}
+m1.A[2, 1]
+```
+
+```{julia;term=true}
+m1.A[2, 2]
+```
+
+```{julia;term=true}
+m1.A[3, 1]
+```
+
+```{julia;term=true}
+m1.A[3, 2]
+```
+
+```{julia;term=true}
+m1.A[3, 3]
+```
+
+Blocks on the diagonal of `A` will be positive-semi-definite Hermitian matrices.  More importantly, diagonal blocks corresponding to scalar random-effects terms are diagonal matrices.
+
+The `Λ` member is a vector of lower-triangular matrices whose values are modified during the iterations.  They correspond to the random-effects terms.  In the case of a scalar random-effects term, the corresponding element of `Λ` is a multiple of the identity.  The multiple is the parameter $\theta$ over which the log-likelihood is optimized.
+
+```{julia;term=true}
+m1.Λ
+```
+
+```{julia;term=true}
+setθ!(m1, [0.75258]);
+m1.Λ
+```
+
+After setting a value of $\theta$ the blocked lower Cholesky is updated.
+```jl
+function cholBlocked!{T}(m::LinearMixedModel{T})
+    A, Λ, L = m.A.data.blocks, m.Λ, m.L.data.blocks
+    n = LinAlg.checksquare(A)
+    for j in 1:n, i in j:n
+        inject!(L[i, j], A[i, j])  # like copyto! but L can be more general than A
+    end
+    for (j, λ) in enumerate(Λ)
+        for i in j:n
+            rmul!(L[i, j], λ)
+        end
+        for jj in 1:j
+            lmul!(adjoint(λ), L[j, jj])
+        end
+        L[j, j] += I
+    end
+    for j in 1:n
+        Ljj = L[j, j]
+        cholUnblocked!(Ljj, Val{:L})
+        Ljjlt = isa(Ljj, Diagonal) ? Ljj : LowerTriangular(Ljj)
+        for i in (j + 1):n
+            LinAlg.A_rdiv_Bc!(L[i, j], Ljjlt)
+        end
+        for i in (j + 1):n
+            Lij = L[i, j]
+            Lii = L[i, i]
+            rankUpdate!(-one(T), Lij, isa(Lii, Diagonal) ? Lii : Hermitian(Lii, :L))
+            for jj in (i + 1):n
+                A_mul_Bc!(-one(T), L[jj, j], Lij, one(T), L[jj, i])
+            end
+        end
+    end
+    m
+end
+```
diff --git a/docs/jmd/SimpleLMM.jmd b/docs/jmd/SimpleLMM.jmd
new file mode 100644
index 0000000..12d3e23
--- /dev/null
+++ b/docs/jmd/SimpleLMM.jmd
@@ -0,0 +1,888 @@
+# A Simple, Linear, Mixed-effects Model
+
+In this book we describe the theory behind a type of statistical model called *mixed-effects* models and the practice of fitting and analyzing such models using the [`MixedModels`](https://github.com/dmbates/MixedModels.jl) package for [`Julia`](https://julialang.org).
+These models are used in many different disciplines.
+Because the descriptions of the models can vary markedly between disciplines, we begin by describing what mixed-effects models are and by exploring a very simple example of one type of mixed model, the *linear mixed model*.
+
+This simple example allows us to illustrate the use of the `lmm` function in the `MixedModels` package for fitting such models and other functions 
+for analyzing the fitted model.
+We also describe methods of assessing the precision of the parameter estimates and of visualizing the conditional distribution of the random effects, given the observed data.
+
+## Mixed-effects Models
+
+Mixed-effects models, like many other types of statistical models, describe a relationship between a *response* variable and some of the *covariates* that have been measured or observed along with the response.
+In mixed-effects models at least one of the covariates is a *categorical* covariate representing experimental or observational “units” in the data set.
+In the example from the chemical industry discussed in this chapter, the observational unit is the batch of an intermediate product used in production of a dye.
+In medical and social sciences the observational units are often the human or animal subjects in the study.
+In agriculture the experimental units may be the plots of land or the specific plants being studied.
+
+In all of these cases the categorical covariate or covariates are observed at a set of discrete *levels*.
+We may use numbers, such as subject identifiers, to designate the particular levels that we observed but these numbers are simply labels.
+The important characteristic of a categorical covariate is that, at each observed value of the response, the covariate takes on the value of one of a set of distinct levels.
+
+Parameters associated with the particular levels of a covariate are sometimes called the “effects” of the levels.
+If the set of possible levels of the covariate is fixed and reproducible we model the covariate using *fixed-effects* parameters.
+If the levels that were observed represent a random sample from the set of all possible levels we incorporate *random effects* in the model.
+
+There are two things to notice about this distinction between fixed-effects parameters and random effects.
+First, the names are misleading because the distinction between fixed and random is more a property of the levels of the categorical covariate than a property of the effects associated with them.
+Secondly, we distinguish between “fixed-effects parameters”, which are indeed parameters in the statistical model, and “random effects”, which, strictly speaking, are not parameters.
+As we will see shortly, random effects are unobserved random variables.
+
+To make the distinction more concrete, suppose the objective is to model the annual reading test scores for students in a school district and that the covariates recorded with the score include a student identifier and the student’s gender.
+Both of these are categorical covariates.
+The levels of the gender covariate, *male* and *female*, are fixed.
+If we consider data from another school district or we incorporate scores from earlier tests, we will not change those levels.
+On the other hand, the students whose scores we observed would generally be regarded as a sample from the set of all possible students whom we could have observed.
+Adding more data, either from more school districts or from results on previous or subsequent tests, will increase the number of distinct levels of the student identifier.
+
+*Mixed-effects models* or, more simply, *mixed models* are statistical models that incorporate both fixed-effects parameters and random effects.
+Because of the way that we will define random effects, a model with random effects always includes at least one fixed-effects parameter.
+Thus, any model with random effects is a mixed model.
+
+We characterize the statistical model in terms of two random variables: a $q$-dimensional vector of random effects represented by the random variable $\mathcal{B}$ and an $n$-dimensional response vector represented by the random variable $\mathcal{Y}$.
+(We use upper-case “script” characters to denote random variables.
+The corresponding lower-case upright letter denotes a particular value of the random variable.)
+We observe the value, $\bf{y}$, of $\mathcal{Y}$. We do not observe the value, $\bf{b}$, of $\mathcal{B}$.
+
+When formulating the model we describe the unconditional distribution of $\mathcal{B}$ and the conditional distribution, $(\mathcal{Y}|\mathcal{B}=\bf{b})$.
+The descriptions of the distributions involve the form of the distribution and the values of certain parameters.
+We use the observed values of the response and the covariates to estimate these parameters and to make inferences about them.
+
+That’s the big picture.
+Now let’s make this more concrete by describing a particular, versatile class of mixed models called *linear mixed models* and by studying a simple example of such a model.
+First we describe the data in the example.
+
+## The `Dyestuff` and `Dyestuff2` Data
+
+Models with random effects have been in use for a long time.
+The first edition of the classic book, *Statistical Methods in Research and Production*, edited by O.L. Davies, was published in 1947 and contained examples of the use of random effects to characterize batch-to-batch variability in chemical processes.
+The data from one of these examples are available as `Dyestuff` in the `MixedModels` package.
+In this section we describe and plot these data and introduce a second example, the `Dyestuff2` data, described in Box and Tiao (1973).
+
+### The `Dyestuff` Data
+
+The data are described in Davies (), the fourth edition of the book mentioned above, as coming from
+
+> an investigation to find out how much the variation from batch to batch in the quality of an intermediate product (H-acid) contributes to the variation in the yield of the dyestuff (Naphthalene Black 12B) made from it. In the experiment six samples of the intermediate, representing different batches of works manufacture, were obtained, and five preparations of the dyestuff were made in the laboratory from each sample. The equivalent yield of each preparation as grams of standard colour was determined by dye-trial.
+
+First attach the packages to be used
+```{julia;term=true}
+using DataFrames, Distributions, GLM, Gadfly, LinearAlgebra
+using MixedModels, Random, RData, SparseArrays
+```
+and allow for unqualified names for some graphics functions
+```{julia;term=true}
+using Gadfly.Geom: point, line, histogram, density, vline
+using Gadfly.Guide: xlabel, ylabel, yticks
+```
+
+The `Dyestuff` data are available in the [`lme4`](https://github.com/lme4/lme4) package for [`R`](http://r-project.org).
+This data frame and others have been stored in saved RData format in the `test` directory within the `MixedModels` package.
+
+Access the `Dyestuff` data
+```{julia;term=true}
+const dat = Dict(Symbol(k)=>v for (k,v) in 
+    load(joinpath(dirname(pathof(MixedModels)), "..", "test", "dat.rda")));
+dyestuff = dat[:Dyestuff];
+describe(dyestuff)
+```
+and plot it
+```{julia;echo=false;fig_cap="Yield versus Batch for the Dyestuff data"; fig_width=8;}
+plot(dyestuff, x = :Y, y = :G, point, xlabel("Yield of dyestuff (g)"), ylabel("Batch"))
+```
+
+In the dotplot we can see that there is considerable variability in yield, even for preparations from the same batch, but there is also noticeable batch-to-batch variability.
+For example, four of the five preparations from batch F provided lower yields than did any of the preparations from batches B, C, and E.
+
+Recall that the labels for the batches are just labels and that their ordering is arbitrary.
+In a plot, however, the order of the levels influences the perception of the pattern.
+Rather than providing an arbitrary pattern it is best to order the levels according to some criterion for the plot.
+In this case a good choice is to order the batches by increasing mean yield, which can be easily done in R.
+
+```{julia;term=true}
+#dyestuffR = rcopy(R"within(lme4::Dyestuff, Batch <- reorder(Batch, Yield, mean))");
+#plot(dyestuffR, x = :Yield, y = :Batch, point, xlabel("Yield of dyestuff (g)"), ylabel("Batch"))
+```
+
+In Sect. [sec:DyestuffLMM] we will use mixed models to quantify the variability in yield between batches.
+For the time being let us just note that the particular batches used in this experiment are a selection or sample from the set of all batches that we wish to consider.
+Furthermore, the extent to which one particular batch tends to increase or decrease the mean yield of the process — in other words, the “effect” of that particular batch on the yield — is not as interesting to us as is the extent of the variability between batches.
+For the purposes of designing, monitoring and controlling a process we want to predict the yield from future batches, taking into account the batch-to-batch variability and the within-batch variability.
+Being able to estimate the extent to which a particular batch in the past increased or decreased the yield is not usually an important goal for us.
+We will model the effects of the batches as random effects rather than as fixed-effects parameters.
+
+### The `Dyestuff2` Data
+
+The data are simulated data presented in Box and Tiao (1973), where the authors state
+
+> These data had to be constructed for although examples of this sort undoubtedly occur in practice they seem to be rarely published.
+
+The structure and summary are intentionally similar to those of the `Dyestuff` data.
+```{julia;term=true}
+dyestuff2 = dat[:Dyestuff2];
+describe(dyestuff2)
+```
+As can be seen in a data plot
+```{julia;echo=false;fig_width=8}
+plot(dyestuff2, x = :Y, y = :G, point, xlabel("Simulated response"), ylabel(""))
+```
+the batch-to-batch variability in these data is small compared to the within-batch variability.
+In some approaches to mixed models it can be difficult to fit models to such data.
+Paradoxically, small “variance components” can be more difficult to estimate than large variance components.
+
+The methods we will present are not compromised when estimating small variance components.
+
+## Fitting Linear Mixed Models
+
+Before we formally define a linear mixed model, let’s go ahead and fit models to these data sets using `lmm` which takes, as its first two arguments, a *formula* specifying the model and the *data* with which to evaluate the formula. 
+
+The structure of the formula will be explained after showing the example.
+
+### A Model For the `Dyestuff` Data
+
+A model allowing for an overall level of the `Yield` and for an additive random effect for each level of `Batch` can be fit as
+```{julia;term=true}
+mm1 = fit!(LinearMixedModel(@formula(Y ~ 1 + (1 | G)), dyestuff))
+```
+
+As shown in the summary of the model fit, the default estimation criterion is maximum likelihood.
+The summary provides several other model-fit statistics such as Akaike’s Information Criterion (`AIC`), Schwarz’s Bayesian Information Criterion (`BIC`), the log-likelihood at the parameter estimates, and the objective function (negative twice the log-likelihood) at the parameter estimates.
+These are all statistics related to the model fit and are used to compare different models fit to the same data.
+
+The third section is the table of estimates of parameters associated with the random effects.
+There are two sources of variability in this model, a batch-to-batch variability in the level of the response and the residual or per-observation variability — also called the within-batch variability.
+The name “residual” is used in statistical modeling to denote the part of the variability that cannot be explained or modeled with the other terms.
+It is the variation in the observed data that is “left over” after determining the estimates of the parameters in the other parts of the model.
+
+Some of the variability in the response is associated with the fixed-effects terms.
+In this model there is only one such term, labeled the `(Intercept)`.
+The name “intercept”, which is better suited to models based on straight lines written in a slope/intercept form, should be understood to represent an overall “typical” or mean level of the response in this case.
+(For those wondering about the parentheses around the name, they are included so that a user cannot accidentally name a variable in conflict with this name.)
+The line labeled `Batch` in the random effects table shows that the random effects added to the intercept term, one for each level of the factor, are modeled as random variables whose unconditional variance is estimated as 1388.33 g$^2$.
+The corresponding standard deviations is 37.26 g for the ML fit.
+
+Note that the last column in the random effects summary table is the estimate of the variability expressed as a standard deviation rather than as a variance.
+These are provided because it is usually easier to visualize the variability in standard deviations, which are on the scale of the response, than it is to visualize the magnitude of a variance.
+The values in this column are a simple re-expression (the square root) of the estimated variances.
+Do not confuse them with the standard errors of the variance estimators, which are not given here.
+As described in later sections, standard errors of variance estimates are generally not useful because the distribution of the estimator of a variance is skewed - often badly skewed.
+
+The line labeled `Residual` in this table gives the estimate, 2451.25 g$^2$, of the variance of the residuals and the corresponding standard deviation, 49.51 g.
+In written descriptions of the model the residual variance parameter is written $\sigma^2$ and the variance of the random effects is $\sigma_1^2$.
+Their estimates are $\widehat{\sigma^2}$ and $\widehat{\sigma_1^2}$
+
+The last line in the random effects table states the number of observations to which the model was fit and the number of levels of any “grouping factors” for the random effects.
+In this case we have a single random effects term, `(1 | Batch)`, in the model formula and the grouping factor for that term is `Batch`.
+There will be a total of six random effects, one for each level of `Batch`.
+
+The final part of the printed display gives the estimates and standard errors of any fixed-effects parameters in the model.
+The only fixed-effects term in the model formula is the `(Intercept)`.
+The estimate of this parameter is 1527.5 g.
+The standard error of the intercept estimate is 17.69 g.
+
+### A Model For the `Dyestuff2` Data
+
+Fitting a similar model to the `dyestuff2` data produces an estimate $\widehat{\sigma_1^2}=0$.
+
+```{julia;term=true}
+mm2 = fit!(LinearMixedModel(@formula(Y ~ 1 + (1 | G)), dyestuff2))
+```
+
+An estimate of `0` for $\sigma_1$ does not mean that there is no variation between the groups.
+Indeed Fig. [fig:Dyestuff2dot] shows that there is some small amount of variability between the groups.
+The estimate, $\widehat{\sigma_1}$, is a measure of the “between-group” variability that is **in excess of** the variability induced by the "within-group" or residual variability in the responses.  
+
+If 30 observations were simulated from a "normal" (also called "Gaussian") distribution and divided arbitrarily into 6 groups of 5, a plot of the data would look very much like Fig. [fig:Dyestuff2dot]. 
+(In fact, it is likely that this is how that data set was generated.)
+It is only where there is excess variability between the groups that $\widehat{\sigma_1}>0$.
+Obtaining $\widehat{\sigma_1}=0$ is not a mistake; it is simply a statement about the data and the model.
+
+The important point to take away from this example is the need to allow for the estimates of variance components that are zero.
+Such a model is said to be *singular*, in the sense that it corresponds to a linear model in which we have removed the random effects associated with `Batch`.
+Singular models can and do occur in practice.
+Even when the final fitted model is not singular, we must allow for such models to be expressed when determining the parameter estimates through numerical optimization.
+
+It happens that this model corresponds to the linear model (i.e. a model with fixed-effects only)
+```{julia;term=true}
+lm1 = lm(@formula(Y ~ 1), dyestuff2)
+```
+
+The log-likelihood for this model
+```{julia;term=true}
+loglikelihood(lm1)
+```
+is the same as that of `fm2`.
+The standard error of the intercept in `lm1` is a bit larger than that of `fm2` because the estimate of the residual variance is evaluated differently in the linear model.
+
+### Further Assessment of the Fitted Models
+
+The parameter estimates in a statistical model represent our “best guess” at the unknown values of the model parameters and, as such, are important results in statistical modeling. However, they are not the whole story. Statistical models characterize the variability in the data and we must assess the effect of this variability on the parameter estimates and on the precision of predictions made from the model.
+
+In Sect. [sec:variability] we introduce a method of assessing variability in parameter estimates using the “profiled log-likelihood” and in Sect. [sec:assessRE] we show methods of characterizing the conditional distribution of the random effects given the data. Before we get to these sections, however, we should state in some detail the probability model for linear mixed-effects and establish some definitions and notation. In particular, before we can discuss profiling the log-likelihood, we should define the log-likelihood. We do that in the next section.
+
+## The Linear Mixed-effects Probability Model
+
+In explaining some of parameter estimates related to the random effects we have used terms such as “unconditional distribution” from the theory of probability.
+Before proceeding further we clarify the linear mixed-effects probability model and define several terms and concepts that will be used throughout the book.
+Readers who are more interested in practical results than in the statistical theory should feel free to skip this section.
+
+### Definitions and Results
+
+In this section we provide some definitions and formulas without derivation and with minimal explanation, so that we can use these terms in what follows.
+In Chap. [chap:computational] we revisit these definitions providing derivations and more explanation.
+
+As mentioned in Sect. [sec:memod], a mixed model incorporates two random variables: $\mathcal{B}$, the $q$-dimensional vector of random effects, and $\mathcal{Y}$, the $n$-dimensional response vector. In a linear mixed model the unconditional distribution of $\mathcal{B}$ and the conditional distribution, $(\mathcal{Y} | \mathcal{B}=\bf{b})$, are both multivariate Gaussian distributions,
+```math
+\begin{aligned}
+  (\mathcal{Y} | \mathcal{B}=\bf{b}) &\sim\mathcal{N}(\bf{ X\beta + Z b},\sigma^2\bf{I})\\\\
+  \mathcal{B}&\sim\mathcal{N}(\bf{0},\Sigma_\theta) .
+\end{aligned}
+```
+The *conditional mean* of $\mathcal Y$, given $\mathcal B=\bf b$, is the *linear predictor*, $\bf X\bf\beta+\bf Z\bf b$, which depends on the $p$-dimensional *fixed-effects parameter*, $\bf \beta$, and on $\bf b$. The *model matrices*, $\bf X$ and $\bf Z$, of dimension $n\times p$ and $n\times q$, respectively, are determined from the formula for the model and the values of covariates. Although the matrix $\bf Z$ can be large (i.e. both $n$ and $q$ can be large), it is sparse (i.e. most of the elements in the matrix are zero).
+
+The *relative covariance factor*, $\Lambda_\theta$, is a $q\times q$ lower-triangular matrix, depending on the *variance-component parameter*, $\bf\theta$, and generating the symmetric $q\times q$ variance-covariance matrix, $\Sigma_\theta$, as
+```math
+\begin{equation}
+  \Sigma_\theta=\sigma^2\Lambda_\theta\Lambda_\theta'
+\end{equation}
+```
+The *spherical random effects*, $\mathcal{U}\sim\mathcal{N}({\bf 0},\sigma^2{\bf I}_q)$, determine $\mathcal B$ according to
+```math
+\begin{equation}
+  \mathcal{B}=\Lambda_\theta\mathcal{U}.
+\end{equation}
+```
+The *penalized residual sum of squares* (PRSS),
+```math
+\begin{equation}
+  r^2(\theta,\beta,{\bf u})=\|{\bf y} -{\bf X}\beta -{\bf Z}\Lambda_\theta{\bf u}\|^2+\|{\bf u}\|^2,
+\end{equation}
+```
+is the sum of the residual sum of squares, measuring fidelity of the model to the data, and a penalty on the size of $\bf u$, measuring the complexity of the model.
+Minimizing $r^2$ with respect to $\bf u$,
+```math
+\begin{equation}
+  r^2_{\beta,\theta} =\min_{\bf u}\left\{\|{\bf y} -{\bf X}{\beta} -{\bf Z}\Lambda_\theta{\bf u}\|^2+\|{\bf u}\|^2\right\}
+\end{equation}
+```
+is a direct (i.e. non-iterative) computation.
+The particular method used to solve this generates a *blocked Choleksy factor*, ${\bf L}_\theta$, which is an lower triangular $q\times q$ matrix satisfying
+```math
+\begin{equation}
+  {\bf L}_\theta{\bf L}_\theta'=\Lambda_\theta'{\bf Z}'{\bf Z}\Lambda_\theta+{\bf I}_q .
+\end{equation}
+```
+where ${\bf I}_q$ is the $q\times q$ *identity matrix*.
+
+Negative twice the log-likelihood of the parameters, given the data, $\bf y$, is
+```math
+\begin{equation}
+d({\bf\theta},{\bf\beta},\sigma|{\bf y})
+=n\log(2\pi\sigma^2)+\log(|{\bf L}_\theta|^2)+\frac{r^2_{\beta,\theta}}{\sigma^2}.
+\end{equation}
+```
+where $|{\bf L}_\theta|$ denotes the *determinant* of ${\bf L}_\theta$.
+Because ${\bf L}_\theta$ is triangular, its determinant is the product of its diagonal elements.
+
+Negative twice the log-likelihood will be called the *objective* in what follows.
+It is the value to be minimized by the parameter estimates.  It is, up to an additive factor, the *deviance* of the parameters.  Unfortunately, it is not clear what the additive factor should be in the case of linear mixed models.  In many applications, however, it is not the deviance of the model that is of interest as much the change in the deviance between two fitted models.  When calculating the change in the deviance the additive factor will cancel out so the change in the deviance when comparing models is the same as the change in this objective.
+
+Because the conditional mean, $\bf\mu_{\mathcal Y|\mathcal B=\bf b}=\bf
+X\bf\beta+\bf Z\Lambda_\theta\bf u$, is a linear function of both $\bf\beta$ and $\bf u$, minimization of the PRSS with respect to both $\bf\beta$ and $\bf u$ to produce
+```math
+\begin{equation}
+r^2_\theta =\min_{{\bf\beta},{\bf u}}\left\{\|{\bf y} -{\bf X}{\bf\beta} -{\bf Z}\Lambda_\theta{\bf u}\|^2+\|{\bf u}\|^2\right\}
+\end{equation}
+```
+is also a direct calculation.
+The values of $\bf u$ and $\bf\beta$ that provide this minimum are called, respectively, the *conditional mode*, $\tilde{\bf u}_\theta$, of the spherical random effects and the conditional estimate, $\widehat{\bf\beta}_\theta$, of the fixed effects.
+At the conditional estimate of the fixed effects the objective is
+```math
+\begin{equation}
+d({\bf\theta},\widehat{\beta}_\theta,\sigma|{\bf y})
+=n\log(2\pi\sigma^2)+\log(|{\bf L}_\theta|^2)+\frac{r^2_\theta}{\sigma^2}.
+\end{equation}
+```
+Minimizing this expression with respect to $\sigma^2$ produces the conditional estimate
+```math
+\begin{equation}
+\widehat{\sigma^2}_\theta=\frac{r^2_\theta}{n}
+\end{equation}
+```
+which provides the *profiled log-likelihood* on the deviance scale as
+```math
+\begin{equation}
+\tilde{d}(\theta|{\bf y})=d(\theta,\widehat{\beta}_\theta,\widehat{\sigma}_\theta|{\bf y})
+=\log(|{\bf L}_\theta|^2)+n\left[1+\log\left(\frac{2\pi r^2_\theta}{n}\right)\right],
+\end{equation}
+```
+a function of $\bf\theta$ alone.
+
+The MLE of $\bf\theta$, written $\widehat{\bf\theta}$, is the value that minimizes this profiled objective.
+We determine this value by numerical optimization.
+In the process of evaluating $\tilde{d}(\widehat{\theta}|{\bf y})$ we determine $\widehat{\beta}=\widehat{\beta}_{\widehat\theta}$, $\tilde{\bf u}_{\widehat{\theta}}$ and $r^2_{\widehat{\theta}}$, from which we can evaluate $\widehat{\sigma}=\sqrt{r^2_{\widehat{\theta}}/n}$.
+
+The elements of the conditional mode of $\mathcal B$, evaluated at the parameter estimates,
+```math
+\begin{equation}
+\tilde{\bf b}_{\widehat{\theta}}=
+\Lambda_{\widehat{\theta}}\tilde{\bf u}_{\widehat{\theta}}
+\end{equation}
+```
+are sometimes called the *best linear unbiased predictors* or BLUPs of the random effects.
+Although BLUPs an appealing acronym, I don’t find the term particularly instructive (what is a “linear unbiased predictor” and in what sense are these the “best”?) and prefer the term “conditional modes”, because these are the values of $\bf b$ that maximize the density of the conditional distribution $\mathcal{B} | \mathcal{Y} = {\bf y}$.
+For a linear mixed model, where all the conditional and unconditional distributions are Gaussian, these values are also the *conditional means*.
+
+### Fields of a `LinearMixedModel` object
+
+The optional second argument, `verbose`, in a call to `fit!` of a `LinearMixedModel` object produces output showing the progress of the iterative optimization of $\tilde{d}(\bf\theta|\bf y)$.
+
+```{julia;term=true}
+mm1 = fit!(LinearMixedModel(@formula(Y ~ 1 + (1 | G)), dyestuff), verbose = true);
+```
+
+The algorithm converges after 18 function evaluations to a profiled deviance of 327.32706 at $\theta=0.752581$. In this model the parameter $\theta$ is of length 1, the single element being the ratio $\sigma_1/\sigma$.
+
+Whether or not verbose output is requested, the `optsum` field of a `LinearMixedModel` contains information on the optimization.  The various tolerances or the optimizer name can be changed between creating a `LinearMixedModel` and calling `fit!` on it to exert finer control on the optimization process.
+
+```{julia;term=true}
+mm1.optsum
+```
+
+The full list of fields in a `LinearMixedModel` object is
+
+```{julia;term=true}
+fieldnames(LinearMixedModel)
+```
+
+The `formula` field is a copy of the model formula
+
+```{julia;term=true}
+mm1.formula
+```
+
+The `mf` field is a `ModelFrame` constructed from the `formula` and `data` arguments to `lmm`.
+It contains a `DataFrame` formed by reducing the `data` argument to only those columns needed to evaluate the `formula` and only those rows that have no missing data.
+It also contains information on the terms in the model formula and on any "contrasts" associated with categorical variables in the fixed-effects terms.
+
+The `trms` field is a vector of numerical objects representing the terms in the model, including the response vector.
+As the names imply, the `sqrtwts` and `wttrms` fields are for incorporating case weights.
+These fields are not often used when fitting linear mixed models but are vital to the process of fitting a generalized linear mixed model, described in Chapter [sec:glmm].  When used, `sqrtwts` is a diagonal matrix of size `n`. A size of `0` indicates weights are not used.
+
+```{julia;term=true}
+mm1.sqrtwts
+```
+
+The `reterms` field is a vector of `ReMat` objects.
+
+```{julia;term=true}
+length(mm1.reterms)
+```
+
+And the `feterms` field is a vector of length 2.  The first element is $\bf X$, the $n\times p$ model matrix for the fixed-effects parameters, $\bf\beta$, and
+the last element is $\bf y$, the response vector stored as a matrix of size $n\times 1$.  In `mm1`, $\bf X$ consists of a single column of 1's
+
+```{julia;term=true}
+first(mm1.feterms).x
+```
+
+```{julia;term=true}
+last(mm1.feterms).x
+```
+
+The elements of `reterms` represent vertical sections of $\bf Z$ associated with the random effects terms in the model.  In `mm1` there is only one random effects term, `(1 | Batch)`, and $\bf Z$ has only one section, the one generated by this term, of type `ScalarReMat`.
+
+```{julia;term=true}
+first(mm1.reterms)
+```
+
+In practice these matrices are stored in a highly condensed form because, in some models, they can be very large.
+In small examples the structure is more obvious when the `ReMat` is converted to a sparse or a dense matrix.
+
+```{julia;term=true}
+sparse(first(mm1.reterms))
+```
+
+```{julia;term=true}
+Matrix(first(mm1.reterms))
+```
+
+The `A` field is a representation of the blocked, square, symmetric matrix $\bf A = [Z : X : y]'[Z : X : y]$.
+Only the upper triangle of `A` is stored.
+The number of blocks of the rows and columns of `A` is the number of vertical sections of $\bf Z$ (i.e. the number of random-effects terms) plus 2.
+
+```{julia;term=true}
+nblocks(mm1.A)
+```
+
+```{julia;term=true}
+mm1.A[Block(1, 1)]
+```
+
+```{julia;term=true}
+mm1.A[Block(2, 1)]
+```
+
+```{julia;term=true}
+mm1.A[Block(2, 2)]
+```
+
+```{julia;term=true}
+mm1.A[Block(3, 1)]
+```
+
+```{julia;term=true}
+mm1.A[Block(3, 2)]
+```
+
+```{julia;term=true}
+mm1.A[Block(3, 3)]
+```
+
+## Fields modified during the optimization
+
+Changing the value of $\theta$ changes the $\Lambda$ field.
+(Note: to input a symbol like $\Lambda$ in a Jupyter code cell or in the Julia read-eval-print loop (REPL), type `\Lambda` followed by a tab character.  Such "latex completions" are available for many UTF-8 characters used in Julia.)  The matrix $\Lambda$ has a special structure.  It is a block diagonal matrix where the diagonal blocks are [`Kronecker product`](https://en.wikipedia.org/wiki/Kronecker_product)s of an identity matrix and a (small) lower triangular matrix.  The diagonal blocks correspond to the random-effects terms.  For a scalar random-effects term, like `(1 | Batch)` the diagonal block is the Kronecker product of an identity matrix and a $1\times 1$ matrix.  This result in this case is just a multiple of the identity matrix.
+
+It is not necessary to store the full $\Lambda$ matrix.  Storing the small lower-triangular matrices is sufficient.
+
+```{julia;term=true}
+first(mm1.λ)
+```
+
+The `L` field is a blocked matrix like the `A` field containing the upper Cholesky factor of
+
+\begin{bmatrix}
+  \bf{\Lambda'Z'Z\Lambda + I} & \bf{\Lambda'Z'X} & \bf{\Lambda'Z'y} \\
+  \bf{X'Z\Lambda} & \bf{X'X} & \bf{X'y} \\
+  \bf{y'Z\Lambda} & \bf{y'Z} & \bf{y'y}
+\end{bmatrix}   
+
+```{julia;term=true}
+LowerTriangular(mm1.L)
+```
+
+```{julia;term=true}
+mm1.L[Block(1, 1)]
+```
+
+```{julia;term=true}
+mm1.L[Block(2, 1)]
+```
+
+```{julia;term=true}
+mm1.L[Block(2, 2)]
+```
+
+```{julia;term=true}
+mm1.L[Block(3, 1)]
+```
+
+```{julia;term=true}
+mm1.L[Block(3, 2)]
+```
+
+```{julia;term=true}
+mm1.L[Block(3, 3)]
+```
+
+All the information needed to evaluate the profiled log-likelihood is available in the `L` field; $\log(|\bf L_\theta|^2)$ is
+
+```{julia;term=true}
+2 * sum(log, diag(mm1.L[Block(1,1)]))
+```
+
+It can also be evaluated as `logdet(mm1)` or `2 * logdet(mm1.L[Block(1, 1)])`
+
+```{julia;term=true}
+logdet(mm1) == (2*logdet(mm1.L[Block(1, 1)])) == (2*sum(log.(diag(mm1.L[Block(1, 1)]))))
+```
+
+The penalized residual sum of squares is the square of the single element of the lower-right block, `L[3, 3]` in this case
+
+```{julia;term=true}
+abs2(mm1.L[Block(3, 3)][1, 1])
+```
+
+```{julia;term=true}
+pwrss(mm1)
+```
+
+The objective is
+
+```{julia;term=true}
+logdet(mm1) + nobs(mm1) * (1 + log(2π * pwrss(mm1) / nobs(mm1)))
+```
+
+## Assessing variability of parameter estimates
+
+### Parametric bootstrap samples
+
+One way to assess the variability of the parameter estimates is to generate a parametric bootstrap sample from the model.  The technique is to simulate response vectors from the model at the estimated parameter values and refit the model to each of these simulated responses, recording the values of the parameters.  The `bootstrap` method for these models performs these simulations and returns 4 arrays: a vector of objective (negative twice the log-likelihood) values, a vector of estimates of $\sigma^2$, a matrix of estimates of the fixed-effects parameters and a matrix of the estimates of the relative covariance parameters.  In this case there is only one fixed-effects parameter and one relative covariance parameter, which is the ratio of the standard deviation of the random effects to the standard deviation of the per-sample noise.
+
+First set the random number seed for reproducibility.
+
+```{julia;term=true}
+rng = Random.MersenneTwister(1234321);
+mm1bstp = bootstrap(100000, mm1);
+size(mm1bstp)
+```
+
+```{julia;term=true}
+describe(mm1bstp)
+```
+
+#### Histograms, kernel density plots and quantile-quantile plots
+
+I am a firm believer in the value of plotting results **before** summarizing them.
+Well chosen plots can provide insights not available from a simple numerical summary.
+It is common to visualize the distribution of a sample using a *histogram*, which approximates the shape of the probability density function.
+The density can also be approximated more smoothly using a *kernel density* plot.
+Finally, the extent to which the distribution of a sample can be approximated by a particular distribution or distribution family can be assessed by a *quantile-quantile (qq) plot*, the most common of which is the *normal probability plot*.
+
+The [`Gadfly`](https://github.com/GiovineItalia/Gadfly.jl) package for Julia uses a "grammar of graphics" specification, similar to the [`ggplot2`](http://ggplot2.org/) package for R.  A histogram or a kernel density plot are describes as *geometries* and specified by `Geom.histogram` and `Geom.density`, respectively.
+
+```{julia;term=true}
+plot(mm1bstp, x = :β₁, histogram)
+```
+
+```{julia;term=true}
+plot(mm1bstp, x = :σ, histogram)
+```
+
+```{julia;term=true}
+plot(mm1bstp, x = :σ₁, histogram)
+```
+
+The last two histograms show that, even if the models are defined in terms of variances, the variance is usually not a good scale on which to assess the variability of the parameter estimates.  The standard deviation or, in some cases, the logarithm of the standard deviation is a more suitable scale.
+
+The histogram of $\sigma_1^2$ has a "spike" at zero.  Because the value of $\sigma^2$ is never zero, a value of $\sigma_1^2=0$ must correspond to $\theta=0$.  There are several ways to count the zeros in `theta1`.  The simplest is to use `count` on the nonzeros, and subtrack that value from the total number of values in `theta1`.
+
+```{julia;term=true}
+length(mm1bstp[:θ₁]) - count(!iszero, mm1bstp[:θ₁])
+```
+
+That is, nearly 1/10 of the `theta1` values are zeros.  Because such a spike or pulse will be spread out or diffused in a kernel density plot,
+
+```{julia;term=true}
+plot(mm1bstp, density, x = :θ₁)
+```
+
+such a plot is not suitable for a sample of a bounded parameter that includes values on the boundary.
+
+The density of the estimates of the other two parameters, $\beta_1$ and $\sigma$, are depicted well in kernel density plots.
+
+```{julia;term=true}
+plot(mm1bstp, density, x = :β₁)
+```
+
+```{julia;term=true}
+plot(mm1bstp, density, x = :σ)
+```
+
+The standard approach of summarizing a sample by its mean and standard deviation, or of constructing a confidence interval using the sample mean, the standard error of the mean and quantiles of a *t* or normal distribution, are based on the assumption that the sample is approximately normal (also called Gaussian) in shape.  A *normal probability plot*, which plots sample quantiles versus quantiles of the standard normal distribution, $\mathcal{N}(0,1)$, can be used to assess the validity of this assumption.  If the points fall approximately along a straight line, the assumption of normality should be valid.  Systematic departures from a straight line are cause for concern.
+
+In Gadfly a normal probability plot can be  constructed by specifying the statistic to be generated as `Stat.qq` and either `x` or `y` as the distribution `Normal()`. For the present purposes it is an advantage to put the theoretical quantiles on the `x` axis.
+
+This approach is suitable for small to moderate sample sizes, but not for sample sizes of 10,000.  To smooth the plot and to reduce the size of the plot files, we plot quantiles defined by a sequence of `n` "probability points".  These are constructed by partitioning the interval (0, 1) into `n` equal-width subintervals and returning the midpoint of each of the subintervals.
+
+```{julia;term=true}
+function ppoints(n)
+    if n ≤ 0
+        throw(ArgumentError("n = $n should be positive"))
+    end
+    width = inv(n)
+    width / 2 : width : one(width)
+end
+const ppt250 = ppoints(250)
+```
+
+The kernel density estimate of $\sigma$ is more symmetric
+
+```{julia;echo=false;fig_width=8}
+zquantiles = quantile.(Normal(), ppt250);
+plot(x = zquantiles, y = quantile(mm1bstp[:β₁], ppt250), line,
+    Guide.xlabel("Standard Normal Quantiles"), Guide.ylabel("β₁"))
+```
+
+and the normal probability plot of $\sigma$ is also reasonably straight.
+
+```{julia;echo=false;fig_width=8}
+plot(x = zquantiles, y = quantile(mm1bstp[:σ], ppt250), line,
+     xlabel("Standard Normal quantiles"), ylabel("σ"))
+```
+
+The normal probability plot of $\sigma_1$ has a flat section at $\sigma_1 = 0$.
+
+```{julia;echo=false;fig_width=8}
+plot(x = zquantiles, y = quantile(mm1bstp[:σ₁], ppt250), line,
+     xlabel("Standard Normal Quantiles"), ylabel("σ₁"))
+```
+
+In terms of the variances, $\sigma^2$ and $\sigma_1^2$, the normal probability plots are
+
+```{julia;echo=false}
+plot(x = zquantiles, y = quantile(abs2.(mm1bstp[:σ]), ppt250), line,
+     xlabel("Standard Normal quantiles"), ylabel("σ²"))
+```
+
+```{julia;echo=false}
+plot(x = zquantiles, y = quantile(abs2.(mm1bstp[:σ₁]), ppt250), line,
+     xlabel("Standard Normal Quantiles"), ylabel("σ₁²"))
+```
+
+### Confidence intervals based on bootstrap samples
+
+When the distribution of a parameter estimator is close to normal or to a T distribution, symmetric confidence intervals are an appropriate representation of the uncertainty in the parameter estimates.  However, they are not appropriate for skewed and/or bounded estimator distributions, such as those for $\sigma^2$ and $\sigma_2^1$ shown above.
+
+The fact that a symmetric confidence interval is not appropriate for $\sigma^2$ should not be surprising.  In an introductory statistics course the calculation of a confidence interval on $\sigma^2$ from a random sample of a normal distribution using quantiles of a $\chi^2$ distribution is often introduced. So a symmetric confidence interval on $\sigma^2$ is inappropriate in the simplest case but is often expected to be appropriate in much more complicated cases, as shown by the fact that many statistical software packages include standard errors of variance component estimates in the output from a mixed model fitting procedure.  Creating confidence intervals in this way is optimistic at best. Completely nonsensical would be another way of characterizing this approach.
+
+A more reasonable alternative for constructing a $1 - \alpha$ confidence interval from a bootstrap sample is to report a contiguous interval that contains a $1 - \alpha$ proportion of the sample values.
+
+But there are many such intervals.  Suppose that a 95% confidence interval was to be constructed from one of the samples of size 10,000
+of bootstrapped values.  To get a contigous interval the sample should be sorted. The sorted sample values, also called the *order statistics* of the sample, are denoted by a bracketed subscript.  That is, $\sigma_{[1]}$ is the smallest value in the sample, $\sigma_{[2]}$ is the second smallest, up to $\sigma_{[10,000]}$, which is the largest.
+
+One possible interval containing 95% of the sample is $(\sigma_{[1]}, \sigma_{[9500]})$.  Another is $(\sigma_{[2]}, \sigma_{[9501]})$ and so on up to $(\sigma_{[501]},\sigma_{[10000]})$.  There needs to be a method of choosing one of these intervals.  On approach would be to always choose the central 95% of the sample.  That is, cut off 2.5% of the sample on the left side and 2.5% on the right side.  
+
+```{julia;term=true}
+sigma95 = quantile(mm1bstp[:σ], [0.025, 0.975])
+```
+
+This approach has the advantage that the endpoints of a 95% interval on $\sigma^2$ are the squares of the endpoints of a 95% interval on $\sigma$.
+
+```{julia;term=true}
+isapprox(abs2.(sigma95), quantile(abs2.(mm1bstp[:σ]), [0.025, 0.975]))
+```
+
+The intervals are compared with `isapprox` rather than exact equality because, in floating point arithmetic, it is not always the case that $\left(\sqrt{x}\right)^2 = x$.  This comparison can also be expressed
+in Julia as
+
+```{julia;term=true}
+abs2.(sigma95) ≈ quantile(abs2.(mm1bstp[:σ]), [0.025, 0.975])
+```
+
+An alternative approach is to evaluate all of the contiguous intervals containing, say, 95% of the sample and return the shortest shortest such interval.  This is the equivalent of a *Highest Posterior Density (HPD)* interval sometimes used in Bayesian analysis.  If the procedure is applied to a unimodal (i.e. one that has only one peak or *mode*) theoretical probability density the resulting interval has the property that the density at the left endpoint is equal to the density at the right endpoint and that the density at any point outside the interval is less than the density at any point inside the interval.  Establishing this equivalence is left as an exercise for the mathematically inclined reader.  (Hint: Start with the interval defined by the "equal density at the endpoints" property and consider what happens if you shift that interval while maintaining the same area under the density curve.  You will be replacing a region of higher density by one with a lower density and the interval must become wider to maintain the same area.)
+
+With large samples a brute-force enumeration approach works.
+
+```{julia;term=true}
+function hpdinterval(v, level=0.95)
+    n = length(v)
+    if !(0 < level < 1)
+        throw(ArgumentError("level = $level must be in (0, 1)"))
+    end
+    if (lbd = floor(Int, (1 - level) * n)) < 2
+        throw(ArgumentError(
+            "level = $level is too large from sample size $n"))
+    end
+    ordstat = sort(v)
+    leftendpts = ordstat[1:lbd]
+    rtendpts = ordstat[(1 + n - lbd):n]
+    (w, ind) = findmin(rtendpts - leftendpts)
+    return [leftendpts[ind], rtendpts[ind]]
+end
+```
+
+For example, the 95% HPD interval calculated from the sample of $\beta_1$ values is
+
+```{julia;term=true}
+hpdinterval(mm1bstp[:β₁])
+```
+
+which is very close to the central probability interval of
+
+```{julia;term=true}
+quantile(mm1bstp[:β₁], [0.025, 0.975])
+```
+
+because the empirical distribution of the $\beta_1$ sample is very similar to a normal distribution.  In particular, it is more-or-less symmetric and also unimodal.
+
+The HPD interval on $\sigma^2$ is 
+
+```{julia;term=true}
+hpdinterval(abs2.(mm1bstp[:σ]))
+```
+
+which is shifted to the left relative to the central probability interval
+
+```{julia;term=true}
+quantile(abs2.(mm1bstp[:σ]), [0.025, 0.975])
+```
+
+because the distribution of the $\sigma^2$ sample is skewed to the right.  The HPD interval will truncate the lower density, long, right tail and include more of the higher density, short, left tail.
+
+The HPD interval does not have the property that the endpoints of the interval on $\sigma^2$ are the squares of the endpoints of the intervals on $\sigma$, because "shorter" on the scale of $\sigma$ does not necessarily correspond to shorter on the scale of $\sigma^2$.
+
+```{julia;term=true}
+sigma95hpd = hpdinterval(mm1bstp[:σ])
+```
+
+```{julia;term=true}
+abs2.(sigma95hpd)
+```
+
+Finally, a 95% HPD interval on $\sigma_1$ includes the boundary value $\sigma_1=0$.
+
+```{julia;term=true}
+hpdinterval(mm1bstp[:σ₁])
+```
+
+In fact, the confidence level or coverage probability must be rather small before the boundary value is excluded
+
+```{julia;term=true}
+hpdinterval(mm1bstp[:σ₁], 0.798)
+```
+
+```{julia;term=true}
+hpdinterval(mm1bstp[:σ₁], 0.799)
+```
+
+### Empirical cumulative distribution function
+
+The empirical cumulative distribution function (ecdf) of a sample maps the range of the sample onto `[0,1]` by `x → proportion of sample ≤ x`.  In general this is a "step function", which takes jumps of size `1/length(samp)` at each observed sample value.  For large samples, we can plot it as a qq plot where the theoretical quantiles are the probability points and are on the vertical axis.
+
+```{julia;echo=false;fig_width=8}
+plot(layer(x = quantile(mm1bstp[:σ₁], ppt250), y = ppt250, line),
+     layer(xintercept = quantile(mm1bstp[:σ₁], [0.1, 0.9]), vline(color = colorant"orange")),
+     layer(xintercept = hpdinterval(mm1bstp[:σ₁], 0.8), vline(color=colorant"red")),
+     ylabel(""), xlabel("σ₁"), yticks(ticks=[0.0, 0.1, 0.9, 1.0])
+)
+```
+
+The orange lines added to the plot show the construction of the central probability 80% confidence interval on $\sigma_1$ and the red lines show the 80% HPD interval.  Comparing the spacing of the left end points to that of the right end points shows that the HPD interval is shorter, because, in switching from the orange to the red lines, the right end point moves further to the left than does the left end point.
+
+The differences in the widths becomes more dramatic on the scale of $\sigma_1^2$
+
+```{julia;echo=false;fig_width=8.}
+σ₁² = abs2.(mm1bstp[:σ₁])
+plot(
+    layer(x = quantile(σ₁², ppt250), y = ppt250, line),
+    layer(xintercept = quantile(σ₁², [0.1, 0.9]), vline(color = colorant"orange")),
+    layer(xintercept = hpdinterval(σ₁², 0.8), vline(color=colorant"red")),
+    ylabel(""), xlabel("σ₁²"), yticks(ticks=[0.0, 0.1, 0.9, 1.0])
+)
+```
+
+ADD: similar analysis for mm2
+
+## Assessing the Random Effects
+
+In Sect. [sec:definitions] we mentioned that what are sometimes called the BLUPs (or best linear unbiased predictors) of the random effects, $\mathcal B$, are the conditional modes evaluated at the parameter estimates, calculated as $\tilde{b}_{\widehat{\theta}}=\Lambda_{\widehat{\theta}}\tilde{u}_{\widehat{\theta}}$.
+
+These values are often considered as some sort of “estimates” of the random effects. It can be helpful to think of them this way but it can also be misleading. As we have stated, the random effects are not, strictly speaking, parameters—they are unobserved random variables. We don’t estimate the random effects in the same sense that we estimate parameters. Instead, we consider the conditional distribution of $\mathcal B$ given the observed data, $(\mathcal B|\mathcal Y=\mathbf  y)$.
+
+Because the unconditional distribution, $\mathcal B\sim\mathcal{N}(\mathbf 
+0,\Sigma_\theta)$ is continuous, the conditional distribution, $(\mathcal
+B|\mathcal Y=\mathbf  y)$ will also be continuous. In general, the mode of a probability density is the point of maximum density, so the phrase “conditional mode” refers to the point at which this conditional density is maximized. Because this definition relates to the probability model, the values of the parameters are assumed to be known. In practice, of course, we don’t know the values of the parameters (if we did there would be no purpose in forming the parameter estimates), so we use the estimated values of the parameters to evaluate the conditional modes.
+
+Those who are familiar with the multivariate Gaussian distribution may recognize that, because both $\mathcal B$ and $(\mathcal Y|\mathcal B=\mathbf  b)$ are multivariate Gaussian, $(\mathcal B|\mathcal Y=\mathbf  y)$ will also be multivariate Gaussian and the conditional mode will also be the conditional mean of $\mathcal B$, given $\mathcal Y=\mathbf  y$. This is the case for a linear mixed model but it does not carry over to other forms of mixed models. In the general case all we can say about $\tilde{\mathbf 
+  u}$ or $\tilde{\mathbf  b}$ is that they maximize a conditional density, which is why we use the term “conditional mode” to describe these values. We will only use the term “conditional mean” and the symbol, $\mathbf \mu$, in reference to $\mathrm{E}(\mathcal Y|\mathcal B=\mathbf  b)$, which is the conditional mean of $\mathcal Y$ given $\mathcal B$, and an important part of the formulation of all types of mixed-effects models.
+  
+The `ranef` extractor returns the conditional modes.
+
+```{julia;term=true}
+ranef(mm1)  # FIXME return an ordered dict
+```
+
+The result is an array of matrices, one for each random effects term in the model.  In this case there is only one matrix because there is only one random-effects term, `(1 | Batch)`, in the model. There is only one row in this matrix because the random-effects term, `(1 | Batch)`, is a simple, scalar term.
+
+To make this more explicit, random-effects terms in the model formula are those that contain the vertical bar () character. The variable is the grouping factor for the random effects generated by this term. An expression for the grouping factor, usually just the name of a variable, occurs to the right of the vertical bar. If the expression on the left of the vertical bar is , as it is here, we describe the term as a _simple, scalar, random-effects term_. The designation “scalar” means there will be exactly one random effect generated for each level of the grouping factor. A simple, scalar term generates a block of indicator columns — the indicators for the grouping factor — in $\mathbf Z$. Because there is only one random-effects term in this model and because that term is a simple, scalar term, the model matrix, 𝐙, for this model is the indicator matrix for the levels of `Batch`.
+
+In the next chapter we fit models with multiple simple, scalar terms and, in subsequent chapters, we extend random-effects terms beyond simple, scalar terms. When we have only simple, scalar terms in the model, each term has a unique grouping factor and the elements of the list returned by can be considered as associated with terms or with grouping factors. In more complex models a particular grouping factor may occur in more than one term, in which case the elements of the list are associated with the grouping factors, not the terms.
+
+Given the data, 𝐲, and the parameter estimates, we can evaluate a measure of the dispersion of $(\mathcal B|\mathcal Y=\mathbf y)$. In the case of a linear mixed model, this is the conditional standard deviation, from which we can obtain a prediction interval. The extractor is named `condVar`.
+
+```{julia;term=true}
+condVar(mm1)
+```
+
+## Chapter Summary
+
+A considerable amount of material has been presented in this chapter, especially considering the word “simple” in its title (it’s the model that is simple, not the material). A summary may be in order.
+
+A mixed-effects model incorporates fixed-effects parameters and random effects, which are unobserved random variables, $\mathcal B$. In a linear mixed model, both the unconditional distribution of $\mathcal B$ and the conditional distribution, $(\mathcal Y|\mathcal B=\mathbf b)$, are multivariate Gaussian distributions. Furthermore, this conditional distribution is a spherical Gaussian with mean, $\mathbf\mu$, determined by the linear predictor, $\mathbf Z\mathbf b+\mathbf X\mathbf\beta$. That is,
+
+\begin{equation*}(\mathcal Y|\mathcal B=\mathbf b)\sim
+  \mathcal{N}(\mathbf Z\mathbf b+\mathbf X\mathbf\beta, \sigma^2\mathbf I_n) .\end{equation*}
+
+The unconditional distribution of $\mathcal B$ has mean $\mathbf 0$ and a parameterized $q\times q$ variance-covariance matrix, $\Sigma_\theta$.
+
+In the models we considered in this chapter, $\Sigma_\theta$, is a simple multiple of the identity matrix, $\mathbf I_6$. This matrix is always a multiple of the identity in models with just one random-effects term that is a simple, scalar term. The reason for introducing all the machinery that we did is to allow for more general model specifications.
+
+The maximum likelihood estimates of the parameters are obtained by minimizing the deviance. For linear mixed models we can minimize the profiled deviance, which is a function of $\mathbf\theta$ only, thereby considerably simplifying the optimization problem.
+
+To assess the precision of the parameter estimates, we profile the deviance function with respect to each parameter and apply a signed square root transformation to the likelihood ratio test statistic, producing a profile zeta function for each parameter. These functions provide likelihood-based confidence intervals for the parameters. Profile zeta plots allow us to visually assess the precision of individual parameters. Density plots derived from the profile zeta function provide another way of examining the distribution of the estimators of the parameters.
+
+Prediction intervals from the conditional distribution of the random effects, given the observed data, allow us to assess the precision of the random effects.
+
+# Notation
+
+#### Random Variables
+
+- $\mathcal Y$: The responses ($n$-dimensional Gaussian)
+
+- $\mathcal B$: The random effects on the original scale ($q$-dimensional Gaussian with mean $\mathbf 0$)
+
+- $\mathcal U$: The orthogonal random effects ($q$-dimensional spherical Gaussian)
+
+Values of these random variables are denoted by the corresponding bold-face, lower-case letters: $\mathbf y$, $\mathbf b$ and $\mathbf u$. We observe $\mathbf y$. We do not observe $\mathbf b$ or $\mathbf u$.
+
+#### Parameters in the Probability Model
+
+- $\mathbf\beta$: The $p$-dimension fixed-effects parameter vector.
+
+- $\mathbf\theta$: The variance-component parameter vector. Its (unnamed) dimension is typically very small. Dimensions of 1, 2 or 3 are common in practice.
+
+- $\sigma$: The (scalar) common scale parameter, $\sigma>0$. It is called the common scale parameter because it is incorporated in the variance-covariance matrices of both $\mathcal Y$ and $\mathcal U$.
+
+- $\mathbf\theta$ The "covariance" parameter vector which determines the $q\times q$ lower triangular matrix $\Lambda_\theta$, called the *relative covariance factor*, which, in turn, determines the $q\times q$ sparse, symmetric semidefinite variance-covariance matrix $\Sigma_\theta=\sigma^2\Lambda_\theta\Lambda_\theta'$ that defines the distribution of $\mathcal B$.
+
+#### Model Matrices
+
+- $\mathbf X$: Fixed-effects model matrix of size $n\times p$.
+
+- $\mathbf Z$: Random-effects model matrix of size $n\times q$.
+
+#### Derived Matrices
+
+- $\mathbf L_\theta$: The sparse, lower triangular Cholesky factor of $\Lambda_\theta'\mathbf Z'\mathbf Z\Lambda_\theta+\mathbf I_q$
+
+#### Vectors
+
+In addition to the parameter vectors already mentioned, we define
+
+- $\mathbf y$: the $n$-dimensional observed response vector
+
+- $\mathbf\eta$: the $n$-dimension linear predictor,
+
+\begin{equation*}
+  \mathbf{\eta=X\beta+Zb=Z\Lambda_\theta u+X\beta}
+\end{equation*}
+
+- $\mathbf\mu$: the $n$-dimensional conditional mean of $\mathcal Y$ given $\mathcal B=\mathbf b$ (or, equivalently, given $\mathcal U=\mathbf u$)
+
+\begin{equation*}
+  \mathbf\mu=\mathrm{E}[\mathcal Y|\mathcal B=\mathbf b]=\mathrm{E}[\mathcal Y|\mathcal U=\mathbf u]
+\end{equation*}
+
+- $\tilde{u}_\theta$: the $q$-dimensional conditional mode (the value at which the conditional density is maximized) of $\mathcal U$ given $\mathcal Y=\mathbf y$.
+
+## Exercises
+
+These exercises and several others in this book use data sets from the package for . You will need to ensure that this package is installed before you can access the data sets.
+
+To load a particular data set,
+
+or load just the one data set
+
+
+1. Check the documentation, the structure () and a summary of the `Rail` data (Fig. [fig:Raildot]).
+
+1. Fit a model with as the response and a simple, scalar random-effects term for the variable `Rail`. Create a dotplot of the conditional modes of the random effects.
+
+1. Create a bootstrap simulation from the model and construct 95% bootstrap-based confidence intervals on the parameters. Is the confidence interval on $\sigma_1$ close to being symmetric about the estimate? Is the corresponding interval on $\log(\sigma_1)$ close to being symmetric about its estimate?
+
+1. Create the profile zeta plot for this model. For which parameters are there good normal approximations?
+
+1. Plot the prediction intervals on the random effects from this model. Do any of these prediction intervals contain zero? Consider the relative magnitudes of $\widehat{\sigma_1}$ and $\widehat{\sigma}$ in this model compared to those in model for the data. Should these ratios of $\sigma_1/\sigma$ lead you to expect a different pattern of prediction intervals in this plot than those in Fig. [fig:fm01preddot]?
diff --git a/docs/jmd/Simulations.jmd b/docs/jmd/Simulations.jmd
new file mode 100644
index 0000000..18cb5a0
--- /dev/null
+++ b/docs/jmd/Simulations.jmd
@@ -0,0 +1,365 @@
+
+# Simulation studies using Mixedmodels
+
+The most noticeable advantages of fitting mixed-effects models using the [`MixedModels`](https://github.com/dmbates/MixedModels.jl) package for [Julia](https://julialang.org) relative to, say, [`lme4`](https://github.com/lme4/lme4) in [R](https://www.r-project.org) are speed and reliability.
+
+For simple models fit to data sets of modest size, speed is not that much of an issue.
+If the model can be fit in less time than it takes to organize and plot the data to determine an appropriate model form then it is not too important if one implementation is faster than another.
+
+However, in a simulation study that may involve tens or hundreds of thousands of model fits speed of fitting even simple models can be important.
+The purpose of this notebook is to show how to perform simulation studies in Julia quickly and efficiently. The emphasis will be on simulation of linear mixed-effects models but many of the techniques and software practices apply more generally.
+
+## Overall strategy
+
+It is almost trivial to state that the purpose of a simulation study is to generate a set of simulated values that we can use to assess whether a claim is reasonable.
+By the way, it is important to realize that a simulation study can only refute a claim by showing that the observed results are inconsistent with that claim.
+You cannot establish a result through a simulation study.
+You can only show that your results in a particular case are non inconsistent with the claim.
+
+As with any other experiment it helps to determine up front what the scope of the study should be and what information from the study should be retained.
+The design space for the study can be very large, even for a simple case such as assessing the characteristics (Type I error and power) of a particular test, as described below.
+Also the amount of information generated can be very large and the expense of retaining a huge amount of information must be balanced with information loss by summarizing.
+There is a great temptation to perform too much data reduction, such as recording only the proportion of p-values falling below a threshold in a test.
+Retaining the simulated p-values themselves provides a richer source of information.
+
+A basic rule in Julia is that, because of the way just-in-time (JIT) compilation works, any substantial computation should take place within a function.
+Thus the early phases of the simulation study are exercises in designing functions.
+In the later phases these functions should be wrapped in a package.
+
+## A simple example
+
+In Matuschek, Kliegl, Vasishth, Baayen and Bates (2017) we used a simulation study to explore "Balancing Type I Error and Power in Linear Mixed Models".
+The experimental design consisted of a single binary experimental factor, `C`, which varied within subjects (`S`) and within items (`I`).  There are `m` items and `n` subjects.
+
+Begin by loading the packages to be used.
+The `MixedModels` package is obviously needed to define and fit the models as is the `DataFrames` package for the data representation.
+The `Distributions` package provides the complementary cumulative distribution function (`ccdf`) to evaluate p-values from test statistics assuming a given reference distribution, and the `Gadfly` package is used for plotting.
+
+```{julia;term=true}
+using Distributions, DataFrames, Gadfly, Iterators, MixedModels
+```
+
+Define a `mkdata` function to generate a data frame according to the number of subjects, the number of items, and the coding of the levels of the covariates. (My coauthors in that paper overruled me and used a `[-0.5, 0.5]` encoding. Here I use the `[-1, 1]` encoding preferred by statisticians.)
+
+```{julia;term=true}
+function mkdata(m=10, n=30)
+    chars = vcat('A':'Z', 'a':'z')
+    tuples = collect(product(['N','Y'], ('a':'z')[1:m], chars[1:n]))
+    vecs = push!(Any[pool(getindex.(tuples, i)) for i in 1:3], 
+        zeros(length(tuples)))
+    DataFrame(vecs, [:C, :I, :S, :Y])
+end
+```
+
+```{julia;term=true}
+const dat = mkdata()
+```
+
+```{julia;term=true}
+typeof.(dat.columns)  # check the column types
+```
+
+The response, `Y`, is initialized to zeros so that it is available to construct `LinearMixedModel` objects.
+The `const` modifier in the assignment indicates that the type of `dat` will not change, although the actual data values may change (but in this case they won't).
+Using `const`, if appropriate, when assigning global variables is a good practice.
+
+The `lmm` function creates, but does not fit, a `LinearMixedModel`.
+It is useful to have a model that has not yet been fit because it is the structure that can then be used to `simulate!` a response and `refit!` the model.
+
+The formula for the linear mixed-effects model can be constructed with the `@formula` macro or using an explicit call to the `Formula` constructor.
+For example
+```{julia;term=true}
+srand(1234321)
+refit!(simulate!(lmm(@formula(Y ~ 1 + (1|S) +(1|I)), dat),
+        β=[2000.], σ=300., θ=fill(1/3, 2)))
+```
+
+or
+
+```{julia;term=true}
+srand(1234321)
+const m0 = refit!(simulate!(lmm(Formula(:Y,:(1+(1|S)+(1|I))), dat),
+        β=[2000.], σ=300., θ=fill(1/3, 2)))
+```
+
+We obtain the simulated response and fit the alternative model as
+
+```{julia;term=true}
+const y0 = model_response(m0)
+const m1 = refit!(lmm(Formula(:Y, :(1+C+(1|S)+(1|I))), dat), y0)
+```
+
+The likelihood ratio test statistic for `H₀: β₂ = 0` versus `Hₐ: β₂ ≠ 0` is the difference in the objective (negative twice the log-likelihood) values for the fitted models.
+
+```{julia;term=true}
+lrtval = objective(m0) - objective(m1)
+```
+
+The p-value, assuming a `Chisq(1)` reference distribution, is
+
+```{julia;term=true}
+ccdf(Chisq(1), lrtval)
+```
+
+Notice that the conclusions from the likelihoood ratio test are essentially the same as those from the Z test on `β₂` in the coefficient summary table
+```
+  Fixed-effects parameters:
+             Estimate Std.Error  z value P(>|z|)
+(Intercept)   1955.61   34.7294    56.31  <1e-99
+C             5.81124    12.988 0.447432  0.6546
+```
+for the alternative model `m1`.
+This is because the square of the `z` value, which is the ratio of the estimate of `β₂` to its standard error, is essentially the same as `lrtval`
+```{julia;term=true}
+abs2(0.447432)
+```
+and a `Chisq(1)` distribution is the square of a standard normal distribution.
+
+Generally a hypothesis test based on fitting both the null model and the alternative model to the data (i.e. the likelihood ratio test) is more reliable than one based on fitting only the alternative model and inferring what the change to the null model fit will be (the Z-statistic).
+In this case they give essentially the same results.
+
+## Plotting the data
+
+Although it is tempting to continue with writing simulation functions, this is a good time to examine the responses that have been simulated.
+What do we expect them to look like?
+
+To make it more convenient to plot the data we copy the contents of `y0` vector to the `:Y` column of the data frame then look for a systematic shift
+
+```{julia;term=true}
+copyto!(dat[:Y], y0)
+plot(dat, x = :Y, Geom.density, Guide.xlabel("Simulated responses"))
+```
+
+```{julia;term=true}
+plot(dat, x = :Y, Geom.density, Guide.xlabel("Simulated responses"),
+    color = :C)
+```
+
+There is no evidence of a systematic shift according to the level of the condition, `C`, which is what we would expect because we simulated the data from the null model.
+
+Next consider the effect of the subjects and items
+
+```{julia;term=true}
+plot(dat, x = :S, y = :Y, Geom.boxplot, Guide.xlabel("Subject"),
+    Guide.ylabel("Simulated response"))
+```
+
+```{julia;term=true}
+plot(dat, x = :I, y = :Y, Geom.boxplot, Guide.xlabel("Item"),
+    Guide.ylabel("Simulated response"))
+```
+
+```{julia;term=true}
+plot(dat, x = :C, y = :Y, xgroup=:I, Geom.subplot_grid(Geom.boxplot),
+    Guide.ylabel("Simulated response"))
+```
+
+None of these plots show much structure in the data but that is not surprising because these data were simulated from the null model.  Notice that, even though there are simulated random effects for subject and for item the differences between groups are not substantial.
+
+## Simulating p values from a single test
+
+```{julia;term=true}
+function onetest(mods::Vector{LinearMixedModel{T}},
+        N, β::Vector{T}, σ::T, θ::Vector{T}) where {T}
+    if length(mods) ≠ 2
+        throw(ArgumentError(
+                "length(mods) == $(length(mods)) should be 2"))
+    end
+    obj1 = Vector{T}(N)
+    obj2 = similar(obj1)
+    lrt = similar(obj1)
+    pvalue = similar(obj1)
+    for i in 1:N
+        obj1[i] = o1 = objective(refit!(simulate!(mods[1],
+            β=β, σ=σ, θ=θ)))
+        obj2[i] = o2 = objective(refit!(mods[2],
+            model_response(mods[1])))
+        lrt[i] = l = o1 - o2
+        pvalue[i] = ccdf(Chisq(1), l)
+    end
+    DataFrame(Any[obj1, obj2, lrt, pvalue], [:d1, :d2, :lrt, :p])
+end
+srand(1234321)
+const mods = [lmm(Formula(:Y,:(1+(1|S)+(1|I))), dat),
+        lmm(Formula(:Y,:(1+C+(1|S)+(1|I))), dat)]
+onetest(mods, 10, [2000.], 300., fill(1/3, 2))
+```
+
+```{julia;term=true}
+@time const rr =  onetest(mods, 10000, [2000.], 300., fill(1/3, 2));
+```
+
+```{julia;term=true}
+const ppts250 = inv(500):inv(250):1
+```
+
+```{julia;term=true}
+plot(x = ppts250, y = quantile(rr[:p], ppts250), Geom.line,
+    Guide.xlabel("Theoretical quantiles"), 
+    Guide.ylabel("Observed quantiles"))
+```
+
+The so-called "maximal" model for such an experimental design is
+
+```{julia;term=true}
+const maxmod = lmm(@formula(Y ~ 1+C+(1+C|S)+(1+C|I)), dat);
+```
+
+To simulate a response we specify the fixed-effects parameters, `β`, the standard deviation of the per-observation noise, `σ`, and the relative covariance parameters, `θ`.  `β` and `θ` are vectors whereas `σ` is a scalar.  For this model `β` is of length 2 and `θ` is of length 6.  In this simulation the null hypothesis is that the second element of `β` is zero.
+
+```{julia;term=true}
+srand(1234321)   # set the random number seed
+refit!(simulate!(maxmod, β=[2000.,0.], σ=300., θ=[1/3,0.,0.,1/3,0.,0.]))
+```
+
+The value of `θ` corresponds to random intercepts only with a standard deviation of 100 for subject and for item.
+
+The null model for a likelihood ratio test is
+
+```{julia;term=true}
+const nullmod = lmm(@formula(Y ~ 1 + (1+C|S) + (1+C|I)), dat);
+refit!(nullmod, model_response(maxmod))
+```
+
+#### Note for statisticians
+
+> Testing the significance of the main effect for `C` in the presence of apparently non-ignorable interactions of `C` with `S` and with `I`, as done here, is quite peculiar. Some would say nonsensical.  Unfortunately, this practice has become enshrined in the subject matter literature so we include such nonsense comparisons in the simulation.
+
+The likelihood ratio test statistic for these two model fits is
+
+```{julia;term=true}
+const lrtstat = objective(nullmod) - objective(maxmod)
+```
+
+The p-value for the likelihood ratio test, assuming a χ² reference distribution with 1 degree of freedom, is
+
+```{julia;term=true}
+ccdf(Chisq(1), lrtstat)  # ccdf -> complementary cumulative dist fcn
+```
+
+Notice that this is essentially the same result as for the Z test described in the line for `C` in the coefficient table of `maxmod`
+```
+  Fixed-effects parameters:
+             Estimate Std.Error  z value P(>|z|)
+(Intercept)   1952.75   26.8259  72.7934  <1e-99
+C             5.81124   16.4813 0.352595  0.7244
+```
+This is because the `z-value`, which is the ratio of the coefficient estimate and its approximate standard error, is very close to the square root of `lrtstat`
+
+```{julia;term=true}
+sqrt(lrtstat)
+```
+
+## Repeating the simulation
+
+The whole purpose of a simulation study is to repeat the process of simulating a response vector and fitting various models to it many times. Because Julia performs just-in-time (JIT) compilation on methods it is important to perform such repetitive operations within a function call.  This is one of the most important lessons in using Julia efficiently - if you have any doubt about whether an operation should be done in a function or in global scope, do it in a function.
+
+To assess the Type I error rate we will simulate data from the null model `1+(1|S)+(1|I)` and evaluate the likelihood ratio test statistic for the fit of `1+(1|S)+(1|I)` versus `1+C+(1|S)+(1|I)`
+
+```{julia;term=true}
+function typeIsim(seed, N, null::LinearMixedModel{T},
+    alt::LinearMixedModel{T}, β::Vector{T}, σ::T, θ::Vector{T}) where T
+    srand(seed)
+    refdist = Chisq(1)
+    r = Matrix{T}(4,N)
+    for j in 1:N
+        r[1,j] = objn = objective(refit!(simulate!(null, β=β, σ=σ, θ=θ)))
+        r[2,j] = obja = objective(refit!(alt, model_response(null)))
+        r[3,j] = lrtstat = objn - obja
+        r[4,j] = ccdf(refdist, lrtstat)
+    end
+    r
+end
+```
+
+We fill out the matrix `r`, which will be the returned value, one column at a time because matrices in Julia are stored in [*column-major*](https://en.wikipedia.org/wiki/Row-_and_column-major_order) order.
+
+First, run a small test to make sure things are working as expected
+
+```{julia;term=true}
+typeIsim(1234321, 10, lmm(@formula(Y~1+(1|S)+(1|I)), dat),
+   lmm(@formula(Y~1+C+(1|S)+(1|I)), dat), [2000.], 300., [1/3,1/3])
+```
+
+```{julia;term=true}
+using Gadfly
+```
+
+```{julia;term=true}
+@time const r = typeIsim(1234321, 10000, lmm(@formula(Y~1+(1|S)+(1|I)), dat),
+   lmm(@formula(Y~1+C+(1|S)+(1|I)), dat), [2000.], 300., [1/3,1/3])
+```
+
+```{julia;term=true}
+plot(x=view(r, 4, :), Geom.histogram, Guide.xlabel("p-values"))
+```
+
+```{julia;term=true}
+const ppt250 = inv(500) : inv(250) : 1
+plot(x=quantile(view(r,4,:), ppt250), y = ppt250, Geom.line)
+```
+
+```{julia;term=true}
+const nullrhs = [
+    :(1+(1+C|S)+(1+C|I)),     # "maximal" model
+    :(1+(1|S)+(0+C|S)+(1|I)+(0+C|I)), # uncorrelated intercept/slope
+    :(1+(1|S)+(0+C|S)+(1|I)), # uncorrelated subject intercepts/slope 
+    :(1+(1|S)+(1|I)+(0+C|I)), # uncorrelated item intercept/slope
+    :(1+(1|S)+(1|I))];        # random intercepts only.
+```
+
+(Statisticians may find all but the last of these models to be peculiar because they are used to test the significance of a main effect in the presence of apparently non-ignorable interactions with grouping factors for the random effects.  They are indeed peculiar; some would say nonsensical.  Unfortunately this nonsensical approach is now enshrined in the subject matter literature.)
+
+```{julia;term=true}
+const altrhs = deepcopy(nullrhs);  # modify a copy of the null models
+for expr in altrhs
+    insert!(expr.args, 3, :C)  # add the main effect for C
+end
+altrhs
+```
+
+Just to check that we haven't accidently changed the null right-hand sides
+
+```{julia;term=true}
+nullrhs
+```
+
+Now generate `LinearMixedModel` objects for each of these formulas.
+
+```{julia;term=true}
+const nullmods = [lmm(Formula(:Y,rhs), dat) for rhs in nullrhs];
+```
+
+```{julia;term=true}
+const altmods = [lmm(Formula(:Y,rhs), dat) for rhs in altrhs];
+```
+
+```{julia;term=true}
+getfield.(altmods, :formula)
+```
+
+### Creating a single simulated response vector
+
+The `simulate!` function takes a `LinearMixedModel` and optional parameter values and overwrites the response with a simulated response.
+
+For convenience we will always simulate from the "maximal" alternative because all the other models can be represented as special cases of that model with restricted parameter values.
+
+```{julia;term=true}
+refit!(simulate!(altmods[1], 
+    β=[2000.,0.], σ = 300., θ = [1/3,0.,0.,1/3,0.,0.]))
+```
+
+```{julia;term=true}
+newy = model_response(altmods[1]);
+```
+
+```{julia;term=true}
+refit!(nullmods[1], newy)
+```
+
+```{julia;term=true}
+objective(altmods[1]) - objective(nullmods[1])
+```
+
+```{julia;term=true}
+```
diff --git a/docs/jmd/SingularCovariance.jmd b/docs/jmd/SingularCovariance.jmd
new file mode 100644
index 0000000..1523942
--- /dev/null
+++ b/docs/jmd/SingularCovariance.jmd
@@ -0,0 +1,404 @@
+# Singular covariance estimates in random regression models
+
+This notebook explores the occurrence of singularity in the estimated covariance matrix of random regression models.
+These are mixed-effects models with vector-valued random effects.
+
+First, fit a model to the `sleepstudy` data from [`lme4`](https://github.com/lme4/lme4).
+
+## Fitting a linear mixed-model to the sleepstudy data
+
+Load the required packages
+
+```{julia;term=true}
+using DataFrames, FreqTables, LinearAlgebra, MixedModels, Random, RData
+using Gadfly
+using Gadfly.Geom: density, histogram, point
+using Gadfly.Guide: xlabel, ylabel
+const testdir = normpath(joinpath(dirname(pathof(MixedModels)), "..", "test"));
+const dat = Dict(Symbol(k)=>v for (k,v) in load(joinpath(testdir, "dat.rda")));
+```
+
+The `LinearMixedModel` constructor creates a model structure but does not fit it.
+An explicit call to `fit!` is required to fit the model.
+As is customary (though not required) in Julia, a function whose name ends in `!` is a _mutating function_ that modifies one or more of its arguments.
+
+An optional second argument of `true` in the call to `fit!` produces verbose output from the optimization.
+
+```{julia;term=true}
+sleepm = fit!(LinearMixedModel(@formula(Y ~ 1 + U + (1+U|G)), dat[:sleepstudy]), verbose=true)
+```
+
+The variables in the optimization are the elements of a lower triangular matrix, $\Lambda$, which is the relative covariance factor of the random effects.
+The corresponding parameter vector is called $\theta$.
+
+```{julia;term=true}
+Λ = sleepm.λ[1]
+```
+
+The matrix $\Lambda$ is the left (or lower) Cholesky factor of the covariance matrix of the unconditional distribution of the vector-valued random effects, relative to the variance, $\sigma^2$, of the per-observation noise.
+That is
+```math
+\begin{equation}
+    \Sigma = \sigma^2\Lambda\Lambda'
+\end{equation}
+```
+In terms of the estimates,
+
+```{julia;term=true}
+s² = varest(sleepm)    # estimate of the residual variance
+```
+
+```{julia;term=true}
+s² * Λ * Λ'   # unconditional covariance matrix of the random effects
+```
+
+The estimated correlation of the random effects can, of course, be evaluated from the covariance matrix.
+Writing out the expressions for the elements of the covariance matrix in terms of the elements of Λ shows that many terms cancel in the evaluation of the correlation, resulting in the simpler formula.
+
+```{julia;term=true}
+Λ[2, 1] / sqrt(Λ[2, 1]^2 + Λ[2, 2]^2)
+```
+
+For a $2\times 2$ covariance matrix it is not terribly important to perform this calculation in an efficient and numerically stable way.
+However, it is a good idea to pay attention to stability and efficiency in a calculation that can be repeated tens of thousands of times in a simulation or a parametric bootstrap.
+The `norm` function evaluates the (geometric) length of a vector in a way that controls round-off better than the naive calculation.
+The `view` function provides access to a subarray, such as the second row of $\Lambda$, without generating a copy.
+Thus the estimated correlation can be written
+
+```{julia;term=true}
+Λ[2, 1] / norm(view(Λ, 2, :))
+```
+
+### Optimization with respect to θ
+
+As described in section 3 of the 2015 _Journal of Statistical Software_ [paper](https://www.jstatsoft.org/index.php/jss/article/view/v067i01/v67i01.pdf) by Bates, Maechler, Bolker and Walker, using the relative covariance factor, $\Lambda$, in the formulation of mixed-effects models in the [`lme4`](https://github.com/lme4/lme4) and [`MixedModels`](https://github.com/dmbates/MixedModels) packages and using the vector $\theta$ as the optimization variable was a conscious choice.
+Indeed, a great deal of effort went into creating this form so that the profiled log-likelihood can be easily evaluated and so that the constraints on the parameters, $\theta$, are simple "box" constraints.
+In fact, the constraints are simple lower bounds.
+
+```{julia;term=true}
+show(sleepm.lowerbd)
+```
+
+In contrast, trying to optimize the log-likelihood with respect to standard deviations and correlations of the random effects
+would be quite difficult because the constraints on the correlations when the covariance matrix is larger than $2\times 2$ are quite complicated.
+Also, the correlation itself can be unstable.
+Consider what happens to the expression for the correlation if both $\Lambda_{2,1}$ and $\Lambda_{2,2}$ are small in magnitude.
+Small perturbations in $\Lambda_{2,1}$ that result in sign changes can move the correlation from near $-1$ to near $+1$ or vice-versa.
+
+Some details on the optimization process are available in an `OptSummary` object stored as the `optsum` field of the model.
+
+```{julia;term=true}
+sleepm.optsum
+```
+
+### Convergence on the boundary
+
+Determining if an estimated covariance matrix is singular is easy when using the $\theta$  parameters because singularity corresponds to points on the boundary of the allowable parameter space.
+In other words, if the optimization converges to a vector in which either or both of $\theta_1$ or $\theta_3$ are zero, the covariance matrix is singular.
+Otherwise it is non-singular.
+
+The $\theta_1$ parameter is the estimated relative standard deviation of the random intercepts.
+If this is zero then the correlation is undefined and reported as `NaN`.
+If $\theta_3$ is zero and $\theta_2$ is non-zero then the estimated correlation is $\pm 1$ with the sign determined by the sign of $\theta_2$.
+If both $\theta_2$ and $\theta_3$ are zero the correlation is `NaN` because the standard deviation of the random slopes will be zero.
+
+Singular covariance matrices larger than $2\times 2$ do not necessarily result in particular values, like ±1, for the correlations.
+
+Users of `lmer` or `lmm` are sometimes taken aback by convergence on the boundary if this produces correlations of `NaN` or $\pm 1$.
+Some feel that this is a sign of model failure.
+Others consider such estimates as a sign that Bayesian methods with priors that pull singular covariance matrices away from the boundary should be used.
+
+This type of value judgement seems peculiar.
+An important property of maximum likelihood estimates is that these estimates are well-defined once the probability model for the data has been specified.
+It may be difficult to determine numerical values of the estimates but the definition itself is straightforward.
+If there is a direct method of evaluating the log-likelihood at a particular value of the parameters, then, by definition, the mle's are the parameter values that maximize this log-likelihood.
+Bates et al. (2015) provide such a method of evaluating the log-likelihood for a linear mixed-effects model.
+Indeed they go further and describe how the fixed-effects parameters and one of the variance components can be profiled out of the log-likelihood evaluation, thereby reducing the dimension of the nonlinear, constrained optimization problem to be solved.
+
+If the mle's correspond to a singular covariance matrix, this is a property of the model and the data.
+It is not a mistake in some way.
+It is just the way things are.
+It reflects the fact that often the distribution of the estimator of a covariance matrix is diffuse.
+It is difficult to estimate variances and covariances precisely.
+A search for papers or books on "covariance estimation" will produce many results, often describing ways of regularizing the estimating process because the data themselves do not provide precise estimates.
+
+For the example at hand a parametric bootstrap is one way of evaluating the precision of the estimates.
+
+## The bootstrap function
+
+The `MixedModels` package provides a `bootstrap` method to create a parametric bootstrap sample from a fitted model.
+
+For reproducibility, set the random number seed to some arbitrary value.
+
+```{julia;term=true}
+Random.seed!(1234321);
+```
+
+Arguments to the `bootstrap` function are the number of samples to generate and the model from which to generate them.
+By default the converged parameter estimates are those used to generate the samples.
+Addition, named arguments can be used to override these parameter values, allowing `bootstrap` to be used for simulation.
+
+`bootstrap` returns a `DataFrame` with columns:
+- `obj`: the objective (-2 loglikelihood)
+- `σ`: the standard deviation of the per-observation noise
+- `β₁` to `βₚ`: the fixed-effects coefficients
+- `θ₁` to `θₖ`: the covariance parameter elements
+- `σ₁` to `σₛ`: the estimated standard deviations of random effects.
+- `ρ₁` to `ρₜ`: the estimated correlations of random effects
+
+The `ρᵢ` and `σᵢ` values are derived from the `θᵢ` and `σ` values.
+
+```{julia;term=true}
+sleepmbstrp = bootstrap(10000, sleepm);
+show(names(sleepmbstrp))
+```
+
+Recall that the constrained parameters are $\theta_1$ and $\theta_3$ which both must be non-negative.
+If either or both of these are zero (in practice the property to check is if they are "very small", which here is arbitrarily defined as less than 0.00001) then the covariance matrix is singular.
+
+```{julia;term=true}
+issmall(x) = x < 0.00001   # defines a one-liner function in Julia
+```
+
+```{julia;term=true}
+freqtable(issmall.(sleepmbstrp[:θ₁]), issmall.(sleepmbstrp[:θ₃]))
+```
+
+Here the covariance matrix estimate is non-singular in 9,686 of the 10,000 samples, has an zero estimated intercept variance in 6 samples and is otherwise singular (i.e. correlation estimate of $\pm 1$) in 308 samples.
+
+Empirical densities of the θ components are:
+
+```{julia;echo=false;fig_width=8}
+plot(sleepmbstrp, x = :θ₁, density, 
+    xlabel("Relative standard deviation of intercept, θ₁"))
+```
+
+```{julia;echo=false;fig_width=8}
+plot(sleepmbstrp, x = :θ₃, density, xlabel("θ₃"))
+```
+
+A density plot is typically a good way to visualize such a large sample.
+However, when there is a spike such as the spike at zero here, a histogram provides a more informative plot.
+
+```{julia;echo=false;fig_width=8}
+plot(sleepmbstrp, x = :θ₃, histogram, xlabel("θ₃"))
+```
+
+```{julia;echo=false;fig_width=8}
+plot(x = filter(isfinite, sleepmbstrp[:ρ₁]), histogram, xlabel("ρ₁₂"))
+```
+
+### Reciprocal condition number
+
+The definitve way to assess singularity of the estimated covariance matrix is by its _condition number_ or, alternatively, its _reciprocal condition number_.
+In general the condition number, $\kappa$, of a matrix is the ratio of the largest singular value to the smallest.
+For singular matrices it is $\infty$, which is why it is often more convenient to evaluate and plot $\kappa^{-1}$.
+Because $\kappa$ is a ratio of singular values it is unaffected by nonzero scale factors.
+Thus
+```math
+\begin{equation}
+\kappa^{-1}(s^2\Lambda\Lambda') = \kappa^{-1}(\Lambda\Lambda') =
+[\kappa^{-1}(\Lambda)]^2
+\end{equation}
+```
+```{julia}
+function recipcond(bstrp::DataFrame)
+    T = eltype(bstrp[:θ₁])
+    val = sizehint!(T[], size(bstrp, 1))
+    d = Matrix{T}(undef, 2, 2)
+    for (t1, t2, t3) in zip(bstrp[:θ₁], bstrp[:θ₂], bstrp[:θ₃])
+        d[1, 1] = t1
+        d[1, 2] = t2
+        d[2, 2] = t3
+        v = svdvals!(d)
+        push!(val, v[2] / v[1])
+    end
+    val
+end
+rc = recipcond(sleepmbstrp)
+extrema(rc)
+```
+
+```{julia;echo=false;fig_width=8}
+plot(x = rc, histogram, xlabel("κ⁻¹"))
+```
+
+$\kappa^{-1}$ is small if either or both of $\theta_1$ or $\theta_3$ is small.
+
+```{julia;term=true}
+sum(issmall, rc)
+```
+
+The density of the estimated correlation
+
+```{julia;echo=false;fig_width=8}
+plot(x = filter(isfinite, sleepmbstrp[:ρ₁]), histogram, xlabel("ρ₁₂"))
+```
+
+```{julia;term=true}
+sum(isfinite, sleepmbstrp[:ρ₁])  # recall that ρ = NaN in 7 cases
+```
+
+```{julia;term=true}
+sum(x -> x == -1, sleepmbstrp[:ρ₁])  # number of cases of rho == -1
+```
+
+```{julia;term=true}
+sum(x -> x == +1, sleepmbstrp[:ρ₁])  # number of cases of rho == +1
+```
+
+In this case the bootstrap simulations that resulted in $\rho = -1$ were not close to being indeterminant with respect to sign.
+That is, the values of $\theta_2$ were definitely negative.
+
+```{julia;term=true}
+sleepmbstrp[:θ₂][findall(x -> x == -1, sleepmbstrp[:ρ₁])]
+```
+
+## The Oxboys data
+
+In the `nlme` package for R there are several data sets to which random regression models are fit.
+The `RCall` package for Julia provides the ability to run an embedded R process and communicate with it.
+The simplest form of writing R code within Julia is to use character strings prepended with `R`.
+In Julia strings are delimited by `"` or by `"""`.
+With `"""` multi-line strings are allowed.
+
+```{julia;term=true}
+using RCall
+R"""
+library(nlme)
+plot(Oxboys)
+"""
+```
+
+```{julia;term=true}
+oxboys = rcopy(R"Oxboys");
+show(names(oxboys))
+```
+
+```{julia;term=true}
+oxboysm = fit(LinearMixedModel, @formula(height ~ 1 + age + (1+age | Subject)), oxboys)
+```
+
+```{julia;term=true}
+show(getθ(oxboysm))
+```
+
+As seen in the plot and by the estimate $\widehat{\theta_1} = 12.0$, the estimated standard deviation of the random intercepts is much greater than the residual standard deviation.
+It is unlikely that bootstrap samples will include singular covariance estimates.
+
+```{julia;term=true}
+Random.seed!(4321234);
+oxboysmbtstrp = bootstrap(10000, oxboysm);
+```
+
+In this bootstrap sample, there are no singular estimated covariance matrices for the random effects.
+
+```{julia;term=true}
+freqtable(issmall.(oxboysmbtstrp[:θ₁]), issmall.(oxboysmbtstrp[:θ₃]))
+```
+
+The empirical density of the correlation estimates shows that even in this case the correlation is not precisely estimated.
+
+```{julia;echo=false;fig_width=8}
+plot(oxboysmbtstrp, x = :ρ₁, histogram, xlabel("ρ₁₂"))
+```
+
+```{julia;term=true}
+extrema(oxboysmbtstrp[:ρ₁])
+```
+
+The reciprocal condition number
+
+```{julia;term=true}
+rc = recipcond(oxboysmbtstrp);
+extrema(rc)
+```
+
+does not get very close to zero.
+
+```{julia;echo=false;fig_width=8}
+plot(x = rc, histogram, xlabel("κ⁻¹"))
+```
+
+## The Orthodont data
+
+```{julia;term=true}
+R"plot(Orthodont)"
+```
+
+The subject labels distinguish between the male and the female subjects.  Consider first the female subjects only.
+
+```{julia;term=true}
+orthfemale = rcopy(R"subset(Orthodont, Sex == 'Female', -Sex)");
+orthfm = fit!(LinearMixedModel(@formula(distance ~ 1 + age + (1 + age | Subject)), orthfemale))
+```
+
+```{julia;term=true}
+Random.seed!(1234123)
+orthfmbtstrp = bootstrap(10000, orthfm);
+```
+
+```{julia;term=true}
+freqtable(issmall.(orthfmbtstrp[:θ₁]), issmall.(orthfmbtstrp[:θ₃]))
+```
+
+For this model almost 1/3 of the bootstrap samples converge to singular covariance estimates for the vector-valued random effects.
+A histogram of the estimated correlations of the random effects is dominated by the boundary values.
+
+```{julia;echo=false;fig_width=8}
+plot(orthfmbtstrp, x = :ρ₁, histogram, xlabel("ρ₁₂"))
+```
+
+Even though the estimated correlation in the model is -0.30, more of the boundary values are at +1 than at -1.
+This may be an artifact of the optimization routine used.
+In some cases there may be multiple optima on the boundary.
+It is difficult to determine the global optimum in these cases.
+
+A histogram of the reciprocal condition number is also dominated by the boundary values.
+
+```{julia;echo=false}
+#plot(x = recipcond(orthfmbtstrp), density, xlabel("κ⁻¹"))
+```
+
+## Early childhood cognitive study
+
+This example from Singer and Willett (2003), *Applied Longitudinal Data Analysis* was the motivation for reformulating the estimation methods to allow for singular covariance matrices. Cognitive scores (`cog`) were recorded at `age` 1, 1.5 and 2 years on 103 infants, of whom 58 were in the treatment group and 45 in the control group.  The treatment began at age 6 months (0.5 years).  The data are available as the `Early` data set in the `mlmRev` package for R.  In the model, time on study (`tos`) is used instead of age because the zero point on the time scale should be when the treatment begins.
+
+```{julia;term=true}
+R"""
+suppressMessages(library(mlmRev))
+library(lattice)
+Early$tos <- Early$age - 0.5
+Early$trttos <- Early$tos * (Early$trt == "Y")
+xyplot(cog ~ tos | reorder(id, cog, min), Early, 
+    type = c("p","l","g"), aspect="xy")
+"""
+```
+
+Notice that most of these plots within subjects have a negative slope and that the scores at 1 year of age (`tos = 0.5`) are frequently greater than would be expected on an age-adjusted scale.
+
+```{julia;term=true}
+R"print(xtabs(cog ~ age + trt, Early) / xtabs(~ age + trt, Early))";
+```
+
+When the time origin is the beginning of the treatment there is not generally a "main effect" for the treatment but only an interaction of `trt` and `tos`.
+
+```{julia;term=true}
+early = rcopy(R"subset(Early, select = c(cog, tos, id, trt, trttos))");
+earlym = fit(LinearMixedModel, @formula(cog ~ 1 + tos + trttos + (1 + tos | id)), early)
+```
+
+The model converges to a singular covariance matrix for the random effects.
+
+```{julia;term=true}
+getθ(earlym)
+```
+
+The conditional (on the observed responses) means of the random effects fall along a line.
+
+```{julia;echo=false;fig_width=5}
+rr = ranef(earlym)[1];
+plot(x = view(rr, 1, :), y = view(rr, 2, :), point,
+    xlabel("Conditional mean of intercept random effect"),
+    ylabel("Conditional mean of slope random effect"))
+```
diff --git a/docs/jmd/SubjectItem.jmd b/docs/jmd/SubjectItem.jmd
new file mode 100644
index 0000000..de4f260
--- /dev/null
+++ b/docs/jmd/SubjectItem.jmd
@@ -0,0 +1,22 @@
+
+```{julia;term=true}
+using DataFrames, MixedModels, RData
+const dat = Dict(Symbol(k)=>v for (k,v) in
+    load(joinpath(dirname(pathof(MixedModels)), "..", "test", "dat.rda")));
+```
+
+```{julia;term=true}
+mm1 = fit(LinearMixedModel, @formula(Y ~ 1+S+T+U+V+W+X+Z+(1+S+T+U+V+W+X+Z|G)+(1+S+T+U+V+W+X+Z|H)), dat[:kb07])
+```
+
+```{julia;term=true}
+mm1.optsum
+```
+
+```{julia;term=true}
+mm1.trms[1].Λ
+```
+
+```{julia;term=true}
+mm1.trms[2].Λ
+```
diff --git a/docs/jmd/maximal.jmd b/docs/jmd/maximal.jmd
new file mode 100644
index 0000000..9ed61bd
--- /dev/null
+++ b/docs/jmd/maximal.jmd
@@ -0,0 +1,219 @@
+---
+title: Simulation of Subject-Item mixed models
+author: Douglas Bates
+date: 2019-12-01
+---
+Following the publication of Barr et al. (2011) there has been considerable interest in simulation of subject-item types of data from mixed-effects models to assess the effect of the choice of random-effects structure on the Type I error of tests on the fixed-effects.
+Here we show how such simulations can be carried out efficiently using the [`MixedModels`](https://github.com/dmbates/MixedModels.jl) package for [`Julia`](https://julialang.org).
+
+## Data characteristics
+
+The data characteristics for this simulation are those from the paper _Maybe maximal: Good enough mixed models optimize power while controlling Type I error_ by Seedorff, Oleson and McMurray, which is just one example of such a study.
+There are 50 subjects, 25 from each of two age groups, which are denoted by `'O'` (older) and `'Y'` (younger).
+Each subject's response is measured on 5 different occasions on each of 20 different items under two noise conditions, `'Q'` (quiet) and `'N'` (noisy).
+Such an experimental design yields a total of 10,000 measurements.
+
+In the data for this experimental design, the 25 younger subjects are labelled `'a'` to `'y'` while the older subjects are `'A'` to `'Y'` and the items are `'A'` to `'T'`.
+```{julia}
+using DataFrames, Distributions, FreqTables, Gadfly, MixedModels, Random, StatsModels, Tables
+```
+```{julia}
+df = (S = repeat(['A':'Y'; 'a':'y'], inner=40, outer=5),
+    Age = repeat(['O','Y'], inner=1000, outer=5),
+    I = repeat('A':'T', inner=2, outer=250),
+    Noise = repeat(['Q','N'], outer=5000),
+    Y = ones(10000));
+```
+The response column, `Y`, is added as a placeholder.
+
+#### ColumnTable versus DataFrame
+
+`df` is a `NamedTuple`, which is similar to a `list` in `R` except that the names are `Symbol`s, not `String`s,
+```{julia}
+typeof(df)
+```
+It is easily converted to a `DataFrame` if desired.
+```julia
+DataFrame(df)
+```
+
+The trend in Julia packages supporting data science, like the `StatsModels` package, is towards data representations as "column tables" (a `NamedTuple` of arrays) or "row tables" (a vector of `NamedTuple`s).
+Sometimes it is convenient to work on individual columns, sometimes it makes more sense to iterate over rows.
+The `columntable` and `rowtable` functions allow for conversion back and forth between the two representations.
+
+```{julia}
+rowtable(df)
+```
+
+`DataFrames.describe` provides a convenient summary of a `DataFrame`.
+```julia
+describe(DataFrame(df))
+```
+
+#### Checking properties of the design
+
+It is worthwhile checking that the design has the desired properties.
+`S` (subject) and `I` (item) should be balanced, which can be checked in a cross-tabulation
+```{julia;eval=false}
+freqtable(df, :I, :S)
+```
+```
+20×50 Named Array{Int64,2}
+I ╲ S │ 'A'  'B'  'C'  'D'  'E'  'F'  'G'  'H'  'I'  'J'  'K'  'L'  'M'  'N'  …  'l'  'm'  'n'  'o'  'p'  'q'  'r'  's'  't'  'u'  'v'  'w'  'x'  'y'
+──────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+'A'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10  …   10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'B'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'C'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'D'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'E'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'F'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'G'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'H'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'I'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'J'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'K'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'L'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'M'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'N'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'O'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'P'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'Q'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'R'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'S'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'T'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10  …   10   10   10   10   10   10   10   10   10   10   10   10   10   10
+```
+or, more compactly,
+```julia
+all(freqtable(df, :I, :S) .== 10)
+```
+
+Checking on the experimental variables, `Age` does not vary within levels of `S`
+```julia
+freqtable(df, :Age, :S)
+```
+However, `Age` does vary within levels of `I`
+```julia
+freqtable(df, :Age, :I)
+```
+and `Noise` varies within levels of `S`
+```julia
+freqtable(df, :Noise, :S)
+```
+and within levels of `I`
+```julia
+freqtable(df, :Noise, :I)
+```
+
+## Creating a LinearMixedModel
+
+A `LinearMixedModel` with fixed-effects for `Age` and `Noise` and for their interaction and with random intercepts for `S` and `I` is created as
+```{julia}
+contrasts = Dict(:Age => HelmertCoding(), :Noise => HelmertCoding());
+m1 = LinearMixedModel(@formula(Y ~ 1 + Age * Noise + (1|S) + (1|I)), df, contrasts = contrasts);
+m1.X                               # model matrix for fixed-effects terms
+```
+
+#### HelmertCoding of contrasts
+
+The third argument in the call to `LinearMixedModel` is a dictionary of "contrasts" to use when forming the contrasts for categorical covariates.
+`HelmertCoding` applied to a 2-level factor creates a  `±1` coding of the levels, as shown in the display of the model matrix.
+With this coding the `(Intercept)` coefficient will be a "typical" response level without regard to `Age` and `Noise`.
+In other words, the `(Intercept)` is not defined with respect to an arbitrary reference level for the categorical covariates.
+Note that when 2-level factors are coded as `±1` the interaction terms also have a `±1` coding.
+
+Sometimes coefficient estimates are called the "effect" of the condition in the covariate, e.g. "Noise" versus "Quiet".
+For the `HelmertCoding` the "effect" of changing from the lower level to the higher level is twice the coefficient, because the distance between the `±1` values in the model matrix is 2.
+
+## Simulating a response and fitting the model
+
+The `MixedModels.simulate!` function installs a simulated response in the model object, given values of the parameters.
+
+```{julia}
+rng = Random.MersenneTwister(2052162715);  # repeatable random number generator
+refit!(simulate!(rng, m1, β = [1000., 0, 0, 0], σ = 200., θ = [0.5, 0.5]))
+```
+
+The parameters are `β`, the fixed-effects coefficients, `σ`, the standard deviation of the per-observation, or "residual", noise term and `θ`, the parameters in the lower Cholesky factor of the relative covariance matrices for the random effects terms.
+
+In this case, both random-effects terms are simple, scalar random effects with standard deviations of $100 = 200 * 0.5$.
+
+Notice that the estimated standard deviations, 98.014 and 101.665 for the random effects and 199.16 for the residual noise, are very close to the values in the simulation.
+
+Similarly, the estimates of the fixed-effects are quite close to the values in the simulation.
+
+#### REML estimates
+
+To use the REML criterion instead of maximum likelihood for parameter optimization, add the optional `REML` argument.
+```{julia}
+refit!(m1, REML=true)
+```
+
+Because the experimental design is balanced across subjects, items, age and noise, the fixed-effects parameter estimates are the same under ML or under REML.
+This does not need to be the case for unbalanced designs.
+
+The REML standard errors of the fixed-effects parameter estimates and the estimated variance components are all somewhat larger than those from ML, as would be expected.
+Because the standard errors are larger for REML estimates, the p-values for the null hypothesis of no effect for a covariate or interaction are also larger.
+
+The REML estimates are preferred for evaluating p-values for the fixed-effects coefficients, because they are more conservative.
+However, REML estimates should not be used for comparing models in likelihood ratio tests or using various information criteria because the log-likelihood is not explicitly optimized.
+
+In this package the `loglikelihood` extractor function does not return a value for a model fit by REML,
+```julia
+loglikelihood(m1)
+```
+hence all the other information criteria extractors also fail
+```julia
+aicc(m1)    # the corrected Akaike's Information Criterion
+```
+
+## Simulation of Type I error and power
+
+The basic approach to simulating Type I error and power for a test is to simulate samples from the model, evaluate the test and accumulate the results.
+Because it is possible to calculate the p-values we store these.
+
+In practice, a simulation should be performed in a function to take advantage of the Just-In-Time (JIT) compilation of Julia functions but it helps first to show the development of the steps.
+
+The p-values for the fixed-effects coefficients from a model fit are available as the `pvalues` property of the model.
+```julia
+m1.pvalues
+```
+
+A loop could be used to accumulate such p-values for a large number of simulated responses.
+One of the great joys of programming in Julia is the ability to use `for` loops when convenient and not fear a catastrophic drop in efficiency.
+In fact, `for` loops are often the most effective way of writing an iterative calculation.
+
+Begin by creating the vector of vectors that will hold the result, say 10,000 replications of vectors of p-values, then run the loop to simulate a response, re-fit the model to the simulated response and extract and store the p-values.
+The straightforward way to write a function to do such a simulation is
+```julia
+function simulatepvalues(rng::AbstractRNG, nsamp::Integer, m::LinearMixedModel;
+        β = m.β, σ = m.σ, θ = m.θ)
+    samp = Vector{Vector{Float64}}(undef, nsamp)
+    for i in 1:nsamp
+        refit!(simulate!(rng, m1, β = m.β, σ = m.σ, θ = m.θ), REML=true)
+        samp[i] = m1.pvalues
+    end
+    samp
+end
+samp = simulatepvalues(rng, 10_000, m1, β = [1000., 0., 0., 0.], σ = 200., θ = [0.5,0.5]);
+typeof(samp)
+```
+
+Because the simulation is carried out inside a function, which will be compiled by the JIT compiler, it is fast.
+```julia
+@time simulatepvalues(Random.MersenneTwister(1234), 10_000, m1, β = [1000., 0., 0., 0.], σ = 200., θ = [0.5,0.5]);
+```
+
+The first element of the vectors of p-values should be very small as the value of $\beta_1$ used in the simulations, $1000.0$, is much larger than a typical standard error for that coefficient in this model.
+```julia
+extrema(getindex.(samp, 1))
+```
+
+## Fitting alternative models to the same response
+
+Define another model with random slopes with respect to `Noise` and random intercepts for both `S` and `I`.
+
+```{julia}
+m2 = LinearMixedModel(@formula(Y ~ 1 + Age * Noise + (1+Noise|S) + (1+Noise|I)), df,
+     Dict(:Age => hc, :Noise => hc));
+refit!(m2, response(m1))
+```
diff --git a/docs/md/maximal.md b/docs/md/maximal.md
new file mode 100644
index 0000000..63729e8
--- /dev/null
+++ b/docs/md/maximal.md
@@ -0,0 +1,420 @@
+---
+title: Simulation of Subject-Item mixed models
+author: Douglas Bates
+date: 2019-02-21
+---
+[Note: The "terms2.0 - Son of Terms" release of the `StatsModels` package is imminent.
+After that release and a corresponding release of `"MixedModels"` these calculations can be reproduced.
+To reproduce the results in this document now requires the `"terms2.0"` branch of MixedModels and the `"dfk/terms2.0"` branch of StatsModels.]
+
+Following the publication of Barr et al. (2011) there has been considerable interest in simulation of subject-item types of data from mixed-effects models to assess the effect of the choice of random-effects structure on the Type I error of tests on the fixed-effects.
+Here we show how such simulations can be carried out efficiently using the [`MixedModels`](https://github.com/dmbates/MixedModels.jl) package for [`Julia`](https://julialang.org).
+
+## Data characteristics
+
+The data characteristics for this simulation are those from the paper _Maybe maximal: Good enough mixed models optimize power while controlling Type I error_ by Seedorff, Oleson and McMurray, which is just one example of such a study.
+There are 50 subjects, 25 from each of two age groups, which are denoted by `'O'` (older) and `'Y'` (younger).
+Each subject's response is measured on 5 different occasions on each of 20 different items under two noise conditions, `'Q'` (quiet) and `'N'` (noisy).
+Such an experimental design yields a total of 10,000 measurements.
+
+In the data for this experimental design, the 25 younger subjects are labelled `'a'` to `'y'` while the older subjects are `'A'` to `'Y'` and the items are `'A'` to `'T'`.
+````julia
+using DataFrames, FreqTables, MixedModels, Random, StatsModels, Tables
+df = (S = repeat(['A':'Y'; 'a':'y'], inner=40, outer=5),
+    Age = repeat(['O','Y'], inner=1000, outer=5),
+    I = repeat('A':'T', inner=2, outer=250),
+    Noise = repeat(['Q','N'], outer=5000),
+    Y = ones(10000))
+````
+
+
+
+
+The response column, `Y`, is added as a placeholder.
+
+#### ColumnTable versus DataFrame
+
+`df` is a `NamedTuple`, which is similar to a `list` in `R` except that the names are `Symbol`s, not `String`s,
+````julia
+typeof(df)
+````
+
+
+````
+NamedTuple{(:S, :Age, :I, :Noise, :Y),Tuple{Array{Char,1},Array{Char,1},Arr
+ay{Char,1},Array{Char,1},Array{Float64,1}}}
+````
+
+
+
+
+It is easily converted to a `DataFrame` if desired.
+````julia
+DataFrame(df)
+````
+
+
+
+<table class="data-frame"><thead><tr><th></th><th>S</th><th>Age</th><th>I</th><th>Noise</th><th>Y</th></tr><tr><th></th><th>Char</th><th>Char</th><th>Char</th><th>Char</th><th>Float64</th></tr></thead><tbody><p>10,000 rows × 5 columns</p><tr><th>1</th><td>'A'</td><td>'O'</td><td>'A'</td><td>'Q'</td><td>1.0</td></tr><tr><th>2</th><td>'A'</td><td>'O'</td><td>'A'</td><td>'N'</td><td>1.0</td></tr><tr><th>3</th><td>'A'</td><td>'O'</td><td>'B'</td><td>'Q'</td><td>1.0</td></tr><tr><th>4</th><td>'A'</td><td>'O'</td><td>'B'</td><td>'N'</td><td>1.0</td></tr><tr><th>5</th><td>'A'</td><td>'O'</td><td>'C'</td><td>'Q'</td><td>1.0</td></tr><tr><th>6</th><td>'A'</td><td>'O'</td><td>'C'</td><td>'N'</td><td>1.0</td></tr><tr><th>7</th><td>'A'</td><td>'O'</td><td>'D'</td><td>'Q'</td><td>1.0</td></tr><tr><th>8</th><td>'A'</td><td>'O'</td><td>'D'</td><td>'N'</td><td>1.0</td></tr><tr><th>9</th><td>'A'</td><td>'O'</td><td>'E'</td><td>'Q'</td><td>1.0</td></tr><tr><th>10</th><td>'A'</td><td>'O'</td><td>'E'</td><td>'N'</td><td>1.0</td></tr><tr><th>11</th><td>'A'</td><td>'O'</td><td>'F'</td><td>'Q'</td><td>1.0</td></tr><tr><th>12</th><td>'A'</td><td>'O'</td><td>'F'</td><td>'N'</td><td>1.0</td></tr><tr><th>13</th><td>'A'</td><td>'O'</td><td>'G'</td><td>'Q'</td><td>1.0</td></tr><tr><th>14</th><td>'A'</td><td>'O'</td><td>'G'</td><td>'N'</td><td>1.0</td></tr><tr><th>15</th><td>'A'</td><td>'O'</td><td>'H'</td><td>'Q'</td><td>1.0</td></tr><tr><th>16</th><td>'A'</td><td>'O'</td><td>'H'</td><td>'N'</td><td>1.0</td></tr><tr><th>17</th><td>'A'</td><td>'O'</td><td>'I'</td><td>'Q'</td><td>1.0</td></tr><tr><th>18</th><td>'A'</td><td>'O'</td><td>'I'</td><td>'N'</td><td>1.0</td></tr><tr><th>19</th><td>'A'</td><td>'O'</td><td>'J'</td><td>'Q'</td><td>1.0</td></tr><tr><th>20</th><td>'A'</td><td>'O'</td><td>'J'</td><td>'N'</td><td>1.0</td></tr><tr><th>21</th><td>'A'</td><td>'O'</td><td>'K'</td><td>'Q'</td><td>1.0</td></tr><tr><th>22</th><td>'A'</td><td>'O'</td><td>'K'</td><td>'N'</td><td>1.0</td></tr><tr><th>23</th><td>'A'</td><td>'O'</td><td>'L'</td><td>'Q'</td><td>1.0</td></tr><tr><th>24</th><td>'A'</td><td>'O'</td><td>'L'</td><td>'N'</td><td>1.0</td></tr><tr><th>&vellip;</th><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td><td>&vellip;</td></tr></tbody></table>
+
+
+
+The trend in Julia packages supporting data science, like the `StatsModels` package, is towards data representations as "column tables" (a `NamedTuple` of arrays) or "row tables" (a vector of `NamedTuple`s).
+Sometimes it is convenient to work on individual columns, sometimes it makes more sense to iterate over rows.
+The `columntable` and `rowtable` functions allow for conversion back and forth between the two representations.
+
+````julia
+rowtable(df)
+````
+
+
+````
+10000-element Array{NamedTuple{(:S, :Age, :I, :Noise, :Y),Tuple{Char,Char,C
+har,Char,Float64}},1}:
+ (S = 'A', Age = 'O', I = 'A', Noise = 'Q', Y = 1.0)
+ (S = 'A', Age = 'O', I = 'A', Noise = 'N', Y = 1.0)
+ (S = 'A', Age = 'O', I = 'B', Noise = 'Q', Y = 1.0)
+ (S = 'A', Age = 'O', I = 'B', Noise = 'N', Y = 1.0)
+ (S = 'A', Age = 'O', I = 'C', Noise = 'Q', Y = 1.0)
+ (S = 'A', Age = 'O', I = 'C', Noise = 'N', Y = 1.0)
+ (S = 'A', Age = 'O', I = 'D', Noise = 'Q', Y = 1.0)
+ (S = 'A', Age = 'O', I = 'D', Noise = 'N', Y = 1.0)
+ (S = 'A', Age = 'O', I = 'E', Noise = 'Q', Y = 1.0)
+ (S = 'A', Age = 'O', I = 'E', Noise = 'N', Y = 1.0)
+ ⋮                                                  
+ (S = 'y', Age = 'Y', I = 'P', Noise = 'N', Y = 1.0)
+ (S = 'y', Age = 'Y', I = 'Q', Noise = 'Q', Y = 1.0)
+ (S = 'y', Age = 'Y', I = 'Q', Noise = 'N', Y = 1.0)
+ (S = 'y', Age = 'Y', I = 'R', Noise = 'Q', Y = 1.0)
+ (S = 'y', Age = 'Y', I = 'R', Noise = 'N', Y = 1.0)
+ (S = 'y', Age = 'Y', I = 'S', Noise = 'Q', Y = 1.0)
+ (S = 'y', Age = 'Y', I = 'S', Noise = 'N', Y = 1.0)
+ (S = 'y', Age = 'Y', I = 'T', Noise = 'Q', Y = 1.0)
+ (S = 'y', Age = 'Y', I = 'T', Noise = 'N', Y = 1.0)
+````
+
+
+
+
+
+`DataFrames.describe` provides a convenient summary of a `DataFrame`.
+````julia
+describe(DataFrame(df))
+````
+
+
+
+<table class="data-frame"><thead><tr><th></th><th>variable</th><th>mean</th><th>min</th><th>median</th><th>max</th><th>nunique</th><th>nmissing</th><th>eltype</th></tr><tr><th></th><th>Symbol</th><th>Union…</th><th>Any</th><th>Union…</th><th>Any</th><th>Union…</th><th>Nothing</th><th>DataType</th></tr></thead><tbody><p>5 rows × 8 columns</p><tr><th>1</th><td>S</td><td></td><td>'A'</td><td></td><td>'y'</td><td>50</td><td></td><td>Char</td></tr><tr><th>2</th><td>Age</td><td></td><td>'O'</td><td></td><td>'Y'</td><td>2</td><td></td><td>Char</td></tr><tr><th>3</th><td>I</td><td></td><td>'A'</td><td></td><td>'T'</td><td>20</td><td></td><td>Char</td></tr><tr><th>4</th><td>Noise</td><td></td><td>'N'</td><td></td><td>'Q'</td><td>2</td><td></td><td>Char</td></tr><tr><th>5</th><td>Y</td><td>1.0</td><td>1.0</td><td>1.0</td><td>1.0</td><td></td><td></td><td>Float64</td></tr></tbody></table>
+
+
+
+#### Checking properties of the design
+
+It is worthwhile checking that the design has the desired properties.
+`S` (subject) and `I` (item) should be balanced, which can be checked in a cross-tabulation
+````julia
+
+freqtable(df, :I, :S)
+````
+
+
+
+```
+20×50 Named Array{Int64,2}
+I ╲ S │ 'A'  'B'  'C'  'D'  'E'  'F'  'G'  'H'  'I'  'J'  'K'  'L'  'M'  'N'  …  'l'  'm'  'n'  'o'  'p'  'q'  'r'  's'  't'  'u'  'v'  'w'  'x'  'y'
+──────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
+'A'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10  …   10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'B'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'C'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'D'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'E'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'F'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'G'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'H'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'I'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'J'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'K'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'L'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'M'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'N'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'O'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'P'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'Q'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'R'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'S'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10      10   10   10   10   10   10   10   10   10   10   10   10   10   10
+'T'   │  10   10   10   10   10   10   10   10   10   10   10   10   10   10  …   10   10   10   10   10   10   10   10   10   10   10   10   10   10
+```
+or, more compactly,
+````julia
+all(freqtable(df, :I, :S) .== 10)
+````
+
+
+````
+true
+````
+
+
+
+
+
+Checking on the experimental variables, `Age` does not vary within levels of `S`
+````julia
+freqtable(df, :Age, :S)
+````
+
+
+````
+2×50 Named Array{Int64,2}
+Age ╲ S │ 'A'  'B'  'C'  'D'  'E'  'F'  …  't'  'u'  'v'  'w'  'x'  'y'
+────────┼──────────────────────────────────────────────────────────────
+'O'     │ 200  200  200  200  200  200  …    0    0    0    0    0    0
+'Y'     │   0    0    0    0    0    0  …  200  200  200  200  200  200
+````
+
+
+
+
+However, `Age` does vary within levels of `I`
+````julia
+freqtable(df, :Age, :I)
+````
+
+
+````
+2×20 Named Array{Int64,2}
+Age ╲ I │ 'A'  'B'  'C'  'D'  'E'  'F'  …  'O'  'P'  'Q'  'R'  'S'  'T'
+────────┼──────────────────────────────────────────────────────────────
+'O'     │ 250  250  250  250  250  250  …  250  250  250  250  250  250
+'Y'     │ 250  250  250  250  250  250  …  250  250  250  250  250  250
+````
+
+
+
+
+and `Noise` varies within levels of `S`
+````julia
+freqtable(df, :Noise, :S)
+````
+
+
+````
+2×50 Named Array{Int64,2}
+Noise ╲ S │ 'A'  'B'  'C'  'D'  'E'  'F'  …  't'  'u'  'v'  'w'  'x'  'y'
+──────────┼──────────────────────────────────────────────────────────────
+'N'       │ 100  100  100  100  100  100  …  100  100  100  100  100  100
+'Q'       │ 100  100  100  100  100  100  …  100  100  100  100  100  100
+````
+
+
+
+
+and within levels of `I`
+````julia
+freqtable(df, :Noise, :I)
+````
+
+
+````
+2×20 Named Array{Int64,2}
+Noise ╲ I │ 'A'  'B'  'C'  'D'  'E'  'F'  …  'O'  'P'  'Q'  'R'  'S'  'T'
+──────────┼──────────────────────────────────────────────────────────────
+'N'       │ 250  250  250  250  250  250  …  250  250  250  250  250  250
+'Q'       │ 250  250  250  250  250  250  …  250  250  250  250  250  250
+````
+
+
+
+
+
+## Creating a LinearMixedModel
+
+A `LinearMixedModel` with fixed-effects for `Age` and `Noise` and for their interaction and with random intercepts for `S` and `I` is created as
+````julia
+const hc = HelmertCoding();
+m1 = LinearMixedModel(@formula(Y ~ 1 + Age * Noise + (1|S) + (1|I)), df, Dict(:Age => hc, :Noise => hc));
+m1.X                               # model matrix for fixed-effects terms
+````
+
+
+````
+10000×4 Array{Float64,2}:
+ 1.0  -1.0   1.0  -1.0
+ 1.0  -1.0  -1.0   1.0
+ 1.0  -1.0   1.0  -1.0
+ 1.0  -1.0  -1.0   1.0
+ 1.0  -1.0   1.0  -1.0
+ 1.0  -1.0  -1.0   1.0
+ 1.0  -1.0   1.0  -1.0
+ 1.0  -1.0  -1.0   1.0
+ 1.0  -1.0   1.0  -1.0
+ 1.0  -1.0  -1.0   1.0
+ ⋮                    
+ 1.0   1.0  -1.0  -1.0
+ 1.0   1.0   1.0   1.0
+ 1.0   1.0  -1.0  -1.0
+ 1.0   1.0   1.0   1.0
+ 1.0   1.0  -1.0  -1.0
+ 1.0   1.0   1.0   1.0
+ 1.0   1.0  -1.0  -1.0
+ 1.0   1.0   1.0   1.0
+ 1.0   1.0  -1.0  -1.0
+````
+
+
+
+
+
+#### HelmertCoding of contrasts
+
+The third argument in the call to `LinearMixedModel` is a dictionary of "contrasts" to use when forming the contrasts for categorical covariates.
+`HelmertCoding` applied to a 2-level factor creates a  `±1` coding of the levels, as shown in the display of the model matrix.
+With this coding the `(Intercept)` coefficient will be a "typical" response level without regard to `Age` and `Noise`.
+In other words, the `(Intercept)` is not defined with respect to an arbitrary reference level for the categorical covariates.
+Note that when 2-level factors are coded as `±1` the interaction terms also have a `±1` coding.
+
+Sometimes coefficient estimates are called the "effect" of the condition in the covariate, e.g. "Noise" versus "Quiet".
+For the `HelmertCoding` the "effect" of changing from the lower level to the higher level is twice the coefficient, because the distance between the `±1` values in the model matrix is 2.
+
+## Simulating a response and fitting the model
+
+The `MixedModels.simulate!` function installs a simulated response in the model object, given values of the parameters.
+
+````julia
+rng = Random.MersenneTwister(2052162715);  # repeatable random number generator
+refit!(simulate!(rng, m1, β = [1000., 0, 0, 0], σ = 200., θ = [0.5, 0.5]))
+````
+
+
+````
+Linear mixed model fit by maximum likelihood
+ Y ~ 1 + Age + Noise + Age & Noise + (1 | S) + (1 | I)
+     logLik        -2 logLik          AIC             BIC       
+ -6.72750605×10⁴  1.34550121×10⁵  1.34564121×10⁵  1.34614593×10⁵
+
+Variance components:
+            Variance  Std.Dev.
+ S          9606.815  98.01436
+ I         10335.779 101.66503
+ Residual  39665.426 199.16181
+ Number of obs: 10000; levels of grouping factors: 50, 20
+
+  Fixed-effects parameters:
+                   Estimate Std.Error  z value P(>|z|)
+(Intercept)         993.065      26.7  37.1934  <1e-99
+Age: Y             -27.0764   14.0037 -1.93352  0.0532
+Noise: Q           -2.21268   1.99162 -1.11099  0.2666
+Age: Y & Noise: Q  0.257456   1.99162  0.12927  0.8971
+````
+
+
+
+
+
+The parameters are `β`, the fixed-effects coefficients, `σ`, the standard deviation of the per-observation, or "residual", noise term and `θ`, the parameters in the lower Cholesky factor of the relative covariance matrices for the random effects terms.
+
+In this case, both random-effects terms are simple, scalar random effects with standard deviations of $100 = 200 * 0.5$.
+
+Notice that the estimated standard deviations, 98.014 and 101.665 for the random effects and 199.16 for the residual noise, are very close to the values in the simulation.
+
+Similarly, the estimates of the fixed-effects are quite close to the values in the simulation.
+
+#### REML estimates
+
+To use the REML criterion instead of maximum likelihood for parameter optimization, add the optional `REML` argument.
+````julia
+refit!(m1, REML=true)
+````
+
+
+````
+Linear mixed model fit by REML
+ Y ~ 1 + Age + Noise + Age & Noise + (1 | S) + (1 | I)
+ REML criterion at convergence: 134528.135222923
+
+Variance components:
+            Variance  Std.Dev.
+ S          9867.226  99.33391
+ I         10736.171 103.61550
+ Residual  39673.394 199.18181
+ Number of obs: 10000; levels of grouping factors: 50, 20
+
+  Fixed-effects parameters:
+                   Estimate Std.Error  z value P(>|z|)
+(Intercept)         993.065   27.1684  36.5522  <1e-99
+Age: Y             -27.0764   14.1884 -1.90834  0.0563
+Noise: Q           -2.21268   1.99182 -1.11088  0.2666
+Age: Y & Noise: Q  0.257456   1.99182 0.129257  0.8972
+````
+
+
+
+
+
+Because the experimental design is balanced across subjects, items, age and noise, the fixed-effects parameter estimates are the same under ML or under REML.
+This does not need to be the case for unbalanced designs.
+
+The REML standard errors of the fixed-effects parameter estimates and the estimated variance components are all somewhat larger than those from ML, as would be expected.
+Because the standard errors are larger for REML estimates, the p-values for the null hypothesis of no effect for a covariate or interaction are also larger.
+
+The REML estimates are preferred for evaluating p-values for the fixed-effects coefficients, because they are more conservative.
+However, REML estimates should not be used for comparing models in likelihood ratio tests or using various information criteria because the log-likelihood is not explicitly optimized.
+
+In this package the `loglikelihood` extractor function does not return a value for a model fit by REML,
+````julia
+loglikelihood(m1)
+````
+
+
+<pre class="julia-error">
+ERROR: ArgumentError: loglikelihood not available for models fit by REML
+</pre>
+
+
+
+hence all the other information criteria extractors also fail
+````julia
+aicc(m1)    # the corrected Akaike's Information Criterion
+````
+
+
+<pre class="julia-error">
+ERROR: ArgumentError: loglikelihood not available for models fit by REML
+</pre>
+
+
+
+
+## Fitting alternative models to the same response
+
+Define another model with random slopes with respect to `Noise` and random intercepts for both `S` and `I`.
+
+````julia
+m2 = LinearMixedModel(@formula(Y ~ 1 + Age * Noise + (1+Noise|S) + (1+Noise|I)), df,
+     Dict(:Age => hc, :Noise => hc));
+refit!(m2, response(m1))
+````
+
+
+````
+Linear mixed model fit by maximum likelihood
+ Y ~ 1 + Age + Noise + Age & Noise + (1 + Noise | S) + (1 + Noise | I)
+     logLik       -2 logLik         AIC            BIC      
+ -6.7274890×10⁴  1.3454978×10⁵  1.3457178×10⁵ 1.34651094×10⁵
+
+Variance components:
+              Variance      Std.Dev.     Corr.
+ S          9607.09559138  98.01579256
+               7.46097821   2.73147912 -0.25
+ I         10335.72689027 101.66477704
+               0.78373263   0.88528675 -1.00
+ Residual  39657.13460277 199.14099177
+ Number of obs: 10000; levels of grouping factors: 50, 20
+
+  Fixed-effects parameters:
+                   Estimate Std.Error  z value P(>|z|)
+(Intercept)         993.065   26.7001  37.1933  <1e-99
+Age: Y             -27.0764   14.0038  -1.9335  0.0532
+Noise: Q           -2.21268   2.03817 -1.08562  0.2776
+Age: Y & Noise: Q  0.257456   2.02853 0.126917  0.8990
+````