Updates to documentation, fix typos

danpovey · danpovey · commit 55f7f3bf8127 · 2016-04-23T15:22:08.000-04:00
diff --git a/docs/motivation.md b/docs/motivation.md
@@ -31,40 +31,41 @@ little sub-optimal, for a couple of reasons:
     and then interpolate them, but this is clearly not optimal in the way it
     interacts with backoff.  Consider the case where the sources would get the
     same weight-- you'd want to combine them and estimate the LM together, to
-    get more optimal backoff.  Let's call this the ``estimate-then-interpolate
-    problem''.
+    get more optimal backoff.  Let's call this the "estimate-then-interpolate
+    problem".
 
   - The standard pruning method (e.g. Stolcke entropy pruning) is non-optimal
     because when removing probabilities it doesn't update the backed-off-to
     state.
 
 ## Our solution
 
-As part of our solution to the ``estimate-then-interpolate problem'', we
+As part of our solution to the "estimate-then-interpolate problem", we
 interpolate the data sources at the level of data-counts before we estimate
 the LMs.  Think of it as a weighted sum.  At this point, anyone familiar with
-how language models are discounted will say ``wait a minute-- that can't work''
+how language models are discounted will say "wait a minute-- that can't work"
 because these techniques rely on integer counts.   Our method is to treat a count as a
 collection of small pieces of different sizes, where we only care about the total
 (i.e. the sum of the size of the pices), and the magnitude of the three largest
 pieces.  Recall that modified Kneser-Ney discounting only cares about the first
 three data counts when deciding the amount to discount.  We extend
-that discounting method to variable-size individual counts, as discounting proportions
-D1, D2 and D3 of the top-1, top-2 and top-3 largest individual pieces.
+modified Kneser-Ney discounting to variable-size individual counts by discounting proportions
+D1, D2 and D3 of the top-1, top-2 and top-3 largest individual counts in
+the collection.
 
-## Computing the gradients
+## Computing the hyperparameters
 
 The perceptive reader will notice that there is no very obvious way to estimate
-the discounting constants D1, D2 and D3 for each n-gram order, and no very
-obvious way to estiate the interpolation weights for the language models.  Our
-solution to this is to estimate all these hyperparameters based on the
+the discounting constants D1, D2 and D3 for each n-gram order, nor
+to estmiate the interpolation weights for the data.  Our
+solution to this is to estimate all these hyperparameters to maximize the
 probability of dev data, using a standard numerical optimization technique
 (L-BFGS).  The obvious question is, "how do estimate the gradients?".  The simplest
-way to estimate the gradients is by the "difference method".  But if there
+way is the "difference method".  But if there
 are 20 hyperparameters, we expect L-BFGS to take at least, say, 10 iterations
 to converge acceptably, and computing the derivative would require estimating
-the LM 20 times on each iteration-- in total we're estimating the language model
-200 times, which seems like a bit too much.  Our solution to this is to
+the LM 20 times on each iteration-- so in total we're estimating the language model
+200 times, which seems too much.  Our solution to this is to
 estimate the gradients directly using reverse-mode automatic differentiation
 (which is the general case of neural-net backprop).  This allows us to remove
 the factor of 20 (or however many hyperparameters there are) in the gradient
@@ -77,22 +78,23 @@ the intermediate quantities of the computation in memory at once.  In principle
 this can be done almost as a mechanical process.  But for language model
 estimation the obvious approach is is a bit limiting, as we'd like to cover
 cases where the data doesn't all fit in memory at once.  (We're using a version
-of Patrick Nguyen's sorting trick, see the ``MSRLM'' report).  Anyway, we have a
+of Patrick Nguyen's sorting trick, see his "MSRLM" report).  Anyway, we have a
 solution to this, and the details are extremely tedious.  It involves
 decomposing the discounting and probability-estimation processes into simple
 individual steps, each of which operates on a pipe (i.e. doesn't hold too much
 data in memory), and then working out the automatic differentiation operation
-for each individual pipe.  We don't have to go in "backwards order" inside the
-individual operations, because operations like summing don't "care about" the
+for each individual pipe.  We don't have to go in backwards order inside the
+individual operations, because operations like summing don't care about the
 order of operations.
 
 
 ## Pruning
 
 The language model pruning method is something we've done before in a different
-context, which is an improved, more-exact version of Stolcke pruning.  It operates
-on the same principle, but manages to be more optimal because when it removes
-a probability from the LM it assigns the removed data to the backed-off-to-state
-and updates its probabilities accordingly.  Of course, we take this change into
-account when selecting the n-grams to prune away.
+context, which is an improved, more-exact version of Stolcke pruning.  It
+operates on the same principle as Stolcke pruning, but manages to be more
+optimal because when it removes a probability from the LM it assigns the removed
+data to the backed-off-to-state and updates its probabilities accordingly.  Of
+course, we take this change into account when selecting the n-grams to prune
+away.