Skip to content

Commit 55f7f3b

Browse files
committed
Updates to documentation, fix typos
1 parent 720a3c6 commit 55f7f3b

File tree

1 file changed

+23
-21
lines changed

1 file changed

+23
-21
lines changed

docs/motivation.md

+23-21
Original file line numberDiff line numberDiff line change
@@ -31,40 +31,41 @@ little sub-optimal, for a couple of reasons:
3131
and then interpolate them, but this is clearly not optimal in the way it
3232
interacts with backoff. Consider the case where the sources would get the
3333
same weight-- you'd want to combine them and estimate the LM together, to
34-
get more optimal backoff. Let's call this the ``estimate-then-interpolate
35-
problem''.
34+
get more optimal backoff. Let's call this the "estimate-then-interpolate
35+
problem".
3636

3737
- The standard pruning method (e.g. Stolcke entropy pruning) is non-optimal
3838
because when removing probabilities it doesn't update the backed-off-to
3939
state.
4040

4141
## Our solution
4242

43-
As part of our solution to the ``estimate-then-interpolate problem'', we
43+
As part of our solution to the "estimate-then-interpolate problem", we
4444
interpolate the data sources at the level of data-counts before we estimate
4545
the LMs. Think of it as a weighted sum. At this point, anyone familiar with
46-
how language models are discounted will say ``wait a minute-- that can't work''
46+
how language models are discounted will say "wait a minute-- that can't work"
4747
because these techniques rely on integer counts. Our method is to treat a count as a
4848
collection of small pieces of different sizes, where we only care about the total
4949
(i.e. the sum of the size of the pices), and the magnitude of the three largest
5050
pieces. Recall that modified Kneser-Ney discounting only cares about the first
5151
three data counts when deciding the amount to discount. We extend
52-
that discounting method to variable-size individual counts, as discounting proportions
53-
D1, D2 and D3 of the top-1, top-2 and top-3 largest individual pieces.
52+
modified Kneser-Ney discounting to variable-size individual counts by discounting proportions
53+
D1, D2 and D3 of the top-1, top-2 and top-3 largest individual counts in
54+
the collection.
5455

55-
## Computing the gradients
56+
## Computing the hyperparameters
5657

5758
The perceptive reader will notice that there is no very obvious way to estimate
58-
the discounting constants D1, D2 and D3 for each n-gram order, and no very
59-
obvious way to estiate the interpolation weights for the language models. Our
60-
solution to this is to estimate all these hyperparameters based on the
59+
the discounting constants D1, D2 and D3 for each n-gram order, nor
60+
to estmiate the interpolation weights for the data. Our
61+
solution to this is to estimate all these hyperparameters to maximize the
6162
probability of dev data, using a standard numerical optimization technique
6263
(L-BFGS). The obvious question is, "how do estimate the gradients?". The simplest
63-
way to estimate the gradients is by the "difference method". But if there
64+
way is the "difference method". But if there
6465
are 20 hyperparameters, we expect L-BFGS to take at least, say, 10 iterations
6566
to converge acceptably, and computing the derivative would require estimating
66-
the LM 20 times on each iteration-- in total we're estimating the language model
67-
200 times, which seems like a bit too much. Our solution to this is to
67+
the LM 20 times on each iteration-- so in total we're estimating the language model
68+
200 times, which seems too much. Our solution to this is to
6869
estimate the gradients directly using reverse-mode automatic differentiation
6970
(which is the general case of neural-net backprop). This allows us to remove
7071
the factor of 20 (or however many hyperparameters there are) in the gradient
@@ -77,22 +78,23 @@ the intermediate quantities of the computation in memory at once. In principle
7778
this can be done almost as a mechanical process. But for language model
7879
estimation the obvious approach is is a bit limiting, as we'd like to cover
7980
cases where the data doesn't all fit in memory at once. (We're using a version
80-
of Patrick Nguyen's sorting trick, see the ``MSRLM'' report). Anyway, we have a
81+
of Patrick Nguyen's sorting trick, see his "MSRLM" report). Anyway, we have a
8182
solution to this, and the details are extremely tedious. It involves
8283
decomposing the discounting and probability-estimation processes into simple
8384
individual steps, each of which operates on a pipe (i.e. doesn't hold too much
8485
data in memory), and then working out the automatic differentiation operation
85-
for each individual pipe. We don't have to go in "backwards order" inside the
86-
individual operations, because operations like summing don't "care about" the
86+
for each individual pipe. We don't have to go in backwards order inside the
87+
individual operations, because operations like summing don't care about the
8788
order of operations.
8889

8990

9091
## Pruning
9192

9293
The language model pruning method is something we've done before in a different
93-
context, which is an improved, more-exact version of Stolcke pruning. It operates
94-
on the same principle, but manages to be more optimal because when it removes
95-
a probability from the LM it assigns the removed data to the backed-off-to-state
96-
and updates its probabilities accordingly. Of course, we take this change into
97-
account when selecting the n-grams to prune away.
94+
context, which is an improved, more-exact version of Stolcke pruning. It
95+
operates on the same principle as Stolcke pruning, but manages to be more
96+
optimal because when it removes a probability from the LM it assigns the removed
97+
data to the backed-off-to-state and updates its probabilities accordingly. Of
98+
course, we take this change into account when selecting the n-grams to prune
99+
away.
98100

0 commit comments

Comments
 (0)