@@ -31,40 +31,41 @@ little sub-optimal, for a couple of reasons:
31
31
and then interpolate them, but this is clearly not optimal in the way it
32
32
interacts with backoff. Consider the case where the sources would get the
33
33
same weight-- you'd want to combine them and estimate the LM together, to
34
- get more optimal backoff. Let's call this the `` estimate-then-interpolate
35
- problem'' .
34
+ get more optimal backoff. Let's call this the " estimate-then-interpolate
35
+ problem" .
36
36
37
37
- The standard pruning method (e.g. Stolcke entropy pruning) is non-optimal
38
38
because when removing probabilities it doesn't update the backed-off-to
39
39
state.
40
40
41
41
## Our solution
42
42
43
- As part of our solution to the `` estimate-then-interpolate problem'' , we
43
+ As part of our solution to the " estimate-then-interpolate problem" , we
44
44
interpolate the data sources at the level of data-counts before we estimate
45
45
the LMs. Think of it as a weighted sum. At this point, anyone familiar with
46
- how language models are discounted will say `` wait a minute-- that can't work''
46
+ how language models are discounted will say " wait a minute-- that can't work"
47
47
because these techniques rely on integer counts. Our method is to treat a count as a
48
48
collection of small pieces of different sizes, where we only care about the total
49
49
(i.e. the sum of the size of the pices), and the magnitude of the three largest
50
50
pieces. Recall that modified Kneser-Ney discounting only cares about the first
51
51
three data counts when deciding the amount to discount. We extend
52
- that discounting method to variable-size individual counts, as discounting proportions
53
- D1, D2 and D3 of the top-1, top-2 and top-3 largest individual pieces.
52
+ modified Kneser-Ney discounting to variable-size individual counts by discounting proportions
53
+ D1, D2 and D3 of the top-1, top-2 and top-3 largest individual counts in
54
+ the collection.
54
55
55
- ## Computing the gradients
56
+ ## Computing the hyperparameters
56
57
57
58
The perceptive reader will notice that there is no very obvious way to estimate
58
- the discounting constants D1, D2 and D3 for each n-gram order, and no very
59
- obvious way to estiate the interpolation weights for the language models . Our
60
- solution to this is to estimate all these hyperparameters based on the
59
+ the discounting constants D1, D2 and D3 for each n-gram order, nor
60
+ to estmiate the interpolation weights for the data . Our
61
+ solution to this is to estimate all these hyperparameters to maximize the
61
62
probability of dev data, using a standard numerical optimization technique
62
63
(L-BFGS). The obvious question is, "how do estimate the gradients?". The simplest
63
- way to estimate the gradients is by the "difference method". But if there
64
+ way is the "difference method". But if there
64
65
are 20 hyperparameters, we expect L-BFGS to take at least, say, 10 iterations
65
66
to converge acceptably, and computing the derivative would require estimating
66
- the LM 20 times on each iteration-- in total we're estimating the language model
67
- 200 times, which seems like a bit too much. Our solution to this is to
67
+ the LM 20 times on each iteration-- so in total we're estimating the language model
68
+ 200 times, which seems too much. Our solution to this is to
68
69
estimate the gradients directly using reverse-mode automatic differentiation
69
70
(which is the general case of neural-net backprop). This allows us to remove
70
71
the factor of 20 (or however many hyperparameters there are) in the gradient
@@ -77,22 +78,23 @@ the intermediate quantities of the computation in memory at once. In principle
77
78
this can be done almost as a mechanical process. But for language model
78
79
estimation the obvious approach is is a bit limiting, as we'd like to cover
79
80
cases where the data doesn't all fit in memory at once. (We're using a version
80
- of Patrick Nguyen's sorting trick, see the `` MSRLM'' report). Anyway, we have a
81
+ of Patrick Nguyen's sorting trick, see his " MSRLM" report). Anyway, we have a
81
82
solution to this, and the details are extremely tedious. It involves
82
83
decomposing the discounting and probability-estimation processes into simple
83
84
individual steps, each of which operates on a pipe (i.e. doesn't hold too much
84
85
data in memory), and then working out the automatic differentiation operation
85
- for each individual pipe. We don't have to go in " backwards order" inside the
86
- individual operations, because operations like summing don't " care about" the
86
+ for each individual pipe. We don't have to go in backwards order inside the
87
+ individual operations, because operations like summing don't care about the
87
88
order of operations.
88
89
89
90
90
91
## Pruning
91
92
92
93
The language model pruning method is something we've done before in a different
93
- context, which is an improved, more-exact version of Stolcke pruning. It operates
94
- on the same principle, but manages to be more optimal because when it removes
95
- a probability from the LM it assigns the removed data to the backed-off-to-state
96
- and updates its probabilities accordingly. Of course, we take this change into
97
- account when selecting the n-grams to prune away.
94
+ context, which is an improved, more-exact version of Stolcke pruning. It
95
+ operates on the same principle as Stolcke pruning, but manages to be more
96
+ optimal because when it removes a probability from the LM it assigns the removed
97
+ data to the backed-off-to-state and updates its probabilities accordingly. Of
98
+ course, we take this change into account when selecting the n-grams to prune
99
+ away.
98
100
0 commit comments