Replies: 8 comments 15 replies
-
If I recall the previous discussion correctly, we observed that many of the use-cases which potentially suffer from the variant explosion can be simplified by using unit formatting, relative time formatting, or duration formatting. In the example that you share I can see two placeholders that could be handled by duration formatting: |
Beta Was this translation helpful? Give feedback.
-
Another way to approach this is to make use of custom functions to build localized glossaries of frequently occuring terms. I tried to demo this solution in my experiments/stasm/third branch. The glossary approach requires a localized collection of terms, complete with their grammatical forms corresponding to the natural language and the use-case; and a custom function which can access these terms. In my experiments I used JSON to store the terms (glossary_en.json, glossary_pl.json). I tend to think that such glossaries are good candidates for being localized via a spreadsheet. They don't require any advanced features; they're just tabular data of translations. Then, a custom
I would use this approach only in extreme cases where variant explosion is actively hurting the translators' experience (i.e. hundreds of variants or more). The drawback of this approach is that it can make translations too computational (a glossary term for "people", a glossary term for "record", a glossary term for "day", etc.), which results in something hard to reason about. (The primary use-case of this approach was to enable parameterizable dynamic terms, i.e. use an runtime-supplied |
Beta Was this translation helpful? Give feedback.
-
I agree that in almost all cases we want to try to avoid concatenation, but I think in such extreme example we shouldn't flat out reject it. My position on "flattening" the selectors from the beginning is that we should be open to add non-flatten selectors later if we accumulate pathological cases that can't be handled by flattened approach. This is one of them. Stas idea of using some sort of glossary may work here, but my concern, based solely on how it worked at Mozilla, is that such scenario may introduce a significant bureaucratic overhead. Introducing terms that are often reused is reasonable (brand names etc.), but introducing words like "clip" and "person" may be less welcomed as it would grow into a dictionary of nouns that had to be pluralized at least once in any more complex sentence. Adding each such noun to a glossary shared in all contexts is likely unnecessary, and building multiple glossaries each for a group of messages that require a given noun seems like a small value over just adding them as messages referenced from the main message, but with its own selector. So, at the risk of strong reaction from this group, I suggest this should be solved with use of message references:
If later the term |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
-
Within Google we dealt with the combinatorics by disallowing more than one
plural. I think we've only hit problem cases with 2 placeholders, never
more. And there are very few cases that would need even 2.
The downside of that is that it forces the developer to break apart
messages that would take more, and alert translators that (a) the 2
messages would be concatenated with a third pattern, and (b) to avoid
inflectible references from one sentence to another. It is somewhat
awkward, but because it arises so seldom, we didn't want to over-engineer a
solution.
So the problem cases were more like the following:
{ NUMBER($totalHours) ->
[one] { $totalHours } hour
*[other] { $totalHours } hours
} is achievable in just over { NUMBER($periodMonths) ->
[one] { $periodMonths } month
*[other] { $periodMonths } months
}
The developer would fix it by using two different messages concatenated,
and trying to avoid cross references. Example: Message 2
Message 1
The goal of { NUMBER($totalHours) ->
[one] { $totalHours } hour
*[other] { $totalHours } hours
} is achievable.
Message 2
That goal takes just over { NUMBER($periodMonths) ->
[one] { $periodMonths } month
*[other] { $periodMonths } months
}
Message 3 (a sentence concatenation). Typically a space, but
localizable for the language in question.
On Mon, Jan 16, 2023 at 3:46 PM Martin Dürst ***@***.***>
wrote:
… An extremely small correction below.
On 2023-01-16 19:15, Eemeli Aro wrote:
> tl;dr We ought to figure out how to do plurals when top-level selectors
don't work.
> The same message represented in MF2 requires 16 variants, which is
cumbersome, but perhaps at the limit of what's bearable:
> However, when translated to e.g. Welsh or another language that uses all
6 plural categories, this single message explodes to 4^6 = 1296 variants,
which breaks Pontoon.
This should be 6^4 = 1296. 4^6 is 4*4*4*4*4*4 = 4096.
Regards, Martin.
—
Reply to this email directly, view it on GitHub
<#323 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMBVICSJGAODOCV2SJTWSXMVFANCNFSM6AAAAAAT4RY2VU>
.
You are receiving this because you are subscribed to this thread.Message
ID: <unicode-org/message-format-wg/repo-discussions/323/comments/4702004@
github.com>
|
Beta Was this translation helpful? Give feedback.
-
The purpose of my message is to not to say that we should limit MF2.0 to
only one plural. (And in that particular, Google was just imposing a
limitation on the use of plurals in messages its translation tooling would
take, not trying to impose a limitation on MF1.0). So I fully agree that "I
think we should support (until we can't) arbitrary replacements/selectors
or at least a reasonably large number (5 or 10) if we feel the need for a
limit."
The purpose of my message is to note that any big re-architecture is out of
scope for MF2.0, especially since there are known workarounds for the
(relatively rare) combinatorics problem. We should be refining the spec,
catching ambiguities, and so on — not expanding the scope.
For MF2.0, I want to caution people that we have to be very careful about
conceptual burdens placed on developers and translators. Those can be far
harder to detail with than simply more lines to translate. For example, I
anticipate that unless done very carefully, "submessages" are doomed to
failure. It is difficult enough to alert them to the fact that a message
like "Welcome", or "Insert the disk" — with no placeholders — needs gender
selection; we cannot expect developers to understand the intricacies of
inflection agreement among parts of a message across a hundred languages.
…On Tue, Jan 17, 2023 at 8:00 AM Zibi Braniecki ***@***.***> wrote:
The real kicker, I think, is that rich messages such as the one Eemeli
quotes, tend to have high visibility
100% agree. This observation is important for adjusting our sensitivity to
importance of various features in MF2.0 - using share popularity is not a
sufficient indicator of importance because the "visibility" and
"significance" of complex messages is usually dis-proportionally high.
—
Reply to this email directly, view it on GitHub
<#323 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMFWE3RKGVJZPYSSOETWS262XANCNFSM6AAAAAAT4RY2VU>
.
You are receiving this because you commented.Message ID:
<unicode-org/message-format-wg/repo-discussions/323/comments/4708375@
github.com>
|
Beta Was this translation helpful? Give feedback.
-
Added to the 2023-01-23 agenda per @stasm's request |
Beta Was this translation helpful? Give feedback.
-
What I said was "unless done very carefully".
My concern is the concatenation issue. Say developer X is looking at a
message that has number and gender. I'll just show one of those.
[one female] {number...} books can be delivered ...
X thinks, huh, I can pull out {number...} books} into a submessage (with
its different count forms). Effectively, that submessage will need to be
concatenated with the rest of that message. But it turns out that in
language X, the verb needs to be inflected for the number and gender of the
word "books". And even worse, it turns out that the ... part here has an
effect on the 'definiteness' inflection of book.
So when a submessage is extracted, one has to have pretty sophisticated
mechanisms for declaring inflection (and even wording) dependencies, and
resolving all of those dependencies. It will need a lot of work to make
that something that developers and translators and a translation pipeline
can handle.
…On Tue, Jan 17, 2023 at 11:46 AM Zibi Braniecki ***@***.***> wrote:
Thank you Mark for this comment. I agree with your concerns and appreciate
your support in keeping MF2.0 scope bound.
I do, however, disagree with your conclusion:
For example, I anticipate that unless done very carefully, "submessages"
are doomed to failure.
Are you implying that, if we enable some form of submessages they will
have no chance to become useful in the context we described? Or just that
you do not anticipate their use to be utilized by existing large systems
such as Google?
I agree that the latter is likely, but I am much more optimistic than you
about the potential of the innovation we can enable if the MF2.0 system
allows for submessages. I think there is an a massive untapped potential
of the JS ecosystem in building innovative, sophisticated tools and I have
hope that MF2.0, implemented in ECMA-402 will serve as an enabler for that
potential.
While the route to mature large organization expanding their CAT and infra
tooling to enable submessages is many years away, I wish to think of
MF2.0 as a foundational technology that serves that cohort *as well* as
the wild west of web community. I suspect that the latter will be much
quicker in exploring, experimenting and prototyping highly advanced
features that will utilize such features and I am concerned about capping
MF2.0 capabilities based on historical experience of who and how develops
L10n tools.
—
Reply to this email directly, view it on GitHub
<#323 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMC3CB45UWXBOWLQ5ZTWS3ZJJANCNFSM6AAAAAAT4RY2VU>
.
You are receiving this because you commented.Message ID:
<unicode-org/message-format-wg/repo-discussions/323/comments/4710469@
github.com>
|
Beta Was this translation helpful? Give feedback.
-
tl;dr We ought to figure out how to do plurals when top-level selectors don't work.
As a step towards MF2 support, I've started work on migrating the internal data model used by Pontoon to match the MF2 shape, which has lead to this issue: mozilla/pontoon#2703.
One of the projects localized via Pontoon is Common Voice, which at one point includes this UI:

That message uses this Fluent source:
The same message represented in MF2 requires 16 variants, which is cumbersome, but perhaps at the limit of what's bearable:
However, when translated to e.g. Welsh or another language that uses all 6 plural categories, this single message explodes to 6^4 = 1296 variants, which breaks Pontoon.
I think we should have at least some idea how to deal with messages like this in MF2, where lifting the selectors to the top level is not practical. Perhaps implementations should be guided towards allowing a
:plural
"formatter" in excessive cases like this? Something like:Beta Was this translation helpful? Give feedback.
All reactions