-
Notifications
You must be signed in to change notification settings - Fork 7
FR: Umlauts support #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Interesting idea - I can see the value. I will take a look at how In the meantime, as a workaround, you could define an alias, e.g.
which would, I think, give the effect you're looking for. |
Here is how you would do it in Java: String normalize(String s) {
return Normalizer.normalize(s, Normalizer.Form.NFKD)
.replaceAll("\\p{M}", "")
.replaceAll("ß", "s")
.replaceAll("ẞ", "s");
}
void main() {
var tests = List.of("Hallo", "Übergrößengeschäft", "mañana es sábado", "kir à l’aÿ", "kočka");
for (var t : tests) {
println(String.format("%s -> %s", t, normalize(t)));
}
} and in Go: package main
import (
"fmt"
"unicode"
"golang.org/x/text/runes"
"golang.org/x/text/transform"
"golang.org/x/text/unicode/norm"
)
func main() {
tests := []string{"Hallo", "Übergrößengeschäft", "mañana es sábado", "kir à l’aÿ", "kočka"}
for _, t := range tests {
n, _ := normalize(t)
fmt.Printf("%s -> %s\n", t, n)
}
}
func normalize(s string) (string, error) {
t := transform.Chain(norm.NFD,
runes.Remove(runes.In(unicode.Mn)),
runes.Map(func(r rune) rune {
switch r {
case 'ß':
return 's'
case 'ẞ':
return 's'
}
return r
}),
norm.NFC)
r, _, err := transform.String(t, s)
if err != nil {
return "", err
}
return r, nil
} Their output: Hallo -> Hallo
Übergrößengeschäft -> Ubergrosengeschaft
mañana es sábado -> manana es sabado
kir à l’aÿ -> kir a l’ay
kočka -> kocka Note the special case for |
…t are not present in code (#95)
Thanks @sdavids for the pointers, they were very helpful. In this case however, I do need to handle it slightly differently, as @sschneider-ihre-pvs 's request was for However, in languages other than german, simply removing the combining mark works fine. To that end, I've been working on a solution that tests both and will match either the case with the combining mark removed, or specially handled cases for the german umlaut where it is replaced with an So with this change, if the code contained And in other languages, e.g. French, I hope this suits your needs, @sschneider-ihre-pvs ? Regarding the You can see a preview of the proposed documentation change to describe this feature here: https://docs.test.contextive.tech/community/v/c961ccc/guides/defining-terminology/#unicode-and-diacritics - feedback welcome! The test cases are here: https://github.com/dev-cycles/contextive/blob/main/src/language-server/Contextive.LanguageServer.Tests/E2e/HoverTests.fs#L84 |
Note: You forgot the capital
|
A German test case: Noun "Größe" (size) Größe - correct spelling All 7 variants would be considered the same word by a German (most Germans do not know about the capital It get's hairy: Noun "Masse" (mass) with the old upper case spelling: Masse ⇒ MASSE with the new upper case spelling: Masse ⇒ MASSE Uppercase conversion with the old rules is irreversible in German—MASSE could have been derived from two distinct words "Masse" or "Maße". In this case one cannot use Other languages have similar quirks. I favor a Pareto solution instead of a 100% solution. |
German orthography has changed quite a bit in recent years; especially with the Reform der deutschen Rechtschreibung 1996—
See above. A German would know what People with German as a second language might also write it that way because they are not familiar with the intricacies of Some use |
…n represented as 'ss' in code (#95)
Thanks @sdavids - I initially excluded it because we do an IgnoreCase comparison so thought it wouldn't matter, but your comment helped me realise that the IgnoreCase comparison is after the normalisation, so it does need doing explicitly. Added a test case and support for this now. |
With the new implementation, having experimented a bit, this is how it works over a few scenarios, hopefully it makes sense:
|
In a German software project… Decide on the language used in source code/config files
Another consideration is the programming language one is using. Some do not support non-ASCII identifiers (Ruby) or used to (Rust, Python), in that case one has to decide on how to handle Decide on the language used in acceptance/BDD tests
Decide on the language used in documentation
Decide on the language used in stakeholder communication
In my experience the most common combination is:
I have heard of projects where everything is in German (legacy systems, banking, and insurance). In a DDD context one might also have a ubiquitous language document and a canonical translation document German synonyms ⇒ a single ubiquitous term ⇒ canonical translation "Auto", "Karre", "Wagen" ⇒ "Kraftfahrzeug" ⇒ "motor vehicle" |
Thanks @sdavids for your thorough analysis. You might also find the discussion here interesting - #88 This ticket is primarily just about the appropriate handling of unicode text for likely expected mismatches between glossary terminology and code. That discussion explores a proposal for a more thorough handling of multi-language projects and could help with the scenarios you explore in this comment. |
ExampleLet’s say we are in the box storing domain… We might come up with {
"id": 1,
"mass": 5.5,
"dimensions": "1x3x5"
} public record Box(int id, double mass, String dimensions) {} class Box
attr_reader :id, :mass, :dimensions
def initialize(id, mass, dimensions)
@id = id
@mass = mass
@dimensions = dimensions
end
end Note: This a really bad This is how it would translate to “everything in German”:
{
"Nummer": 1,
"Masse": 5.5,
"Maße": "1x3x5"
} public record Kiste(@JsonProperty("Nummer") int nummer, @JsonProperty("Masse") double masse, @JsonProperty("Maße") String maße) {} Does not work (illegal Ruby identifier): class Kiste
def initialize(nummer, masse, maße)
@nummer = nummer
@masse = masse
@maße = maße
end
end Does not work either (duplicate identifier): class Kiste
def initialize(nummer, masse, masse)
@nummer = nummer
@masse = masse
@masse = masse
end
end Reach for a synonym, e.g.: class Kiste
include ActiveModel::Serializers::JSON
attr_reader :nummer, :masse, :format
def initialize(nummer, masse, format)
@nummer = nummer
@masse = masse
@format = format
end
def as_json(options = {})
h = super(options)
h.store('Nummer', h.delete(:nummer))
h.store('Masse', h.delete(:masse))
h.store('Maße', h.delete(:format))
end
end Most developers would reach for a synonym in the Java case also—using German special letters breaks easily when some team members use Windows and some use macOS/Linux with JDK < 18 : public record Kiste(@JsonProperty("Nummer") int nummer, @JsonProperty("Masse") double masse, @JsonProperty("Maße") String format) {} |
🎉 This issue has been resolved in version 1.17.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Closing this issue now as the original intent is satisfied. Further conversation about multi-language support to take place on discussion #88 . |
Usually in programming you do not use umlauts in code, but in descriptions to terms that might be apropriate.
so in code you could have something like
Ausloeser
but similar to the plural detection this should reference toAuslöser
in the glossaryThe text was updated successfully, but these errors were encountered: