Skip to content

FR: Umlauts support #95

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sschneider-ihre-pvs opened this issue Mar 27, 2025 · 13 comments
Closed

FR: Umlauts support #95

sschneider-ihre-pvs opened this issue Mar 27, 2025 · 13 comments

Comments

@sschneider-ihre-pvs
Copy link

Usually in programming you do not use umlauts in code, but in descriptions to terms that might be apropriate.

so in code you could have something like Ausloeser but similar to the plural detection this should reference to Auslöser in the glossary

@chrissimon-au
Copy link
Contributor

Interesting idea - I can see the value. I will take a look at how oe could be considered equivalent ö for the purposes of displaying a hover.

In the meantime, as a workaround, you could define an alias, e.g.

   - name: Auslöser
     aliases:
       - Ausloeser

which would, I think, give the effect you're looking for.

@sdavids
Copy link

sdavids commented Apr 10, 2025

Here is how you would do it in Java:

String normalize(String s) {
  return Normalizer.normalize(s, Normalizer.Form.NFKD)
      .replaceAll("\\p{M}", "")
      .replaceAll("ß", "s")
      .replaceAll("ẞ", "s");
}

void main() {
  var tests = List.of("Hallo", "Übergrößengeschäft", "mañana es sábado", "kir à l’aÿ", "kočka");
  for (var t : tests) {
    println(String.format("%s -> %s", t, normalize(t)));
  }
}

and in Go:

package main

import (
	"fmt"
	"unicode"

	"golang.org/x/text/runes"
	"golang.org/x/text/transform"
	"golang.org/x/text/unicode/norm"
)

func main() {
	tests := []string{"Hallo", "Übergrößengeschäft", "mañana es sábado", "kir à l’aÿ", "kočka"}
	for _, t := range tests {
		n, _ := normalize(t)
		fmt.Printf("%s -> %s\n", t, n)
	}
}

func normalize(s string) (string, error) {
	t := transform.Chain(norm.NFD,
		runes.Remove(runes.In(unicode.Mn)),
		runes.Map(func(r rune) rune {
			switch r {
			case 'ß':
				return 's'
			case 'ẞ':
				return 's'
			}
			return r
		}),
		norm.NFC)
	r, _, err := transform.String(t, s)
	if err != nil {
		return "", err
	}
	return r, nil
}

Their output:

Hallo -> Hallo
Übergrößengeschäft -> Ubergrosengeschaft
mañana es sábado -> manana es sabado
kir à l’aÿ -> kir a l’ay
kočka -> kocka

Note the special case for ß and : They are proper letters instead of diacritical marks on a letter.

chrissimon-au added a commit that referenced this issue May 17, 2025
@chrissimon-au
Copy link
Contributor

Thanks @sdavids for the pointers, they were very helpful. In this case however, I do need to handle it slightly differently, as @sschneider-ihre-pvs 's request was for Ausloeser in code to match Auslöser as a defined term. The scripts above remove the combining mark resulting in Ausloser.

However, in languages other than german, simply removing the combining mark works fine.

To that end, I've been working on a solution that tests both and will match either the case with the combining mark removed, or specially handled cases for the german umlaut where it is replaced with an e, e.g. ö becoming oe. (The contextive philosophy is generally to match more loosely than strictly.)

So with this change, if the code contained ausloser, ausloeser, OR auslöser they would all show the definitions of a term defined with Auslöser.

And in other languages, e.g. French, pere OR père would match a term defined as Père

I hope this suits your needs, @sschneider-ihre-pvs ?

Regarding the ß character, I understand it is commonly replaced with ss. Is that your experience? To that end it's now only matching with ss, e.g. strasse or straße would match a term defined as Straße, but strase would not. Is that acceptable?

You can see a preview of the proposed documentation change to describe this feature here: https://docs.test.contextive.tech/community/v/c961ccc/guides/defining-terminology/#unicode-and-diacritics - feedback welcome!

The test cases are here: https://github.com/dev-cycles/contextive/blob/main/src/language-server/Contextive.LanguageServer.Tests/E2e/HoverTests.fs#L84

@sdavids
Copy link

sdavids commented May 17, 2025

Note: You forgot the capital :

s.Replace("\u0308", "e").Replace("ß", "ss")

@sdavids
Copy link

sdavids commented May 17, 2025

A German test case:

Noun "Größe" (size)

Größe - correct spelling
Groesse - someone typing on a keyboard w/out German letters or someone being lazy
groesse - someone typing on a keyboard w/out German letters and not caring about capitalization or someone being lazy
größe - lower case or not caring about capitalization
GRÖSSE - old upper case spelling
GRÖẞE - new (since 2017) upper case spelling with
GRÖßE - incorrect spelling with lower case ß

All 7 variants would be considered the same word by a German (most Germans do not know about the capital though 😂).


It get's hairy:

Noun "Masse" (mass)
Noun "Maße" (dimensions)

with the old upper case spelling:

Masse ⇒ MASSE
Maße ⇒ MASSE

with the new upper case spelling:

Masse ⇒ MASSE
Maße ⇒ MAẞE

Uppercase conversion with the old rules is irreversible in German—MASSE could have been derived from two distinct words "Masse" or "Maße".

In this case one cannot use ss and ß interchangeably because that would change the meaning.


Other languages have similar quirks.

I favor a Pareto solution instead of a 100% solution.

@sdavids
Copy link

sdavids commented May 17, 2025

Regarding the ß character, I understand it is commonly replaced with ss.

German orthography has changed quite a bit in recent years; especially with the Reform der deutschen Rechtschreibung 1996ss vs. ß was an important (and at that time contentious) part of it.

To that end it's now only matching with ss, e.g. strasse or straße would match a term defined as Straße, but strase would not. Is that acceptable?

See above.


A German would know what strase means though.

People with German as a second language might also write it that way because they are not familiar with the intricacies of ss vs. ß yet.

Some use s, ss, and ß interchangeably in colloquial text—to the horror of German Grammatiknazis.

chrissimon-au added a commit that referenced this issue May 17, 2025
@chrissimon-au
Copy link
Contributor

Note: You forgot the capital :

contextive/src/core/Contextive.Core/GlossaryFile.fs

Line 39 in c961ccc

s.Replace("\u0308", "e").Replace("ß", "ss")

Thanks @sdavids - I initially excluded it because we do an IgnoreCase comparison so thought it wouldn't matter, but your comment helped me realise that the IgnoreCase comparison is after the normalisation, so it does need doing explicitly. Added a test case and support for this now.

@chrissimon-au
Copy link
Contributor

With the new implementation, having experimented a bit, this is how it works over a few scenarios, hopefully it makes sense:

ß in the glossary file

If the glossary file contains:

contexts:
  - terms:
      - name: Masse
        definition: Mass (English)
      - name: Maße
        definition: dimensions (English)

And the code contains masse then the following hover appears, showing both options, as we can't disambiguate:

Image

If the code contains maße then the following hover appears, showing only the Maße option:

Image

SS in the glossary file

If the glossary file contains:

contexts:
  - terms:
      - name: MASSE
        definition: Mass (English)
      - name: MASSE
        definition: dimensions (English)

And the code contains masse then again, both defnitions are shown:

Image

If the code contains maße then nothing is shown. We don't reverse SS from the glossary file to match ß or in the code. From everything above, it seems unlikely that this would happen - if ß or are in the code, then they would also be used in the definitions file.

in the glossary file

If the glossary file contains:

contexts:
  - terms:
      - name: MASSE
        definition: Mass (English)
      - name: MAẞE
        definition: dimensions (English)

And the code contains masse then both options are shown:

Image

If the code contains maße then only MAẞE is shown:

Image

Summary

Does this follow the most common expectations in your experience?

@sdavids
Copy link

sdavids commented May 18, 2025

In a German software project…

Decide on the language used in source code/config files

  1. everything in German
  • “Autofabrik”, “Personenlager” - correct spelling
  • “AutoFabrik”, “PersonenLager” - incorrect spelling but better IDE and find/replace DX
  • "FabrikAuto", "LagerPersonen" - incorrect spelling but better sort DX
  1. everything in German with Anglicisms
  • “Autofactory” , Personenrepository” - correct spelling but feels weird to use for a German
  • “AutoFactory” , PersonenRepository” - incorrect spelling but better IDE and find/replace and DX
  • “FactoryAuto”, “RepositoryPersonen” - incorrect spelling but better sorting DX
  1. everything in English—open to non-German speaking team members and a possible company/project merger in the future
  • "CarFactory", "PersonRepository"
  • "FactoryCar", "RepositoryPerson" - better sort DX

Another consideration is the programming language one is using.

Some do not support non-ASCII identifiers (Ruby) or used to (Rust, Python), in that case one has to decide on how to handle äöüßÄÖÜẞ: äöüßÄÖÜẞae,oe,ue,ss,Ae,Oe,Ue,SS or äöüßÄÖÜẞa,o,u,s,A,O,U,S or go with the “everything in English” option for that programming language only.

Decide on the language used in acceptance/BDD tests

  • everything in German
  • source code in English, specifications in German
  • everything in English

Decide on the language used in documentation

  • everything in German
    • if the decision was made to use English in source code/config files then usually a document (wiki page) with canonical translations is created, i.e. German “Auto” is always translated as “Car” and not “Auto”, and source code snippet are exempt from the "everything in German" rule
  • documentation in English

Decide on the language used in stakeholder communication

  • everything in German
    • if any other step above was not "everything in German" a canonical translation document (wiki page) is necessary
  • everything in English - unrealistic because even though most Germans have knowledge of the English language most of them are not at a "native speaker" level or able to tell you the English translation of their term

In my experience the most common combination is:

  • source code/config files in English
  • source code of acceptance/BDD tests in English, specifications in German
  • documentation in English
  • canonical translation (English ⇔ German) document/wiki page
  • stakeholder communication in German
  • canonical translation (English ⇔ German) document/wiki page

I have heard of projects where everything is in German (legacy systems, banking, and insurance).


In a DDD context one might also have a ubiquitous language document and a canonical translation document

German synonyms ⇒ a single ubiquitous term ⇒ canonical translation

"Auto", "Karre", "Wagen" ⇒ "Kraftfahrzeug" ⇒ "motor vehicle"

@chrissimon-au
Copy link
Contributor

Thanks @sdavids for your thorough analysis. You might also find the discussion here interesting - #88

This ticket is primarily just about the appropriate handling of unicode text for likely expected mismatches between glossary terminology and code.

That discussion explores a proposal for a more thorough handling of multi-language projects and could help with the scenarios you explore in this comment.

@sdavids
Copy link

sdavids commented May 18, 2025

Example

Let’s say we are in the box storing domain…

We might come up with BoxRepository and Box.

{
  "id": 1,
  "mass": 5.5,
  "dimensions": "1x3x5"
}
public record Box(int id, double mass, String dimensions) {}
class Box
  attr_reader :id, :mass, :dimensions

  def initialize(id, mass, dimensions)
    @id = id
    @mass = mass
    @dimensions = dimensions
  end
end

Note: This a really bad Box model!


This is how it would translate to “everything in German”:

Kistenlager and Kiste (note the additional n—Kistelager would be incorrect)

{
  "Nummer": 1,
  "Masse": 5.5,
  "Maße": "1x3x5"
}
public record Kiste(@JsonProperty("Nummer") int nummer, @JsonProperty("Masse") double masse, @JsonProperty("Maße") String maße) {}

Does not work (illegal Ruby identifier):

class Kiste
  def initialize(nummer, masse, maße)
    @nummer = nummer
    @masse = masse
    @maße = maße
  end
end

Does not work either (duplicate identifier):

class Kiste
  def initialize(nummer, masse, masse)
    @nummer = nummer
    @masse = masse
    @masse = masse
  end
end

Reach for a synonym, e.g.:

class Kiste
  include ActiveModel::Serializers::JSON

  attr_reader :nummer, :masse, :format

  def initialize(nummer, masse, format)
    @nummer = nummer
    @masse = masse
    @format = format
  end

  def as_json(options = {})
    h = super(options)
    h.store('Nummer', h.delete(:nummer))
    h.store('Masse', h.delete(:masse))
    h.store('Maße', h.delete(:format))
  end
end

Most developers would reach for a synonym in the Java case also—using German special letters breaks easily when some team members use Windows and some use macOS/Linux with JDK < 18 :

public record Kiste(@JsonProperty("Nummer") int nummer, @JsonProperty("Masse") double masse, @JsonProperty("Maße") String format) {}

@chrissimon-au
Copy link
Contributor

🎉 This issue has been resolved in version 1.17.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@chrissimon-au
Copy link
Contributor

Closing this issue now as the original intent is satisfied. Further conversation about multi-language support to take place on discussion #88 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants