Preprocess text, Corpus or new separate widget: provide tool to convert British English to American or vice versa #1078

wvdvegte · 2024-08-15T08:45:54Z

Is your feature request related to a problem? Please describe.
When I'm working with a corpus that is a mixture of documents in American English and British English spelling, the two versions of the same word (e.g., behavior and behaviour) can influence analyses such as clustering because they may be treated as different words. Stemming might help in some cases but it's hard to find out when it does work and when it doesn't.
As an example, I had a case where, in Annotated Corpus Map, both "organize" and "organise" were identified as keywords within a cluster. It would be better if only one version would be identified as an even more significant keyword

Describe the solution you'd like
It would be better to have an option to automatically treat all the documents so that they are analyzed as written in only one version of English. I'm not sure if this should be an option in Corpus (where the language is selected first), in Preprocess Text (however this widget may be skipped if Document Embedding is used as suggested here) or as a separate widget altogether.
The conversion can be easily realized using the code suggested here on Stack Overflow, using a list that is no longer available at its original location, but is still available in the www archive here.

Describe alternatives you've considered
In the case I described before, I ended up with a quick fix going back to the source data (which was already in a table, fortunately, not in separate documents), find-and-replace "organis" by "organiz" and re-loading the data into Orange. But this is not a comprehensive solution to the problem.

wvdvegte · 2024-08-20T09:25:55Z

Addendum: the suggested code has an error in its function definition, and the dictionary is incomplete. And of course, there is an alternative to be considered: write a Python script to do the translation. The Python script in the attached workflow contains the corrected code and the complete dictionary.
Nevertheless, it would be nice to have harmonization/harmonisation of English spelling as an easier-to-access option in the Text add-on. Also this script works on text in a table, it cannot process a corpus (I have no idea how to address a corpus in Python)

UK-US conversion.ows.zip

(edit: code adapted to replace whole words only, and convert to lowercase first. Added remark about corpus as input)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess text, Corpus or new separate widget: provide tool to convert British English to American or vice versa #1078

Preprocess text, Corpus or new separate widget: provide tool to convert British English to American or vice versa #1078

wvdvegte commented Aug 15, 2024 •

edited

Loading

wvdvegte commented Aug 20, 2024 •

edited

Loading

Preprocess text, Corpus or new separate widget: provide tool to convert British English to American or vice versa #1078

Preprocess text, Corpus or new separate widget: provide tool to convert British English to American or vice versa #1078

Comments

wvdvegte commented Aug 15, 2024 • edited Loading

wvdvegte commented Aug 20, 2024 • edited Loading

wvdvegte commented Aug 15, 2024 •

edited

Loading

wvdvegte commented Aug 20, 2024 •

edited

Loading