You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When I'm working with a corpus that is a mixture of documents in American English and British English spelling, the two versions of the same word (e.g., behavior and behaviour) can influence analyses such as clustering because they may be treated as different words. Stemming might help in some cases but it's hard to find out when it does work and when it doesn't.
As an example, I had a case where, in Annotated Corpus Map, both "organize" and "organise" were identified as keywords within a cluster. It would be better if only one version would be identified as an even more significant keyword
Describe the solution you'd like
It would be better to have an option to automatically treat all the documents so that they are analyzed as written in only one version of English. I'm not sure if this should be an option in Corpus (where the language is selected first), in Preprocess Text (however this widget may be skipped if Document Embedding is used as suggested here) or as a separate widget altogether.
The conversion can be easily realized using the code suggested here on Stack Overflow, using a list that is no longer available at its original location, but is still available in the www archive here.
Describe alternatives you've considered
In the case I described before, I ended up with a quick fix going back to the source data (which was already in a table, fortunately, not in separate documents), find-and-replace "organis" by "organiz" and re-loading the data into Orange. But this is not a comprehensive solution to the problem.
The text was updated successfully, but these errors were encountered:
Addendum: the suggested code has an error in its function definition, and the dictionary is incomplete. And of course, there is an alternative to be considered: write a Python script to do the translation. The Python script in the attached workflow contains the corrected code and the complete dictionary.
Nevertheless, it would be nice to have harmonization/harmonisation of English spelling as an easier-to-access option in the Text add-on. Also this script works on text in a table, it cannot process a corpus (I have no idea how to address a corpus in Python)
Is your feature request related to a problem? Please describe.
When I'm working with a corpus that is a mixture of documents in American English and British English spelling, the two versions of the same word (e.g., behavior and behaviour) can influence analyses such as clustering because they may be treated as different words. Stemming might help in some cases but it's hard to find out when it does work and when it doesn't.
As an example, I had a case where, in Annotated Corpus Map, both "organize" and "organise" were identified as keywords within a cluster. It would be better if only one version would be identified as an even more significant keyword
Describe the solution you'd like
It would be better to have an option to automatically treat all the documents so that they are analyzed as written in only one version of English. I'm not sure if this should be an option in Corpus (where the language is selected first), in Preprocess Text (however this widget may be skipped if Document Embedding is used as suggested here) or as a separate widget altogether.
The conversion can be easily realized using the code suggested here on Stack Overflow, using a list that is no longer available at its original location, but is still available in the www archive here.
Describe alternatives you've considered
In the case I described before, I ended up with a quick fix going back to the source data (which was already in a table, fortunately, not in separate documents), find-and-replace "organis" by "organiz" and re-loading the data into Orange. But this is not a comprehensive solution to the problem.
The text was updated successfully, but these errors were encountered: