-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transdcuer no longer meets Apertium Turkic standards #15
Comments
First of all, apologies for having broken the old workflows. Apertium-uzb and apertium-kaa are also affected by this. If we decide to restore the old organisation, apertium-kaz can simply be reverted to d9ee49d All subsequent changes were also made on https://raw.githubusercontent.com/taruen/apertiumpp/master/apertiumpp-kaz/lexicon.rkt (stems from which I plan to merge back to apertium-kaz in some sensible way, once I finish proofreading them against explanatory dictionary). Apertium-uzb and apertium-kaa had that organisation before GSoC, but committers didn't seem to be careful enough not to put adjectives to LEXICON Nouns, nouns to LEXICON Adjectives etc. In short, the reasons why I had reduced lexicons to Common, Proper, Punctuation and Abbreviations were that:
Iirc, even the creators of .lexc admit (in FSM book) that some more computationally-processable format should be used for storing the lexicons (from which them .lexc files are then derived). Either lexc2dix should be polished up so that we can easily query lexicons (to count stems etc), or we just should write lexicons in some other format. I see that as a real problem, but that's only my opinion. |
Last time I looked at it, lexc2dix was making some errors which I don't recall anymore. |
Hello to everyone!
If you don't mind I would like to say a couple of words here.
My part in apertium-tat during last few years is not so big and mostly
consists of improving twol rules and working with the lexc file.
In my opinion lexc file's classic organisation really has those
shortcomings described by Ilnar.
It is really difficult to improve this dictionary when you, for example,
don't see whether the word you are editing is presented in other parts or
not. Jumping all the time through the file is not an option here and the
search also doesn't help that much, unfortunately. It slows you down
significantly.
Situation gets drastically worse when the dictionary becomes bigger.
The new way will at least keep them all (nouns, adjectives, adverbs...)
next to each other according to the alphabet and solve one of the biggest
problems I have met here.
Actually I don't know what pros the classic organisation gives us and if
you don't mind maybe it is time to consider some changes?
With best wishes,
Mansur
Ilnar Salimzianov <[email protected]> schrieb am Mo., 2. Sep. 2019,
21:46:
… Last time I looked at it, lexc2dix was making some errors which I don't
recall anymore.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#15?email_source=notifications&email_token=AEZNYQPLARKOGXWFAJEA46LQHVNRZA5CNFSM4IS6CVA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5WMNMQ#issuecomment-527222450>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEZNYQMD2LRYSWESGKXOWGDQHVNRZANCNFSM4IS6CVAQ>
.
|
@mansayk, thank you for sharing your view on this—it's very helpful. I'd just like to clarify one point. You say:
I'm not sure I understand what the problem is. Could you provide more information on what you're having trouble with? |
When checking a lexc file for miscategorized stems (maybe having an alphabetically sorted reference dictionary at hand, maybe not, but especially if you do), you must see all occurences of the particular stem in the lexc file (to see the continuations they have). That implies manual search. That would mean typing Control-S in emacs, and then typing up the word you're looking for, and jumping through the file. Or selecting the word, and then searching: https://stackoverflow.com/questions/202803/searching-for-marked-selected-text-in-emacs (I doubt that it's any faster in Vi(m) :P ) Imo that's significantly slower than going through an alphabetically sorted list and just deleting the lines where the stem has wrong cont. lexicon. @mansayk , did you mean that? |
|
Another thing is that categories are not independent of each other, so to speak. Sure, some stems can belong to several categories at once, but there also cases when belonging to one category excludes belonging to another. In my worldview at least, "foo A1" makes "foo ADV" redundant, as "hargle CC" would make "hargle CS" redundant (or incorrect). Yet another thing are improperly lexicalised wordforms. Seeing "алдында ADV" right after "ал{д} N1" should make any conscious lexicographer think. |
I think I like what Fran suggested. Indeed pronouns especially tend to have lots of hardcoded entries anyway, hence it makes sense to keep them and other closed categories separate. |
Hi!
Jonathan, let's imagine the following situations:
1. I need to add a new word to the lexc file:
- I use my corpus to construct a frequency list.
- I use apertium to mark up all the words in that list.
- I remove the words that were successfully recognized and tagged by
apertium.
- I take unrecognized words from the top of the list (most frequent ones)
one by one to insert to the lexc file. Usually I open them as 2 tabs in vim
or just split the screen (vsplit).
- Ilnar and I try to keep all our lists in the lexc file alphabetically
sorted, though it is not easy because of there are many of them.
- So I take that word and place it to the, for example, adj n1 section and
go on happy and later I accidentally see that word already were presented
in adj n3 section or even in adv.
Why apertium didn't tag it if it already were in the dict? There are
several reasons and one of the most frequent of them is because of mistakes
in twol rules. That's why I payed all my attention during last year to
improve it. Let me know if there are still any problems in twol rules. If
all the pos categories were together in the same list I would see that word
with all its other pos tags in the first place.
Why didn't I use the search first? Mostly I do, but when it comes to the
short words it makes many wrong matches and I should use regex syntax and
additional symbols...
2. I need to change the POS tag of the word existing in the lexc file:
- I use search to locate the word. If that word is presented in several
sections it is not convenient to jump through the whole file instead of
seeing them in one place. And we cannot just edit the tag of that word
currently miscategorized, but we also need to move it manually to the
according section and to the according place among other sorted words. What
if I forget to do the last part? According to the presence of many cases I
saw when words with one tag are located in the wrong section it is a quite
common problem.
3. Let's remember the last commits of koguzhan to the Tatar lexc file.
There were many duplications I think mostly because of the current lexc
file's messy structure. And I did the same mistakes the time I started
contributing to apertium several years ago. And we could avoid that if we
had a sorted single big list.
4. I tried to go through the whole dictionary and check it word by word
several times, because there are many miscategorized words there. But when
I see some word it comes to my mind that this word should also be presented
in some other section(s). I put vim's anchor to the current line and go
search it there and there. Again jumps and it distracts pretty much.
Instead I just could see all tags of this word in one place if the whole
list was sorted as one piece.
It is only a couple of the problems that came to my mind in the middle of
night... It might not sound very terrible to make all those manipulations
for a single word, but we deal with hundreds and thousands of them where it
becomes messy, time consuming and leads to new mistakes... I hope you agree
with me that the file structure should help us to avoid those mistakes,
save our time, be easy to use.
Please do not consider this post as a complaining. It is just some notes
from the experience of a non-professional apertiumer. Thank you!
Best,
Mansur
Jonathan Washington <[email protected]> schrieb am Mo., 2. Sep.
2019, 23:32:
… @mansayk <https://github.com/mansayk>, thank you for sharing your view on
this—it's very helpful.
I'd just like to clarify one point. You say:
Jumping all the time through the file is not an option here and the search
also doesn't help that much, unfortunately. It slows you down significantly.
I'm not sure I understand what the problem is. Could you provide more
information on what you're having trouble with?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15?email_source=notifications&email_token=AEZNYQLBSN3JDTXNWNWBVNDQHVZ5ZA5CNFSM4IS6CVA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5WQMVA#issuecomment-527238740>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEZNYQPJHQH4WKVP2M3M7GTQHVZ5ZANCNFSM4IS6CVAQ>
.
|
Okay, I have a better sense now of what the reasoning is. These are valid reasons, and I've experienced these issues myself. I like Fran's proposal—to keep "open" and "closed" categories separate. I would argue that closed categories should be broken down much the way we had them—or we could include conjunctions and the like with the open categories so they're near adverbs. Pronouns and determiners should definitely go together. Numbers should probably remain separate. In any case, I'm okay lumping various categories together for the reasons stated, but I also think there are certain ways that we should keep things separate. Does this make sense? Is my general philosophy towards it compatible with everyone else's? |
I was thinking something like:
|
Hi!
I would suggest to place LEXICON Open in the very end of the file, so it is
easier to find where it ends when we sort it.
O maybe use some kind of
@import TO LEXICON Open FROM FILE...
I am not sure about the second one. It doesn't seem a good choice here,
because this way has its own shortcomings and we try to keep the lexc file
solid.
I just want to say that Open and Proper categories are going to be huge and
to sort them we need to find the beginning and scroll down to the end line
of that category without interfering with another one.
Maybe we just need some anchors there and some universal bash script (with
different LC_COLLATE parameter for each language) to sort all the
categories in the lexc file when we run it?..
Best,
Mansur
Francis Tyers <[email protected]> schrieb am Di., 3. Sep. 2019,
08:08:
… I was thinking something like:
LEXICON Root
Open ;
Closed ;
Proper ;
Punctuation ;
Numerals ;
LEXICON Open
bar:bar N1 ; ! ""
foo:foo N1 ; ! ""
foo:foo V-TV ; ! ""
LEXICON Closed
Pronouns ;
Determiners ;
Conjunctions ;
Postpositions ;
LEXICON Pronouns
blah:blah PRON-PERS ; ! ""
LEXICON Proper
LEXICON Punctuation
LEXICON Numerals
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15?email_source=notifications&email_token=AEZNYQLZ5VKHWMG7ODJZLZTQHXWLPA5CNFSM4IS6CVA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5XAYWY#issuecomment-527305819>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEZNYQO2G3KZBM2QEP4AF63QHXWLPANCNFSM4IS6CVAQ>
.
|
I'm used to having In any case, finding the end of a lexicon isn't difficult with vim: you just enter visual mode ( But I certainly don't mind having I propose a couple adjustments to @ftyers's proposal:
|
The issue with the reorganisation of the lexicon in de4c77a is that different parts of speech are all lumped together.
Every single other Turkic transducer uses the lexicon names
Nouns
,Adjectives
,Verbs
,ProperNouns
, etc. This is standardised for several reasons. One of which is so that we have an easy way to count the number of stems of a particular type. E.g., note that the countstems script was broken by your changes.@IlnarSelimcan, could you justify why you did this reorganisation? Also, in principle this sort of major restructuring should be done in consultation with and by consensus among everyone it affects—that is, everyone who's committed to this repo, or at least the apertium-turkic mailing list.
The text was updated successfully, but these errors were encountered: