-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ud kazakh ktb v2.7 #17
base: main
Are you sure you want to change the base?
Conversation
… through my changes one more time over the next few days and then finally make a pull request to apertium's repo and then move on to new stuff)
… errroneous 'fixes' of mine
5 - - PUNCT guio _ 6 punct _ SpaceAfter=No | ||
6 есен есен ADJ adj _ 7 advmod _ _ | ||
6 есен есен ADJ adj _ 4 compound _ _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flip is debatable. Let me know if it should be reverted. The rule of thumb in UD seemes to be, all other things being equal, to make the first component the head (cf. conjunctions), hence this change probably.
Although in kk it is currently the other way around in case of numerical compounds: https://universaldependencies.org/kk/dep/compound.html
1 Кім кім PRON prn Case=Nom|PronType=Int 3 nsubj _ _ | ||
2 қалай қалай ADV adv PronType=Int 3 advmod _ _ | ||
3 тұрады тұр VERB v Mood=Ind|Number=Sing|Person=3|Tense=Aor|VerbForm=Fin 0 root _ SpaceAfter=No | ||
4 ? ? PUNCT sent _ 3 punct _ _ | ||
|
||
# sent_id = kdt.tagged.txt:252:4113 | ||
# text = Кімге сенуге, кіммен істесуге болады? | ||
# labels = checked_IFS | ||
1 Кімге кім PRON prn Case=Dat|PronType=Int 2 obj _ _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iobj maybe?
… came before this particualar sentence
3 алсам екен ал VERB v Mood=Vol|Number=Sing|Person=1|VerbForm=Fin 5 ccomp _ _ | ||
4 деген де VERB v Tense=Past|VerbForm=Part 11 acl _ _ | ||
5 надан надан ADJ adj _ 11 amod _ _ | ||
3 алсам екен ал VERB v Mood=Vol|Number=Sing|Person=1|VerbForm=Fin 4 ccomp _ _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multitokennes of this should probably be indicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this looks like a conversion error.
This one is no different from constructions lie `Қазақстан республикасы', so I don't see a reason to label it differently.
Well, the reasone why the deprel of `былай' was changed from `obl' to `advmod' is to make it consistent with `осылай', and then the validator complains about PRON being advmod... So this thing is debatable -- both `осылай' and `былай' could be PRON & obl . This is something mentioned in the paper iirc.
taruen/ud-tools@d0819a0 solves most of the validation issues related to auxiliaries except for the following:
|
@@ -4256,20 +4622,21 @@ | |||
6 , , PUNCT cm _ 10 punct _ _ | |||
7 қала қала NOUN n Case=Nom 9 compound _ SpaceAfter=No | |||
8 - - PUNCT guio _ 9 punct _ SpaceAfter=No | |||
9 қалаға қала NOUN n Case=Dat 10 nmod _ _ | |||
9 қалаға қала NOUN n Case=Dat 10 obl _ _ | |||
10 орнады орна VERB v Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin 0 root _ SpaceAfter=No | |||
11 . . PUNCT sent _ 10 punct _ _ | |||
|
|||
# sent_id = kdt.tagged.txt:35:570 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Line 4647 Sent kdt.tagged.txt:35:570 Node 16]: [L3 Syntax orphan-parent] The parent of 'orphan' should normally be 'conj' but it is 'nsubj'.
тұр in constructions X Y-ABL тұрады was treated as the main verb in all other cases except this one. I think the right way to handle is to make it root.
…6 Node 2]: [L3 Syntax leaf-aux-cop] 'cop' not expected to have children (2:болғанда:cop --> 3:да:advmod) by attaching да to the head of the copula
…l-upos-aux] 'aux' should be 'AUX' but it is 'VERB'
validation.py complains that [Line 3629 Sent akorda-random.tagged.txt:286:5111 Node 1]: [L3 Syntax leaf-cc] 'cc' not expected to have children (1:Сондықтан:cc --> 2:да:advmod and I see no other way of fixing that. Besides, SCONJ and cc seem to be contradictory (unlike SCONJ and mark, i.e.)
…y, fix the following: [Line 6552 Sent kdt.tagged.txt:203:3220 Node 2]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'NOUN' [Line 8739 Sent Жиырма_Бесінші_Сөз.tagged.txt:9:220 Node 9]: [L3 Syntax rel-upos-cop] 'cop' should be 'AUX' or 'PRON'/'DET' but it is 'VERB'
…y, fix the following: [Line 7121 Sent kdt.tagged.txt:248:4021 Node 8]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'NOUN'
…y, fix the following: [Line 8274 Sent wikitravel.tagged.txt:7:59 Node 1]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'PRON' [Line 8864 Sent Жиырма_Бесінші_Сөз.tagged.txt:17:420 Node 3]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'PRON' [Line 8934 Sent Жиырма_Бесінші_Сөз.tagged.txt:21:528 Node 2]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'SCONJ' The first two changes are quite inline with the guidelines: <quote ud guideliens> There is a closed subclass of pronominal adverbs that refer to circumstances in context, rather than naming them directly; similarly to pronouns, these can be categorized as interrogative, relative, demonstrati> </unquote> I'm less certain about the latter. Another option would've been to UPOS сөйтсе as SCONJ and label as cc, but that also looks dubious.
…y, fix the following: [Line 9001 Sent Жиырма_Бесінші_Сөз.tagged.txt:25:614 Node 8]: [L3 Syntax punct-is-nonproj] Punctuation must not be attached non-projectively over nodes [18, 19, 20] [Line 9160 Sent Жиырма_Бесінші_Сөз.tagged.txt:33:880 Node 1]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'SCONJ'
…ce the latest fixes were accidentally done only there
#16 convolutes changes to the Constraint Grammar with the corrections in the UD treebank. To make merging easier/faster, I decided to make a separate PR out of the latter.
I worked on the conllu file directly. The changes to it will have to be "backpropagated" into
tagged.txt
files.A brief description of the changes.
nmod
were changed toobl
, as they should be according to version 2 of the annotation guidelines:The majority of the validation errors were like the following:
which was due the incompleteness / out-of-datedness of the language-specific
documentationrather than the issues with the treebank itself (turns out that these language-specific lists of auxiliaries & copulas are kept in the validation script itself. A pull-request has been made to it, see below).Note that as of yet the treebank does not fully validate, about 20 issues remain.
I've double checked my own changes by going over https://github.com/apertium/apertium-kaz/pull/17/files having opened up the treebank before and after in UD-Annotatrix. Please let me know if you think that I've made things worse, especially if you notice that I made an error consistently.
Reviewing this PR carefully is likely to take three-four full working days. That's what double checking my own changes seemed to take me.
I really hope that the next release of UD (scheduled for November 15, data freeze is on November 1) will include this new version. What remains to be done:
Auxiliaries need to be added to validate.py.digging
in the sentenceThe boss said to start digging
in the guidelines is labeled asxcomp
ofstart
, whereas in Kazakh we treatбаста
in-A.<gna_impf> баста-
as anaux
. Also, unlike all other verbs in the above pull request,баста
is not listed among the auxiliary verbs inKazakh: A Comprehensive Grammar
, although it probably should be handled as such as we do currently.Қ. Қасабек, Ж. Балтабаев, Қ. Мұхамади, Р. Рахметова & Ж. Көпбаева. 2012.
Қазіргі қазақ тілі. Алматы: Нур-Принт.