Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ud kazakh ktb v2.7 #17

Open
wants to merge 42 commits into
base: main
Choose a base branch
from
Open

Conversation

IlnarSelimcan
Copy link
Member

@IlnarSelimcan IlnarSelimcan commented Jul 3, 2020

#16 convolutes changes to the Constraint Grammar with the corrections in the UD treebank. To make merging easier/faster, I decided to make a separate PR out of the latter.

I worked on the conllu file directly. The changes to it will have to be "backpropagated" into tagged.txt files.

A brief description of the changes.

  • two not-UDv2-conformant things were fixed:
  1. dependents of the predicate which were labeled nmod were changed to obl, as they should be according to version 2 of the annotation guidelines:

The nmod relation, which in v1 was used for nominals modifying either predicates or other nominals, is in v2 restricted to modifying nominals. A new relation obl (oblique) is introduced for oblique dependents of predicates. (https://universaldependencies.org/v2/summary.html)

  1. punctuation was re-attached projectively:

Coordinating conjunctions (cc) and punctuation (punct) inside coordinated structures are in v2 attached to the immediately succeeding conjunct (instead of the first conjunct as in v1). (https://universaldependencies.org/v2/summary.html)

  • few other things which validate.py was complaining about were fixed, most notably in case of words like `осылай' for which errors like the following were occuring:
[Line 1935 Sent akorda-random.tagged.txt:164:2942 Node 5]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'PRON'

The majority of the validation errors were like the following:

[Line 115 Sent akorda-random.tagged.txt:8:120 Node 11]: [L5 Morpho aux-lemma] 'отыр' is not an auxiliary verb in language [kk]

which was due the incompleteness / out-of-datedness of the language-specific documentation rather than the issues with the treebank itself (turns out that these language-specific lists of auxiliaries & copulas are kept in the validation script itself. A pull-request has been made to it, see below).

Note that as of yet the treebank does not fully validate, about 20 issues remain.

  • all other changes deal with what I believe to be annotation errors.

I've double checked my own changes by going over https://github.com/apertium/apertium-kaz/pull/17/files having opened up the treebank before and after in UD-Annotatrix. Please let me know if you think that I've made things worse, especially if you notice that I made an error consistently.

Reviewing this PR carefully is likely to take three-four full working days. That's what double checking my own changes seemed to take me.

I really hope that the next release of UD (scheduled for November 15, data freeze is on November 1) will include this new version. What remains to be done:

  1. Making sure that the treebank validates against validate.py.
  2. Updating the documentation of the treebank.
  • at http://taruen.com/apertium-kaz/ (Section 7.1. Open questions about Kazakh UD) I tried to keep track of the issues which need to be discussed, but keep in mind that those notes were taken "in the heat of annotating" and that most of them at the moment are probably too brief to be discussable and thus will have to be re-checked by me first and turned into a more tangible form.
  • note that the `AN ASSESSMENT OF UNIVERSAL DEPENDENCY ANNOTATION GUIDELINES FOR TURKIC LANGUAGES' (2017) paper contains some more info / specific tests on cases marked as being discussed in the language-specific documentation
  1. There seemed to be some more sentences in the UD_KTB repo, those should be validated too.
  2. Auxiliaries need to be added to validate.py.
  • complete the list of Kazakh auxiliary verbs UniversalDependencies/tools#69 does just that.
  • Note that the verb digging in the sentence The boss said to start digging in the guidelines is labeled as xcomp of start, whereas in Kazakh we treat баста in -A.<gna_impf> баста- as an aux. Also, unlike all other verbs in the above pull request, баста is not listed among the auxiliary verbs in Kazakh: A Comprehensive Grammar, although it probably should be handled as such as we do currently.
  • Screenshot from 2020-10-13 01-01-14 in Оразбаева, Ф.Ш., Г. Сағидолда, Б. Қасым, А. Қобыланова, Қ. Есенова, Ұ. Исабекова,
    Қ. Қасабек, Ж. Балтабаев, Қ. Мұхамади, Р. Рахметова & Ж. Көпбаева. 2012.
    Қазіргі қазақ тілі. Алматы: Нур-Принт.

5 - - PUNCT guio _ 6 punct _ SpaceAfter=No
6 есен есен ADJ adj _ 7 advmod _ _
6 есен есен ADJ adj _ 4 compound _ _
Copy link
Member Author

@IlnarSelimcan IlnarSelimcan Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flip is debatable. Let me know if it should be reverted. The rule of thumb in UD seemes to be, all other things being equal, to make the first component the head (cf. conjunctions), hence this change probably.

Although in kk it is currently the other way around in case of numerical compounds: https://universaldependencies.org/kk/dep/compound.html

1 Кім кім PRON prn Case=Nom|PronType=Int 3 nsubj _ _
2 қалай қалай ADV adv PronType=Int 3 advmod _ _
3 тұрады тұр VERB v Mood=Ind|Number=Sing|Person=3|Tense=Aor|VerbForm=Fin 0 root _ SpaceAfter=No
4 ? ? PUNCT sent _ 3 punct _ _

# sent_id = kdt.tagged.txt:252:4113
# text = Кімге сенуге, кіммен істесуге болады?
# labels = checked_IFS
1 Кімге кім PRON prn Case=Dat|PronType=Int 2 obj _ _
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iobj maybe?

3 алсам екен ал VERB v Mood=Vol|Number=Sing|Person=1|VerbForm=Fin 5 ccomp _ _
4 деген де VERB v Tense=Past|VerbForm=Part 11 acl _ _
5 надан надан ADJ adj _ 11 amod _ _
3 алсам екен ал VERB v Mood=Vol|Number=Sing|Person=1|VerbForm=Fin 4 ccomp _ _
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multitokennes of this should probably be indicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this looks like a conversion error.

This one is no different from constructions lie `Қазақстан республикасы',
so I don't see a reason to label it differently.
Well, the reasone why the deprel of `былай' was changed from `obl' to `advmod'
is to make it consistent with `осылай', and then the validator complains about
PRON being advmod... So this thing is debatable -- both `осылай' and `былай'
could be PRON & obl . This is something mentioned in the paper iirc.
@IlnarSelimcan IlnarSelimcan marked this pull request as ready for review July 19, 2020 04:43
@IlnarSelimcan
Copy link
Member Author

taruen/ud-tools@d0819a0 solves most of the validation issues related to auxiliaries except for the following:

[Line 7883 Sent udhr.tagged.txt:7:305 Node 8]: [L3 Syntax leaf-aux-cop] 'cop' not expected to have children (8:болса:cop --> 9:да:advmod)
[Line 10375 Sent Иран.tagged.txt:52:1481 Node 12]: [L3 Syntax rel-upos-aux] 'aux' should be 'AUX' but it is 'VERB'
[Line 12859 Sent wikipedia.tagged.txt:71:1163 Node 5]: [L5 Morpho aux-lemma] 'атан' is not an auxiliary verb in language [kk]
[Line 13815 Sent Шымкент.tagged.txt:6:156 Node 2]: [L3 Syntax leaf-aux-cop] 'cop' not expected to have children (2:болғанда:cop --> 3:да:advmod)

атан in 12859 looks like a main verb.

@@ -4256,20 +4622,21 @@
6 , , PUNCT cm _ 10 punct _ _
7 қала қала NOUN n Case=Nom 9 compound _ SpaceAfter=No
8 - - PUNCT guio _ 9 punct _ SpaceAfter=No
9 қалаға қала NOUN n Case=Dat 10 nmod _ _
9 қалаға қала NOUN n Case=Dat 10 obl _ _
10 орнады орна VERB v Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin 0 root _ SpaceAfter=No
11 . . PUNCT sent _ 10 punct _ _

# sent_id = kdt.tagged.txt:35:570
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Line 4647 Sent kdt.tagged.txt:35:570 Node 16]: [L3 Syntax orphan-parent] The parent of 'orphan' should normally be 'conj' but it is 'nsubj'.

тұр in constructions X Y-ABL тұрады was treated as the main verb in all
other cases except this one. I think the right way to handle is to make
it root.
…6 Node 2]: [L3 Syntax leaf-aux-cop] 'cop' not expected to have children (2:болғанда:cop --> 3:да:advmod) by attaching да to the head of the copula
…l-upos-aux] 'aux' should be 'AUX' but it is 'VERB'
validation.py complains that

[Line 3629 Sent akorda-random.tagged.txt:286:5111 Node 1]: [L3 Syntax leaf-cc] 'cc' not expected to have children (1:Сондықтан:cc --> 2:да:advmod

and I see no other way of fixing that. Besides, SCONJ and cc seem to be contradictory (unlike SCONJ and mark, i.e.)
…y, fix the following:

[Line 6552 Sent kdt.tagged.txt:203:3220 Node 2]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'NOUN'
[Line 8739 Sent Жиырма_Бесінші_Сөз.tagged.txt:9:220 Node 9]: [L3 Syntax rel-upos-cop] 'cop' should be 'AUX' or 'PRON'/'DET' but it is 'VERB'
…y, fix the following:

[Line 7121 Sent kdt.tagged.txt:248:4021 Node 8]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'NOUN'
…y, fix the following:

[Line 8274 Sent wikitravel.tagged.txt:7:59 Node 1]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'PRON'
[Line 8864 Sent Жиырма_Бесінші_Сөз.tagged.txt:17:420 Node 3]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'PRON'
[Line 8934 Sent Жиырма_Бесінші_Сөз.tagged.txt:21:528 Node 2]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'SCONJ'

The first two changes are quite inline with the guidelines:

<quote ud guideliens>
There is a closed subclass of pronominal adverbs that refer to circumstances in context, rather than naming them directly; similarly to pronouns, these can be categorized as interrogative, relative, demonstrati>
</unquote>

I'm less certain about the latter. Another option would've been to UPOS сөйтсе as SCONJ and label as cc, but that also looks dubious.
…y, fix the following:

[Line 9001 Sent Жиырма_Бесінші_Сөз.tagged.txt:25:614 Node 8]: [L3 Syntax punct-is-nonproj] Punctuation must not be attached non-projectively over nodes [18, 19, 20]
[Line 9160 Sent Жиырма_Бесінші_Сөз.tagged.txt:33:880 Node 1]: [L3 Syntax rel-upos-advmod] 'advmod' should be 'ADV' but it is 'SCONJ'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant