Merge bidixes into a multidix #12

IlnarSelimcan · 2019-04-03T04:19:40Z

Merge contents of:

apertium-turkic/apertium-kaz-tat.kaz-tat.dix,
apertium-turkic/apertium-kaz-rus.kaz-rus.dix,
apertium-turkic/apertium-eng-kaz.eng-kaz.dix,
apertium-turkic/apertium-kaz-tur.kaz-tur.dix,
apertium-turkic/apertium-tat-rus.tat-rus.dix,
apertium-turkic/apertium-tat-eng.tat-eng.dix, and
apertium-turkic/apertium-tat-bak.tat-bak.dix

into divan.dix. Delete lines which were copied over until bidixes are empty.

Adjust bak.lexc, tat.lexc, kaz.lexc, eng.dix, rus.dix and tur.lexc if necessary as you go.

Generate bidixes from divan.dix. Run a corpus test comparing performance of translators before and after.

Also see:

The text was updated successfully, but these errors were encountered:

jonorthwash · 2019-04-03T18:46:47Z

A couple questions:

To what extent can the combination of these and checking to make sure that the output matches the original dictionaries be automated?
How is the effectiveness of new dixes that could be created using this approach (like eng-tur) able to be evaluated?

IlnarSelimcan · 2019-04-03T20:01:33Z

I don't have an automatic solution for this as of yet.

import wordgraph as wg
WG = wg.bidixes2wordgraph(
    wg.append_leftiso3_rightiso3(["apertium-kaz-tat.kaz-tat.dix",
                                                 "apertium-kaz-tur.kaz-tur.dix"
                                                 ...]))

creates a Wordgraph in seconds, but currently:

it doesn't handle LR RL restrictions
it skips entries with <re>'s in them
doesn't preserve comments (and I want to see them preserved in divan.dix next to the entries)
how to turn/output a Wordgraph into a Multidix like divan.dix isn't entirely clear to me

I guess beam search could help...

On a serious note, to start with, I figure that I need a simple command line tool which would merge selected ranges in emacs or vim. Say, selecting this range in the editor:

<e>
    <bak>hargle</bak>
    <tat>bargle</tat>
</e>
<e>
   <tat>bargle</tat>
   <kaz>herp</kaz>
   <rus>derp</rus>
</e>

and running the new merge command on it should replace the contents of that region with:

<e>
   <bak>hargle</bak>   
   <tat>bargle</tat>
   <kaz>herp</kaz>
   <rus>derp</rus>
</e>

This would allow me to work on bidixes piece-wise, preserving their internal structure.

By 2: do you mean how can we evaluate eng-tur.dix without having to write apertium-eng-tur and then evaluate it in the usual way?

ftyers · 2019-04-03T20:14:55Z

Suggestion: Instead of <tat> etc. use xml:lang="tat" :)

jonorthwash · 2019-04-04T03:50:06Z

By 2: do you mean how can we evaluate eng-tur.dix without having to write apertium-eng-tur and then evaluate it in the usual way?

I mean something like "how will we know if the output of a pair newly generated using this approach is of decent quality or not?"

IlnarSelimcan self-assigned this Apr 3, 2019

IlnarSelimcan added a commit that referenced this issue Apr 3, 2019

minor changes for #12

b830cc0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge bidixes into a multidix #12

Merge bidixes into a multidix #12

IlnarSelimcan commented Apr 3, 2019 •

edited

Loading

jonorthwash commented Apr 3, 2019

IlnarSelimcan commented Apr 3, 2019 •

edited

Loading

ftyers commented Apr 3, 2019

jonorthwash commented Apr 4, 2019

Merge bidixes into a multidix #12

Merge bidixes into a multidix #12

Comments

IlnarSelimcan commented Apr 3, 2019 • edited Loading

jonorthwash commented Apr 3, 2019

IlnarSelimcan commented Apr 3, 2019 • edited Loading

ftyers commented Apr 3, 2019

jonorthwash commented Apr 4, 2019

IlnarSelimcan commented Apr 3, 2019 •

edited

Loading

IlnarSelimcan commented Apr 3, 2019 •

edited

Loading