Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge bidixes into a multidix #12

Open
IlnarSelimcan opened this issue Apr 3, 2019 · 4 comments
Open

Merge bidixes into a multidix #12

IlnarSelimcan opened this issue Apr 3, 2019 · 4 comments
Assignees

Comments

@IlnarSelimcan
Copy link
Member

IlnarSelimcan commented Apr 3, 2019

Merge contents of:

  1. apertium-turkic/apertium-kaz-tat.kaz-tat.dix,
  2. apertium-turkic/apertium-kaz-rus.kaz-rus.dix,
  3. apertium-turkic/apertium-eng-kaz.eng-kaz.dix,
  4. apertium-turkic/apertium-kaz-tur.kaz-tur.dix,
  5. apertium-turkic/apertium-tat-rus.tat-rus.dix,
  6. apertium-turkic/apertium-tat-eng.tat-eng.dix, and
  7. apertium-turkic/apertium-tat-bak.tat-bak.dix

into divan.dix. Delete lines which were copied over until bidixes are empty.

Adjust bak.lexc, tat.lexc, kaz.lexc, eng.dix, rus.dix and tur.lexc if necessary as you go.

Generate bidixes from divan.dix. Run a corpus test comparing performance of translators before and after.

Also see:

@IlnarSelimcan IlnarSelimcan self-assigned this Apr 3, 2019
IlnarSelimcan added a commit that referenced this issue Apr 3, 2019
@jonorthwash
Copy link
Member

A couple questions:

  1. To what extent can the combination of these and checking to make sure that the output matches the original dictionaries be automated?
  2. How is the effectiveness of new dixes that could be created using this approach (like eng-tur) able to be evaluated?

@IlnarSelimcan
Copy link
Member Author

IlnarSelimcan commented Apr 3, 2019

  1. I don't have an automatic solution for this as of yet.
import wordgraph as wg
WG = wg.bidixes2wordgraph(
    wg.append_leftiso3_rightiso3(["apertium-kaz-tat.kaz-tat.dix",
                                                 "apertium-kaz-tur.kaz-tur.dix"
                                                 ...]))

creates a Wordgraph in seconds, but currently:

  • it doesn't handle LR RL restrictions
  • it skips entries with <re>'s in them
  • doesn't preserve comments (and I want to see them preserved in divan.dix next to the entries)
  • how to turn/output a Wordgraph into a Multidix like divan.dix isn't entirely clear to me

I guess beam search could help...

On a serious note, to start with, I figure that I need a simple command line tool which would merge selected ranges in emacs or vim. Say, selecting this range in the editor:

<e>
    <bak>hargle</bak>
    <tat>bargle</tat>
</e>
<e>
   <tat>bargle</tat>
   <kaz>herp</kaz>
   <rus>derp</rus>
</e>

and running the new merge command on it should replace the contents of that region with:

<e>
   <bak>hargle</bak>   
   <tat>bargle</tat>
   <kaz>herp</kaz>
   <rus>derp</rus>
</e>

This would allow me to work on bidixes piece-wise, preserving their internal structure.

By 2: do you mean how can we evaluate eng-tur.dix without having to write apertium-eng-tur and then evaluate it in the usual way?

@ftyers
Copy link
Member

ftyers commented Apr 3, 2019

Suggestion: Instead of <tat> etc. use xml:lang="tat" :)

@jonorthwash
Copy link
Member

By 2: do you mean how can we evaluate eng-tur.dix without having to write apertium-eng-tur and then evaluate it in the usual way?

I mean something like "how will we know if the output of a pair newly generated using this approach is of decent quality or not?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants