Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert TEI anchors to TEITOK format #139

Open
matyaskopp opened this issue Apr 8, 2021 · 1 comment
Open

Convert TEI anchors to TEITOK format #139

matyaskopp opened this issue Apr 8, 2021 · 1 comment

Comments

@matyaskopp
Copy link
Member

No description provided.

@matyaskopp
Copy link
Member Author

05efd78 uses following format:

<text>
  <body>
    <div type="debateSection">
      <div>
        <pb source="https://www.psp.cz/eknih/2013ps/stenprot/001schuz/s001005.htm" n="1" id="ps2013-001-01-003-003.pb1" corresp="#ps2013-001-01-003-003.audio1" />
        <a href="https://www.psp.cz/eknih/2013ps/stenprot/001schuz/s001005.htm" target="_blank" class="external-link page-link" />
        <media url="2013ps/audio/2013/11/25/2013112514381452.mp3">
          <note type="speaker">Předsedající Miroslava Němcová</note>
          <u who="#MiroslavaNemcova.1952" ana="#chair" id="ps2013-001-01-003-003.u1">
            <seg id="ps2013-001-01-003-003.u1.p1" />
            <seg id="ps2013-001-01-003-003.u1.p2" />
          </u>
          <note type="speaker">Poslanec Martin Kolovratník</note>
          <u who="#MartinKolovratnik.1975" ana="#regular" id="ps2013-001-01-003-003.u2" source="https://www.psp.cz/eknih/2013ps/stenprot/001schuz/s001005.htm#r2">
            <a href="https://www.psp.cz/eknih/2013ps/stenprot/001schuz/s001005.htm#r2" target="_blank" class="external-link speech-link" />
            <seg id="ps2013-001-01-003-003.u2.p1" />
            <seg id="ps2013-001-01-003-003.u2.p2" />
            <seg id="ps2013-001-01-003-003.u2.p3" />
            <seg id="ps2013-001-01-003-003.u2.p4" />
            <seg id="ps2013-001-01-003-003.u2.p5" />
            <seg id="ps2013-001-01-003-003.u2.p6" />
            <seg id="ps2013-001-01-003-003.u2.p7" />
            <seg id="ps2013-001-01-003-003.u2.p8" />
            <seg id="ps2013-001-01-003-003.u2.p9" />
            <seg id="ps2013-001-01-003-003.u2.p10" />
          </u>
          <note type="speaker">Předsedající Miroslava Němcová</note>
          <u who="#MiroslavaNemcova.1952" ana="#chair" id="ps2013-001-01-003-003.u3" source="https://www.psp.cz/eknih/2013ps/stenprot/001schuz/s001005.htm#r3">
            <seg id="ps2013-001-01-003-003.u3.p1" />
            <seg id="ps2013-001-01-003-003.u3.p2" />
          </u>
        </media>
      </div>
      <div>
        <pb source="https://www.psp.cz/eknih/2013ps/stenprot/001schuz/s001006.htm" n="2" id="ps2013-001-01-003-003.pb2" corresp="#ps2013-001-01-003-003.audio2" />
        <a href="https://www.psp.cz/eknih/2013ps/stenprot/001schuz/s001006.htm" target="_blank" class="external-link page-link" />
        <media url="2013ps/audio/2013/11/25/2013112514481502.mp3">
          <note type="time"> <!-- LEADING NOTES MOVED BEFORE UTTERANCE -->
            <time when="2013-11-25T14:50:00">(14.50 hodin)</time>
          </note>
          <note type="speaker">Předsedající Miroslava Němcová</note> <!-- SPEAKER NOTE COPIES -->
          <u who="#MiroslavaNemcova.1952" ana="#chair" id="ps2013-001-01-003-003.u3" source="https://www.psp.cz/eknih/2013ps/stenprot/001schuz/s001005.htm#r3"> <!-- DUPLICITE ID -->
            <seg id="ps2013-001-01-003-003.u3.p3" />
            <seg id="ps2013-001-01-003-003.u3.p4" />
            <seg id="ps2013-001-01-003-003.u3.p5" />
            <seg id="ps2013-001-01-003-003.u3.p6" />
            <seg id="ps2013-001-01-003-003.u3.p7" />
            <seg id="ps2013-001-01-003-003.u3.p8" />
            <seg id="ps2013-001-01-003-003.u3.p9" />
            <seg id="ps2013-001-01-003-003.u3.p10" />
            <seg id="ps2013-001-01-003-003.u3.p11" />
          </u>
        </media>
      </div>
    </div>
  </body>
</text>

each page has its own <div> and <media>. Utterance is split (=>duplicit id !!!) and leading <note>s are moved before utterance.

Each paragraph <seg> contains aligned sentences <s>. start contains a time of the first aligned token and end contains the end-time of the last aligned token.

<seg id="ps2013-001-01-003-003.u1.p1">
  <s id="ps2013-001-01-003-003.u1.p1.s1" start="378000.0" end="378110.0">
    <num id="ps2013-001-01-003-003.ne1" category="ni" label="">
      <tok id="ps2013-001-01-003-003.u1.p1.s1.w1" />
    </num>
    <tok id="ps2013-001-01-003-003.u1.p1.s1.w2" />
  </s>
  <s id="ps2013-001-01-003-003.u1.p1.s2" start="378120.0" end="384650.0">
    <tok id="ps2013-001-01-003-003.u1.p1.s2.w1" />
    <tok id="ps2013-001-01-003-003.u1.p1.s2.w2" />
    <tok id="ps2013-001-01-003-003.u1.p1.s2.w3" />
    <tok id="ps2013-001-01-003-003.u1.p1.s2.w4" />
    <tok id="ps2013-001-01-003-003.u1.p1.s2.w5" />
    <name id="ps2013-001-01-003-003.ne2" type="ORG" category="io" label="">
      <tok id="ps2013-001-01-003-003.u1.p1.s2.w6" />
      <tok id="ps2013-001-01-003-003.u1.p1.s2.w7" />
    </name>
    <tok id="ps2013-001-01-003-003.u1.p1.s2.w8" />
    <tok id="ps2013-001-01-003-003.u1.p1.s2.w9" />
    <tok id="ps2013-001-01-003-003.u1.p1.s2.w10" />
    <tok id="ps2013-001-01-003-003.u1.p1.s2.w11" />
    <tok id="ps2013-001-01-003-003.u1.p1.s2.w12" />
  </s>
</seg>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant