Greek Language Utilties from the WMU Hecurlaneum Project.

This package provides a set of utilities for working with Greek text. It is designed to be used in conjunction with the WMU Herculaneum Project, but can be used independently.

Installation

poetry add wmu_greek_utils

Usage

Normalization Options

The Normalizer class provides several options for normalizing Greek text. These options can be combined to achieve the desired normalization effect. Below are the available options:

LOWERCASE: Converts all characters to lowercase.
UPPERCASE: Converts all characters to uppercase.
REMOVE_SPACES: Removes all spaces from the text.
REMOVE_NEWLINES: Removes all newline characters from the text.
REMOVE_PUNCTUATION: Removes all punctuation marks from the text.
REMOVE_ACCENTS: Removes all accent marks from the text.
REMOVE_BREATHING: Removes all breathing marks from the text.
IOTA_ADSCRIPT: Converts iota subscript to iota adscript.
NORMALIZE_SIGMA: Normalizes all sigma characters to a single form.
NORMALIZE_THETA: Normalizes all theta characters to a single form.
NORMALIZE_PHI: Normalizes all phi characters to a single form.
NORMALIZE_APOSTROPHE: Normalizes all apostrophe characters to a single form.

Example Usage

from wmu_greek_utils.normalize import Normalizer, NormalizationOptions

# Standard normalization is LOWERCASE | NORMALIZE_THETA | NORMALIZE_PHI | NORMALIZE_APOSTROPHE
normalize = Normalizer()
# notice odd thetas
text = "Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν ϑεόν, καὶ ϑεὸς ἦν ὁ Λόγος."
normalized_text = normalize(text)
print(normalized_text)  # Output: "ἐν ἀρχῇ ἦν ὁ λόγος, καὶ ὁ λόγος ἦν πρὸς τὸν θεόν, καὶ θεὸς ἦν ὁ λόγος."

# Create a normalizer with multiple options

from wmu_greek_utils.normalize import UPPERCASE, REMOVE_SPACES, REMOVE_NEWLINES, REMOVE_PUNCTUATION, REMOVE_ACCENTS, REMOVE_BREATHING, IOTA_ADSCRIPT, NORMALIZE_SIGMA, NORMALIZE_THETA, NORMALIZE_PHI, NORMALIZE_APOSTROPHE

radical_normalizer = Normalizer(config=UPPERCASE
        | REMOVE_SPACES
        | REMOVE_NEWLINES
        | REMOVE_PUNCTUATION
        | REMOVE_ACCENTS
        | REMOVE_BREATHING
        | IOTA_ADSCRIPT
        | NORMALIZE_SIGMA
        | NORMALIZE_THETA
        | NORMALIZE_PHI
        | NORMALIZE_APOSTROPHE
)

# The above is equivalent to Normalizer(config=NORMALIZATION_OPTIONS.ALL)

normalized_text = radical_normalizer(text)
print(normalized_text)  # Output: "ΕΝΑΡΧΗΙΗΝΟΛΟΓΟϹΚΑΙΟΛΟΓΟϹΗΝΠΡΟϹΤΟΝΘΕΟΝΚΑΙΘΕΟϹΗΝΟΛΟΓΟϹ"

AGDT morphological parsing

parse_mophology

The parse_morphology function can be used to parse the morphology field of a morphological code.

Examples:

Parsing a verb morphology code:

>>> parse_morphology("v3sasm---", include_names=False)
['verb', 'third person', 'singular', 'aorist', 'subjunctive', 'middle', None, None, None]

Parsing a noun morphology code:

>>> parse_morphology("n-s---mn-", include_names=False)
['noun', None, 'singular', None, None, None, 'masculine', 'nominative', None]

Including the position names in the output:

   >>> list(parse_morphology("n-s---mn-"))
    [('part_of_speech', 'noun'), ('person', None), ('number', 'singular'), ('tense', None), ('mood', None), ('voice', None), ('gender', 'masculine'), ('case', 'nominative'), ('degree', None)]

morphology_string

Given a list of forms, produce the morphology string to the best of our ability.

Examples:

Basic usage with a list of forms:

>>> morphology_string(['noun', 'masculine', 'singular', 'nominative'])
'n-s---mn-'

Usage with a randomized list of forms (in other words, the order of the forms does not matter):

>>> list = ['noun', 'masculine', 'singular', 'nominative']
>>> random.shuffle(list)
>>> morphology_string(list)
'n-s---mn-'

Usage with abbreviated forms:

>>> morphology_string(['masc', 'sing', 'nom', 'n'])
'n-s---mn-'

Usage with a more complex list of forms:

>>> morphology_string(['verb', 'third person', 'singular', 'aorist', 'subjunctive', 'middle', None, None, None])
'v3sasm---'

Usage with a partial list of forms:

>>> morphology_string(['verb', 'third person', 'singular', 'aorist', 'subjunctive', 'middle'])
'v3sasm---'

position_to_name

""" Given a 0-based position, return the name of the position.

'part_of_speech' >>> position_to_name(8)
'degree'

name_to_position

Given a name, return the 0-based position. Can use some short or alternate names for the name.

    >>> name_to_position('part_of_speech')
    0
    >>> name_to_position('pos')
    0
    >>> name_to_position('degree')
    8

recreate_sentence

Given a list of words and a list of morphologies, recreate the sentence, along with the positions in the sentence.

words = [
        ("The", "det"),
        ("cat", "noun"),
        (",", "punctuation"),
        ("the", "det"),
        ("dog", "noun"),
        (",", "punctuation"),
        ("and", "conj"),
        ("the", "det"),
        ("frog", "noun"),
        ("sat", "verb"),
        ("on", "prep"),
        ("the", "det"),
        ("mat", "noun"),
        (".", "punctuation"),
    ]
sentence, poss = agdt.recreate_sentence(words)
assert sentence == "The cat, the dog, and the frog sat on the mat."
assert poss == [
        (0, 2),
        (4, 6),
        (7, 7),
        (9, 11),
        (13, 15),
        (16, 16),
        (18, 20),
        (22, 24),
        (26, 29),
        (31, 33),
        (35, 36),
        (38, 40),
        (42, 44),
        (45, 45),
    ]

Acknowledgements

This package was developed by the WMU Herculaneum Project.

I am grateful for James Tauber's greek_normalisation package, which was used as a reference for the normalization options in this package; some of that package is used.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
tests		tests
wmu_greek_utils		wmu_greek_utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Greek Language Utilties from the WMU Hecurlaneum Project.

Installation

Usage

Normalization Options

Example Usage

AGDT morphological parsing

parse_mophology

morphology_string

position_to_name

name_to_position

recreate_sentence

Acknowledgements

About

Releases

Packages

Languages

License

WMU-Herculaneum-Project/wmu_greek_utils

Folders and files

Latest commit

History

Repository files navigation

Greek Language Utilties from the WMU Hecurlaneum Project.

Installation

Usage

Normalization Options

Example Usage

AGDT morphological parsing

parse_mophology

morphology_string

position_to_name

name_to_position

recreate_sentence

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages