1 Mezira

Amir Zeldes Dissertation Format

Abstract

This article motivates and details the first implementation of a freely available part of speech tag set and tagger for Coptic. Coptic is the last phase of the Egyptian language family and a descendant of the hieroglyphs of ancient Egypt. Unlike classical Greek and Latin, few resources for digital and computational work have existed for ancient Egyptian language and literature until now. We evaluate our tag set in an inter-annotator agreement experiment and examine some of the difficulties in tagging Coptic data. Using an existing digital lexicon and a small training corpus taken from several genres of literary Sahidic Coptic in the first half of the first millennium, we evaluate the performance of a stochastic tagger applying a fine-grained and coarse-grained set of tags within and outside the domain of literary texts. Our results show that a relatively high accuracy of 94–95% correct automatic tag assignment can be reached for literary texts, with substantially worse performance on documentary papyrus data. We also present some preliminary applications of natural language processing to the study of genre, style, and authorship attribution in Coptic and discuss future directions in applying computational linguistics methods to the analysis of Coptic texts.

This article uses Coptic fonts which are not usually installed in most computers. You may need to install these fonts to view some of the symbols or text used here.

1 Introduction

Despite widespread illiteracy, late antique Egypt increasingly became a land that loved the book. The codex as an information technology device came to prominence in the Roman era. The earliest Christian monasteries in Egypt contributed to book production, book ownership, and book learning (Bagnall, 2009). The rules of the monastery of Pachomius (fourth century) required that all monks learn to read, even against their will:

Whoever enters the monastery uninstructed shall be taught first what he must observe; and when, so taught, he has consented to it all, they shall give him 20 Psalms or two of the Apostle’s epistles, or some other part of the Scripture. And if he is illiterate, he shall go at the first, third and sixth hours to someone who can teach him and who has been appointed for this. He shall stand before him and learn very studiously with all gratitude. Then the fundamentals of a syllable, the verbs, and nouns shall be written for him, and he shall be forced to read, even if he refuses. (Veilleux, 1981, pp. 166–7)

Pachomius’s rules, written in the 300s, were translated into Latin by the church father Jerome, who also created the Latin biblical translation known as the Vulgate. In this form, the rules would go on to influence the monasteries of medieval Europe, where monks imagined the desert fathers and mothers of Egypt to be their ascetic forbears. (Boon, 1932; Lefort, 1956)

Pachomius, like many other literate late antique Egyptians, originally composed his letters and rules in Coptic, the last phase of the ancient Egyptian language family. In use during the Roman and early Islamic periods of Egyptian history, it evolved ultimately from the language of the hieroglyphs, and together with them, forms the longest chain of historical documentation of any language in the world.

In the computer age of the 21st century, scholars face a new challenge: deciphering, studying, and documenting the vast library of Coptic texts in digital formats. This means adapting accepted standards, software, and best practices to a domain that has seen little attention from computational quarters. At the same time, advances in computational and corpus linguistics offer great promise for unraveling new ways to study Coptic language and literature. This article aims to contribute to the new wave of Digital Coptic studies by presenting and evaluating a comprehensive part-of-speech (POS)-tagging schema for Sahidic Coptic, the classical dialect of the language, as well as discussing some applications. We begin by outlining the importance of some of the Coptic works that can be accessed using computational methods (Section 2). Section 3 briefly outlines Coptic ‘words’ and smaller morphological units, and Section 4 describes the proposed tag set for evaluation. Section 5 presents an inter-annotator agreement study to determine how well humans can apply the tag set, and Section 6 evaluates tagging performance on test data. Section 7 showcases some applications for tagged text, and Section 8 concludes with lessons from this work and suggestions for future development.

2 Why Tag Coptic?

Coptic emerged during the Roman era when Greek was the ‘lingua franca’ in the Eastern Empire. The alphabet is primarily Greek, though it includes a handful of Egyptian characters, taken from the previous phase of the Egyptian language, known as Demotic. But the language’s structure derives from Egyptian grammar and syntax. Coptic absorbed some Greek vocabulary, as well as to a lesser extent Latin and, later, Arabic loan words. These factors make Coptic a rich environment for studies of culture and language.

Sources in Coptic are pivotal for a range of humanistic disciplines, such as linguistics, biblical studies, history of Christianity, Egyptology, and ancient history. For example, some of the most important extra-canonical Christian texts (such as the Gnostic scriptures) survive in Coptic. Pachomius, his successors, and others among the earliest Christian monks also documented their history, theology, and ways of life in this language. In many cases, the correspondences between Coptic texts, indigenous religious works in adjacent traditions, and translations in the area cannot be studied based on one to one lexical correspondence, but necessitate reference to quantitative studies at more abstract levels, including signature author styles and grammar, which help scholars to study transmission histories, textual re-use, and authorship attribution. A richly annotated Coptic corpus, tagged for POS as well as language of origin for foreign words, enables research into questions of bilingualism, education, and translation practices in multilingual environments as well as fundamental questions about the Coptic language and Egyptian language family. Computational analyses of the corpus may allow us to identify Coptic texts in translation (e.g. original Greek) versus natively authored Coptic sources, preliminarily classify texts into genres, or identify authorship styles. Tagged data may ultimately be able to help scholars understand the large number of untranslated or understudied Coptic texts at an early stage in the digitization, editorial, and research process. The annotated corpus has also recently been used for digital pedagogy in Coptic at the Humboldt University in Berlin.

The realization of the potential in interdisciplinary work on computational methods for Coptic led us to establish Coptic SCRIPTORIUM (http://www.copticscriptorium.org), an open access, open source, and collaborative project on the study of Coptic in the digital age. No open source corpus in the Egyptian language family, tagged for both POS and language of origin, exists aside from the project’s developing body of work, which meant that creating it required the adaptation of standards and software. To produce this corpus and address these research questions, we have developed the first fully implemented POS-tagging schema for the Coptic language (though see below for some previous pioneering work). We next explain how Coptic morphology presents some challenges in applying POS tagging to the correct units of analysis.

3 Language as Lego: Picking apart and Piecing together Coptic

Coptic is an agglutinative language, in which several morphemes can join together to create complex noun and verb forms. Example (1) shows a sentence composed of three such forms known as ‘bound groups’ in standardized Sahidic Coptic script as well as a transliteration, a glossed form, and a translation. We follow Layton (2011, pp. 12–8) for transliteration conventions. PST indicates the auxiliary indicating past tense, 3sgM the third person singular masculine pronoun, and 3sgF the equivalent feminine pronoun: Though original Coptic literary texts are generally written on manuscripts in scriptio continua, without spaces, there are conventions for transcribing groups of morphemes together, leaving spaces between larger ‘word forms’ or ‘bound groups'. This is motivated both by some phonological considerations (clitics are written together with stressed stems) and by orthographic hints in manuscripts, such as the symbol ‘ ⳿ ’ in the diplomatic text from a manuscript in Example (2). Such symbols often correspond to modern notions of ‘words’ or ‘bound groups’ in Coptic (but not always, see Layton 2011, pp. 19–20). Note that in the manuscript, there is no space between the two large graphemic units. Additionally, lines in a manuscript may break in the middle of a word. In the 19th and 20th centuries, scholars have segmented words according to different standards, which means that published editions of Coptic do not show uniformity in the divisions between words. The emerging scholarly standard is now Layton’s, but even he notes that word segmentation in Coptic is a modern convenience. Another commonly used method was established by Till (1960); we follow Layton’s model.

  • (1)  ⲁϥⲥⲱⲧⲙ̄   ⲉⲣⲟⲥ   ⲛ̄ϭⲓⲡⲣⲱⲙⲉ

  •    a.f.sōtm̩    n̩kyi.p.rōme

  •   PST.3sgM.hear to.3sgF namely.the.man

  •   ‘He heard her, that is the man’

  • (2) ⲛ̄ⲟⲩϣⲏⲣⲉ`   ⲛ̄ⲁⲃⲣⲁϩⲁⲙ`

  •    n̩.u.šēre   n̩.Abraham

  •    of.a.son   of.Abraham

  •    ‘of a son of Abra?ham’

For the purposes of tagging, however, the relevant unit is not the units delimited by spaces in (1–2), but rather the constituent morphemes, separated by dots in the transcription and glosses (e.g. a noun like ‘son’, not the sequence ‘of a son’). Therefore, we must first segment the text into such units before submitting it to the tagger. While some texts may already be segmented in this way at digitization, many are not. We therefore prepared a simple segmentation script based on a lexicon lookup using the lexicon described in the next section. While correct segmentation is important for accurate tagging, the issue of segmentation is not within the scope of this article. We therefore assume correctly segmented text for the discussion below.

4 Tag sets

Designing a POS tag set for Coptic is complicated, since there has been almost no previous work on tagging for the language, and grammatical traditions and terminology vary. A notable exception can be found in Orlandi (2004), which included the development of an electronic full-form lexicon with some useful categories and an attempt at a dedicated rule-based system for Sahidic Coptic, but did not culminate in robust, publicly available tagging software. Our work builds on Orlandi (2004) by reusing the same lexicon, recoded for the tag set described below, but using the freely available TreeTagger (Schmid, 1994), a trainable stochastic tagger, instead of a dedicated rule-based approach.1

Creating Coptic training data for a stochastic tagger is work-intensive. Digitized text in multiple genres (for robustness) is hard to find and usually not normalized, meaning extensive manual work, and texts are often difficult to translate and understand, making tagging more laborious. Thus, in order to achieve high accuracy with comparatively little data, we opted to create a coarse tag set (SCRIPTORIUM Coarse, or SC), making only minimal distinctions, and a more fine-grained tag set (SCRIPTORIUM Fine, SF) that could be less robust. We begin with the coarse tag set.

4.1 Coarse tag set (SC)

SC comprises twenty-two distinct tags, only five of which are open ended (i.e. allow items not found in the lexicon). The open classes can be seen in Table 1:

Table 1

Open classes in the coarse tag set (SC)

Tag Description 
Noun 
Verb 
ADV Adverb 
NUM Numeral (including letter combinations standing for numbers) 
UNKNOWN Missing or illegible/unknown words8
Tag Description 
Noun 
Verb 
ADV Adverb 
NUM Numeral (including letter combinations standing for numbers) 
UNKNOWN Missing or illegible/unknown words8

View Large

Assigning an open class to items not in the lexicon is one of the main challenges facing the tagger. However, Coptic syntax makes it often possible to tell nouns from verbs based on syntactic environment alone, and unknown adverbs are rare (primarily loans from Greek, which can also be recognized based on their suffixes).2 Unlisted numerals (large numbers, letter combinations) and unknown items are also rare, the latter also being recognizable by some notations for lacunae (e.g. ⲃ[….]), which can also be treated as distinctive ‘affixes’ by the tagger. A greater challenge is posed by disambiguating the closed classes, since many Coptic morphemes are homographs in certain environments (though they are distinguishable in others). The closed classes can be seen in Table 2:

Table 2

Closed classes in the coarse tag set (SC)

Tag Description 
Auxiliary (any Coptic conjugation base, see also next section) 
ART Article 
Converter (several subordinators, e.g. relativizers; cf. Layton 2011, pp. 319–66 and the next section) 
CONJ Conjunctions (e.g. ⲁⲩⲱ awō ‘and’, ⲏ ē ‘or’) 
COP Copula 
EXIST Existential predicates (ⲟⲩⲛ wn/ⲙⲛ mn ‘there is/isn’t’) 
FUT Future marker (ⲛⲁ na
IMOD Inflected modifier (ⲧⲏⲣ- tēr- ‘all of’, ϩⲱⲱ- hō’- ‘also, for one’s part’) 
NEG Negations 
PDEM Pronoun, demonstrative 
PINT Pronoun, interrogative 
PPER Pronoun, personal 
PPOS Pronoun, possessive 
PREP Preposition 
PTC Particle (e.g. ⲇⲉ de ‘but’, ⲛϭⲓ nkyi ‘namely’) 
PUNCT Punctuation 
VBD Verboid (a closed class of suffixally conjugated predicates, e.g. ⲛⲁⲛⲟⲩ- nanu- ‘be good’) 
Tag Description 
Auxiliary (any Coptic conjugation base, see also next section) 
ART Article 
Converter (several subordinators, e.g. relativizers; cf. Layton 2011, pp. 319–66 and the next section) 
CONJ Conjunctions (e.g. ⲁⲩⲱ awō ‘and’, ⲏ ē ‘or’) 
COP Copula 
EXIST Existential predicates (ⲟⲩⲛ wn/ⲙⲛ mn ‘there is/isn’t’) 
FUT Future marker (ⲛⲁ na
IMOD Inflected modifier (ⲧⲏⲣ- tēr- ‘all of’, ϩⲱⲱ- hō’- ‘also, for one’s part’) 
NEG Negations 
PDEM Pronoun, demonstrative 
PINT Pronoun, interrogative 
PPER Pronoun, personal 
PPOS Pronoun, possessive 
PREP Preposition 
PTC Particle (e.g. ⲇⲉ de ‘but’, ⲛϭⲓ nkyi ‘namely’) 
PUNCT Punctuation 
VBD Verboid (a closed class of suffixally conjugated predicates, e.g. ⲛⲁⲛⲟⲩ- nanu- ‘be good’) 

View Large

Disambiguating even the coarse-closed classes can be difficult, as some forms can belong to multiple classes. For example, the letter ⲛ n can stand for a preposition (‘of’), an auxiliary (conjunctive, somewhat similar to an English ‘-ing’ form or a Latin ablativus absolutus), a negation, a plural definite article, or a personal pronoun (first person plural, ‘we’). These are not generally difficult for humans to distinguish in context (see Section 5 below), but nevertheless mean a substantial challenge to the tagger. Other distinctions which involve disambiguating different uses of the same morpheme are generally not attempted: e.g., the COP tag is used for all instances of the predicative nexus marker, e.g. masculine singular ⲡⲉ pe- ‘(it) is’, whether it marks the theme in a nominal sentence (‘it is x’) or just a linking marker in a three-part predication (‘x is y’).3

Readers will note that we have not assigned a class of tags for adjectives. The Ancient Egyptian category of the attributive post-nominal adjective is not continued as a productive category in Sahidic Coptic, and is limited to a small class of about six lexemes (Lambdin 1983, p. 57), mostly very rare except for šēm ‘little’, in the expression šēre šēm/še’ere šēm, literally ‘boy little’ and ‘girl little’, but simply lexicalized to mean ‘boy’ and ‘girl’. The productive attributive construction is realized by the combination of a preposition and a noun without an article (see Shisha-Halevy, 1986, pp. 135–9 for word-order variants and discussion). For example: Since ponēros may also serve as a noun, we follow Layton and speak of nouns used adjectivally, or note that some nouns may be used with either gender article (so-called ‘genderless nouns’, e.g. ponēros ‘evil one’ may also be feminine in Coptic, but rōme ‘man’ is always masculine), as Layton (2011, p. 90) also notes. We therefore tag all of these cases as ‘N’ uniformly, and regard modification as a syntactic construction. Predicative adjectives of the suffixally conjugating type are treated under the category VBD (suffixally inflecting verboid), as in nanu-f ‘he is good’.

  • (1) ⲡⲣⲱⲙⲉ   ⲙⲡⲟⲛⲏⲣⲟⲥ

  •   p.rōme   m.ponēros

  •   the.man   of.evil

  •   ‘the evil man’

4.2 Fine tag set (SF)

SF comprises forty-four distinct tags, which add to and expand on SC in the following ways. Firstly, an additional open class of proper nouns NPROP is distinguished from common nouns N. This distinction is primarily recognizable for unknown words by checking for the presence of an article, as proper nouns generally do not carry an article. However, this rule is not absolute, as some place names take articles, and at the same time common nouns occasionally occur without articles, especially in generic readings (e.g. ‘man’ to mean mankind, or any man in general).

Secondly, fifteen different auxiliaries are distinguished, which have multiple, partly overlapping spellings but otherwise form closed classes. These can be seen in Table 3:

Table 3

Auxiliary tags in the fine tag set (SF)

Tag Name Example Approx. translation 
AAOR Aorist ϣⲁ he always/generally does 
ACAUS Causative ⲧⲣⲉ he causes to do 
ACOND Conditional ⲉⲣϣⲁⲛ if he does 
ACONJ Conjunctive ⲛⲧⲉ Doing 
AFUTCONJ Future Conjunctive ⲧⲁⲣⲉ he shall do 
AJUS Jussive ⲙⲁⲣⲉ let him do 
ALIM Limitative ϣⲁⲛⲧⲉ until he does 
ANEGAOR Negative Aorist ⲙⲉ he never does 
ANEGJUS Negative Jussive ⲙⲡⲣⲧⲣⲉ let him not do 
ANEGOPT Negative Optative ⲛⲛⲉ may he not do 
ANEGPST Negative Past ⲙⲡⲉ he did not do 
ANY Not Yet ⲙⲡⲁⲧⲉ he has not yet done 
AOPT Optative ⲉⲣⲉ may he do 
APREC Precursive ⲛⲧⲉⲣⲉ after he does 
APST Past ⲁ he did 
Tag Name Example Approx. translation 
AAOR Aorist ϣⲁ he always/generally does 
ACAUS Causative ⲧⲣⲉ he causes to do 
ACOND Conditional ⲉⲣϣⲁⲛ if he does 
ACONJ Conjunctive ⲛⲧⲉ Doing 
AFUTCONJ Future Conjunctive ⲧⲁⲣⲉ he shall do 
AJUS Jussive ⲙⲁⲣⲉ let him do 
ALIM Limitative ϣⲁⲛⲧⲉ until he does 
ANEGAOR Negative Aorist ⲙⲉ he never does 
ANEGJUS Negative Jussive ⲙⲡⲣⲧⲣⲉ let him not do 
ANEGOPT Negative Optative ⲛⲛⲉ may he not do 
ANEGPST Negative Past ⲙⲡⲉ he did not do 
ANY Not Yet ⲙⲡⲁⲧⲉ he has not yet done 
AOPT Optative ⲉⲣⲉ may he do 
APREC Precursive ⲛⲧⲉⲣⲉ after he does 
APST Past ⲁ he did 

View Large

The remaining added tags specify subtypes of verbs, personal pronouns, and the aforementioned converters. Verbs distinguish morphological imperative (VIMP) and stative forms (VSTAT), where they are distinguishable. The former exist for only a handful of verbs (e.g. ⲁⲣⲓ ari ‘do’, ⲁϫⲓ ači ‘say’), and no attempt is made to tag other verbs used in the imperative as VIMP (cf. Schiller et al., 1999 for a similar decision in the standard tag set for German, STTS). The latter exist for most verbs and signify a state in the case of intransitive verbs (e.g. ϩⲟⲗϭ holky ‘be sweet’) or a passive for transitive verbs (ⲕⲏⲧ kēt ‘be built’).

For pronouns, subject, object, and independent forms are distinguished as PPERS, PPERO, and PPERI, respectively. The latter are used for emphatic purposes (‘As for me, I…’) or in nominal sentences (‘It is I’). Converters (the tag C in the coarse set) include: CREL for the relative converter (‘which’), CCIRC for the circumstantial (‘while’),4 CFOC for the focalizing converter (‘it is X!’, see below), and the preterit conversion CPRET, which signifies an anterior past (imperfect and pluperfect readings, depending on tenses it combines with). Though they have rather different semantics, the converters share morphosyntactic characteristics, including partly identical forms depending on their environment, an initial position before fully inflected sentences (which they ‘convert’), and fusional morphology together with adjacent pronouns.

Thus, the primary differences between the fine- and coarse-grained tag sets revolve around more detailed distinctions in the closed classes, as well as the addition of proper names. How challenging the decision between closed classes is can best be illustrated using the example of the form ⲉ e, which can have as many as five different tags in SF: In some cases, especially when the text is fragmentary, even a human annotator cannot disambiguate these with absolute certainty, as in the following example, for which the preceding context is lost: The first e- in the sentence is definitely a converter, but in this environment, three converters share the same form, and a translation with any of the three is possible: These ambiguities are amongst the most difficult for the tagger, as are the distinctions between other homographs, such as the plural article n-, the pronoun -n (first person plural), the negation n-, and the preposition n- ‘of’. The converters in particular are also a major source of disagreement between human annotators, see Section 5.

  • -PREP—a preposition meaning ‘to’

  • -CREL—a form of the relative converter in some environments ‘(…) which’

  • -CCIRC—a form of the circumstantial converter ‘(…) while’

  • -CFOC—a form of the focalizing converter ‘it’s that (…)’, stressing some element in the following sentence.

  • -PPERO—an object pronoun (second person feminine singular)

  • (2) ϥϯⲙ̄ⲧⲟⲛ    ⲇⲉ ⲟⲛ  ⲛ̄ⲧⲙⲁⲁⲩ    ⲛ̄ⲧⲁⲥϫⲡⲟϥ

  • e.f.ti.mton    de on  n.t.ma’u    nt.a.s.čpo.f

  • ?.3sgM.give.rest but still of.the.mother that.PST.3sgF.bore.3sgM

  • -Relative5:‘which however still gives rest to the mother that bore him’

  • -Circumstantial: ‘while he still however gives rest…’

  • -Focalizing: ‘But it is TO THE MOTHER WHO BORE HIM that he gives rest!’

4.3 Portmanteau tags

In some comparatively infrequent cases, a single orthographic form can contain two categories. For example the verb ⲉⲓⲛⲉ eine ‘bring’ takes the form ⲛⲧ nt- before personal pronouns objects (e.g. ⲛⲧϥ nt.f ‘bring him’). However, if it takes the first person object form ⲧ -t ‘me’, then this is not written separately, leading to a plain ⲛⲧ nt ‘bring me’. In these cases, we assign a portmanteau tag consisting of both relevant categories separated by an underscore: V_PPERO (a verb and its object in one; cf. Schiller et al., 1999 for a similar decision for German).

The same can occur in many forms of the second person singular feminine subject, which is often realized as a ‘zero’, as in the case of the preterit conversion: In the third case in (4), the converter takes the same form as in the first case (nere), but there is no overt realization of the word ‘you (fem.)’. For a masculine second person subject, the converter is ne, and the word ‘you (masc.)’ is realized as k. Thus, the tag for nere ‘you used to (fem.)’ is CPRET_PPERS, a converter form which also contains a personal pronoun marking.

  • (3) ⲛⲉⲣⲉ-ⲡⲣⲱⲙⲉ  ⲥⲱⲧⲙ̄ ⲛⲉⲕⲥⲱⲧⲙ̄   ⲛⲉⲣⲉⲥⲱⲧⲙ̄

  • nere.p.rōme   sōtm ne.k.sōtm   nere.sōtm

  • CPRET.the.man hear CPRET.2sgM.hear  CPRET+2sgF.hear

  • the man used to hear you (m.) used to hear you (f.) used to hear

We have so far assigned twelve combination tags (mostly second person singular feminine subjects connected to various auxiliaries), but these form only fifty-seven tokens within our test corpus of over 12,000 tokens, i.e. less than 0.4%.

5 Inter-annotator agreement

Automatic POS tagging is only useful, and can only be evaluated for accuracy, if human annotators can agree on the ‘gold standard’ tag for every word (or more realistically for most words) in a text. We therefore conducted a small experiment to evaluate our tag set. Both authors independently annotated the same two ‘subcorpora’ using the maximally granular SF tag set. The data were taken from two different texts in order to give a first indication whether agreement robustness might be affected by text type or genre. We selected a section from the letter Abraham Our Father by the classical monastic author Shenoute and a collection of short narrative anecdotes from the Sayings of the Desert Fathers (known by the Greek title Apophthegmata Patrum), which both have good orthography and few lacunae, but have rather different styles. The two text types contained 906 and 576 Coptic morphs, respectively. Our agreement on tagging is presented in Table 4.

Table 4

Percentage of agreement between two annotators by text

Text Identical SF Identical SC 
Abraham our Father854/906 (94.26%) 872/906 (96.24%) 
Apophthegmata Patrum542/576 (94.09%) 553/576 (96.01%) 
Total1,396/1,482 (94.19%) 1,425/1,482 (96.15%) 
Text Identical SF Identical SC 
Abraham our Father854/906 (94.26%) 872/906 (96.24%) 
Apophthegmata Patrum542/576 (94.09%) 553/576 (96.01%) 
Total1,396/1,482 (94.19%) 1,425/1,482 (96.15%) 

View Large

The figures in Table 4 are quite positive, with absolute agreement accuracy of around 94% for the fine-grained tag set and 96% for the coarse-grained one. However, these figures do not give us an idea of how likely this agreement is to arise by chance (e.g. if most words are nouns, it is easier to just guess that something is a noun whenever in doubt). For this reason, the Kappa metric is commonly used to evaluate annotation schemes, which takes into account the difficulty of the annotation task in terms of reaching agreement by chance. Kappa ranges from 1 (perfect agreement) to 0 (absolutely random, but not zero agreement), by using the sum of squares of annotators voting for a certain decision for each case (here, the POS for a single word). When all annotators agree, the squared value is maximal, but disagreement leads to a sum of lower squares. The weight of the decision for each possible category is proportional to the frequency with which it is assigned, meaning that the assignment of a frequent category is considered less surprising, or more likely to occur by chance. For our experiment, we calculated a Kappa value of 93.96 for SF and 95.69 for SC, which can be considered very high (see Artstein and Poesio, 2008 for more details).

The primary disagreements occurred in telling apart the open classes of nouns and verbs, and disambiguating closed classes, particularly the converters. Figure 1 shows the most-frequent confusion categories in SF.

The confusion of nouns and verbs may seem surprising given the linear, agglutinative nature of Coptic grammar. However, cases of confusion primarily arose in the context of nominalized verbs, as in Example (5). The morph under disagreement in this example is the verb ti ‘give’, which is part of the complex expression ti-tōn ‘quarrel (lit. give quarrel)’. The entire combination ti-tōn has been nominalized in the presence of an indefinite article u, so that the second bound group in (5) literally reads ‘in a give quarrel’, roughly meaning ‘argumentatively’ (or ‘in argument, while arguing’). The morph ti is morphologically a verb, but syntactically converted to a noun, which leads to disagreement. This type of issue can probably be resolved by refining guidelines.

  • (4) ⲥⲉⲁⲡⲟⲧⲁⲥⲥⲉ   ϩⲛⲟⲩϯⲧⲱⲛ

  •   n̩.se.apotasse   hn̩.u.ti.tōn

  •   and.they.renounce  in.a.give.quarrel

  •   ‘and they renounce (them) argumentatively’

A different class of problem occurs primarily in disagreements about converters, which can stem from subtle translation differences, as in (6). In this case, it is difficult to make a certain decision about the converter e in bold in (6). Coptic relative clauses modifying an indefinite noun take the same converter form e as circumstantial clauses meaning roughly ‘while’. Therefore, in (6), the text could mean that there was a man who was in the habit of carrying a reed mat (relative), or that there was a man there carrying a reed mat at that point in time (circumstantial). Ambiguities like this are not likely to be answered completely consistently by human annotators, and an automatic tagger is likely to vary, as well, though generally preferring the option that is more frequent in training data.

  • (5) ⲛⲉⲟⲩⲛ   ⲟⲩϩⲗⲗⲟ ϩⲛⲛⲣⲓ … ϥⲫⲟⲣⲉⲓ   ⲛⲟⲩⲧⲙⲏ

  • ne.wn̩    u.hl̩lo  hn̩.n̩.ri    e.f.phori   n̩.u.tmē

  • PRET.was an.old  in.the.cells  C.3sgM.carry  ACC.a.mat

  • ‘There was an old man in Kellia … (who carried/carrying) a reed mat’

6 Automated Tagging Accuracy

To train the tagger and evaluate accuracy, we tagged the texts in Table 5. Texts were selected for scholarly interest (linguistic and philological), and in order to offer a breadth of genres in literary Coptic, including religious discourse, letters, and Biblical and non-Biblical narrative (see Section 7 for more details on the texts and authors).

Table 5

Breakdown of texts used in the gold standard training corpus

Text Morphs 
Shenoute / Abraham Our Father2,061 
Shenoute / Acephalous 22229 
Shenoute / Not Because a Fox Barks1,767 
Besa / Letter to Aphthonia1,123 
Besa / Letter to Thieving Nuns785 
New Testament / Mark 11,229 
Apophthegmata Patrum (11 texts)1,388 
Artificial sentences62 
Total8,582 
Text Morphs 
Shenoute / Abraham Our Father2,061 
Shenoute / Acephalous 22229 
Shenoute / Not Because a Fox Barks1,767 
Besa / Letter to Aphthonia1,123 
Besa / Letter to Thieving Nuns785 
New Testament / Mark 11,229 
Apophthegmata Patrum (11 texts)1,388 
Artificial sentences62 
Total8,582 

View Large

The inclusion of some artificial sentences at the bottom of the table was motivated by the need to generate examples for the tagger of some particularly infrequent combinations not otherwise attested in the corpus, in particular cases of portmanteau tags (Section 4.3) which we had foreseen based on combinatoric possibilities in Coptic grammar, e.g. possible second person singular feminine forms that were not attested in our corpus. This need was minimized, particularly in the context of rare 2sgF forms, by including Besa’s Letter to Aphthonia (fifth century), written to address a female nun in the second person.

To evaluate the tagger’s performance, we take a ten-fold cross-validation approach, dividing the data into ten portions of which each portion is held out once as test data, while the remaining nine are used as training data (excluding the artificial sentences, which are never in the test set). The training data were fed to the freely available and trainable TreeTagger (Schmid, 1994), which was also given a list of the open and closed tags and a POS-tagged lexicon containing 5,265 entries (derived from CMCL’s database mentioned above). The best model used trigram context (looking at probabilities in sequences of three morphs). Table 6 gives the accuracy per slice as well as the percentage of unknown words encountered by the tagger, which were missing from the lexicon.

Table 6

Tagger accuracy in ten-fold cross validation

Slice % correct SF % correct SC % out of lexicon 
89.66 91.10 0.46 
95.16 96.82 0.00 
95.07 95.28 0.96 
94.03 95.76 1.65 
92.92 96.55 2.85 
94.96 96.90 1.76 
94.93 93.62 1.38 
96.26 96.48 0.70 
94.17 93.64 2.74 
10 94.44 95.00 3.67 
Average 94.16 95.12 1.62 
Slice % correct SF % correct SC % out of lexicon 
89.66 91.10 0.46 
95.16 96.82 0.00 
95.07 95.28 0.96 
94.03 95.76 1.65 
92.92 96.55 2.85 
94.96 96.90 1.76 
94.93 93.62 1.38 
96.26 96.48 0.70 
94.17 93.64 2.74 
10 94.44 95.00 3.67 
Average 94.16 95.12 1.62 

View Large

A relatively high total accuracy of over 94% for SF and 95% for SC was reached, meaning that on average, about every 20th morph receives an incorrect tag (slightly more often for SF). While climbing even fractions of a percent higher will become exponentially more difficult, it should be noted that these figures are not very far below tagging performance for languages with much larger training sets, which is due in large part to the high coverage of the lexicon: on average, only 1.62% of morphs in the test texts had no lexicon information, meaning that at least for open classes, the tagger could usually rely on dictionary information to establish whether a word was known to be a noun or a verb. The large part of tagging errors was due to incorrect disambiguation, the major cause of human disagreements in Section 5. For example, the top five tagging errors, making up 16.7% of all errors, were due to different confusions of the correct tag for the morph ⲉ e, which has five different readings (cf. Section 4.2).

These results suggest that the tagger may be vulnerable in texts with a higher proportion of out-of-data vocabulary, and possibly also different genres. While we cannot offer a full exploration of this issue in this article, we give a toy evaluation on a much more ‘unruly’ text type, documentary papyri. We tested both models on two documentary papyri taken from papyri.info, together comprising 137 tokens. For SF, tagging accuracy degraded to 80.29%, while SC remained more robust at 87.59%. Of the 137 tokens, twenty-three were not found in the lexicon (16.78%). The difference between the two models’ performance is due in large part (but not only) to the distinction between proper and common nouns, as proper nouns are often out of lexicon items and difficult to distinguish from common nouns. At the same time, it is highly likely that the worse performance on papyrus data is due not only to out-of-data items and the frequency of proper nouns, but also because the tagger has been trained on completely different text types and language domains, all coming from literary Coptic. We therefore feel that there is room for much work on expanding the domains and text types on which the tagger is trained, as well as for obtaining more lexicon data, including lists of proper names and toponyms.

7 Applications in Research and Pedagogy

A Coptic corpus tagged for POS can enable research projects in a variety of disciplines. Coptic literature (hagiography, sermons, epistles) can be analyzed for knowledge about the rhetorical structure of diverse texts. Traditional scholarship on genre and literary formulae (e.g. Choat, 2007 on epistolary formulae) can be enhanced by the ability to query and analyze large corpora in terms of grammar and syntax as well as vocabulary. A statistical analysis of a corpus that spans several centuries of Coptic history can yield information about the evolution of the language over time, especially as Arabic enters Egypt.

We present here some preliminary research on genre and style. These results come from a subset of the corpora used in Section 6, but they illustrate the potential for research with a larger corpus. The documents include the letter Abraham Our Father by the monastic leader Shenoute writing in the late 300s or early 400s, portions of an untitled fragmentary text by Shenoute (Acephalous Work 22), two letters by Shenoute’s successor Besa, a selection of sayings from the Coptic Sayings of the Desert Fathers (the Apophthegmata Patrum), and the first six chapters from the Coptic Gospel of Mark.

Table 7 shows the frequencies for the most prevalent POS in each of the works above as deviations from the expected norm using a chi-square test on a contingency table tabulating the frequencies of each tag against the different corpora against.6

Amir Zeldes - Homepage

Overview

I am a computational linguist specializing in corpus linguistics, the extraction and analysis of linguistic structures in digital text collections. My main areas of interest are at the syntax-semantics interface: I am interested in how we say what we want to say, and especially in the kinds of discourse models we retain across sentences. This includes representing entity models of who or what has been mentioned, how they are introduced and referred back to, but also relationships between utterances as a complex discourse is constructed, such as expressing causality, signalling support for arguments and opinions with evidence, contrasts and more.

I am also very interested in how we learn to be productive in our first, second and subsequent languages, producing some (but not only, and not just any) utterances and combinations we have never heard before. I believe that very many factors constantly and concurrently influence the choice between competing constructions, which means that we need multifactorial methods and multilayer corpus data in order to understand what it is that we do when we produce and understand language.

Research Interests

  • Corpus Linguistics
  • Building and using multilayer corpora
  • Predictive modelling of syntactic alternations
  • Productivity in argument selection
  • Information structure
  • Digital Humanities for Coptic studies
  • Coreference and entity resolution
  • Discourse annotation (especially in RST)
  • Developing corpus search and annotation interfaces
  • Constructions in second language acquisition (esp. of German)

Stuff I work on

News and events

Send me an e-mail if you'd like to join corpinfo, the GU mailing list for information on corpus linguistics events, jobs and corpus releases at GU and the DC area.

Older events...

Leave a Comment

(0 Comments)

Your email address will not be published. Required fields are marked *