Skip to content

Morphosyntactic annotation of the De Latinae Linguae Reparatione by Sabellicus in the Universal Dependencies formalism

Notifications You must be signed in to change notification settings

CIRCSE/Sabellicus

Repository files navigation

Overview

This repository contains the text of De Latinae Linguae Reparatione authored by Marcus Antonius Sabellicus (1436–1506), annotated with respect to lemmas, part-of-speech tags, morphological features and syntactic dependencies.

A first version of the text has been included in the test data of EvaLatin Shared Task (Sprugnoli et al., 2022), which however did not focus on syntactic dependencies. Since that first version, some changes have been implemented - see below Annotation formalism and choices for further details.

Acknowledgments

Annotator: Federica Gamba
Editor: Flavio Massimiliano Cecchini

We thank Timo Korkiakangas (Helsingin yliopisto, Helsinki, Finland) for funding the annotation through the Suomen Akatemia (Research Council of Finland) project grant no. 315176, Digital philology and Latin text production: a multimodal analysis of writing in the past.

Sources

The raw text was originally downloaded from ALIM - Archivio della Latinità Italiana del Medioevo as a txt file, and is reproduced in this repository. The text in ALIM is based on the critical edition by G. Bottari (1999).

The text is composed of four sections:

  • Epistola a Marcantonio Morosini
  • De Latinae Linguae Reparatione Marci Antonii Sabellici Dialogus, Qui Et Latinae Linguae Reparatio Inscribitur
  • Baptistae Guarini Dissertatio
  • Epistola ad Antonio Moretto

Annotation formalism and choices

The document is annotated according to the typological formalism of Universal Dependencies (UD), in particular as it is applied for Latin. It follows the annotation's state of the art as represented especially by the UDante treebank (Cecchini et al., 2020) and the harmonisation effort by Gamba & Zeman (2023a, 2023b), however implementing some slight (and compatible) twists. These are highlighted in the following, along with more general annotation choices which might be of interest to the user.

Lexicon

  • Lemmas are normalised: they are always lowercase and the couple v/u is neutralised in favour of u.
  • The annotation follows the current UD praxis of assigning "adverbs" (ADV) their own adverbial lemma.
    • This means that we observe some etymologically related couples treated as different words (e.g. facile/faciliter, nimis/nimium,...), or also, to keep compatibility with other Latin treebanks, elements possibly belonging other parts of speech not annotated as such (e.g. PRON for ubi, DET for eo, NOUN for sponte).
  • The current convention of other Latin treebanks is followed in mostly not splitting (i.e. treating as multi-word tokens) compounds of any kind (orthograhical or morphological).
    • This means for example that tokens like admodum or praeterquam are analysed as unitary elements. The only exception to this is the token rempublicam, consisting of two morphologically independent words. Clitics such as que and ue are instead regularly split (also see below Statistics - Words).
  • A complete annotation for foreign words is implemented, choosing the code-switching strategy as detailed in UD guide lines. Lemmas are represented as in their original language, including possibly different writing systems. On the other hand, the feature OrigLang for integrated foreign words has not yet been implemented.
    • We note that while this CoNLL-U overall passes the official UD validation process, some morphological features used for Foreign words are not (yet) recognised by their respective treebanks (e.g. InflClass for grc).

Morphology

  • The lexical features NameType, NumValue and Proper are moved from the features to the miscellaneous (MISC) field.
    • The motivation with regard to NameType and Proper is that these features are of purely semantic type, with no reflexes whatsoever on morphosyntax, and they actually represent a different, independent layer of annotation (which will need to be better formalised in future), i.e. the one for named entities and multi-word expressions.
    • As for NumValue, the main reason is technical, as its values correspond to the infinite set of natural numbers, and so cannot be listed exhaustively in a UD documentation page.
  • NumValue is added to all words which are assigned a NumType.
  • The tripartite system of VerbForms as described in (Cecchini, 2021) is used.
    • This excludes the values Gdv, Ger and Sup. The value Inf for infinitives instead of a typologically more transparent Vnoun is maintained for compatibility with the other Latin treebanks (we notice that they are in fact equivalent notations for the same object).
    • The MISC field stores traditional denominations for Latin tenses and moods by means of the fields TraditionalMood and TraditionalTense.
  • The Latin Perfect tense (corresponding to forms with Perfectum stems) is analysed as having Aspect=Perf and Tense=Pres. This implies that Plusquamperfectum is represented by Aspect=Perf and Tense=Past (as opposed to Imperfectum Aspect=Imp and Tense=Past), and so Tense=Pqp is not used.
  • So-called "Inchoative" verbs (i.e. sc-verbs) are always annotated with Aspect=Inch in their imperfective forms. This is also valid for lemmas which are sometimes considered to "have lexicalised", e.g. cresco.
  • The features VerbForm and PronType are also used in an "etymological" sense: that is, they can be assigned to words which are not tagged as VERB or PRON/DET/ADV respectively, meaning that the original form was one, even if their current distribution belongs to another part of speech now.

Syntax

  • The semantic transversal subtypes lmod and tmod for space and time arguments are applied consistently through the annotation, and so are the corresponding values for AdvType.
  • So-called "free relatives", or also "double pronouns", are annotated as clauses containing their relative element (e.g. quando or quis) and as depending on the root of their main clause with the needed clausal relation type (advcl, ccomp/xcomp, csubj), using the relcl subtype for distinction.
    • This is in contrast to another style of annotation where the relative element is made dependent as a nominal argument of the main clause, and the rest of the free relative depends on the relative element as a "regular" relative clause (acl:relcl).

Statistics

Words

The annotated texts consists of (for the definitions of tokens, syntactic words and multi-word tokens, please refer to UD documentation):

  • 246 sentences
  • 10651 tokens, or 9070 without considering the 1581 punctuation marks (i.e. part of speech PUNCT)
  • 10755 syntactic words, or 9174 without considering the 1581 punctuation marks (i.e. part of speech PUNCT)

The difference is given by 104 multi-word tokens which are always composed by 2 elements, the second of which is nearly always a functional clitic, distributed as follows:

  • 94 que (CCONJ)
  • 3 cum (ADP)
  • 2 ue (CCONJ)
  • 2 quis (PRON)
  • 1 quidem (PART)
  • 1 uero (ADV)

The only case where a token is split into two lexical components is rempublicam, made out of forms of res (NOUN) and publicus (ADJ) respectively.

There are 1866 different lemmas, 1857 if punctuation marks are not included. This leads to a lexical richness (i.e. "type-token ratio") of ca 20,24%. Please note that some words might be different even if they are assigned the same lemma, but then e.g. they belong to different parts of speech.

There are 13 foreign words (marked with Foreign=Yes) in the text: 8 belong to Italian (it), 5 to Ancient Greek (grc). Especially the latter represent cases of code-switching.

Parts of speech

All UD's parts of speech apart from SYM and X are used. They are distributed as follows according to their form types and lemmas:

Part of speech per form type per lemma
NOUN + PROPN 1572 + 404 454 + 221
VERB 1691 508
ADJ 862 344
ADV 845 208
DET + NUM 863 + 25 45 + 8
ADP 608 29
PRON 837 19
SCONJ 366 19
PART 276 16
CCONJ 560 13
AUX 264 1
INTJ 1 1
PUNCT 1581 9

Please refer to UD's guidelines for parts of speech.

Dependency relations

A total of 61 dependency relations is used, 31 of which represent subtypes of universal relations, as follows:

  • acl
    • acl:relcl
  • advcl
    • advcl:abs
    • advcl:cmp
    • advcl:pred
    • advcl:relcl
  • advmod
    • advmod:emph
    • advmod:lmod
    • advmod:neg
    • advmod:tmod
  • amod
  • appos
  • aux
    • aux:pass
  • case
  • cc
  • ccomp
  • conj
    • conj:expl
  • csubj
    • csubj:cleft
    • csubj:pass
    • csubj:relcl
  • det
    • det:numgov
  • discourse
  • dislocated (NB: never used without subtype)
    • dislocated:csubj
    • dislocated:nsubj
    • dislocated:obj
  • flat
    • flat:gov
    • flat:name
  • mark
  • nmod
  • nsubj
    • nsubj:cleft
    • nsubj:outer
    • nsubj:pass
  • nummod
  • obj
  • obl
    • obl:agent
    • obl:arg
    • obl:cmp
    • obl:lmod
    • obl:tmod
  • orphan
  • parataxis (NB: never used without subtype)
    • parataxis:rep
    • parataxis:reporting
    • parataxis:speaker
  • punct
  • root
  • vocative
  • xcomp

Please refer to UD's guidelines for dependency relations and to the specific guide lines for Latin, where present.

Features

The following 24 morpholexical features and their values (given as pipe-separated strings after the equal sign) are used:

  • Abbr=Yes
  • AdvType=Loc|Tim
  • Aspect=Imp|Inch|Perf|Prosp
  • Case=Abl|Acc|Dat|Gen|Loc|Nom|Voc
  • Compound=Yes
  • Degree=Abs|Cmp|Dim
  • Foreign=Yes
  • Form=Emp
  • Gender=Fem|Masc|Neut
  • InflClass=IndEurA|IndEurE|IndEurI|IndEurO|IndEurU|IndEurX|LatA|LatAnom|LatE|LatI|LatI2|LatPron|LatX
    • also as layer [nominal]
  • Mood=Imp|Ind|Sub
  • Number=Plur|Sing
    • also as layer [psor]
  • NumForm=Roman|Word
  • NumType=Card|Mult|Ord
  • PartType=Int
  • Person=1|2|3
    • also as layer [psor]
  • Polarity=Neg
  • Poss=Yes
  • PronType=Con|Dem|Ind|Int|Neg|Prs|Rel|Tot
  • Reflex=Yes
  • Tense=Fut|Past|Pres
  • Variant=Greek
  • VerbForm=Conv|Fin|Inf|Part
  • Voice=Act|Pass

Please refer to UD's guidelines for morpholexical features, layered features, and to the specific guide lines for Latin, where present.

Contacts

  • Federica Gamba, ÚFAL, MFF, Univerzita Karlova, Prague, Czech Republic: gamba at ufal.mff.cuni.cz
  • Flavio Massimiliano Cecchini, KU Leuven, Belgium (formerly CIRCSE, Universita Cattolica del Sacro Cuore, Milan, Italy): flaviomassimiliano.cecchini at kuleuven.be

References

About

Morphosyntactic annotation of the De Latinae Linguae Reparatione by Sabellicus in the Universal Dependencies formalism

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published