Opera Graeca Adnotata: Building a 40M+ Token Multilayer Corpus for Ancient Greek
Identifier (Artikel)
Abstract
In this article, the beta version 0.2.0 of Opera Graeca Adnotata (OGA), the largest open access multilayer corpus for Ancient Greek (AG), is presented. OGA consists of 1,999 literary works and 40M+ tokens sourced from the canonical-greekLit, First1KGreek, and PatristicTextArchive GitHub repositories, which together host AG texts ranging from approximately 900 BCE to 1400 CE. The texts have been enriched with nine annotation layers: (i) tokenization; (ii) sentence segmentation; (iii) lemmatization; (iv) morphology; (v) dependency structure; (vi) dependency function; (vii) IPA transcription; (viii) composition date; and (ix) CTS structure. The layers are described by highlighting the main technical and annotation-related issues encountered. The corpus is released in the standoff formats PAULA XML and its derivative LAULA XML and is queryable online through ANNIS.
Statistiken

Lizenz

Dieses Werk steht unter der Lizenz Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 4.0 International.


