Opera Graeca Adnotata: Building a 40M+ Token Multilayer Corpus for Ancient Greek

Giuseppe G. A. Celano

doi:10.11588/dco.2026.12.112290

Opera Graeca Adnotata: Building a 40M+ Token Multilayer Corpus for Ancient Greek

Giuseppe G. A. Celano (Autor/in)

https://orcid.org/0000-0002-7699-2566

PDF

Identifier (Artikel)

DOI: https://doi.org/10.11588/dco.2026.12.112290

Abstract

In this article, the beta version 0.2.0 of Opera Graeca Adnotata (OGA), the largest open access multilayer corpus for Ancient Greek (AG), is presented. OGA consists of 1,999 literary works and 40M+ tokens sourced from the canonical-greekLit, First1KGreek, and PatristicTextArchive GitHub repositories, which together host AG texts ranging from approximately 900 BCE to 1400 CE. The texts have been enriched with nine annotation layers: (i) tokenization; (ii) sentence segmentation; (iii) lemmatization; (iv) morphology; (v) dependency structure; (vi) dependency function; (vii) IPA transcription; (viii) composition date; and (ix) CTS structure. The layers are described by highlighting the main technical and annotation-related issues encountered. The corpus is released in the standoff formats PAULA XML and its derivative LAULA XML and is queryable online through ANNIS.

Opera Graeca Adnotata: Building a 40M+ Token Multilayer Corpus for Ancient Greek

Identifier (Artikel)

Abstract

Statistiken

Lizenz

Aktuelle Ausgabe