Constructing Stoplists for Historical Languages
Identifier (Artikel)
Identifier (Dateien)
Stoplists are lists of words that have been filtered from documents prior to text analysis tasks, usually words that are either high frequency or that have low semantic value. This paper describes the development of a generalizable method for building stoplists in the Classical Language Toolkit (CLTK), an open-source Python platform for natural language processing research on historical languages. Stoplists are not readily available for many historical languages, and those that are available often offer little documentation about their sources or method of construction. The development of a generalizable method for building historical-language stoplists offers the following benefits: 1. better support for well-documented, data-driven, and replicable results in the use of CLTK resources; 2. reduction of arbitrary decision-making in building stoplists; 3. increased consistency in how stopwords are extracted from documents across multiple languages; and 4. clearer guidelines and standards for CLTK developers and contributors, a helpful step forward in managing the complexity of a multi-language open-source project.

Constructing Stoplists for Historical Languages Code/Data Repository (English)
BeschreibungThis is a zipped GitHub repository including the code notebooks referenced in the paper and supporting data. This has been included as a separate file, because references to my GitHub repository have been redacted in the body of the paper to ensure blind peer review per the journal's guidelines.Urheber/in (oder Besitzer/in) der DateiPatrick J. Burns
Constructing Stoplists for Historical Languages Jupyter notebook results (English)
BeschreibungHTML Export file from jupyter notebookUrheber/in (oder Besitzer/in) der DateiPatrick J. Burns
Figure 1 (English)
BeschreibungThe inheritance tree for the CLTK Stop module.

Dieses Werk steht unter der Lizenz Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 4.0 International.