Constructing Stoplists for Historical Languages
Identifiers (Article)
Identifiers (Files)
Abstract
Stoplists are lists of words that have been filtered from documents prior to text analysis tasks, usually words that are either high frequency or that have low semantic value. This paper describes the development of a generalizable method for building stoplists in the Classical Language Toolkit (CLTK), an open-source Python platform for natural language processing research on historical languages. Stoplists are not readily available for many historical languages, and those that are available often offer little documentation about their sources or method of construction. The development of a generalizable method for building historical-language stoplists offers the following benefits: 1. better support for well-documented, data-driven, and replicable results in the use of CLTK resources; 2. reduction of arbitrary decision-making in building stoplists; 3. increased consistency in how stopwords are extracted from documents across multiple languages; and 4. clearer guidelines and standards for CLTK developers and contributors, a helpful step forward in managing the complexity of a multi-language open-source project.
Statistics
References
Alajmi et al. (2012): A. Alajmi, E. M. Saad and R. R. Darwish. “Toward an Arabic Stop-Words List Generation,” International Journal of Computer Applications 46 (8), 8–13.
Arun et al. (2009): R. Arun, R. Saradha, R., V. Suresh, M. Narasimha Murty and C. E. Veni Madhavan, “Stopwords and Stylometry: A Latent Dirichlet Allocation Approach,” in: NIPS Workshop on Applications for Topic Models: Text and Beyond, 1-4.
Berra (2018): Aurélien Berra, Ancient Greek and Latin Stopwords for Textual Analysis, version 2.1.0. https://github.com/aurelberra/stopwords. (accessed on 30 August 2018)
Bird et al. (2015). Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit,” https://www.nltk.org/book/. (accessed on 30 August 2018)
Chekima/Alfred (2016): Khalifa Chekima and Rayner Alfred, “An Automatic Construction of Malay Stop Words Based on Aggregation Method,” in: Proceedings of Soft Computing in Data Science: Second International Conference, 180-189
Choy (2012): Murphy Choy, “Effective Listings of Function Stop Words for Twitter,” arXiv preprint, http://arxiv.org/abs/1205.6396.
Daowadung/Chen (2012): Patcharanut Daowadung and Yaw-Huei Chen, “Stop Word in Readability Assessment of Thai Text,” in: 2012 IEEE 12th International Conference on Advanced Learning Technologies, 497–99.
El-Khair (2016): Ibrahim Abu El-Khair, “Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study,” International Journal of Computing and Information Sciences 4 (3), 119–133.
Flood (1999): Flood, Barbara J., “Historical Note: The Start of a Stop List at Biological Abstracts,” Journal of the American Society for Information Science 50 (12), 1066.
Fox (1989): Christopher Fox, “A Stop List for General Text.” SIGIR Forum 24 (1–2), 19–21.
HaCohen-Kerner/Shmuel (2010): Yaakov HaCohen-Kerner and Yishai Blitz Shmuel, “Initial Experiments with Extraction of Stopwords in Hebrew,” in: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, 449-453.
Harter (1986): Stephen P. Harter, Online Information Retrieval: Concepts, Principles, and Techniques. San Diego.
Haswell (2005): Richard H. Haswell, “NCTE/CCCC’s Recent War on Scholarship,” Written Communication 22 (2): 198–223.
Johnson (2018): Kyle P. Johnson and the Classical Language Toolkit contributors, “The Classical Language Toolkit,” http://cltk.org/. (accessed on 30 August 2018)
Kaur/Saini (2016): Jasleen Kaur and Jatinderkumar R. Saini, “Punjabi Stop Words: A Gurmukhi, Shahmukhi and Roman Scripted Chronicle,” in: Proceedings of the ACM Symposium on Women in Research, 32–37.
Krauwer (2003): Steven Krauwer, “The Basic Language Resource Kit (BLARK) as the First Milestone for the Language Resources Roadmap.” Proceedings of International Workshop Speech and Computer (SPECOM), 8–15.
Lazarinis (2007): Fotis Lazarinis, “Engineering and Utilizing a Stopword List in Greek Web Retrieval,” Journal of the American Society for Information Science and Technology 58 (11), 1645-1652.
Lo et al. (2005): Rachel Tsz-Wai Lo, Ben He and Iadh Ounis, “Automatically Building a Stopword List for an Information Retrieval System,” in: 5th Dutch-Belgium Information Retrieval Workshop, 17-24.
Luhn (1957): Hans Peter Luhn, “A Statistical Approach to Mechanized Encoding and Searching of Literary Information,” IBM Journal of Research and Development 1 (4): 309–17.
Luhn (1958): Hans Peter Luhn, “The Automatic Creation of Literature Abstracts.” IBM Journal of Research and Development 2 (2): 159–65.
Luhn (1960): Hans Peter Luhn, “Key Word-in-Context Index for Technical Literature (KWIC Index).” American Documentation 11 (4): 288–95.
Makrehchi/Kamel (2008): Masoud Makrehchi and Mohamed S. Kamel, “Automatic Extraction of Domain-Specific Stopwords from Labeled Documents,” in: Advances in Information Retrieval, 222–33.
Manning et al. (2012): Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge.
Miretie/Khedkar (2018): Sileshi Girmaw Miretie and Vijayshri Khedkar, “Automatic Generation of Stopwords in the Amharic Text,” International Journal of Computer Applications 180 (10), 19-22.
Nothman et al. (2018): Joel Nothman, Hanmin Qin and Roman Yurchak, “Stop Word Lists in Free Open-Source Software Packages,” in: Proceedings of Workshop for NLP Open Source Software (NLP-OSS), 7–12.
Oliphant (2006): Travis E. Oliphant, A Guide to NumPy, vol. 1, Spanish Fork, UT.
Pedregosa et al. (2011): Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss and Vincent Dubourg, “Scikit-Learn: Machine Learning in Python,” Journal of Machine Learning Research 12: 2825–2830.
Peng (2011): Roger D. Peng, “Reproducible Research in Computational Science,” Science 334 (6060): 1226–27.
Puri et al. (2013): Rajeev Puri, R. P. S. Bedi and Vishal Goyal, “Automated Stopwords Identification in Punjabi Documents,” Research Cell: An International Journal of Engineering Sciences 8, 119–125.
Rakholia/Saini (2017): Rajnish M. Rakholia and Jatinderkumar R. Saini, “A Rule-Based Approach to Identify Stop Words for Gujarati Language,” in: Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications, 797–806.
Rasmussen (2009): Edie Rasmussen, “Stoplists,” in: Ling Liu and M. Tamer Özsu (eds.), Encyclopedia of Database Systems, Boston, 2794–96.
Raulji/Saini (2016): Jaideepsinh K. Raulji and Jatinderkumar R. Saini, “Stop-Word Removal Algorithm and Its Implementation for Sanskrit Language,” International Journal of Computer Applications 150 (2), 15-17.
Raulji/Saini (2017): Jaideepsinh K. Raulji and Jatinderkumar R. Saini, “Generating Stopword List for Sanskrit Language,” in: 2017 IEEE 7th International Advance Computing Conference (IACC), 799–802.
Rijsbergen (1975): C.J. van Rijsbergen, Information Retrieval, Newton, MA.
Rockwell et al. (2012): Geoffrey Rockwell, Stéfan Sinclair and the Voyant Tools team, “Voyant Tools,” https://voyant-tools.org/. (accessed on 30 August 2018)
Rose et al. (2010): Stuart Rose, Dave Engel, Nick Cramer, Wendy Cowley, “Automatic Keyword Extraction from Individual Documents,” in: Michael W. Berry and Jacob Kogan (eds.), Text Mining, Chichester, UK, 1–20.
Sadeghi/Vegas (2014): Mohammad Sadeghi and Jésus Vegas, “Automatic Identification of Light Stop Words for Persian Information Retrieval Systems,” Journal of Information Science 40 (4), 476–487.
Saif et al. (2014): Hassan Saif, Miriam Fernández, Yulan He and Harith Alani, “On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter,” in: Proceedings of the 9th Language Resources and Evaluation Conference (LREC), 810-817.
Saini/Rakholia (2016): Jatinderkumar R. Saini and Rajnish M. Rakholia, “On Continent and Script-Wise Divisions-Based Statistical Measures for Stop-Words Lists of International Languages,” in: Procedia Computer Science 89, 313–19.
Salton/McGill (1983): Gerard Salton and Michael J. McGill, Introduction to Modern Information Retrieval, New York City.
Savoy (1999): Jacques Savoy, “A Stemming Procedure and Stopword List for General French Corpora,” Journal of the American Society for Information Science 50 (10), 944–52.
Schofield et al. (2017): Alexandra Schofield, Måns Magnusson and David Mimno, “Pulling Out the Stops: Rethinking Stopword Removal for Topic Models,” in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 432–36.
Silva/Ribiero (2003): Catarina Silva and Bernardete Ribeiro, “The Importance of Stop Word Removal on Recall Values in Text Categorization,” in: Proceedings of the International Joint Conference on Neural Networks, 1661–1666.
Tijani et al. (2017): Olatunde D. Tijani, A. T. Akinwale, Saidat A. Onashoga and E. O. Adeleke, “An Auto-Generated Approach Of Stop Words Using Aggregated Analysis,” in: Proceedings of the 13th International Conference of the Nigeria Computer Society, 99-115.
Weinberg (2004): Bella Hass Weinberg, “Predecessors of Scientific Indexing Structures in the Domain of Religion,” in: Second Conference on the History and Heritage of Scientific and Technical Information Systems, 126–134.
Yao/Ze-wen (2011): Zhou Yao and Cao Ze-wen, “Research on the Construction and Filter Method of Stop-Word List in Text Preprocessing,” in: Fourth International Conference on Intelligent Computation Technology and Automation, 217–21.
Zaman et al. (2011): A. N. K. Zaman, Pascal Matsakis and Charles Brown, “Evaluation of Stop Word Lists in Text Retrieval Using Latent Semantic Indexing,” in: 2011 Sixth International Conference on Digital Information Management, 133–36.
Zheng/Gaowa (2010): Gong Zheng and Guan Gaowa, “The Selection of Mongolian Stop Words,” in: IEEE International Conference on Intelligent Computing and Intelligent Systems, vol. 2, 71–74.
Zou et al. (2006): Feng Zou, Fu Lee Wang, Xiaotie Deng, Song Han and Lu Sheng Wang, “Automatic Construction of Chinese Stop Word List,” in: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, 1010–1015.
Supplementary Content
-
Constructing Stoplists for Historical Languages Code/Data Repository
DescriptionThis is a zipped GitHub repository including the code notebooks referenced in the paper and supporting data. This has been included as a separate file, because references to my GitHub repository have been redacted in the body of the paper to ensure blind peer review per the journal's guidelines.Creator (or owner) of filePatrick J. Burns
-
Constructing Stoplists for Historical Languages Jupyter notebook results
DescriptionHTML Export file from jupyter notebookCreator (or owner) of filePatrick J. Burns
-
Figure 1
DescriptionThe inheritance tree for the CLTK Stop module.
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.