وسم المدونات اللغوية: المفهوم والمجالات
Abstract
Corpus Tagging: Concept and Domains
This paper reviews Corpus Tagging, a topic rarely explored in the Arab literature despite its importance in Linguistics and Natural Language Processing fields. This paper defines Corpus and Corpus Tagging then reviews several studies that investigated Corpus Tagging, which, nonetheless, did not set a clear borderline between the types of tags that can be added. Here comes the importance of this paper in distinguishing between three types of tags that can be added to corpus, which include adding linguistic tags for words (Tagging), marking-up text structure (Markup), and adding descriptive data to a corpus (Metadata). This paper also explains the forms of each type of these tags and the mechanism for adding them to Arabic language corpora accompanied with examples. It also describes the mechanism for combining these three types in one corpus, which contributes to making them more rich and useful for researchers in the Linguistics and Natural Language Processing fields.
Keywords
Full Text:
PDFReferences
Reference:
Alfaifi, A. (2015). Building the Arabic Learner Corpus and a System for Arabic Error Annotation. Unpublished Ph.D Thesis, University of Leeds.
Alosaimy, A. & Atwell, E. (2017). Tagging Classical Arabic Text using Available Morphological Analysers and Part of Speech Taggers. The Journal for Language Technology and Computational Linguistics (JLCL). 32 (1), 1-26.
Alqrainy, S. (2008). A Morphological - Syntaical Analysis Approach For Arabic Textual Tagging. Unpublished Ph.D Thesis. De Montfort University.
Alqubaishi, H. (2020). An algorithm for Morphological Disambiguating in the Arabic oligarchs. Unpublished master’s dissertation. IMSIU.
Alrabiah, Maha Sulaiman (2014). Building A Distributional Semantic Model for Traditional Arabic & Investigating its Novel Applications to The Holy Quran. Unpublished Ph.D Thesis. King Saud University. Riyadh.
Althubaiti, A. (2015). Designing and Building Corpora. In S. Alosaimi (Ed.), Arabic Corpora: How to Build and Utilise (pp. 147–178). Riyadh: Kabaical.
Boudchiche, M., Mazroui, A., Ould Abdallahi, M., Lakhouaja, A., Boudlal, A. (2017). AlKhalil Morpho Sys 2: A robust Arabic morpho-syntactic analyzer. Journal of King Saud University – Computer and Information Sciences. 29(2), 141-146.
Buckwalter, T. (2004). Buckwalter Arabic Morphological Analyzer Version 2.0. Lingistic Data Consortium, University of Pennsylvania, 2004. LDC Catalog NO: LDC2004L02.
Burnard, L. (2005). Metadata for corpus work. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 30–46). Oxford, UK: Oxbow Books.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Diab, M. (2007). Improved Arabic Base Phrase Chunking with a new enriched POS tag set. In: Proceedings of the 5th Workshop on Important Unresolved Matters, Association for Computational Linguistics (ACL), Prague.
Garside. R., Geoffrey, L. & Tony, M. (Eds.) (1997). Corpus Annotation: Linguistic Information from Computer Text Corpora. New York: Routledge.
Granger, S. (2002). A bird’s-eye view of computer learner corpus research. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition and foreign language teaching (pp. 3–33). Amsterdam, the Netherlands: Benjamins.
Khoja, S. (2001). APT: Arabic Part-of-speech Tagger. In: Proceedings of the Student Workshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL2001), Carnegie Mellon University, Pittsburgh, Pennsylvania.
Leech, G. (1997). Introducing Corpus Annotation. In Roger Garside, Geoffrey Leech & Tony McEnery (Eds.), Corpus Annotation: Linguistic Information from Computer Text Corpora (pp. 1-18). New York: Routledge.
Lu, X. (2014). Computational Methods for Corpus Annotation and Analysis. New York: Springer.
Mohammad. A. (2017). Grammatical Tree Bank: construction and employment in the context of artificial intelligence techniques. Riyadh: KABAICAL.
Pasha, A., Mohamed A., Mona, D., Ahmem E., Ramy, E., Nizar, H., Manoj, P., Owen, R. & Ryan M. (2014). Madamira: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic. LREC.
Pustejovsky, J. & Stubbs, A. (2013). Natural language annotation for machine learning. Sebastopol, CA: O’Reilly Media.
Sawalha, M. (2011) Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora. Unpublished Ph.D Thesis, University of Leeds.
Sinclair, J. (1996). EAGLES. Preliminary recommendations on corpus typology. Retrieved 11 April 2013 from http://www.ilc.cnr.it/ EAGLES/ corpustyp/ corpustyp.html
Sinclair, J. (2005). Corpus and text - basic principles. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 1–16). Oxford, UK: Oxbow Books.
Smrž, Otakar et al., (2008). Prague Arabic dependency treebank: A word on the million words. In: Proceedings of the Workshop on Arabic and Local Languages (LREC) 2008.Marrakech, Morocco. European Language 2008.Marrakech, Morocco. European Language Resources Association.
Wynne, M. (Ed.) (2005). Developing linguistic corpora: A guide to good practice. Oxford, UK: Oxbow Books.
DOI: http://dx.doi.org/10.35682/1934
Published by
MUTAH UNIVERSITY