وسم المدونات اللغوية: المفهوم والمجالات

عبدالله بن يحيى الفيفي

doi:10.35682/1934

وسم المدونات اللغوية: المفهوم والمجالات

عبدالله بن يحيى الفيفي

Abstract

تستعرض هذه الورقة وسم المدونات اللغوية Corpus Tagging، وهو أحد الموضوعات التي قلما تناولتها الأدبيات العربية مع أهميتها للبحث العلمي في المجالين اللغوي والحاسوبي؛ إذ تُعرِّف هذه الورقة المدونات اللغوية ووسمها، ثم تستعرض عدداً من الدراسات الأجنبية التي تناولت وسم المدونات اللغوية، لكنّها لم تضع حداً واضحاً لأنواع الوسوم التي يمكن إضافتها، وهنا تأتي أهمية هذا البحث في التفريق بين ثلاثة من أنواع الوسم التي تضاف إلى المدونات اللغوية، وهي وسم المفردات (Tagging)، وهيكل النص (Mark-up)، والبيانات الوصفية (Metadata)، وتشرح الورقة أشكال كلّ نوع من هذه الوسوم، وآلية إضافته إلى المدونات اللغوية العربية مع أمثلة عليها، وتشرح كذلك آلية الجمع بين هذه الوسوم الثلاثة في مدونة واحدة، ما يسهم في زيادة ثرائها وفائدتها للباحثين في المجالين اللغوي والحاسوبي.

Corpus Tagging: Concept and Domains

This paper reviews Corpus Tagging, a topic rarely explored in the Arab literature despite its importance in Linguistics and Natural Language Processing fields. This paper defines Corpus and Corpus Tagging then reviews several studies that investigated Corpus Tagging, which, nonetheless, did not set a clear borderline between the types of tags that can be added. Here comes the importance of this paper in distinguishing between three types of tags that can be added to corpus, which include adding linguistic tags for words (Tagging), marking-up text structure (Markup), and adding descriptive data to a corpus (Metadata). This paper also explains the forms of each type of these tags and the mechanism for adding them to Arabic language corpora accompanied with examples. It also describes the mechanism for combining these three types in one corpus, which contributes to making them more rich and useful for researchers in the Linguistics and Natural Language Processing fields.

Keywords

وسوم، معجم، تقنيات حاسوبية، كشاف سياقات، مداخل معجمية، مدونات لغوية، شيوع المفردات. Tagging, dictionary, computational technologies, concordancer, lexical entries, corpora, words frequency.

Full Text:

PDF

References

Reference:

Alfaifi, A. (2015). Building the Arabic Learner Corpus and a System for Arabic Error Annotation. Unpublished Ph.D Thesis, University of Leeds.

Alosaimy, A. & Atwell, E. (2017). Tagging Classical Arabic Text using Available Morphological Analysers and Part of Speech Taggers. The Journal for Language Technology and Computational Linguistics (JLCL). 32 (1), 1-26.

Alqrainy, S. (2008). A Morphological - Syntaical Analysis Approach For Arabic Textual Tagging. Unpublished Ph.D Thesis. De Montfort University.

Alqubaishi, H. (2020). An algorithm for Morphological Disambiguating in the Arabic oligarchs. Unpublished master’s dissertation. IMSIU.

Alrabiah, Maha Sulaiman (2014). Building A Distributional Semantic Model for Traditional Arabic & Investigating its Novel Applications to The Holy Quran. Unpublished Ph.D Thesis. King Saud University. Riyadh.

Althubaiti, A. (2015). Designing and Building Corpora. In S. Alosaimi (Ed.), Arabic Corpora: How to Build and Utilise (pp. 147–178). Riyadh: Kabaical.

Boudchiche, M., Mazroui, A., Ould Abdallahi, M., Lakhouaja, A., Boudlal, A. (2017). AlKhalil Morpho Sys 2: A robust Arabic morpho-syntactic analyzer. Journal of King Saud University – Computer and Information Sciences. 29(2), 141-146.

Buckwalter, T. (2004). Buckwalter Arabic Morphological Analyzer Version 2.0. Lingistic Data Consortium, University of Pennsylvania, 2004. LDC Catalog NO: LDC2004L02.

Burnard, L. (2005). Metadata for corpus work. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 30–46). Oxford, UK: Oxbow Books.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.

Diab, M. (2007). Improved Arabic Base Phrase Chunking with a new enriched POS tag set. In: Proceedings of the 5th Workshop on Important Unresolved Matters, Association for Computational Linguistics (ACL), Prague.

Garside. R., Geoffrey, L. & Tony, M. (Eds.) (1997). Corpus Annotation: Linguistic Information from Computer Text Corpora. New York: Routledge.

Granger, S. (2002). A bird’s-eye view of computer learner corpus research. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition and foreign language teaching (pp. 3–33). Amsterdam, the Netherlands: Benjamins.

Khoja, S. (2001). APT: Arabic Part-of-speech Tagger. In: Proceedings of the Student Workshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL2001), Carnegie Mellon University, Pittsburgh, Pennsylvania.

Leech, G. (1997). Introducing Corpus Annotation. In Roger Garside, Geoffrey Leech & Tony McEnery (Eds.), Corpus Annotation: Linguistic Information from Computer Text Corpora (pp. 1-18). New York: Routledge.

Lu, X. (2014). Computational Methods for Corpus Annotation and Analysis. New York: Springer.

Mohammad. A. (2017). Grammatical Tree Bank: construction and employment in the context of artificial intelligence techniques. Riyadh: KABAICAL.

Pasha, A., Mohamed A., Mona, D., Ahmem E., Ramy, E., Nizar, H., Manoj, P., Owen, R. & Ryan M. (2014). Madamira: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic. LREC.

Pustejovsky, J. & Stubbs, A. (2013). Natural language annotation for machine learning. Sebastopol, CA: O’Reilly Media.

Sawalha, M. (2011) Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora. Unpublished Ph.D Thesis, University of Leeds.

Sinclair, J. (1996). EAGLES. Preliminary recommendations on corpus typology. Retrieved 11 April 2013 from http://www.ilc.cnr.it/ EAGLES/ corpustyp/ corpustyp.html

Sinclair, J. (2005). Corpus and text - basic principles. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 1–16). Oxford, UK: Oxbow Books.

Smrž, Otakar et al., (2008). Prague Arabic dependency treebank: A word on the million words. In: Proceedings of the Workshop on Arabic and Local Languages (LREC) 2008.Marrakech, Morocco. European Language 2008.Marrakech, Morocco. European Language Resources Association.

Wynne, M. (Ed.) (2005). Developing linguistic corpora: A guide to good practice. Oxford, UK: Oxbow Books.

DOI: http://dx.doi.org/10.35682/1934

Published by
MUTAH UNIVERSITY

Username
Password
Remember me