Corpus.mari-language.com: A Rudimentary Corpus Searchable by Syntactic and Morphological Patterns
DOI:
https://doi.org/10.7557/5.3468Abstract
This paper introduces a rudimentary infrastructure for a searchable corpus of Mari, a highly agglutinative Uralic language spoken in the Volga and Ural regions of the Russian Federation. is infrastructure allows users to search the corpus by syntactic and morphological paerns. It makes use of the University of Vienna’s digital Mari-English dictionary, published under a Creative Commons License in 2014, and a morphological analyser following a simple item-and-arrangement approach. Texts fed into the corpus are subjected to a morphological analysis, the results of which are saved into the application’s database with the corpus materials and are accessed by the search algorithm. A demonstration of this open-source tool, covering 994,097 tokens taken from works not subject to copyright, can be found at corpus.mari-language.com, the source code at source.mari-language.com. While a non-representative text collection of this scope can only serve demonstrative purposes, the infrastructure could enable quantitative diachronic or sociolinguistic comparisons, if fed with a sufficiently wide text collection annotated with adequate metadata.Metrics
Metrics Loading ...
Downloads
Published
2015-06-17
How to Cite
Bradley, J. (2015). Corpus.mari-language.com: A Rudimentary Corpus Searchable by Syntactic and Morphological Patterns. Septentrio Conference Series, (2), 57–68. https://doi.org/10.7557/5.3468
Issue
Section
Articles