Corpus.mari-language.com: A Rudimentary Corpus Searchable by Syntactic and Morphological Patterns

  • Jeremy Bradley Ludwig Maximilian University of Munich Institute for Finno-Ugric and Uralic Studies Koneen Säätiö (Kone foundation)

Abstract

This paper introduces a rudimentary infrastructure for a searchable corpus of Mari, a highly agglutinative Uralic language spoken in the Volga and Ural regions of the Russian Federation. is infrastructure allows users to search the corpus by syntactic and morphological paerns. It makes use of the University of Vienna’s digital Mari-English dictionary, published under a Creative Commons License in 2014, and a morphological analyser following a simple item-and-arrangement approach. Texts fed into the corpus are subjected to a morphological analysis, the results of which are saved into the application’s database with the corpus materials and are accessed by the search algorithm. A demonstration of this open-source tool, covering 994,097 tokens taken from works not subject to copyright, can be found at corpus.mari-language.com, the source code at source.mari-language.com. While a non-representative text collection of this scope can only serve demonstrative purposes, the infrastructure could enable quantitative diachronic or sociolinguistic comparisons, if fed with a sufficiently wide text collection annotated with adequate metadata.
Published
2015-06-17