Corpus.mari-language.com: A Rudimentary Corpus Searchable by Syntactic and Morphological Patterns

Authors

  • Jeremy Bradley Ludwig Maximilian University of Munich Institute for Finno-Ugric and Uralic Studies Koneen Säätiö (Kone foundation)

DOI:

https://doi.org/10.7557/5.3468

Abstract

This paper introduces a rudimentary infrastructure for a searchable corpus of Mari, a highly agglutinative Uralic language spoken in the Volga and Ural regions of the Russian Federation. is infrastructure allows users to search the corpus by syntactic and morphological paerns. It makes use of the University of Vienna’s digital Mari-English dictionary, published under a Creative Commons License in 2014, and a morphological analyser following a simple item-and-arrangement approach. Texts fed into the corpus are subjected to a morphological analysis, the results of which are saved into the application’s database with the corpus materials and are accessed by the search algorithm. A demonstration of this open-source tool, covering 994,097 tokens taken from works not subject to copyright, can be found at corpus.mari-language.com, the source code at source.mari-language.com. While a non-representative text collection of this scope can only serve demonstrative purposes, the infrastructure could enable quantitative diachronic or sociolinguistic comparisons, if fed with a sufficiently wide text collection annotated with adequate metadata.

Metrics

PDF views
453
Jul 2015Jan 2016Jul 2016Jan 2017Jul 2017Jan 2018Jul 2018Jan 2019Jul 2019Jan 2020Jul 2020Jan 2021Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 2025Jul 2025Jan 202623
|

Downloads

Published

2015-06-17

How to Cite

Bradley, J. (2015). Corpus.mari-language.com: A Rudimentary Corpus Searchable by Syntactic and Morphological Patterns. Septentrio Conference Series, (2), 57–68. https://doi.org/10.7557/5.3468