A Rudimentary Corpus Searchable by Syntactic and Morphological Patterns


  • Jeremy Bradley Ludwig Maximilian University of Munich Institute for Finno-Ugric and Uralic Studies Koneen Säätiö (Kone foundation)



This paper introduces a rudimentary infrastructure for a searchable corpus of Mari, a highly agglutinative Uralic language spoken in the Volga and Ural regions of the Russian Federation. is infrastructure allows users to search the corpus by syntactic and morphological paerns. It makes use of the University of Vienna’s digital Mari-English dictionary, published under a Creative Commons License in 2014, and a morphological analyser following a simple item-and-arrangement approach. Texts fed into the corpus are subjected to a morphological analysis, the results of which are saved into the application’s database with the corpus materials and are accessed by the search algorithm. A demonstration of this open-source tool, covering 994,097 tokens taken from works not subject to copyright, can be found at, the source code at While a non-representative text collection of this scope can only serve demonstrative purposes, the infrastructure could enable quantitative diachronic or sociolinguistic comparisons, if fed with a sufficiently wide text collection annotated with adequate metadata.