The Finno-Ugric Languages and The Internet Project

Authors

  • Heidi Jauhiainen University of Helsinki Department of Modern Languages
  • Tommi Jauhiainen University of Helsinki Department of Modern Languages
  • Krister Lindén University of Helsinki Department of Modern Languages

DOI:

https://doi.org/10.7557/5.3471

Abstract

 

This paper describes a Kone Foundation funded project called "The Finno-Ugric Languages and The Internet" together with some of the achieved results. The main activity of the project is to crawl the internet and gather texts written in small Uralic languages. The sentences and words of the found texts will be assembled into a freely available corpus. Crawling is done using the open source crawler Heritrix, which is developed by the Internet Archive. Heritrix crawls through the pages and passes the found texts to a language identifier.

 

We are using a state of the art language identifier, which has been further developed within the project and has been evaluated using 285 languages. We describe the language identification evaluation results concerning the 34 Uralic languages known by the language identifier. We also describe the initial observations and results from the first five large crawls which were done in the national internet domains of Finland, Sweden, Norway, Russia, and Estonia.

 

Metrics

PDF views
505
Jul 2015Jan 2016Jul 2016Jan 2017Jul 2017Jan 2018Jul 2018Jan 2019Jul 2019Jan 2020Jul 2020Jan 2021Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 2025Jul 2025Jan 202636
|

Downloads

Published

2015-06-17

How to Cite

Jauhiainen, H., Jauhiainen, T., & Lindén, K. (2015). The Finno-Ugric Languages and The Internet Project. Septentrio Conference Series, (2), 87–98. https://doi.org/10.7557/5.3471