Low-Resource Active Learning of North Sámi Morphological Segmentation

Authors

  • Stig-Arne Grönroos Department of Signal Processing and Acoustics Aalto University
  • Kristiina Jokinen Institute of Behavioural Sciences University of Helsinki
  • Katri Hiovain Institute of Behavioural Sciences University of Helsinki
  • Mikko Kurimo Department of Signal Processing and Acoustics Aalto University
  • Sami Virpioja Department of Information and Computer Science Aalto University

DOI:

https://doi.org/10.7557/5.3465

Abstract

Many Uralic languages have a rich morphological structure, but lack tools of morphological analysis needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications.We study how to create a statistical model for morphological segmentation of North Sámi language with a large unannotated corpus and a small amount of human-annotated word forms selected using an active learning approach. For statistical learning, we use the semi-supervised Morfessor Baseline and FlatCat methods. Aer annotating 237 words with our active learning setup, we improve morph boundary recall over 20% with no loss of precision.

Metrics

PDF views
790
Jul 2015Jan 2016Jul 2016Jan 2017Jul 2017Jan 2018Jul 2018Jan 2019Jul 2019Jan 2020Jul 2020Jan 2021Jul 2021Jan 2022Jul 2022Jan 2023Jul 2023Jan 2024Jul 2024Jan 2025Jul 2025Jan 202632
|

Downloads

Published

2015-06-17

How to Cite

Grönroos, S.-A., Jokinen, K., Hiovain, K., Kurimo, M., & Virpioja, S. (2015). Low-Resource Active Learning of North Sámi Morphological Segmentation. Septentrio Conference Series, (2), 20–33. https://doi.org/10.7557/5.3465