You can’t suggest that?!

Comparisons and improvements of speller error models

Authors

  • Heiki-Jaan Kaalep Tartu ülikool
  • Flammie Pirinen UiT Norgga árktalaš universitehta
  • Sjur Nørstebø Moshagen UiT Norgga árktalaš universitehta

DOI:

https://doi.org/10.7557/12.6349

Keywords:

Spell-Checking, rule-based, fsa, machine learning, sami languages, estonian

Abstract

In this article, we study correction of spelling errors, specifically on how the spelling errors are made and how can we model them computationally in order to fix them.
The article describes two different approaches to generating spelling correction suggestions for three Uralic languages: Estonian, North Sámi and South Sámi.
The first approach of modelling spelling errors is rule-based, where experts write rules that describe the kind of errors are made, and these are compiled into finite-state automaton that models the errors.
The second is data-based, where we show a machine learning algorithm a corpus of errors that humans have made, and it creates a neural network that can model the errors.
Both approaches require collection of error corpora and understanding its contents; therefore we also describe the actual errors we have seen in detail.
We find that while both approaches create error correction systems, with current resources the expert-build systems are still more reliable.

References

Antonsen, Lene. 2013. Čállinmeattáhusaid guorran [english summary: Tracking misspellings]. Sámi diedalaš áigecála 2/2013: 7–32.

Arwidsson, Adolf Ivar. 1822. Ueber die ehstniche orthographie. won einem finnländer. Beiträge zur genauern Kenntniss der ehstnischen Sprache. Funfzehntes Heft pp. 124–130.

Beeksma, Merijn, Maarten Van Gompel, Florian Kunneman, Louis Onrust, Bouke Regnerus, Dennis Vinke, Eduardo Brito, Christian Bauckhage, and Rafet Sifa. 2018. Detecting and correcting spelling errors in high-quality dutch wikipedia text. Computational Linguistics in the Netherlands Journal 8: 122–137.

Beesley, Kenneth R and Lauri Karttunen. 2003. Finite-state morphology: Xerox tools and techniques. CSLI, Stanford.

Bergsland, Knut. 1994. Sydsamisk grammatikk. Davvi Girji o. s., Karasjok.

Bick, Eckhard. 2006. A constraint grammar based spellchecker for danish with a special focus on dyslexics.

Bollmann, Marcel and Anders Søgaard. 2016. Improving historical spelling normalization with bi-directional LSTMs and multi-task learning. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 131–139. The COLING 2016 Organizing Committee, Osaka, Japan.

Bull, Ella Holm and Knut Bergsland. 1974. Lohkede saemien. Sørsamisk lesebok. Grunnskolerådet, Kirke- og undervisningsdepartementet: Universitetsforlaget, Oslo.

Erelt, Mati, Tiiu Erelt, and Kristiina Ross. 2007. Eesti keele käsiraamat. EKI, Tallinn.

Flor, Michael, Michael Fried, and Alla Rozovskaya. 2019. A benchmark corpus of english misspellings and a minimally-supervised model for spelling correction. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 76–86. https://doi.org/10.18653/v1/W19-4407.

Flor, Michael, Yoko Futagi, Melissa Lopez, and Matthew Mulholland. 2015. Patterns of misspellings in L2 and L1 English: a view from the ETS Spelling Corpus. Bergen Language and Linguistics Studies 6. https://doi.org/10.15845/bells.v6i0.811.

Gaup, Børre, Sjur Moshagen, Thomas Omma, Maaren Palismaa, Tomi Pieski, and Trond Trosterud. 2005. From xerox to aspell: A first prototype of a north sámi speller based on twol technology. In International Workshop on Finite-State Methods and Natural Language Processing, pp. 306–307. Springer. https://doi.org/10.1007/11780885_37.

Giellatekno and Divvun. 2021. SIKOR UiT Norges arktiske universitets og det norske Sametingets samiske tekstsamling, versjon 01.10.2021.

Hládek, Daniel, Ján Staš, and Matúš Pleva. 2020. Survey of automatic spelling correction. Electronics 9 10: 1670. https://doi.org/10.3390/electronics9101670.

Hochreiter, Sepp and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9 8: 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735.

Kask, Arnold. 1970. Eesti kirjakeele ajaloost. Tartu Riiklik Ülikool, Tartu.

Klein, Guillaume, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, pp. 67–72. Association for Computational Linguistics, Vancouver, Canada. https://doi.org/10.18653/v1/P17-4012.

Kukich, Karen. 1992. Techniques for automatically correcting words in text. Acm Computing Surveys (CSUR) 24 4: 377–439. https://doi.org/10.1145/146370.146380.

Levenshtein, Vladimir I et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, vol. 10, pp. 707–710. Soviet Union.

Li, Xiangci, Hairong Liu, and Liang Huang. 2020. Context-aware stand-alone neural spelling correction. arXiv preprint arXiv:2011.06642 https://doi.org/10.18653/v1/2020.findings-emnlp.37.

Magga, Ole Henrik and Lajla Mattsson Magga. 2012. Sørsamisk grammatikk. Davvi Girji, Karasjok.

Moshagen, Sjur, Tommi A Pirinen, and Trond Trosterud. 2013. Building an open-source development infrastructure for language technology projects. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), pp. 343–352.

Moshagen, Sjur N. and Trond Trosterud. 2005. Samisk språkteknologi. In Nordisk Sprogteknologi 2004: Aarbog for Nordisk Sprogteknologisk Forskningsprogram 2000-2004, edited by H. Holmboe, pp. 57–62. Museum Tusculanums Forlag, København.

Nickel, Klaus Peter and Pekka Sammallahti. 2011. Nordsamisk grammatikk. Davvi Girji, Karasjok, 2. hapmi = utgave, 1.deaddileapmi = opplag edn.

Pirinen, Flammie, Krister Lindén, et al. 2014. State-of-the-art in weighted finite-state spell-checking. In Computational Linguistics and Intelligent Text Processing 15th International Conference, CICLing 2014, Kathmandu, Nepal, April 6-12, 2014, Proceedings, Part II.

Pirinen, Tommi, Miikka Silfverberg, and Krister Linden. 2012. Improving finite-state spell-checker suggestions with part of speech n-grams. In Computational Linguistics and Intelligent Text Processing, edited by Alexander Gelbukh. International Conference on Intelligent Text Processing and Computational Linguistics ; Conference date: 11-03-2012 through 17-03-2012.

Pirinen, Tommi A and Sam Hardwick. 2012. Effect of language and error models on efficiency of finite-state spell-checking and correction. In Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, edited by Iñaki Alegria and Mans Hulden, pp. 1–8. The Association for Computational Linguistics, United States. International Workshop on Finite State Methods and Natural Language Processing ; Conference date: 23-07-2012 through 25-07-2012.

Strubell, Emma, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243 https://doi.org/10.18653/v1/P19-1355.

Trosterud, Trond and Linda Wiechetek. 2007. Disambiguering av homonymi i nord- og lulesamisk. In Sámit, sánit, sátnehámit. Riepmočála Pekka Sammallahtii miessemánu 21. beaivve 2007, edited by Ante Aikio and Jussi Ylikoski, Suomalais-Ugrilaisen Seuran Toimituksia 253, pp. 347–354. Suomalais-Ugrilainen Seura, Helsinki.

Trón, Viktor, Andras Kornai, György Gyepesi, László Németh, and Péter Halácsy. 2005. Hunmorph: Open source word analysis. In Proceedings of the Workshop on Software. Association for Computational Linguistics, pp. 77–85. https://doi.org/10.3115/1626315.1626321.

Wiechetek, Linda, Sjur Nørstebø Moshagen, and Kevin Brubeck Unhammer. 2019. Seeing more than whitespace — tokenisation and disambiguation in a North Sámi grammar checker. In Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers), pp. 46–55. Association for Computational Linguistics, Honolulu. https://doi.org/10.33011/computel.v1i.403.

Downloads

Published

2022-08-30