All that glitters...

Interannotator agreement in natural language processing

Authors

DOI:

https://doi.org/10.7557/12.6348

Keywords:

evaluation, natural language processing, interannotator agreement, annotation

Abstract

Evaluation has emerged as a central concern in natural language processing (NLP) over the last few decades. Evaluation is done against a gold standard, a manually linguistically annotated dataset, which is assumed to provide the ground truth against which the accuracy of the NLP system can be assessed automatically. In this article, some methodological questions in connection with the creation of gold standard datasets are discussed, in particular (non-)expectations of linguistic expertise in annotators and the interannotator agreement measure standardly but unreflectedly used as a kind of quality index of NLP gold standards.

References

Alkon, Paul K. 1959. Behaviourism and linguistics: An historical note. Language and Speech 2 1: 37–51. https://doi.org/10.1177/002383095900200105.

Antonsen, Lene, Trond Trosterud, and Linda Wiechetek. 2010. Reusing grammatical resources for new languages. In Proceedings of LREC 2010, pp. 2782–2789. ELRA, Valletta.

Artstein, Ron. 2017. Inter-annotator agreement. In Handbook of Linguistic Annotation, edited by Nancy Ide and James Pustejovsky, pp. 297–313. Springer, Dordrecht. https://doi.org/10.1007/978-94-024-0881-2_11.

Artstein, Ron and Massimo Poesio. 2008. Survey article: Inter-coder agreement for computational linguistics. Computational Linguistics 34 4: 555–596. https://doi.org/10.1162/coli.07-034-R2.

Babarczy, Anna, John Carroll, and Geoffrey Sampson. 2006. Definitional, personal and mechanical constraints on part of speech annotation performance. Natural Language Engineering 12 1: 77–90. https://doi.org/10.1017/S1351324905003803.

Bayerl, Petra Saskia and Karsten Ingmar Paul. 2011. What determines inter-coder agreement in manual annotations? A meta-analytic investigation. Computational Linguistics 37 4: 699–725. https://doi.org/10.1162/COLI_a_00074.

Bender, Emily M. 2011. On achieving and evaluating language-independence in NLP. Linguistic Issues in Language Technology 6 3.

Bender, Emily M. 2016. Linguistic typology in natural language processing. Linguistic Typology 20 3: 645–660. https://doi.org/10.1515/lingty-2016-0035.

Brown, Susan Windisch, Travis Rood, and Martha Palmer. 2010. Number or nuance: Which factors restrict reliable word sense annotation? In Proceedings of LREC 2010, pp. 3237–3243. ELRA, Valletta.

Carletta, Jean. 1996. Assessing agreement on classification tasks: The Kappa statistic. Computational Linguistics 22 2: 249–254.

Church, Kenneth Ward and Joel Hestness. 2019. A survey of 25 years of evaluation. Natural Language Engineering 25 6: 753–767. https://doi.org/10.1017/S1351324919000275.

Dąbrowska, Ewa. 2010. Naive v. expert intuitions: An empirical study of acceptability judgments. The Linguistic Review 27: 1–23. https://doi.org/10.1515/tlir.2010.001.

Dagan, Ido, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL Recognising Textual Entailment Challenge. In Machine Learning Challenges: Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment, edited by Joaquin Quiñonero-Candela, Ido Dagan, Bernardo Magnini, and Florence d’Alché-Buc, pp. 177–190. Springer, Berlin.

Dickinson, Markus. 2009. Correcting dependency annotation errors. In Proceedings EACL 2009, pp. 193–201. ACL, Athens.

Dickinson, Markus. 2015. Detection of annotation errors in corpora. Language and Linguistics Compass 9 3: 119–138. https://doi.org/10.1111/lnc3.12129.

Dickinson, Markus and W. Detmar Meurers. 2003. Detecting errors in part-of-speech annotation. In Proceedings of EACL 2003, pp. 107–114. ACL, Budapest.

Gillick, Dan and Yang Liu. 2010. Non-expert evaluation of summarization systems is risky. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 148– 151. ACL, Los Angeles.

Gimpel, Kevin, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for Twitter: Annotation, features, and experiments. In Proceedings of ACL/HLT 2011, pp. 42–47. ACL, Portland.

Harrigan, Atticus G., Katherine Schmirler, Antti Arppe, Lene Antonsen, Trond Trosterud, and Arok Wolvengrey. 2017. Learning from the computational modelling of Plains Cree verbs. Morphology 27 4: 565–598. https://doi.org/10.1007/s11525-017-9315-x.

Hetmański, Marek. 2018. Expert knowledge: Its structure, functions and limits. Studia Humana 7 3: 11–20. https://doi.org/10.2478/sh-2018-0014.

Hollenstein, Nora, Nathan Schneider, and Bonnie Webber. 2016. Inconsistency detection in semantic annotation. In Proceedings of LREC 2016, pp. 3986–3990. ELRA, Portorož.

Hovy, Dirk, Barbara Plank, and Anders Søgaard. 2014. Experiments with crowdsourced re-annotation of a POS tagging data set. In Proceedings of ACL 2014, pp. 377–382. ACL, Baltimore. https://doi.org/10.3115/v1/P14-2062.

Hämäläinen, Mika and Khalid Alnajjar. 2021. The Great Misalignment Problem in human evaluation of NLP methods. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pp. 69–74. ACL, Online.

Kato, Yoshihide and Shigeki Matsubara. 2010. Correcting errors in a treebank based on synchronous tree substitution grammar. In Proceedings of ACL 2010, pp. 74–79. ACL, Uppsala.

Kilgarriff, Adam. 1999. 95% replicability for manual word sense tagging. In Proceedings of EACL 1999, pp. 277–278. ACL, Bergen.

Klein, Gabriella B. 2018. Applied linguistics to identify and contrast racist ‘hate speech’: Cases from the English and Italian language. Applied Linguistics Research Journal 2 3: 1–16. https://doi.org/10.14744/alrj.2018.36855.

Kucera, Henry and W. Nelson Francis. 1967. Computational analysis of present-day American English. Brown University Press, Providence.

Liberman, Mark. 2012. Literary moist aversion. https://languagelog.ldc.upenn.edu/nll/?p=4389. Language Log post.

Lindahl, Anna, Lars Borin, and Jacobo Rouces. 2019. Towards assessing argumentation annotation – a first step. In Proceedings of the 6th Workshop on Argument Mining, pp. 177–186. ACL, Florence. https://doi.org/10.18653/v1/W19-4520.

Loftsson, Hrafn. 2009. Correcting a POS-tagged corpus using three complementary methods. In Proceedings of EACL 2009, pp. 523–531. ACL, Athens.

Manning, Christopher D. 2015. Last words: Computational linguistics and deep learning. Computational Linguistics 41 4: 701–707. https://doi.org/10.1162/COLI_a_00239.

Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19 2: 313–330.

de Marneffe, Marie-Catherine, Matias Grioni, Jenna Kanerva, and Filip Ginter. 2017. Assessing the annotation consistency of the Universal Dependencies corpora. In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), pp. 108–115. LiUEP, Pisa.

McDonnell, Tyler, Matthew Lease, Mucahid Kutlu, and Tamer Elsayed. 2016. Why is that relevant? Collecting annotator rationales for relevance judgments. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, pp. 139–148. AAAI Press, Palo Alto.

Miller, George A. and Walter G. Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes 6 1: 1–28. https://doi.org/10.1080/01690969108406936.

Munro, Robert, Steven Bethard, Victor Kuperman, Vicky Tzuyin Lai, Robin Melnick, Christopher Potts, Tyler Schnoebelen, and Harry Tily. 2010. Crowdsourcing and language studies: The new generation of linguistic data. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 122–130. ACL, Los Angeles.

Palmer, Martha, Olga Babko-Malaya, and Hoa Trang Dang. 2004. Different sense granularities for different applications. In Proceedings of ScaNaLU 2004 at HLT-NAACL 2004, pp. 49–56. ACL, Boston.

Passos, Maria de Lourdes R. da F. and Maria Amelia Matos. 2007. The influence of Bloomfield’s linguistics on Skinner. Language and Speech 30 2: 133–151.

Plank, Barbara, Dirk Hovy, and Anders Søgaard. 2014. Linguistically debatable or just plain wrong? In Proceedings of ACL 2014, pp. 507–511. ACL, Baltimore. https://doi.org/10.3115/v1/P14-2083.

Pradhan, Sameer, Edward Loper, Dmitriy Dligach, and Martha Palmer. 2007. SemEval-2007 task 17: English lexical sample, SRL and all words. In Proceedings of SemEval 2007, pp. 87–92. ACL, Prague.

Pustejovsky, James, Patrick Hanks, Roser Saur, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, and Marcia Lazo. 2003. The TIMEBANK corpus. In Proceedings of Corpus Linguistics 2003), pp. 647–656. Lancaster University, Lancaster.

Reiter, Ehud. 2007. Last words: The shrinking horizons of computational linguistics. Computational Linguistics 33 2: 283–287. https://doi.org/10.1162/coli.2007.33.2.283.

Şahin, Gözde Gül, Clara Vania, Ilia Kuznetsov, and Iryna Gurevych. 2020. LINSPECTOR: Multilingual probing tasks for word representations. Computational Linguistics 46 2: 335–385. https://doi.org/10.1162/coli_a_00376.

Sampson, Geoffrey and Anna Babarczy. 2008. Definitional and human constraints on structural annotation of English. Natural Language Engineering 14 4: 471–494. https://doi.org/10.1017/S1351324908004695.

Santana, Carlos. 2018. Why not all evidence is scientific evidence. Episteme 15 2: 209–227. https://doi.org/10.1017/epi.2017.3.

Snow, Rion, Brendan O’Connor, Daniel Jurafsky, and Andrew Ng. 2008. Cheap and fast – but is it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of EMNLP 2008, pp. 254–263. ACL, Honolulu.

Strapparava, Carlo and Rada Mihalcea. 2007. SemEval-2007 task 14: Affective text. In Proceedings of SemEval 2007, pp. 70–74. ACL, Prague.

Trosterud, Trond. 2006. Grammatically based language technology for minority languages. In Lesser-Known Languages of South Asia: Status and Policies, Case Studies and Applications of Language Technology, edited by Anju Saxena and Lars Borin, pp. 293–315. Mouton de Gruyter, Berlin.

Trosterud, Trond. 2012. A restricted freedom of choice: Linguistic diversity in the digital landscape. Nordlyd 39 2: 89–104.

Uma, Alexandra N., Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, and Massimo Poesio. 2021. Learning from disagreement: A survey. Journal of Artificial Intelligence Research 72: 1385–1470.

Wilks, Yorick. 2000. Is word sense disambiguation just one more NLP task? Computers and the Humanities 34: 235–243.

Wintner, Shuly. 2009. What science underlies natural language engineering? Computational Linguistics 15 4: 641–644. https://doi.org/10.1162/coli.2009.35.4.35409.

Zaenen, Annie. 2006. Last words: Mark-up barking up the wrong tree. Computational Linguistics 32 4: 577–580.

Öhman, Emily. 2021. The validity of lexicon-based emotion analysis in interdisciplinary research. In Proceedings of the Workshop on Natural Language Processing for Digital Humanities (NLP4DH), pp. 7–12. ACL, Online.

Downloads

Published

2022-08-30