Data citation in linguistics publications

A scholar-led, community-based initiative




linguistics, data citation, scholar-led initiatives, reproducible research, research data


Watch the VIDEO.

The creation and dissemination of reproducible research is receiving ever-growing attention in discussions on best practices in publication and education. A key element of these practices is appropriate citation of data sources. In this presentation we describe one scholar-led initiative to increase awareness of the value of data citation in scholarly communication across the discipline of linguistics. 

Practices in linguistics are varied; it is primarily a data-driven social science, in which inferences about the properties of language, human cognition, cultures and societies are drawn from observations of language. The primary data sets underlying the field are records of these observations in the form of, for instance, texts, audio/video recordings and annotations. While linguists have always relied on language data, they have not always facilitated access to those data in publications (Berez-Kroeker et al. 2018). A great deal of published linguistic research is therefore not reproducible, either in principle or in practice.

A primary factor hindering reproducible research in linguistics is the lack of standards for data citation in scholarly publishing. Lacking such standards, the field continues to emphasize linguistic analyses over linguistic data, and as a result, linguists have little incentive to make the data behind research publications accessible.

Funded by the US National Science Foundation, since 2015 we have endeavored to develop and promote standards for citing data. We are an international (Norway, US, Canada, Australia) team of scholars including linguistic data practitioners, scholarly communication librarians, and digital archivists.

In this presentation we discuss our coordinated efforts over the past four years, including:

Network building

  • 3 international workshops to identify technical and sociological barriers to research data citation in linguistics publications;
  • The formation of the Linguistics Data Interest Group ( within the Research Data Alliance, with nearly 100 members from the international linguistics scholarly community.

Outreach activities

  • Short-form technical courses and presentations offered through the Linguistic Society of America.

Deliverable products

  • An open-access position paper (Berez-Kroeker et al. 2018).
  • The Austin Principles of Data Citation in Linguistics (, which annotates the FORCE11 Joint Declaration of Data Citation Principles (Data Citation Synthesis Group 2014) for linguistic scholarship.
  • Guidelines for citing linguistic data to be shared in late 2019 with linguistics journal editors and stylesheet curators.
  • The open-access Open Handbook of Linguistic Data Management (MIT Press Open, est. publication date 2020). 

With this presentation, we aim to encourage practitioners in other fields to initiate similar advancements, and to encourage decision-makers and publishers to actively collaborate with and support scholar-led initiatives working toward better research practices. 

Author Biographies

Helene N. Andreassen, UiT The Arctic University of Norway

Helene N. Andreassen is head of library teaching and learning support at UiT The Arctic University of Norway, where she teaches information literacy, research integrity, and research data management. She holds a PhD in French Linguistics from UiT and is deeply involved in the work on the Tromsø Repository of Language and Linguistics (TROLLing). Her research concentrates on L1 and L2 phonology.

Andrea Berez-Kroeker, University of Hawaii at Manoa

Andrea Berez-Kroeker is an Associate Professor in the Department of Linguistics at the University of Hawaii at Manoa where she teaches classes in Language Documentation and Conservation. She is active in the field of language data sustainability and preservation, especially for endangered languages, and has served as the director of the Kaipuleohone University of Hawaii Digital Language Archive since 2011.

Lauren Collister, University of Pittsburgh

Lauren B. Collister is the Director of the Office of Scholarly Communication and Publishing at the University Library System, University of Pittsburgh, where she oversees activities in library publishing, repositories, and copyright. Her research interests center around language change and technology and how language impacts choice of communication form. She is the current Chair of the Committee on Scholarly Communication in Linguistics for the Linguistic Society of America.

Philipp Conzett, UiT The Arctic University of Norway

Philipp Conzett is a Senior Research Librarian at UiT The Arctic University of Norway. In addition to being the subject librarian for Nordic and Finnish/Kven Languages and Literatures he works mostly with Open Science and Digital Humanities support at UiT. He is one of the developers and service managers of the Tromsø Repository of Language and Linguistics (TROLLing; His research concentrates on North Germanic morphology.

Christopher Cox, Carleton University

Christopher Cox is an Assistant Professor in the School of Linguistics and Language Studies at Carleton University in Ottawa, Canada. His research centres on issues in language documentation, description, and revitalization, with a special focus on the creation and application of linguistic corpora. He has been involved with community-based language work, most extensively in partnership with speakers of Plautdietsch, the traditional language of the Dutch-Russian Mennonites, and with Dene communities in northern and western Canada.

Koenraad De Smedt, University of Bergen, Norway

Koenraad De Smedt is Professor of Computational Linguistics at the University of Bergen, Norway. He coordinates CLARINO, the Norwegian research infrastructure for language resources and technologies, which provides the Humanities, Social Sciences and related fields with access to language data and tools.

Lauren Gawne, La Trobe University

Lauren Gawne is a David Myers Research Fellow at La Trobe University. Her research focuses on evidentiality and gesture, with specialisation in Tibeto-Burman languages. This research is underpinned by an interest in critical approaches to language documentation.

Bradley McDonnell, University of Hawaii at Manoa

Bradley McDonnell is an Assistant Professor in the Department of Linguistics at the University of Hawai'i at Manoa. His specializations include documentary linguistics, Austronesian languages, interactional linguistics, and usage-based linguistics. He is also interested in improving data management workflows for reproducible research in linguistics.


Berez-Kroeker, Andrea L., Lauren Gawne, Susan Kung, Barbara Kelly, Tyler Heston, Gary Holton, Peter Pulsifer, David Beaver, Shobhana Chelliah, Stanley Dubinsky, Richard Meier, Nicholas Thieberger, Keren Rice & Anthony Woodbury. 2018. Reproducible research in linguistics: A position statement on data citation and attribution in our field. Linguistics 56(1): 1–18.

Data Citation Synthesis Group. 2014. Joint Declaration of Data Citation Principles. Martone M. (ed.). San Diego CA: FORCE11.