Monitoring Open Science beyond publications
Datasets and software as research products to be shared
Keywords:French Open Science Monitor, research data, open data, open science, metrics, AI, artificial intelligence
Since 2018, the French Open Science Monitor (BSO) has assessed the effectiveness of the national public policy in open science. This steering tool, developed by the French Ministry of Higher Education and Research, the University of Lorraine and Inria, measures the evolution of open science in France using reliable, open and controlled data updated every year. The result is a website presenting different dashboards, tracking for example the ratio of open access scientific publications by year, discipline or publisher.
Since its last release in March 2023, the BSO also tracks the production and openness of research datasets and software mentioned in scientific publications on a national scale. To ensure a realistic coverage, our platform relies on large-scale open source Deep Learning techniques applied to the full texts of publications with at least one co-author with a French affiliation.
DataStet identifies every mention of datasets in scholarly publications, including implicit mentions of datasets and explicitly named datasets. SoftCite recognizes any software mentions in scientific publications, using as training data the Softcite Dataset. Dataset and software mentions are then characterized automatically as used, created and shared by the research work described in the scientific document. These characterizations can be cumulative. Among 1,608,839 publications from our corpus, we were able to analyze 655,954 of them with our tool DataStet. For this subset, we found 6,511,998 mentions of datasets characterized as used, 330,062 mentions characterized as created, and 78,178 mentions characterized as shared.
With this methodology, the BSO can offer new indicators about the proportion of French publications mentioning the usage, creation and sharing of data, as well as the proportion of publications in France that include a "Data Availability Statement". Similar indicators are dedicated to code and software. In addition, these indicators are further broken down into disciplines, publishers and institutions.
The project is addressing major technical and organizational challenges: to identify French datasets and software without reference registries as for publications, thanks to artificial intelligence; to produce relevant indicators for the different scientific communities. As an enabling technology to identify research datasets and software, deep learning plays a crucial role. This presentation will be an opportunity to present the latest results of the project, to detail the methodology, and finally to underline the reusability of the project results.
How to Cite
Copyright (c) 2023 Laetitia Bracco, Anne L'Hôte
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).