Scraped? How Database Rights Can Protect Academic Repositories
DOI:
https://doi.org/10.7557/5.8223Keywords:
copyright, database right, database, AI training, open access, institutional repositories, AI tools, licence, web scrapingAbstract
(Watch the VIDEO.)
Institutional repositories are a cornerstone of the open science ecosystem, enabling global access to scholarly work and supporting long-term preservation. Yet in an era of automated data extraction and rapidly advancing AI technologies, these repositories face a new and largely unaddressed threat: large-scale, covert scraping of academic content by commercial entities, often for training proprietary AI models. This practice bypasses scholarly norms, offers no transparency or attribution, and exploits institutional infrastructure without consent. Despite the ethos of openness, such unchecked reuse risks distorting scholarly communication, undermining author rights, and compromising the trust that underpins open research infrastructure.
This paper proposes a pragmatic, legally grounded response to this emerging risk. While copyright law has limited utility in protecting open access materials, UK and EU database rights provide an underused but powerful mechanism. These rights recognise the institutional investment involved in maintaining repositories and can be used to enforce contractual restrictions on content reuse. When coupled with technical measures - rate limiting, bot detection, and machine-readable licences - universities can protect their repositories without impeding access for legitimate research or non-commercial users.
Based on implementation planning at the University of Edinburgh, this model includes a layered approach: asserting database rights, requiring end-user licence agreements, and enabling responsible access through APIs. The proposed strategy aligns with open science principles by encouraging transparency, accountability, and stewardship, rather than secrecy or restriction. It offers a scalable response to an infrastructural blind spot in the current scholarly communication ecosystem.
Repairing this gap is not about closing doors but reinforcing them against unethical exploitation. If repositories are to remain open and credible, they must also be governed. This proposal offers a forward-looking model that protects academic infrastructure while upholding the values of openness, equity, and responsible innovation in the age of AI.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Eugen Stoica

This work is licensed under a Creative Commons Attribution 4.0 International License.