Response to the White House Office of Science and Technology Policy’s “Request for Information: Public Access to Peer-Reviewed Scholarly Publications, Data and Code Resulting From Federally Funded Research”.
Dear Dr. Nichols,
Thank you for requesting input on “Public Access to Peer-Reviewed Scholarly Publications, Data and Code Resulting From Federally Funded Research”. I am an Assistant Professor of Medical Biophysics and Computer Science at the University of Toronto, where my research focuses on machine learning and genomic data. I am a U.S. citizen.
I strongly support your efforts to make knowledge, information, and data generated by federally-funded research immediately, universally, and freely accessible upon publication. You should require that upon publication of work funded in part by federal sources, the publication itself, associated data, and associated software code must be deposited in a public repository and available immediately, with no embargo period or payment.
What current limitations exist to the effective communication of research outputs (publications, data, and code) and how might communications evolve to accelerate public access while advancing the quality of scientific research?
Publications
The major limitation to immediate public access of publications is journal paywalls. It should be a requirement of accepting federal funding that any resulting publications be available freely immediately. Not after an embargo period. Embargo periods and tolls paid to publishers add harmful and unnecessary friction, reducing the impact of taxpayer-funded research.
Opponents of such requirements have brought up concerns about interference in a “private marketplace” (letter from publishers to President Trump, 18 December 2019, https://presspage-production-content.s3.amazonaws.com/uploads/1508/coalitionletteropposinglowerembargoes12.18.2019-581369.pdf). In truth, there is no such “private marketplace”. The publishers generally publish articles provided to them at zero cost, written by researchers whose salaries are paid for by public funds.
In exchange for receiving federal funding, recipients must agree to certain restrictions. For example, they usually are not allowed to set the money on fire. This is not interference with a private marketplace for banknote ash; it is a sensible restriction to ensure that the federal funding provides a public benefit. Similarly, maximizing public benefit rather than maximizing “choice” must be the driving force of access policies.
Private organizations such as the Gates Foundation have successfully enforced policies to maximize the impact of their research through zero-embargo public access (“Bill & Melinda Gates Foundation Open Access Policy”, https://www.gatesfoundation.org/how-we-work/general-information/open-access-policy), with few complaints from recipients. Such policies give the recipients leverage to demand beneficial changes in journal policy. No organization has more leverage in this respect than the U.S. government.
Data and code
The major limitation in communicating data and code is that some individual researchers do not find it to be necessary or beneficial to them personally. To counter this, we must have strong requirements for sharing of data and code, a strong enforcement program, and must prioritize funding for those who share data and code well.
Federal research agencies must have strong requirements for data and code sharing. Guidelines that discuss “expectations” or “recommendations” instead of strong requirements are self-defeating.
When requesting funding, applicants must include a data management and materials sharing plan that describes how these requirements will be implemented in detail, and how the plan addresses the 15 FAIR Principles for Findable, Accessible, Interoperable, and Reusable data (https://www.go-fair.org/fair-principles/). The plan must be considered and scored by technical reviewers and not just be an administrative afterthought.
Data management and materials sharing plans for funded projects must be placed on a public web site so that others know what to expect. Grantees knowing that their data management and sharing promises are readily available to the public will provide some measure of self-enforcement. The web site should include contact email addresses for the principal investigators of a grant, officials representing the grantee institution, and the funding agency. This will allow for solving issues at the most local level, when possible, and escalation when the previous proves ineffective.
Agencies should have incentives to encourage high-quality data and code sharing. I suggest that biographical sketches of key personnel include a section where they discuss their most significant contributions to data and resource sharing (including data, code, reagents, samples, and other materials). This should be separate from other contributions to avoid it getting short shrift due to lack of space. The past record of the principal investigator and other key personnel should be explicitly added to scored review criteria.
What more can Federal agencies do to make tax-payer funded research results, including peer-reviewed author manuscripts, data, and code funded by the Federal Government, freely and publicly accessible in a way that minimizes delay, maximizes access, and enhances usability?
Data
To ensure good data management, any data described as collected in a progress report must be deposited independently and an accession code or digital object identifier (DOI) supplied. Without an independently verifiable accession code, funding agency officials and reviewers should not consider the existence of such data when deciding on competing or non-competing renewals.
Except when specified by the funding opportunity announcement, researchers may embargo data until publication, and not beyond. Grant opportunities specifically designated to create a shared resource must specify a date by which data must be available even in the absence of a publication.
There are a large number of digital repositories with different policies. You should require that acceptable digital repositories must not allow recipients to unilaterally change or delete deposited data. The repositories may, however, allow adding new versions of data advertised in metadata for the original dataset.
Code
Requiring availability of software code for published research is essential to maximize the public benefit of federally-funded research. I and others have written more about the importance of code availability in artificial intelligence research in a recent commentary (“The importance of transparency and reproducibility in artificial intelligence research”, https://arxiv.org/abs/2003.00898).
As with data, you should require that code be deposited in an independent repository that does not allow recipients to unilaterally change or delete the code.
Other materials
Other materials produced in part with federal funds, such as plasmids, cell lines, or mouse strains, must be available through third-party repositories. Agreements for access to these materials must be free of restrictions on the ability to perform or publish further research using these materials. This means that requirements for prior approval by, or collaboration or co-authorship with, the depositors of the materials render a repository unacceptable for this policy. Such requirements impede the action of the normal scientific process to ensure robustness and reproducibility of research, and to build on it to maximize public benefit.
Any additional information that might be considered for Federal policies related to public access to peer-reviewed author manuscripts, data, and code resulting from federally supported research.
Exceptions to access requirements must be narrowly tailored to a specific purpose, individually justified, and receive prior approval by peer reviewers, program staff, and an agency-level advisory committee of data management experts that includes data scientists and librarians. While privacy concerns sometimes prevent sharing of full data associated with individuals, it is often possible to share those data in de-identified form, via platforms that restrict access to qualified researchers, or in summary form.
It is important to protect human participant privacy but it is also important that concerns about human participant privacy not be abused to eliminate appropriate data sharing. It is especially worth considering that many human participants expect that data from their participation will be shared with other qualified researchers. Ineffective sharing of the resulting data (assuming appropriate protective measures such as de-identification are in place) is unethical as it wastes human participants’ contributions to research and may result in more patients being exposed to harm. Therefore it should be an explicit goal of this policy and any submitted data management and materials sharing plans to maximize access subject to necessary restrictions.
Conclusion
Thank you for your work to increase the public benefit of federally-funded research. These benefits will be maximized by strong requirements for public and free availability upon publication of the publication itself and associated data, code, and other materials.
Sincerely yours,
Michael Hoffman
Acknowledgments
Thanks to Prachee Avasthi for helpful comments. Much of the text I used here comes from my previous “Comments on the draft NIH Policy for Data Management and Sharing”.