{{ site.title }}

ARL Comments on the NIH Strategic Plan for Data Science 2023–2028

March 15, 2024

NIH Office of Data Science Strategy (ODSS)
9000 Rockville Pike
Bethesda, Maryland 20892

Re: Request for Information (RFI)—Inviting Comments on the National Institutes of Health (NIH) Strategic Plan for Data Science 2023–2028 (NOT-OD-24-037)

On behalf of the members of the Association of Research Libraries (ARL), thank you for the opportunity to provide comments on the National Institutes of Health (NIH) “Strategic Plan for Data Science 2023–2028.” We applaud NIH for its leadership in public access, specifically its investment in PubMed Central (PMC) and the recently implemented “NIH Policy for Data Management and Sharing.” ARL and its members are committed to the advancement of open scholarship and open access to accelerate scientific and medical advances and to expand diverse, public participation in federally funded research. We appreciate NIH’s commitment to making the results of federally funded research widely available, leveraging persistent identifiers to support scientific integrity, bolstering the data repository infrastructure, and ensuring equitable access.

Decisions made by NIH, one of the world’s largest funders of scientific research, will influence the entire research data ecosystem, with implications for researchers globally. ARL recommends that NIH consider the far-reaching, global impact of its work and priorities with regard to non-NIH-funded researchers in addition to those funded by NIH. 

ARL appreciates that the FAIR and CARE principles are emphasized in the “Strategic Plan for Data Science,” and applauds NIH for partnering with DataCite to support the ability to find and cite NIH-funded data via the use of persistent identifiers (PIDs), and for investing in the Generalist Repository Ecosystem Initiative (GREI). 

  1. Given the importance of generalist repositories, and not just those that are part of the GREI Initiative, ARL suggests NIH recognize Institutional Repositories (IRs) and/or Institutional Data Repositories (IDRs) as viable locations for sharing research data. In particular, data repository infrastructure includes the expertise of dedicated personnel, in the form of metadata specialists, data curators, data stewards, research data management librarians, and other data management specialists, who work with researchers across biomedical disciplines. Library administrators drive research data management and sharing practices and policies within their library departments and across their institutions. Through their strategic oversight and the deployment of specialized staff, libraries ensure the effective management and sharing of research data. Consequently, IRs and IDRs are instrumental in enhancing research data accessibility and reusability, acting as critical resources that support the research lifecycle and advance scientific discovery. Institutional repositories are also critical infrastructure for data sharing in the absence of a disciplinary data repository. 

Recent research[1] between ARL and the Data Curation Network shows an increase (Figure 1) in the number of data deposits in local repository infrastructure, either in institutional repositories or institutional data repositories among ARL’s academic membership. This increase of data deposit in local repository infrastructure signals the value researchers are placing on their local library-based infrastructure.

Area graph, "Total Count of Datasets in Institutional Repositories (IRs), Institutional Data Repositories (IDRs), and Total Overall Number in 2017, 2020, and 2023.

Figure 1: The Total Count of Datasets in Institutional Repositories (IRs) and Institutional Data Repositories (IDRs), and the Total Overall Number in 2017, 2020, and 2023.

  1. Researchers typically incur direct costs related to data management and sharing activities throughout the life cycle of their projects. Due to this, ARL recommends NIH provide guidance for NIH-funded researchers on how to estimate their data management and sharing expenses at the proposal stage. Recent ARL research[2] has found that the average expenditure for research data management and sharing activities for NIH-funded researchers is, on average, a total of $36,000 across the lifetime of their projects. This research has also found that the average yearly institutional expense (researcher expense + institution-based service provider expenses) for data management and sharing is $2,500,000 with a range from approximately $800,000 to over $6,000,000. While researcher direct costs may be included in grant budgets, overall institutional expenses are not yet well accounted for through institutional direct or indirect cost reimbursement. 

Notably, as seen in Figure 2, when leveraging institutional services, such as institutional repositories for data sharing, researcher expenses for data management and sharing have a lower average expense when compared to the average expense when using a different data-sharing location.

Bar graph of how data were shared (supplementary materials, personal website, institutional repository, generalist repository, disciplinary repository, and by request) by average DMS cost.

Figure 2: Average DMS costs per funded research project by how research data were shared.  

Based upon the information above and to ensure researchers have support to meet requirements, ARL recommends that NIH:

  • Minimize the administrative and financial burden on researchers and institutions for compliance by working with institution-based service providers to educate and support the preparation of materials for sharing for public access.
  • Specify allowable (and unallowable) costs for data management and sharing activities. This includes explicitly stating if data storage and repository expenses post-award are allowed. 
  • Develop a mechanism to ensure that funds are available post-closeout for publication and research data storage and/or sharing expenses. Post-award funding is particularly important for early-career, postdoctoral, and graduate student researchers whose publication and data-sharing costs may not have been factored into the original grant budget.
  1. ARL suggests that the Data Curation Network (DCN) be acknowledged as an NIH partner organization within the international biomedical data ecosystem (page 12 in the “Strategic Plan for Data Science”). As recognized in the Plan’s Appendix, the DCN has recently partnered with the NLM, providing ongoing training in data management and sharing for researchers, data resource staff, and NIH program staff. The DCN continues its collaboration with NIH through recent funding from the Office of Data Science Strategy’s program, the Data Management Center of Excellence (DMCOE).[3] A key outcome of this partnership is the Summit for Academic Institutional Readiness in Data Sharing (STAIRS), an event which aims to gather key stakeholders from academic institutions, encompassing data service providers, institutional repository (IR) managers, and data curation professionals. The primary goal of the STAIRS summit is to foster connections between institutions, enhancing their capability to share high-quality datasets in a manner that is both FAIR (Findable, Accessible, Interoperable, and Reusable) and ethical.
  2. Require 508 compliance for shared data. Ensuring datasets are accessible to all is critical to enabling equitable delivery of federally-funded research results. Open, non-proprietary or commonly available data formats facilitate data accessibility and reusability, especially for individuals with print disabilities. Requiring and enforcing 508 compliance for data is essential. 
  3. Given the compounding environmental impact associated with the energy consumption of artificial intelligence (AI), machine learning (ML), and large language models (LLMs), ARL recommends that NIH encourage NIH-funded researchers to assess any environmental offsets of the computational systems and infrastructures they utilize. This approach aligns with sustainable research practices and acknowledges the responsibility of the scientific community to mitigate climate change. By evaluating and considering the carbon footprint of these computational methods, researchers can make informed decisions. Encouraging such assessments not only promotes environmental stewardship but also sets a precedent for incorporating sustainability into scientific practices.

We look forward to continued engagement with NIH during the development of the agency’s public access plan. We are happy to work with NIH to identify ARL member institutions to participate in conversations regarding any of these specific topics. Please feel free to contact me or my colleague Cynthia Hudson Vitale, Director of Science Policy and Scholarship (cvitale@arl.org), with any questions about these comments.

Sincerely,

Andrew K. Pace
Executive Director

[1] Priesman Marques R, Narlock M, and Taylor S. “Essential Infrastructure Indeed: Institutional Repositories and Data. 2024.” https://doi.org/10.17605/OSF.IO/R68FE. Manuscript forthcoming.

[2] Hofelich Mohr, Alicia, Jake Carlson, Lizhao Ge, Joel Herndon, Wendy Kozlowski, Jennifer Moore, Jonathan Petters, Shawna Taylor, and Cynthia Hudson Vitale. Making Research Data Publicly Accessible: Estimates of Institutional & Researcher Expenses. Washington, DC: Association of Research Libraries, February 2024. https://doi.org/10.29242/report.radsexpense2024.

[3] See Data Curation Network. “Announcing the Summit for Academic Institutional Readiness in Data Sharing (STAIRS).” DCN Blog. March 15 2023. https://datacurationnetwork.org/2024/03/15/announcing-the-summit-for-academic-institutional-readiness-in-data-sharing-stairs/ 

, , , , , , ,

Affiliates