Building a Community of Practice: Observations of the Current Use of DataCite DOIs as Project IDs
Project identifiers have been used in DataCite for nearly a decade. While this adoption has been inconsistent, the consistent adoption of Project IDs could aid with the tracking of provenance of data and other project assets, ensuring transparency and compliance with data management practices. Furthermore, a widespread adoption of Project IDs could play a role in assessing the impact of projects, which is vital for funding decisions and measuring the success and outcomes of research endeavors. A community of practice around project IDs could support data sharing, collaboration, and interdisciplinary research, enabling the integration of data and resources from various fields to address complex problems effectively.
So, if project identifiers are a valuable tool, how can we build a consistent community of practice? To explore this question, we researched the application of Project IDs within the DataCite corpus. With this understanding of how Project IDs are used, we piloted creating Project IDs and metadata for the Tetiaroa Ecostation and the Gump South Pacific Research Station, two small scientific facilities in French Polynesia.
It is only through this understanding that our communities can start to build better and more robust usage of these invaluable identifiers. Through this research, we show the power of using existing infrastructure to connect projects and their downstream outputs.
Project Metadata: existing practice within the DataCite community
The use of DataCite metadata schema to capture and connect project metadata is well established. To analyze the current practice, we reviewed how users define project information within the resourceType element, typically with resourceTypeGeneral = “Other”. This approach allows identification of new types in a way that allows discovery and easy migration if the new type becomes part of the shared vocabulary on the road to broader adoption.
Finding DataCite metadata that includes “Project” in the resourceType is straightforward using the DataCite API query: https://api.datacite.org/dois?query=types.resourceType:*Project*&page[size]=1 and the facets for this query provide an overview of repositories that are using this type:
Client | Name | Count |
cos.osf | Open Science Framework | 73,828 |
cern.zenodo | Zenodo | 28,687 |
tdl.tacc | Texas Advanced Computing Center | 621 |
gdcc.odum-library | UNC Libraries | 212 |
umich.library | University of Michigan Library | 193 |
fatj.ngeahg | WL - Publications | 167 |
cdl.cdl | California Digital Library | 93 |
unlv.ds | DigitalScholarship@UNLV | 68 |
tib.hawk | HAWK Hildesheim - Hornemann Institut | 58 |
tib.eurescom | Eurescom GmbH | 48 |
| Total | 103,975 |
Table 1. These ten repositories have 99.75% of DataCite records with "Project" in the resourceType element. Data collected during May, 2023.
ResourceTypes
The DataCite resourceType is an optional free-text field that provides more information about the type of a resource than the mandatory resourceTypeGeneral element. This dataset includes records that have “Project” in their resourceTypes. Repositories implement this free-text differently, some with simply “Project” and some with more details about the type of project or the resource:
Client | resourceType |
cos.osf | Project |
cern.zenodo | Project Deliverable, Project milestone |
tdl.tacc | Project/Other, Project/Report, Project/Other/REU, Project/Other/Dataset, Project/Experimental, Project/Other/Check Sheet, Project/Other/Database, Project/Other/Poster, Project/Other/Report, Project/Other/None, Project/Other/Code, Project/Other/Other, Project/Simulation |
gdcc.odum-library | Capstone Project, Project |
umich.library | Project, Master's Project |
fatj.ngeahg | Projectrapport, Project report |
Table 2. Some repositories use simply "Project", others provide more detail.
Note that the largest user of DataCite projects (cos.osf) has records that clearly describe projects, i.e. resourceType = “Project”. Other repositories include text that describes various parts of projects.
Projects as Hubs for Resources, People, and Organizations
Projects are composed of the resources used for planning, executing, and reporting on the work done during the project and the people and organizations that fund and participate in the project. Project metadata can be a hub for connecting all of these.
Project metadata serves as a hub for connecting project resources using related resources and relation types. Detailed analysis of the DataCite project metadata shows that only two of the largest project repositories in DataCite include related resources and, in those cases, these connections are only available for a small portion of the projects. A few projects from Zenodo that stand out as connectors of many resources are shown in Figure 1.
Figure 1. Several projects with diverse relations and some inter-connections.
Connecting people and organizations to projects is done with ORCIDs and RORs that identify them. The DataCite project metadata was examined for these kinds of connections and the results, shown in Figure 2, show how connectivity varies over the different repositories:
- gdcc.odum-library and cos.osf repositories, shown in the upper left of Figure 2, include Resource Author Affiliation Identifiers (RORs),
- cos.osf and cern.zenodo repositories, in the center of Figure 2, include funder and award identifiers,
- cern.zenodo repository includes Resource Contact identifier metadata.
- cern.zenodo and tcl.tacc include related identifiers with many relation types.
Identifiers included in more than two repositories are not shown here. That includes resource Resource Author Identifiers (ORCIDs), included in four repositories, and Resource Author Affiliation strings without identifiers, included in five repositories.
Figure 2. Identifiers found in project metadata records.
The FAIR Island Case Study: Leveraging Existing Infrastructure
Biological field stations, marine labs, and other scientific facilities, are an important part of the global scientific research infrastructure, providing access and logistical support that makes scientific contributions across domains from ecology to archeology possible. Currently, researchers submit applications to do work at these field stations. These applications provide metadata, i.e. who, what, when, and where, about proposed projects, but they are generally not visible to other researchers wanting to do work in the same place. Could they form the basis for open project metadata?
The connections discussed above are great, but, if you can’t see them, it can be hard to realize the benefits and promote adoption.The FAIR Island Project prototyped an approach to work with researchers as they are planning their projects to create project metadata records and to mint DataCite DOIs when the projects are approved. Using this approach, we can leverage DataCite infrastructure and capabilities to 1) connect resources, people, and organizations to projects with identifiers, 2) update the metadata records as additional relationships are created (e.g. protocols, datasets, papers), and 3) use DataCite Commons to visualize the growing list of resources related to the field station.
Figure 3 shows one of these projects in the DataCite Commons. The citation to the project and a description are shown on the Description tab while other tabs list Creators and Contributors to the project.
Figure 3. DataCite Commons page for a project with tabs showing descriptive metadata, creators, and contributors. The complete metadata are also available in several representations using the Download Metadata button.
Connecting these projects back to the field stations where they took place is another important goal of this work. Those connections were made by adding the field station as a contributor to the project along with its ROR. All of the projects for the Tetiaroa Ecostation are listed on the organization page in the DataCite Commons (Figure 4.) This page includes a clickable list of creators and contributors to the projects, a time history of works from this organization, graphics summarizing work types and licenses (which still need to be added to the metadata), and links to reports with identifiers and more metadata on related works and funders on the far left.
Figure 4. DataCite Commons page for an organization with a ROR that lists all of the works connected to the organization. In this case resources with the resourceType = other are lists as these are the projects included in this experiment.
Conclusion and Next Steps
Persistent identifiers (PIDs) are crucial for scientific projects in many fields and contexts. They serve to ensure the long-term accessibility, traceability, and discoverability of projects and facilitate the linkage and integration of many kinds of resources that make up research projects.
The DataCite community is already using project metadata for a variety of use cases. Samples of DataCite metadata from ten repositories currently creating project metadata were selected to determine how over 100,000 projects are currently described. All of these repositories and their users can improve utilization of existing capabilities to increase connectivity of their projects. The FAIR Island Project is working to provide working examples that demonstrate how even more of the DataCite infrastructure in the metadata and in the DataCite Commons can be leveraged.
Figure 5 illustrates the resources that make up the research landscape and the relationships between them. Most of the resource types in this landscape are currently supported by the DataCite metadata schema and the growing infrastructure and capabilities built on top of it. Other elements of the landscape are also supported by open, reliable and available infrastructure and data systems, e.g. ORCID, Crossref, ROR, and protocols.io.
Figure 5. Schematic illustration of the existing global research infrastructure built on identifiers for a variety of resources and the partners that work together to add capabilities on top of it.
Discussions of the PID Graph, and the connected global research infrastructure that it represents have been going on for several years. This work demonstrates that the connections it facilitates, often hidden in metadata records, are emerging and becoming visible in functional capabilities. No new infrastructure was required to create these project IDs or the connections. We relied entirely on existing DataCite, Crossref, ROR, and ORCID infrastructure to identify these resources and connect them. This is the power of the global research infrastructure and it is exciting to see the connections being formed and displayed.
Now is the time for the research and repository communities to focus on creating, curating, and re-curating high-quality, well-connected metadata for the variety of resource types shown in Figure 5 and using the existing infrastructure to demonstrate the return on investment of that metadata. They can work together to evolve the DataCite schema with improvements for their resource types and use cases. In addition, DataCite could lead discussions to build community consensus and guidelines for Project IDs and other new resource types, increasing the usability and impact of these metadata.
____
Much of the analysis in this post was drawn from : https://metadatagamechangers.com/blog/2023/5/2/project-metadata-in-datacite
The FAIR Island Project Metadata experiment is more fully described in this blog post: https://metadatagamechangers.com/blog/2023/4/30/fair-island-experiments-with-connecting-project-resources-in-datacite and in a talk given at the Earth Science Information Partners’ July 2023 Meeting: Recording starts at 50:00 https://youtu.be/nB6nrnrIcF0?feature=shared&t=3008