Worldwide, governments have started to ease or end the Covid-19 restrictions, signaling the beginning of the end of a pandemic which, according to the WHO, infected over 400 million people and caused 5.8 million deaths, not to speak of the devastating disruptions it caused to public and economic life.
The end of the pandemic seems to be in sight, a result of Herculean efforts by public authorities and all professional sectors which stood on the front line. It is also a testimony of the power of science. As the pandemic unfolded, science helped develop vaccines in record time and supported devising new treatment protocols and finding therapeutic agents. In addition, it also informed containment measures by public authorities.
A clear indication of researchers’ engagement is the surge of scientific publications dealing with all aspects of the pandemic. 6% of the publications in 2020 indexed in PubMed have been reported to deal with Covid-19.
How was this new research on Covid-19 and earlier related research funded? In a recent paper, we sought to understand the extent to which open data infrastructures - specifically Crossref data - can help answer this question.
We used the COVID-19 Open Research Dataset (CORD-19), “a comprehensive collection of publications and preprints on COVID-19 and related historical coronaviruses such as SARS and MERS”, to operationalize Covid-19 and related research.
Our main objective was to explore the extent to which Crossref funding information can help in identifying the funding organizations behind the papers in CORD-19 and how this compares to the data from two established proprietary bibliometric databases: Scopus and Web of Science.
We focused on funding information extracted from funding statements in scientific papers. Those funding statements provide full funding information by listing one or more funding lines that supported the authors of the reported research. This is opposed to funding information provided by funding agencies, which offer partial information as they focus only on linking publications to their own organization without consideration of other organizations which may have co-funded the research. For this reason, we did not use Dimensions, a third commercial bibliometric database, as it does not distinguish between funding information obtained from funding agencies and funding information extracted from funding statements in papers.
We used the CORD-19 version released on 15 February 2021, which had 474k unique records out of which 260k (55%) with Digital Object identifiers (DOIs). We linked those 260k records via their DOIs to our Crossref database (snapshot downloaded on 5 March 2021) and also to Scopus and Web of Science (in-house bibliometric databases at CWTS). This enabled us to assess the funding information in the different databases in terms of coverage and accuracy:
- Coverage: to what extent do the databases have funding information for Covid-19 papers?
- Accuracy: to what extent does the funding information correspond to the text of funding statements?
Coverage by database
The figure below shows the results of the data linking and the availability of funding information in the three databases.
Overall, the results show a limited coverage of funding information in Crossref compared to the proprietary databases. We find that, of the 260k records with DOIs, 45k have funding information in the Crossref database (17%), while 61k and 73k have funding information in Scopus and Web of Science (WoS) respectively, corresponding to a coverage of funding information of 24% for Scopus and 28% for WoS.
The differences in coverage are influenced by different indexing strategies of the databases. Only 66% of the 260k records are indexed in WoS. Scopus indexes 72% of the records, while almost all records (98%) can be found in Crossref.
Considering only publications indexed by a given database, we found that 33% of the CORD-19 publications indexed in Scopus have funding information, while this is the case for 43% of the CORD-19 publications indexed in WoS. The corresponding share for Crossref is 18%.
By comparing publications with funding information in the three databases, we found a relatively low overlap. Considering only publications indexed in all three databases, only 33% of the publications with funding information in at least one database have funding information in all three databases. For the two proprietary databases, the overlap is 64%.
Coverage by publisher
Funding information made available by Crossref is based on metadata submitted by publishers to Crossref (and since recently also by funding bodies). We looked at differences between publishers in the percentage of publications with funding information, comparing the percentage of publications with funding information in Crossref to the corresponding percentages in Scopus and WoS. As shown in the figure below, some publishers submit funding information for almost all their publications (notably Oxford University Press and American Chemical Society). Other publishers, such as American Medical Association, Cambridge University Press, and JMIR, do not submit any funding information at all. The majority of the funders submit funding information only for part of their records. Some of these funders may have started submitting funding information only recently.
The fact that a database has funding information for a given paper, does not guarantee that this information is also correct.
In our analysis, we assessed the accuracy of funding information by determining – for small samples of papers – if the funding information in a given database corresponds to the funding statement contained in the full text of a paper.
Overall we found that funding information in WoS has the highest accuracy. Compared to WoS, we found that Crossref has a higher share of publications for which we could not confirm the correctness of the funding information (based on the full text of a publication). The highest share of publications for which the funding information seems incorrect was found for Scopus.
Another problem, which is more acute in Scopus than in the other two databases, is the erroneous identification of pharmaceutical companies as funders of a given study. In a small random sample, we found that in most cases in which Scopus lists a pharmaceutical company as a funder, this is an artefact resulting from an error made by the algorithm used by Scopus to extract and structure funding information. In those cases, the algorithm incorrectly treats the “conflict of interest” or “disclosure” section of a paper as a funding statement. As a result, pharmaceutical companies are incorrectly presented as funders of the research presented in a paper.
Lessons learned and outlook
Our work gave us a number of unexpected insights. For example, we saw that, despite the limited coverage of Crossref, statistics based on funding information in Crossref approximate quite well statistics based on data from proprietary platforms with higher coverage. When ranking funders in terms of the number of publications they funded, Crossref gives largely the same picture as WoS. When considering only publications with funding information that includes a disambiguated funding entity, the differences between Crossref and the proprietary databases drop to a single-digit percentage point (14% for Crossref, 18% for WoS, and 20% for Scopus).
A second lesson is that funding information should not be blindly trusted as the information can give an inaccurate picture of the funding landscape. For example, pharmaceutical companies feature prominently among top funders of Covid-19 research according to Scopus, but this is largely an artefact of the extraction algorithm.
Reflecting on suggestions for improvements, we believe that publishers should be encouraged to sustain and intensify their efforts to submit funding information to Crossref. Several prominent publishers have not yet started to submit funding information. Given the importance of this information, we urge these publishers to start working on this as soon as possible.
Not all publishers have the resources to extract and structure funding information but an arguably simple but impactful improvement could be to include in the metadata submitted to Crossref also the raw text of funding statements. This may not only increase the share of publishers submitting funding metadata, but it may also help increase the quality of funding information. For example, errors made by algorithms could be detected more easily, and as better algorithms become available, they can be applied, also retrospectively, to the available funding statements to turn these statements into high-quality structured funding information.
We also welcome the recent efforts made by Crossref to work together with funders to assign DOIs to research grants. This offers a great opportunity for funders to get better data on the results and impact of the research they fund.