Sharing the load: Building a collective to support open research information online

Cameron Neylon; Bianca Kramer; Alysson Fernandes Mazoni; Rodrigo Costas; Najko Jahn; Nees Jan van Eck

doi:https://doi.org/10.54900/2pnyq-nhx95

The rise of open research information resources is transforming the way we track, analyse and study research systems. Increasingly, sources like OpenAIRE, OpenAlex, Crossref, DataCite, ORCID, ROR and others are being used as the basis for making decisions, designing interventions and understanding progress in the science system. This operates both at the small scale, where access to data and evidence is easier than it has ever been, to the very large scale analysis of whole systems.

Traditionally, the capacity to do large-scale analyses was restricted to a very small set of players, like specialised research centres or companies. This kind of large scale analysis usually requires access to an actionable version of the whole dataset, particularly if the goal is combining data resources. The set of sites with access to complete copies of proprietary databases is tiny.

Modern open data sources provide access, including access to full copies of the data, but there has been less focus on providing this access in a way that allows large scale complex querying and connecting of whole data archives - for example to compare the coverage of research outputs by OpenAlex and OpenAIRE or analyse global information on clinical trials using affiliation data from OpenAlex and clinical trials information from Pubmed. Another valuable possibility is the ability to incorporate local data enrichments from national or regional data sources to support local data needs, or improve the overall pool of data.

Google BigQuery has emerged as one powerful tool for combining and working on these large datasets at scale. Multiple groups (including the MultiObs team - continuing the work of the InSySPo team at Campinas, SUB Göttingen, Sesame Open Science and CWTS amongst others), have created ‘public’ versions of specific open datasets in the BigQuery system, which anyone can access and run their own analyses. Through these public versions, the ‘provider’ (i.e. the teams mentioned above) pays for storage, and the user freely accesses the ‘public’ versions taking responsibility for covering the costs of data querying and processing.

Having worked independently so far, this small group came together last year to ask whether we could coordinate actions. Could it be possible to build a comprehensive open research information resource where the load of providing specific core and relevant open data sources was distributed? Rather than each separately trying to tackle the whole, potentially duplicating efforts, could we collectively create a resource that was more than the sum of its parts?

We met with a series of key questions:

Can we share resources and burdens to make available key open research information resources in actionable and connectable form in the cloud?
Through sharing processes and systems, is it possible, over time, to build a standard for how these data sources should be made available?
What are the challenges that we can usefully approach collectively?
What are the benefits and risks of Google BigQuery as an environment and do we agree it is the best place to start?
What are the blockers for engagement with such an effort? What is needed for different stakeholders to make it attractive both as users and (for some) as providers?

User and use case driven

Core to our shared interest in working together was the idea of making it easier for more people to undertake large scale analysis. There are many kinds of analysis for which access to APIs is sufficient. We share a belief that large scale analysis will be useful in multiple settings, but that it has been relatively inaccessible. This inaccessibility is a hurdle to realising the promise of democratization and broader adoption of open research information in all decision making processes around science and scholarship, as proposed by the Barcelona Declaration. APIs are also expensive to run, by taking some of the heavy load use-cases away from APIs we can support providers by reducing their costs, centralising distribution, and allowing APIs to focus on the use cases they are best suited for.

There is a growing set of research projects that are exploiting this capacity for large scale and combined analysis in a range of ways. Two recent pieces of work provide examples of what is possible. One by Camilla Lindelow and Eline Vandewalle, used the combination of ORCID and OpenAlex provided by InSySPo (now MultiObs) to analyse researchers without a formal affiliation from around the world. The second example, from Cespedes and colleagues associated with the UNESCO Chair in Open Science, used a global analysis of language in OpenAlex to examine affiliation. This combines with other efforts, including comparisons of metadata coverage across sources, and combinations of data sets that exploit the capacity to do analysis at scale.

These use cases have a few things in common. They tend to be global in scope (or at least aspire to be) so they require analysis across the whole of a datasource (or a combination of datasources). They generally involve a complex form of query, requiring filtering or analysis on multiple database elements, or a combination of multiple data sources, that is difficult or impossible using the API for any given datasource. And the generated dataset is often very large in its own right - perhaps involving hundreds of millions of rows of data - and requires further reduction and analysis.

Overall, the common theme here is analyses that require entire data sources to be combinable and actionable at scale. We believe if we focus on that set of use cases we can add something valuable to the overall Open Research Information ecosystem.

Opportunities for shared systems

If people are already doing this what is the value of coordination? The first and most obvious is that with a shared cloud system we only need to pay for online storage of each dataset once and then anyone can use it (backups and versions over time are a separate issue, which we aim to address, but not as the first priority). Cloud storage costs are generally larger than the usage costs involved in running queries so sharing this load is valuable in its own right.

The second advantage is the ability to share capacities. One example of this is data preprocessing. These datasets are not “clean” in the sense that they change over time, have some internal inconsistencies, and often contain elements that raise compatibility issues with database systems. Processing hundreds of millions of lines of JSON to convert hyphens to underscores in variable names takes time and computing power (and money!).

Systems developed within the Curtin Open Knowledge Initiative (COKI) use cloud VMs to do this on demand which scales but adds costs. The team at Göttingen are using code derived from this on their own HPC resources. The team at CWTS uses their own code to process datasource dumps on local servers so that relational databases can be integrated into their internal database system, while also exporting the results to Google BigQuery. Within the Sesame Open Science system a further evolution of the COKI code is used to process dumps on local computers. There is a clear benefit to be gained by using a common code base for necessary transformations. But also in having a community discussion on what pathways and transformations are necessary. The MultiObs team uses a quite different approach -- creating relational structures from datasource dumps -- with advantages (reduced costs, timestamps, etc.) but also disadvantages (lack of live data, need for updates, etc.) we can learn from. Different approaches and experiences, but also different sets of resources like HPC could be shared amongst an effective collaboration.

This leads to the third advantage. If we use common systems we help to develop quasi-standards that can be adopted by others. That creates an opportunity to spread the load further, as well as to increase the diversity of datasets available (again, thinking of those highly curated national datasets that are used locally but not always recombined into the global data ecosystem). If we have a clear shared approach to the data and how it is managed it makes it easier for others to contribute, and makes the whole set of resources more valuable and sustainable. In essence, the more we share the load, the less we pay for the costs of our contributions.

A final benefit of a shared approach would be a virtuous loop in which shared systems encourage shared approaches to analysis. Common approaches can form the basis for training resources that give end-users an easy point of entry to using these data sources at scale. They will also encourage the sharing of analysis scripts and protocols creating advanced and transparent resources to support developing users.

Key to this is understanding both what has value to keep in common, but also what needs to be different to serve a diversity of use cases. We can see value in technical standards (where they are useful) and in agreements around archiving and preservation. Documentation, where it can reach common standards, will be helpful not just for users of the data, but potentially for upstream producers in understanding how the data is being used and how to optimize the provision of their data snapshots to facilitate downstream usage.

The Google-shaped elephant in the room

A big question is why Google BigQuery? It is certainly not an open system in any meaningful sense and Google is not an organisation many of us feel able to trust. The short answer is pragmatism. There are reasons why we independently arrived at GBQ as a useful tool. Google solves a bunch of the hard problems, including authentication without the need for institutional affiliation, systems provisioning and a highly performant database system. In practice, this means datasets can be made publicly available, without the need for specific hard-or software on the side of the user, and, from a user perspective, access to datasets hosted by different providers is possible using a single system. Standing up an independent infrastructure to do this is a big job and not one we’re equipped to tackle at the moment.

That said, none of us believe that reliance on Google is a desirable long term solution, nor that it is fully equitable. There are some emerging alternatives both in the cloud and for local computing. These aren’t fully mature but they show promise. In the meantime we believe it is important to ensure we have an exit strategy. One such strategy could be a commitment to creating backups in the form of parquet files. Parquet is an interesting interoperability format for databases and can be read in by an increasing number of tools. It holds schema information and allows for database partitioning.

Perhaps the most important argument is that with Google BigQuery and external archiving, there is at least one plausible option to explore that can provide value immediately, but also provide a potential escape route. We can save the arguments for frozen duck lakes, glaciers, torrents and MySQL for later and for those who will want to have them! But we need to think seriously about how we will work towards more independence and resilience early on in the process.

Next steps and a call for interest

We have made a small start. Small, but useful for us. After all, we are already using these shared data resources. We have demonstrated that without much effort or technical hassle it is possible to share the load, reduce costs and maximize benefits and accessibility. We hope by engaging a wider community we can make this more useful for more people and move us all closer to ideals of democratization of open research information, supporting adoption. How far this goes and how big a community we can create is an open question.

We have made a small start under the label of ORION-DBs, standing for Open Research Information Online Databases. There is now a website that details the datasets available, where they can be accessed and when the most recent update was. We hope this will be a useful resource for people doing ad hoc analyses, occasional use, or just one-off interest in taking a look, as well as those with bigger use cases and ongoing needs for data access. We hope a community of users and also of providers will be interested to coordinate through this platform to aid discovery, adoption and democratization of open research information.

Looking forward, we’re interested in how we can build on this base. We want to coordinate and build a shared capacity. If you have an interest in how this could be shaped, demonstrating specific use cases, or contributing additional hosted datasets, we’d love to hear from you. Coordination takes time, time requires resources. If there is sufficient interest, we will look at how we could coordinate resources and build something as lightweight as possible and as formalised as necessary.

Above all, we want to hear from those who share the vision for creating data resources that can be combined and used together and to make them as useful as possible. You can contact us through info@orion-dbs.community and depending on interest, we will set up other forums. It is through using these data sources that we identify their issues and can correct and improve them. When we do that work together, we increase the quality of all data resources faster, more sustainably and more effectively.