We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Opening Address: FAIR scientific information: State of the art and future directions

00:00

Formal Metadata

Title
Opening Address: FAIR scientific information: State of the art and future directions
Title of Series
Number of Parts
30
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Computer animation
Computer animation
Meeting/Interview
Transcript: English(auto-generated)
Hi everyone, and welcome to the talk on fair scientific formation, which I'm pleased to deliver as opening address of the 24th International Conference on Gray Literature. I'm Markus Stocker and I lead the Knowledge Infrastructure Research Group and co-lead the Open
Research Knowledge Graph Initiative at TIB Leibniz Information Center for Science and Technology in Hanover, Germany. Imagine a world in which scientific formation is not just buried into documents, but also organized in databases.
How would we conduct research today? And what would this mean for modern knowledge societies, which are very much relying on scientific formation as we have seen during the last years of the COVID-19 pandemic, but also in other contexts of societal challenges like climate change, scientific formation
is really a key source to advance knowledge to scientists in the right direction. During the last decades, we have seen a strong digitalization, meaning
a transformative process towards information systems in many sectors of the economy. Some of you might remember mail order catalogs as they were sent around to homes. You could get informed about products, their characteristics, prices, and you
could also, to some extent, compare products, maybe even across different catalogs. Today, this is not common anymore. Today we use information systems like Amazon and other systems that have transformed, have had a digitalization of the original information printed in documents
towards information systems that are now capable of enabling new services, not just it's easier to search, you have different search functionality like faceted search,
but also you can more automatically, for instance, compare information products across vendors and the characteristics of mobile phones, for instance, not just price, but also memory and other characteristics. In similar manner, we used to print maps on paper, and I remember my parents having printed maps in the back seats of cars,
and we used this approach to guide us from A to B on the hopefully shortest route.
But today also, we have moved away from printed maps towards digital services, information systems like Google Maps or OpenStreetMaps. And again, they have organized the information about the roads, but also integrate information for the weather or traffic congestion,
and therefore, they are not just telling us or guiding us on the shortest route from A to B, but also on the fastest route, taking into consideration also weather conditions and street conditions.
So this is a completely different approach to information management and also that has enabled completely new services on top of this information in contrast to how things were done a few decades ago.
In research, we have not seen quite such a transformation, so we have had certainly a digitization in contrast to digitalization. Digitalization is not a process of, a transformative process towards information systems, but more like, okay, a scanning of the original
printed documents or a creation of documents in digital form, in natively digital formats, but not moving away from the document based communication.
Indeed, it started several centuries ago with the first journals, as you can see here, Le Journal des Sables, was probably one of the first scientific journals. And, you know, for a few centuries went on, in similar manner, we have published in text, and the papers haven't really changed that much.
So they have the title, they have authors, they have publication date, and usually some abstract and some figures that used to be printed or created by hand, maybe even.
And all this has, of course, moved into the digital space, so today we create this document in digital form, natively, and we use a number of tools to create figures, tables, and so on, and other artifacts published in papers.
So there has been a transition, a digitization of these documents, but not really a transformative digitalization. So the information in these papers is still communicated in document form, and
we don't really have a database that curates, manages, and exposes the information for applications in an information system manner.
So scientific information is data, I think that's not controversial, and this data is not fair, certainly not for machines. So basically, when we go out and try to find scientific information, what we do find and what we do access are documents, which then we need to read to extract information that we need. And machines, yes, they support us in finding documents, largely based on keyword-based search,
but they don't help us very much in reusing the content, in comparing, for instance, the content, in giving us access to the data inside.
So we don't really access, find access, and scientific information, and the information inside is not really interoperable, often not even for machines, for humans, but certainly not for machines. And therefore, the reusability of this content, yes, for humans, we can read
and extract, but it's very inefficient because of the lack of machine support. So let's look at this paper, this page here, in more details. First paragraph has a sentence, IRE binding activity was significantly reduced in failing hearts.
And it also points to a figure, 1B, which I'm highlighting here. And this figure is basically a plot of data, and this data happened to be also in CSV format. And as you can see, once you encode this data in this form as a plot, an image,
it's useful for humans to quickly digest the data, but for machines, it's not really reusable because this data is now pixel data. And it's very difficult for machines to try to extract the actual values once it is in this form.
Note also the p-value here is approximated to 0.001. That's also useful for humans because that's all they care to see that there is a highly significant difference between these two means. But for machines and for reusability, for instance, systematic reviews, it would be much more useful to have the actual p-value.
So this is what machines should really be consuming. So a machine should be consuming the information about the student's t-test, which was executed here.
And the student's t-test should be typed by using a term of a particular ontology. This is here the ontology for biomedical investigation, OB, that has a term for student's t-test. A student's t-test has generally also dependent variable. That would be here, in this case, the iron responsive element binding.
Again, also here we can use an existing term of the gene ontology. A student's t-test has typically also a specified input that will be the data used in the t-test. And again, here we can use a direct link to the DOI, using the DOI of the data set so
that we can give direct access to humans, but also in particular to machines, to consume then this machine-readable data. Which stands very much in contrast to the plot, which cannot be really read by machines.
And the student's t-test has also specified output. That would be a p-value. Again, we use here a term of the ontology for biomedical investigations. And the p-value has a scale of value specification, which here is the actual p-value. So not the approximated value 001, but the actual p-value that is computed.
So Baron Mons, a while ago, over 15 years, said in an article that scientific writing can be called information burning. I think it's a very nice way to put it, because we have information that we produce during the research lifecycle, often in structured manner.
Or we have the product or the source data in structured manner used in data analysis, for instance. And then we bury this in products like visualizations and try to then mine it again.
And of course, this is very expensive to do because you have to train models to do so or algorithms. And it's also highly inaccurate. Teresa Atwood and colleagues, a few years later, noted in a paper that we have essentially failed to organize information in rigorous ways.
So the finding what we want and understanding what's already known today becomes increasingly costly as an experience. So this is the inefficiency of extraction from text that brings that cost.
And Julian Higgins and colleagues in the Cochrane training handbook, which, as you might know, Cochrane focuses on extraction of information from
clinical trials, about clinical trials and life sciences. They have this promise as well that they would like to build databases with this information. And note that this is largely a manual process.
So there has been, of course, developments in machine learning in recent decades. But this kind of data extraction from documents is still a manual process. Therefore, very expensive to do. Still, it is no longer the 17th century, of course, and many things have happened, in particular at the level of metadata.
So the research infrastructure, scholarly communication infrastructures have agreed on certain metadata standards and also agreed on how to mechanisms to interlink metadata about artifacts.
So articles, datasets, but also authors, authors and organizations, which are often increasingly also persistent, identified these entities and described with standardized metadata and interlinked. So that's great, because given any of these entities, you can basically traverse this
graph, called also the PID graph, and discover assets and entities that are related. And for this, there are, of course, also a number of systems that manage this metadata. Like Eurocris, the research information systems, Vivo as an open source, Pure
from ends of the year, and also graph technology have been used increasingly. There are a number of projects. The Australian Research Graph or OpenAir has a research graph as well, and also Springer Nature has the so-called SciGraph, which builds on this kind of graph technologies as a database.
So that's interesting and certainly useful because it organizes the metadata around articles, they're linking to datasets and organizations and people, instruments and, you know, artifacts relevant to the context of a research investigation,
but it doesn't really capture the system, don't really capture the actual content, the actual scientific information in articles. So that content, the scientific information remains not really accessible and reusable for machines.
That said, there are many communities that have identified that problem and also created efforts, often in the context of systematic reviews, they have executed this task of extracting information from literature, from a corpus, generally very limited,
of course, because they are focused on a particular problem in a specific domain. And then some some of these communities have made this available, this data available to reuse.
And so they have a service where you can download it. And maybe they have also developed some kind of visualization of the data so that they can enable analytics on top of this very highly specific domain, specific data. So there are good examples, for instance, in biodiversity.
So HighKnowledge.org is a good example. In invasion biology, they have extracted information about hypotheses in invasion biology and their support or questioning the literature and their relationships.
So you can look at these networks of hypotheses in invasion biology in a visual manner. PLATSE is another well-known example in the biodiversity domain. The Cooperation Data Bank is an example in social science.
You can see here that I have about 2000 papers with extracted information about participants and the number of effects and studies. And they built also a visualization of the data for this data to support analytics.
And the air quality data collection effort that Forschund sent to me, particularly in the context of the COVID-19 pandemic, is another good example. They have here about 150 publications with a thousand measurements that they extracted from the literature and created the database, which again powers some analytics.
Papers with code is another good example in machine learning, where they extract standard metrics for certain tasks like object detection, which communities have developed algorithms and models to detect objects using some benchmarks.
And then you can basically also show who is the leading algorithm today and how is the trend of the performance of various models over time. So you can see how the community trends for a particular task in machine learning. Metalab is yet another example in the cognitive development research. I'm sure that there are many more that I'm not listing here.
But I hope you get the sense that there are a number of communities that, of course, extract information for their own purposes from the literature and try to organize this information. And the Open Research Knowledge Graph is yet another service which aims, has in principle
the same issue at heart and aims to scale this problem and solutions for this problem beyond a particular domain and beyond a particular research question and make it more generic as a platform.
So the benefits, I think, are obvious, right? Once we have a database with structured information, we can, for instance, compare scientific information across the literature. You can think of this like Amazon product comparisons just for scientific information we have here for the case of SARS-CoV-2
of course, the basic reproduction number estimate here that has value 3.1 in the first paper.
And then information about, for instance, where the study was conducted and the time period so that you can start not just comparing across the literature, how the value, the basic production number in this case behaves, but also made some some regional studies.
Is there a difference between regions? So given this data, you can then start also integrating data from a number of comparisons. For instance, for a case fatality rate, you can draw a mean and as well as for the average basic production number.
And then you can plot SARS-CoV-2 on such a map of infectious diseases and see its deadliness and contagiousness relative to other infectious diseases. And creating such products once you have the data in a system like ORKG extracted from the literature is relatively easy.
So, for instance, by simply supporting the integration of a system like ORKG with a computational environment, which then allows to simply read this comparison data into a data frame.
And so it allows you to reuse this data very efficiently. Another thing that you can do is integrate ORKG with the existing scholarly communication infrastructure here, for instance, Data Site Commons, which is built on top of the pit graph.
So you can see, you can discover these kind of comparisons as artifacts, as published artifacts also. In Data Site Commons, by looking up their DOI, so you allow for persistent identification of these artifacts in the ORKG,
and then you can discover these assets across the scholarly communication infrastructure. And of course, we also relate the underlying literature that was used in this comparison. Here, again, for the basic production number, and you can discover these links.
These are now links from Data Site to CrossrefDIs, and you can discover these related or cited literature. You can do it also the other way around, given a paper that was cited in a particular comparison here for the transmission potential of COVID-19 in Iran.
You can discover, okay, this paper was used in an ORKG comparison. So the million dollar question, I think, is, can we do what we did for bibliographic metadata or metadata about other entities and assets?
And some communities do for specific applications, also for article contents, as we saw in biodiversity, computer science, social science, and so on, for scientific information, but at scale, so efficiently. And I think the main challenge here is how to make fair scientific information production efficient.
This is a huge challenge. And at TIB, in the context of the Open Research Knowledge Graph Initiative, we are essentially looking at three modalities to address this problem. One is crowdsourced.
So I call this the post-publication and then manual approach, where readers or authors, specifically, they can essentially describe their research contributions made in paper in a structured manner. So here, an example for basic production number or a study on the estimate for basic production number.
You can see here, at 3.1, you can, of course, describe this value also for its confidence interval. So it's a structured object here, the location, Lombard Italy can be linked to geonames to make this data interoperable. The time period, again, is a structured object with a beginning and end dates.
And we can also assign a research problem, like the determination of COVID-19 basic production number. So this is great because it's accurate. So humans, experts, generally, when they read the paper, they can accurately extract the information and ingest it in this manner into RKG.
But it's, of course, since it's a manual approach, it doesn't scale very well. As you can imagine, we also tried to automate this process. So in the post publication, we can also think of using NLP models or information extraction approaches to try to automate this task.
And here you can see how NLP models can try to, given a title, an obstacle, potentially also the full text,
they can try to suggest certain properties that you might want to use in the research contribution descriptions. That's great because indeed it is efficient in the sense that it's automated and you can potentially scale this approach to the million of articles out there.
But unfortunately, the approaches aren't very accurate yet, in particular on the extraction of highly complex objects, information objects, like they have a lot of or multiple relations. And the performance of this automated extraction is usually very low.
So you always need an expert or the human in the loop, again, that ensures the quality. So another approach is republication. So in contrast to post publication approaches, this one is a bit different in
the sense that it binds into the data analysis step in the research data lifecycle. And it ensures that information is created fair at birth.
So when it's essentially produced in, for instance, R or Python data analysis scripts, as I'm showing here, you are executing here or fitting a linear mixed model to some data. So you can describe this activity. So here, for instance, the Linux mixed model has an input data set.
A particular statistical model is described here and also the output data set. And the interesting thing here is that you can basically pass variables that you anyway have in your script as you compute these values directly to the step that ensures this information is machine readable.
And then you eventually serialize this into a file so it stays in the environment of the user, of the researchers conducting this analysis.
And then it can be submitted as supplementary material to the publisher together with the manuscript into the review process. Here I'm using Kubernetes publication as an example of an open access publisher. And so these assets, they all flow into the review process.
The manuscript is reviewed. In principle, you can also review this supplementary material. But in the end, the publisher would essentially ensure then that the article is in the metadata of the UI, is linked via is supplemented by links to the JSON documents,
JSON data that we have here as supplementary material describing some of the information in the article also in structured form. And you might guess what happens if you do that. Now that we have a published article with its DOI, we can use this DOI and simply harvest
the machine readable description into a of some of information in the article into a system like ORKG. And if we do this, you can see here for a paper, the two research contributions that describe the linear mixed models that were executed here.
You have the input data set. That's a larger data set, which is not part of the article itself. So we simply link to it as a onto, for instance, it's an auto repository. You have an input model for the linear mixed model. This is described
in particular the response and particular variables, but also the output data set. That's actually a table in the original article as shown here. But here now the data is really ingested into ORKG so we can very easily reuse this data.
For instance, again, in computational environments using like Python or R and load this data again into a data frame so that we can reuse it very efficiently. And also, for instance, in a systematic review that takes such data from a number of publications to power some meta-analysis.
So some takeaways. I think scientific information can be expressed so that it is more machine readable than it is currently. I think this is essential for improving the reusability of scientific information, which given the speed
of science and the need of scientific information digested also in various forms at various obstruction levels, for instance, for policymakers, even the speed of the need of this information in current knowledge societies to address our challenges.
It's really important to make all of this more efficient than it is currently. Some disciplines have indeed developed their own systems, as I showed you, for biodiversity, social science.
There are a number of examples how this can look like and what can you do once you have information in a database. And they support very interesting tailored applications. But all this is not really general praxis. As you know, we publish in general pretty much exclusively information as PDF or maybe HTML documents.
And yeah, so the question is, how do we realize this vision and how do we move from prototypes to real systems that are used at scale?
And this is indeed not exactly trivial undertaking. In particular, I think the knowledge, the fair scientific information production. So the question, how do we do that efficiently is really a key challenge and one that relies on also on shifts in practices.
So, for instance, in the pre-publication modality that I showed you, it's of course required that researchers are aware of this possibility of these mechanisms and integrate this into the data analysis phase.
So that's a change that needs to happen, a shift that needs to be practiced, that needs to happen. And I also think that it relies on the infrastructure and changing the current infrastructure that we have so that all this becomes as easy as possible.
So thank you very much for listening and I hope you enjoy this conference.