Towards research infrastructures that curate scientific information: A use case in life sciences. - TIB AV-Portal

Towards research infrastructures that curate scientific information: A use case in life sciences.

00:00

14

Related Material

Technische Informationsbibliothek (TIB)

Stocker, Markus Prinz, Manuel Rostami, Fatemeh Kempf, Tibor

Formal Metadata

Title

Towards research infrastructures that curate scientific information: A use case in life sciences.

Title of Series

Data Integration in the Life Sciences (DILS2018)

Number of Parts

12

Author

Stocker, Markus

Rostami, Fatemeh

License

CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/38887 (DOI)

Publisher

Technische Informationsbibliothek (TIB)

0000-0002-5190-1867 (ORCID)

1080328793 (GND)

04aj4c181 (ROR)

Release Date

Language

Content Metadata

Subject Area

Life Sciences Computer Science

Genre

Conference/Talk

Abstract

Scientific information communicated in scholarly literature remains largely inaccessible to machines. The global scientific knowledge base is little more than a collection of (digital) documents. The main reason is in the fact that the document is the principal form of communication and since underlying data, software and other materials mostly remain unpublished -the fact that the scholarly article is, essentially, the only form used to communicate scientific information. Based on a use case in life sciences, we argue that virtual research environments and semantic technologies are transforming the capability of research infrastructures to systematically acquire and curate machine readable scientific information communicated in scholarly literature.

Keywords

Scientific information

Scholarly communication

Knowledge representation

Virtual research environments

Research infrastructures

Knowledge infrastructures

Data Integration in the Life Sciences (DILS2018)12 / 12

1

18:25

Converting Alzheimer’s disease map into a heavyweight ontology: a formal network to integrate data.

2

23:24

Interactive Visualization for large-scale multi-factorial Research Designs.

3

17:37

Using Machine Learning to Distinguish Infected from Non-Infected Subjects at an Early Stage Based on Viral Inoculation.

4

36:02

The Hannover Medical School Enterprise Clinical Research Data Warehouse: 5 years of experience.

5

22:24

Automated Coding of Medical Diagnostics from Free-Text: the Role of Parameters Optimization and Imbalanced Classes.

6

19:21

A learning-based approach to combine medical annotation results.

7

52:04

Invited Talk on "The de.NBI network – a Bioinformatics Infrastructure in Germany for Handling Big Data in Life Sciences.”

8

18:37

Lung Cancer Concept Annotation from Spanish Clinical Narratives.

9

22:00

Linked Data based Multi-Omics Integration and Visualization for Cancer Decision Networks.

10

31:17

Leaving no stone unturned: Using machine learning based approaches for information extraction from full texts of a research data warehouse.

11

22:37

Construction and Visualization of Dynamic Biological Networks: Benchmarking the Neo4J Graph Database.

12

24:10

Towards research infrastructures that curate scientific information: A use case in life sciences.

Automatic playback

Speech

Text

Image

00:00

DisintegrationInformationPrototypePhysical systemBitLevel (video gaming)Digital libraryParticle systemInformationSinc functionXMLUMLLecture/Conference

00:22

InformationDisintegrationRepresentation (politics)Database transactionContinuum hypothesisVideo gameCASE <Informatik>Digital libraryKnowledge baseProgrammable read-only memoryXML

00:42

Digital libraryRepresentation (politics)InformationDatabase transactionSineSuite (music)DisintegrationRepresentation (politics)ResultantVideoconferencingNatural languageForm (programming)Virtual machineLetterpress printingDigital libraryPattern recognitionInformationInformation and communications technologyEvent horizonWeightProcess (computing)Presentation of a groupData structureCASE <Informatik>File formatXML

02:15

InformationDisintegrationInformation and communications technologyMathematical analysisData transmissionProcess (computing)Digital signal processingCycle (graph theory)CASE <Informatik>InformationVideo gamePoint (geometry)Digital signalPhysicalismForm (programming)DiagramProgram flowchart

02:38

Mathematical analysisInformationData transmissionDisintegrationMenu (computing)Process (computing)Information and communications technologyCycle (graph theory)MultiplicationPoint (geometry)Source codeInformationTelecommunicationForm (programming)File formatData miningVideo gameData analysisContext awarenessResultantEndliche ModelltheorieProcess (computing)Digital libraryTheorySoftware testingKolmogorov complexityOctahedronArithmetic meanPhysicalismTransformation (genetics)Mathematical analysisFunction (mathematics)MereologyStatisticsDiagramProgram flowchart

05:38

File formatNormed vector spaceDisintegrationConvex hullObservational studyProbability density functionVirtual machineExpert systemField (computer science)ProteinLink (knot theory)InformationProcess (computing)BitLetterpress printingException handlingComputer animation

06:27

InformationDisintegrationKeyboard shortcutInformationRepresentation (politics)Data structureContent (media)Formal grammarBitAuthorizationComputer animation

06:57

Normed vector spaceFile formatKeyboard shortcutDisintegrationInformationConvex hullShift operatorKeyboard shortcutFigurate numberMereologyDifferent (Kate Ryan album)Mobile WebPoint (geometry)

07:37

GUI widgetShift operatorCommutative propertyConsistencyKeyboard shortcutContent (media)Sample (statistics)InformationLevel (video gaming)DisintegrationCartesian coordinate systemDifferent (Kate Ryan album)Group actionMereologyStatisticsStatement (computer science)Speech synthesisNeuroinformatikReduction of orderBitP-valueFigurate numberComputer animation

08:38

Virtual machineFormal grammarInformationProcess (computing)DisintegrationInformationAuthorizationRow (database)Process (computing)Arithmetic meanVirtual machinePhysical systemComputer animation

09:08

Virtual machineFormal grammarInformationProcess (computing)DisintegrationData analysisSemantics (computer science)Virtual realityIntegrated development environmentInformationData analysisPoint (geometry)Form (programming)Extension (kinesiology)UsabilityIntegrated development environmentPhysical systemCycle (graph theory)Key (cryptography)Reverse engineeringVideo gameSemantics (computer science)Representation (politics)VirtualizationBitShift operatorScripting languageComputer animation

10:47

Service (economics)OntologyInequality (mathematics)VarianceDisintegrationExecution unitElectronic meeting systemOntologyService (economics)Electronic data interchangeInstance (computer science)StatisticsType theoryStatistical hypothesis testingXMLProgram flowchart

11:16

Computer-generated imageryVariable (mathematics)DisintegrationComa BerenicesWindowGamma functionArithmetic meanVariable (mathematics)Term (mathematics)Function (mathematics)Link (knot theory)Graph (mathematics)StatisticsDatabaseoutputSource codeElement (mathematics)Physical systemSocial classData storage deviceSemantics (computer science)

11:55

DisintegrationData analysisSemantics (computer science)InformationContext awarenessData analysisArithmetic meanPhysical systemINTEGRALInstance (computer science)Natural languageComputer animation

12:48

Virtual realityIntegrated development environmentDisintegrationSemantics (computer science)VirtualizationIntegrated development environmentP-valueBitComputing platformAddress spaceData analysisPhysical systemNumberInterpreter (computing)InformationComplete metric spaceParticle systemObject (grammar)Different (Kate Ryan album)Video gameComputer animation

13:50

Frame problemDisintegrationInequality (mathematics)VarianceSample (statistics)Function (mathematics)Scalar fieldPlotterStudent's t-testReverse engineeringP-valuePoint (geometry)FeedbackDemosceneBitFigurate numberIdentifiabilityData analysisVirtual machineContext awarenessStatistical hypothesis testingPoint cloudEmailIntegrated development environmentNumberSampling (statistics)Inequality (mathematics)VarianceOntologyLatent heatMereologyGroup actionProjective planeMathematical analysisBeta functionResultantFunction (mathematics)MassWhiteboardComputer animation

16:54

outputFunction (mathematics)DisintegrationQuery languageKeyboard shortcutScripting languageSoftware testingDescriptive statisticsoutputInformationVariable (mathematics)Set (mathematics)Particle systemVideo gameLine (geometry)P-valueFile formatContext awarenessComputer animationDiagram

18:14

Graph (mathematics)Uniform resource nameDisintegrationScalable Coherent InterfaceDigital libraryInformationData analysisProcess (computing)Multiplication signPoint (geometry)Form (programming)AuthorizationData structureOpen setStaff (military)Mathematical analysis1 (number)SummierbarkeitProjective planePatch (Unix)Power (physics)Semantics (computer science)Computer animation

19:38

Information extractionGame theoryDisintegrationInformation extractionUniform boundedness principleNumberEqualiser (mathematics)Observational studyInformationGame theoryAbstractionFigurate number

20:14

Data analysisContext awarenessMotion captureSystem programmingIntegrated development environmentShift operatorLocal ringDisintegrationVirtual realitySoftware frameworkLink (knot theory)OntologyFinitary relationWindows RegistryObject (grammar)HypercubeRepresentation (politics)Statement (computer science)TelecommunicationInformationVirtualizationShift operatorSpacetimeLocal ringPhysical systemTelecommunicationPrisoner's dilemmaData analysisTheory of relativityNeuroinformatikLogicInformationLink (knot theory)Software frameworkOrder (biology)Context awarenessGraph (mathematics)MathematicsIntegrated development environmentSource codeGateway (telecommunications)NumberProjective planeResidual (numerical analysis)WebsiteDigital libraryStandard deviationDataflowComputer virusLogic gateObject (grammar)Programming paradigmOntologyVideo gameXML

24:05

DisintegrationXMLUML

Transcript: English(auto-generated)

00:00

I like to talk a little bit about a very early stage prototype system and also idea for how we can move towards research infrastructure, in particular, digital libraries, that are curating scientific information. And I will talk a little bit what I mean about scientific information.

00:22

And since this is about life sciences here, we did a use case in life science to demonstrate this on a concrete paper. So the problem is that the global scientific knowledge base, if you want, is maybe provocative

00:41

here but is little more than a collection of digital documents. So essentially documents are the primary and in a way only way in representation for how we communicate scientific thought, scientific results. And this has been going on for centuries.

01:01

Here are some of the, perhaps, one of the first publications here in 1600 and one today. And you can see that, well, of course we went digital and so on, but in a way the form as we communicate, the form that we use to communicate scientific results and

01:21

scientific thoughts and information hasn't really changed very much except, of course, going from print, expensive print, to cheaper digital. And of course the tools have changed. You have today ways to search this document, keyword-based search, but also analyze them

01:43

with natural language processing, entity recognition, even linking entities. There are formats to describe the paper, its structure. In particular, what is the title, the authors, the abstract, all this has, of course, advanced.

02:00

And yet the scientific information that is conveyed is really still hardly accessible to machines. And I have a concrete example in the presentation for how this is the case. So let me briefly pause on scientific information and how we get to scientific information

02:21

and like to ground this in the research life cycle. So research generally starts with an experiment. You have an idea, you design an experiment, and then you execute that experiment. And this leads, ultimately, these days to some digital data. You might have data in physical form on the way, but at some point you translate

02:43

that data into a digital format. And you acquire this data as primary, uninterpreted data in this research life cycle. So this data at this point doesn't really mean anything. It hasn't been interpreted yet for what they mean, for what they tell in this particular

03:01

context of the research life cycle. Then you process this data, this primary data. You might aggregate, interpolate, whatever, Fourier transformations, signal processing here. And you transform primary data into secondary and tertiary data.

03:20

And if there has been a data interpretation step or activity, then you start to have information. So it's really data interpretation that attaches meaning to the data and the derivative data in the research data life cycle and for what the data mean in the particular context

03:42

of the research life cycle that you are standing in. And then you process this secondary or tertiary data, which carries some information into secondary information and so on. And this can be very complex here in various research life cycles. You might have multiple research data life cycles here.

04:04

They might join and diverge again and so on. But the point is that you process those information. Of course, think about you move perhaps from information about particular individuals that you are studying and then the secondary information might be statistical information

04:21

about population of these individuals. Now, ultimately, you get to information out of this life cycle, the research data life cycle, and you feed this information back into scholarly communication where that's where you integrate information, your results, but also you relate your results to existing

04:43

information in the literature. And you really learn information here. There is a process of learning in scholarly communication and the output of that process in some models is learn information which is equivalent to knowledge. This is just to give you a broad overview for how we really get from primary,

05:03

the uninterpreted data in research life cycles to scientific information and scientific information communicated in scholarly communication and ultimately as pieces of knowledge integrated information, both the information that you acquired and generated as well as the information

05:23

that is already out there. And we focus really on this part here, try to capture the scientific information generated here as output of the research data life cycle as it feeds into scholarly communication. So information access, information in this sense is really tough for machines.

05:45

In print, it's really almost impossible or very, very difficult. You have to go through OCR processes. In PDFs, it's a little bit easier but still very difficult for machines. It's very hard to do anything useful except a keyword-based search.

06:02

Maybe you can run NLP and I'm not an expert in that field but you can perhaps identify some proteins and link proteins into text and so on. This is a paper that we used for the study here from HADAD, from the Medical School of Hanover.

06:20

And yeah, it's large in PDF. Of course, the publishers these days, they give you a PDF but they often give you also an HTML representation of the same content. That's a little bit easier. You can parse this content a little bit better and often there are actually XML representation of the content as well which really has the structure of the paper separated so you

06:45

can ask for the title for the authors in a fine-grained, granular way. But it's really, again, access to the scientific content, the information conveyed is still very difficult. Let's look at a concrete example I've highlighted here.

07:01

Sorry if you can't read it in the back but I hope it's big enough to read it. I highlighted here one particular sentence in a paper and you can walk this through. So the first part as shown by electrophoretic mobility shift of size, that points to this figure in figure 1B, particular that image.

07:26

The next part is IRE binding activity was significantly reduced in failing hearts. That points to this p-value there that is less than 0.001 and shows the difference between the two groups.

07:42

So on one end it's IRE binding activity which you can see on the y-axis. That's a variable that is studied here and you can see that it's reduced in failing hearts. That gives you, it's here the second group here with F, that's the failing hearts. You can see the reduction and the p-value that gives you the statistical significance.

08:04

Next it's most pronounced in patients with ischemic cardiomyopathy. That is the second group that splits the failing hearts into two groups, ICM and DCM. And you can see that it's most pronounced in this group, that's this part of the statement.

08:23

And then you have this reference to figure 1B which is the last bit that you can refer in this figure. And of course for a computer to try to figure out all this, quite challenging. So the question is can we acquire, curate, publish and process formal, meaning machine

08:44

readable, scientific information as shown here. This is a claim made by authors and scientific information published in the scholarly record. Can this information be formally acquired and then curated in a system and potentially

09:04

published and in particular processed down the road? And the answer is yes we can. It's not exactly trivial perhaps, but I think it doesn't need a major shift in how, it's not rocket science either, it's not going to the moon. So I think it's doable with some of the, by addressing some of the challenging challenges

09:25

that I will talk about later on. So the intuition is actually very simple. Scientific information does exist in more or less formalized representations before it is encoded in documents. So researchers, they go through the research data life cycle or the research life cycle

09:43

and then generate this information. And at some point around in this pipeline, the information is there in more or less formalized ways. Usually you have it maybe in an Excel sheet or MATLAB scripts or something. But the information is there. And what we like to do here is capture that along the research life cycle before you encode

10:03

it into the document, which is really the form for human consumption. And once you have it in that form, it's very difficult to reverse engineer that information. It's expensive and to some extent also not very smart I would say. So the question is can we acquire at birth the information basically when it's generated

10:24

during the research life cycle? And I have here a couple of ingredients of a possible solution. One is vocabulary. The other is semantics-enhanced data analysis. And I will talk a little bit about this also. And then I suggest virtual research environments are also a useful key ingredient of a solution.

10:48

Vocabulary. There's a lot of vocabulary out there, as you all know. This is an example for statistical hypothesis test vocabulary in a particular type of it. And you can find this vocabulary.

11:01

You can look it up from the ontology lookup services, DBI for instance. This is from the statistical methods ontology, a vocabulary concept. And these ontologies, they are more or less formal. So this particular one has in fact rather deep semantics also.

11:22

It includes and describes the meaning of this term also for its input and output. So the output of the p-value. It's also some class. And you have an input, links even to the continuous variables of the data that were used underneath to compute the t-test statistics.

11:41

So vocabulary, absolutely important. All the semantic technologies here also to encode this, to curate this in systems, triple stores, whatever, graph databases. All this is of course relevant here. The second thing is we need to really think how can we integrate data analysis with

12:03

these kind of technologies. So the problem is that in data analysis, these values are simply value out of context. So of course the context is there. The researcher knows the context. But it's not explicit. It's not captured. It's implicit in the minds of people and perhaps in the workflow. So what we want to do is to make sure that the values generated here by researchers,

12:24

for instance in Jupyter, Python or in other languages, are really capturing the context and the meaning of that value, those values generated, the value generated, the information generated out of these workflows is in fact captured and explicit

12:45

and formal and curated in a system. And the third ingredient that I suggest is virtual research environments because these are systems we can engineer to meet these advanced requirements and then ensure that

13:05

things are standardized. So not just the workflow can be standardized, the software, how something is done, but in particular also making sure that the values that come out, for instance, the P value out of these workflows are really semantic information objects, not just numbers without interpretation.

13:24

And we have tested this a little bit with VREs for science infrastructure, the CNR Italian platform where you can spawn, create particular VREs for certain communities to address certain scientific problems or questions that they are looking

13:43

after. So implement a concrete data analysis workflow for a research lifecycle. So quickly to the results, so it's very easy actually. We simply talk for that figure 1B, the data, so the data that you saw in the plots.

14:02

And of course, it would be great to have that data somehow accessible, not just in the paper where a machine would have to go and try to figure out what's the original data. So that's extremely expensive to do. And what we need to do here is not that kind of reverse engineering,

14:21

but we need to make sure that the data is published in first place. So again, capturing this data at birth when it's generated in data analysis. But we were lucky. So we had the PI of the previous paper as a co-author here as well, Tibor. And he emailed me this data.

14:42

So I asked, can I have, originally I actually took the points from the paper, but it's a real cloud. You can't actually read how many data points there are. So then I asked Tibor, can I have the original data? And he said, yeah, let me check. I might still have it. And then he found it, and he copied it into an email to me.

15:04

So this is where we stand. This is how things are done, especially, of course, this is small data. When you have big data, this is perhaps cannot be done like this. But for long tale of science, small groups, little data, little science project, this is also still how it's often done.

15:25

So I got those data points, and I simply replicated this t-test, which you can implement it in Python very easily. And you get that value, 1.3 to the power of minus 8. So that's the value. And this is what any Python environment or Jupiter or R

15:44

would give you as feedback. And this is what needs to be changed. This value is taken out of context. Once you have it like this, of course, the researcher knows it's a p-value. It's related to this particular statistical hypothesis test and so on.

16:00

But we need to capture, we need to ensure that the infrastructure does this automatically behind the scene without the researcher even noticing this. And one way to do it is to use this vocabulary. And it's over foundry vocabulary. So a little bit hard to read from the identifiers. So I have here a commentary.

16:20

So it's basically two sample t-test with unequal variance is the concept that I showed you before on the EBI ontology lookup service. This has a specified output, a p-value in particular. And that p-value has a value specification, which is a scalar. And then eventually you have the numeric value here

16:42

associated as 1.3 to the power of minus eight. So that's the p-value. So this is what the infrastructure, in my opinion, should do behind the scene. And if we do this, we can go to this. We can relate the t-test or the test description more generally,

17:01

not just to the output, so the p-value here, which is what I showed in the previous slide, but also to the input variable. There will be IRE binding activity in our case, but also to the data. You can relate it to the continuous variables underneath that were used to compute the t-test description or the test statistic,

17:23

but also to the script. So this is what we are currently doing, so getting the scientific information itself captured. But of course, you can then relate this scientific information to the broader context, in particular, the script that was involved, or to which the test description was attributed to.

17:45

I'm using provenance vocabulary here. This description, this t-test, is published in some document that is identified by DOI or altered by a person and affiliated to an institution. And suddenly you can ask queries like, OK, give me all the people

18:03

that have published a paper on IRE binding activity and the p-value lower than something with a data set that has an n greater than 200. I would like to discuss this briefly in the light of the Open Research Knowledge Graph, which is a TIB-led and initiated project,

18:23

where we are looking at creating digital libraries for particular communities that capture and curate this semantic scientific knowledge, so machine-readable scientific knowledge. And I would like to briefly discuss four pathways for how we can get to this information.

18:44

One is, while we have the papers out there, they are there, we can ask in a crowdsourced manner people who read these papers to annotate them and try to get structured scientific information out of this process. Another one is what we do here with Victor and Yaser.

19:03

You might have heard of, the authors might have heard of this, the deals experiment. Basically you are the author, you are submitting a paper, which you are simulating here, and at the time of submission during the workflow submission, you have maybe a form that we can capture this information in a structured way at this point.

19:22

Another way is you write the paper, and at that point you can make annotations to capture this structured information. And the last one is basically what I showed you here. At the point of data analysis to try to acquire this by the infrastructure automatically. Here I have an information extraction example as a game.

19:43

It's also quite fun. So this is a abstract pub. You can check it out. Basically it asks one question. How many participants gives you an abstract, and you have to figure out the number. Usually there is an N equal something, and here is 59. This study included 59 people, subjects,

20:03

and this is the only question the system asks, but it's a fun way to extract also in a gamification approach this kind of information from abstracts. A few challenges, technical and social. On one hand, we need really changes

20:20

in the data analysis infrastructure here. We need to recapture, as I tried to explain here, and represent the context comprehensively. And this should all be visible, in fact, to researchers that they shouldn't create RDF. And we have to deal with the technological heterogeneity

20:41

here and all the proprietary systems. So the researchers use SPSS, PRISM. It's hard to change those systems in order to meet these requirements. I think the social challenges are even more challenging, because we need a change in practice, a shift from local computing environments

21:02

into engineered VREs, and this is a substantial paradigm shift, if you want, to how things are done today, especially in the long tail of research where the data analysis is largely done offline in your own computer.

21:21

Couple mentions of related work. There is a lot of work done to link digital artifacts, in particular papers and articles. Here you have a number of projects. The Scolix framework is one. Or visualizing citations in context, like done in research gate.

21:40

They're also advancing this access to information in a more flexible way. Research graph, R map, research objects are also projects that are looking at this. Also, vocabulary has a long history already that goes back to at least a decade with hyper, which was one of the first, at least that I'm aware of,

22:02

approaches to try to link scientific claims to experimental evidence. The annotation ontology is also a good example, has concept and relations to encode scholarly articles. Nano publications is perhaps the most famous one, and of course the Oboe Foundry has a lot of vocabulary

22:20

that is relevant here. And the virtual research environments, I would also say, that's relevant here in order to standardize data analysis. There has been a lot of work going on here as well in this space globally. So as in the US, more called the science gateways there

22:42

or virtual laboratories also in Australia. So in conclusion, what we are looking at here is try not only have document-based scholarly communication and move into an environment where we have more knowledge-based information flows in science. I hope I gave you an overview

23:02

for at least four of the approaches here. They are complementary. We can look at them, all of them, and they complement each other and don't think there is a competition here among these various approaches. We're looking at a couple of them currently in the OERC-AG project. And really what I want to suggest here today

23:23

is one approach that I particularly care for is to try to capture this information at birth basically during the research life cycle. Significant challenges are there. I don't think they are not to be solved or addressed or overcome,

23:41

both sociotechnically. In particular, also socially, it's very hard researchers to change practices in how things are done in research. But surely I think it's clear that there are significant benefits to gain from such possibilities down the road.

24:01

Thank you very much.