Towards Knowledge Graph based Representation, Augmentation and Exploration of Scholarly Communications
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/42465 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
00:00
InformationGraph (mathematics)ComputerFaculty (division)Augmented realityRepresentation (politics)Zuse, KonradInternetworkingComputer hardwareBitNeuroinformatikPhysical systemMultiplication signWeightComputer hardwareXMLUMLMeeting/InterviewEngineering drawingComputer animation
00:42
Graph (mathematics)Augmented realityRepresentation (politics)Software bugIRIS-TInfinityConditional-access moduleWeb pageSimulationThomas KuhnComputer programmingInteractive televisionNeuroinformatikComputer hardwareComputer scientistPlastikkarteFunctional (mathematics)Fiber bundleSoftware developerDifferent (Kate Ryan album)Object (grammar)Computer programmingInformationForm (programming)Shape (magazine)Endliche ModelltheorieFunctional programmingProgramming paradigmAreaTerm (mathematics)Procedural programmingObject-oriented programmingComputer animation
02:40
Graph (mathematics)Augmented realityRepresentation (politics)CognitionInformationProcess (computing)Error messageBitCognitionDifferent (Kate Ryan album)Representation (politics)Telecommunication
03:15
Linked dataVariety (linguistics)Uniform resource locatorFormal grammarAugmented realityGraph (mathematics)Representation (politics)Link (knot theory)Twin primeIntranetStatement (computer science)Data modelObject (grammar)Predicate (grammar)Web pageLogistic distributionGraph (mathematics)TowerMeta elementCategory of beingPopulation densitySource codeInstance (computer science)OntologyDatabaseSystem programmingEndliche ModelltheorieProduct (business)Search engine (computing)DigitizingKnowledge representation and reasoningExecution unitDomain nameField (computer science)Open sourceInformationProduct (business)Data modelSource codeWebsiteDifferent (Kate Ryan album)Mechanism designAverageMultiplication signFormal languageSocial classProgramming paradigmFormal grammarQuicksortDescriptive statisticsCartesian coordinate systemStandard deviationWeb 2.0InferenceSlide ruleStudent's t-testIdentifiabilityCodeInformation retrievalPredicate (grammar)Self-organizationEvent horizonEnterprise architectureRepresentation (politics)Library (computing)SpacetimeLink (knot theory)Graph (mathematics)Object (grammar)Category of beingInstance (computer science)OntologyRight angleData structureBitINTEGRALClosed setChainTesselationPhysical systemBuildingEndliche ModelltheorieAdditionBounded variationInformation systemsStatement (computer science)Data storage deviceNatural languageKnowledge baseRelational databaseBerners-Lee, TimGraph (mathematics)Faculty (division)Linked dataJava appletCommunications protocolFile formatComputer engineeringObject-oriented programmingCombinational logicComputer animation
10:05
Link (knot theory)Data structureMeta elementGraph (mathematics)Category of beingPopulation densityRepresentation (politics)Source codeTime domainInstance (computer science)OntologyDatabaseSystem programmingEndliche ModelltheorieProduct (business)Graph (mathematics)Augmented realityQuery languageDean numberFormal languageInformation securityService (economics)Machine visionSelf-organizationRepository (publishing)Configuration spaceMessage passingData managementArchitectureGroup actionHelmholtz decompositionMobile WebLibrary (computing)Internet service providerPhysical systemSelf-organizationSource codeResultantGroup actionComputer architectureExploit (computer security)Helmholtz decompositionElectric generatorObservational studyCall centreProcess (computing)Programming paradigmPower (physics)FeedbackDifferent (Kate Ryan album)Template (C++)MappingNumerical taxonomyGraph (mathematics)Formal grammarRelational databaseInformationNatural languageMorley's categoricity theoremProjective planeType theoryGraph (mathematics)GoogolAreaConfidence intervalINTEGRALPattern recognitionConnectivity (graph theory)Cartesian coordinate systemPredicate (grammar)CASE <Informatik>Object (grammar)Statement (computer science)Domain nameService (economics)BuildingStudent's t-testSelectivity (electronic)Data structureTraffic reportingSocial classPirate BaySoftwareRepresentation (politics)MetadataLink (knot theory)Web 2.0Software developerSinc functionAuthorizationQuery languageAlgorithmWave packet
16:55
Graph (mathematics)Augmented realityRepresentation (politics)InformationDigital signalComputerWeb pageTelecommunicationService (economics)Endliche ModelltheorieDisintegrationAerodynamicsFatou-MengeMereologyLabour Party (Malta)Database transactionSimultaneous localization and mappingCircleLink (knot theory)Group actionState observerRelational databaseShared memoryEmailData modelSheaf (mathematics)LogicHill differential equationQuiltView (database)Linker (computing)Semantics (computer science)Einbettung <Mathematik>Probability density functionData structureDomain nameLibrary catalogFunction (mathematics)CollaborationismSound effectMathematical analysisDigital object identifierPhysicsBusiness modelLibrary (computing)BitRelational databaseRun-time systemDescriptive statisticsAreaLibrary catalogNumberMultiplication signConstraint (mathematics)Self-organizationException handlingDigital libraryDataflowTemplate (C++)19 (number)Shared memoryInteractive televisionPeer-to-peerVector potentialLevel (video gaming)MappingProduct (business)Probability density functionEmailInformationDomain nameObservational studyRevision controlPhysical lawOrder (biology)Uniform resource locatorVertex (graph theory)Cartesian coordinate systemElectronic data interchangeComputer fileGraph (mathematics)Physical systemTelecommunicationIdentifiabilityResultantInternetworkingData structureVirtual machineEinbettung <Mathematik>Service (economics)INTEGRALCollaborationismNatural numberSemantics (computer science)Zoom lensFocus (optics)DigitizingInternet service providerSound effectJava applet
23:45
PhysicsCharacteristic polynomialMereologyBuildingGraph (mathematics)Augmented realityRepresentation (politics)Time domainContent (media)Process (computing)RootCausalityCollaborationismSystem identificationMetadataTelecommunicationInformationCurve fittingVapor barrierMach's principlePhysical systemElement (mathematics)Communications protocolSystem programmingGoogolMetropolitan area networkMachine visionSoftwareGraph (mathematics)Maß <Mathematik>TheoremData structureStandard deviationProof theoryEndliche ModelltheorieAtomic numberOntologyArchitectureLatent heatImplementationWorkstation <Musikinstrument>Data modelComplete metric spaceAdaptive behaviorProgrammable read-only memoryInternet forumAtomic nucleusThermoelectric effectMotion captureCognitionTime evolutionCompact spaceReal numberLevel (video gaming)Semantics (computer science)Compilation albumFormal languageQuery languageArtistic renderingInferenceAerodynamicsEuclidean vectorInformation securityDisintegrationRoundingInterface (computing)Pairwise comparisonArchaeological field surveySoftware testingLink (knot theory)InformationState of matterIdentifiabilityQuery languageMultiplication signOrder (biology)Internet service providerData structureDifferent (Kate Ryan album)Level (video gaming)Graph (mathematics)EvoluteGraph (mathematics)Execution unitCollaborationismSemantics (computer science)Focus (optics)Mechanism designLatent heatDirection (geometry)Electronic program guideAuthorizationLatin squareSpeciesCognitionDomain nameProof theoryMetadataOntologyResultantBitPersonal digital assistantField (computer science)MereologyVirtual machineMachine visionConstructor (object-oriented programming)UsabilityGoodness of fitCharacteristic polynomialPhysical systemPairwise comparisonAreaNormal (geometry)Right angleWordRepresentation (politics)Sign (mathematics)Endliche ModelltheoriePhysicalismMedical imagingBeta functionMessage passingSoftwareTheoremMathematicsCellular automatonRevision controlForm (programming)Formal grammarFormal languageConnectivity (graph theory)Generic programmingPattern recognitionSocial classProjective planeAtomic numberReal numberComputer animation
32:56
Graph (mathematics)Open setVirtual machineTelecommunicationPhysical systemLevel (video gaming)Alpha (investment)Web pageHome pageStrutDigital object identifierRepresentation (politics)StatisticsPhysicsMathematicsComputer hardwareSample (statistics)Kolmogorov complexityDrag (physics)Formal languageMaizeTelecommunicationInformationOpen setService (economics)Graph (mathematics)PrototypeMachine visionMusical ensembleRevision controlAlpha (investment)Field (computer science)AdditionAlgorithmCategory of beingSimilarity (geometry)DatabaseMetadataContent (media)Order (biology)Interface (computing)Merge-SortTerm (mathematics)Link (knot theory)Complex (psychology)ImplementationNeuroinformatikComputer animation
35:10
Formal languageKolmogorov complexityComputer hardwareAlgorithmMathematical analysisBefehlsprozessorPairwise comparisonMemory managementWeb pageMathematical optimizationGraph (mathematics)Continuous functionRepresentation (politics)Variable (mathematics)Statement (computer science)Predicate (grammar)Data modelMetadataLogicTime domainFingerprintAuthenticationArchitectureAiry functionProteinPlastikkarteNormal (geometry)Ring (mathematics)System callExecution unitArithmetic meanGamma functionShift operatorSample (statistics)Level (video gaming)Keyboard shortcutComputer fileKernel (computing)Frame problemCodeCellular automatonoutputHypothesisStatisticsFunction (mathematics)Line (geometry)Data typeOntologyTerm (mathematics)Observational studyAugmented realityOpen setEvent horizonSpacetimeHeat transferComputer-assisted translationComputer scientistState of matterPlotterBitDifferent (Kate Ryan album)Internet service providerData storage deviceIntegrated development environmentUser interfaceLaptopData modelResultantStatisticsStatistical hypothesis testingNumberSoftware developerForm (programming)Library (computing)Order (biology)IdentifiabilityAdditionAuthorizationCartesian coordinate systemProjective planeDatabaseWikiInclusion mapSemantics (computer science)Sound effectPoint (geometry)Descriptive statisticsNeuroinformatikGeneric programmingInequality (mathematics)MeasurementZoom lensMultiplication signHypothesisKnowledge representation and reasoningCASE <Informatik>Data analysisStudent's t-testGraph (mathematics)Category of beingOpen setVapor barrierSampling (statistics)Predicate (grammar)InformationStatement (computer science)Object (grammar)MetadataVirtual machineLevel (video gaming)Faculty (division)INTEGRALoutputStapeldateiSoftware testingVarianceMachine learningVariable (mathematics)Data structureKey (cryptography)WordFactory (trading post)Correspondence (mathematics)DiagramRepresentation (politics)Process (computing)Group actionOrdnungsstatistikLetterpress printingBit rateComputer animation
44:05
Java appletData managementLocal GroupSoftwareGraph (mathematics)Representation (politics)Augmented realityNumberStudent's t-testProcess (computing)Cartesian coordinate systemSoftware developerCollaborationismProjective planeMereologyJava appletGraph (mathematics)AreaVariety (linguistics)Covering spaceDependent and independent variablesComputer animation
45:04
InformationXML
Transcript: English(auto-generated)
00:02
Thank you very much. Now, coming to the topic of today's talk and I want to start with a broader picture. If you look at the history of computing, that's a bit how it started. Of course, it started even earlier with Leibniz. We see that in the logo of the University of Leibniz with the binary system developed by Gottfried Wilhelm Leibniz.
00:23
But the first IT systems were developed here, for example, Zuse Z3 from 1944. At that time, you really had to physically interact with the hardware. You really had to push and pull those registers and instruct the computer to do certain computations.
00:42
Then in the 70s and 80s, this kind of interaction with computers was quite popular and common. In between, there were also punch cards, where you also had to physically interact here. We don't have to physically interact anymore with muscles, but we still had to know what register stored what data
01:02
and how to shift data from one register to another one. So, it was very closely aligned to the hardware, to the capabilities of the hardware. Then we computer scientists realized that's not the most intuitive way how to deal with information, how to process information. We got inspired by cooking recipes, which were used already for hundreds if not thousands of years
01:26
to represent procedures, how to combine those ingredients. That gave rise to procedural functional programming in the 80s and 90s. Also, that was not the end of the development.
01:42
Then there was the area of objects. Objects have a form, a shape and a function. This is a Chinese sand burning object. It has a nice form, a nice shape, but also this function that you can create nice room sense
02:01
with the sand sticks, which you burn there in this candle holder. That inspired computer scientists then to come to this object-oriented paradigm of modeling, of programming, of interacting with the hardware, which organizes information, I think, in a more intuitive way.
02:25
We bundle functions and methods and data in these kinds of objects. We define relationships between different objects, but still the data is somehow hidden inside of these objects. I think what we now see in the last decade is maybe that we come to also an area
02:44
where data, information, knowledge plays an increasingly important role. I would say it's also a bit inspired how we humans process information in our brains, how we exchange information between different humans. I would like to talk a bit about this in my talk.
03:02
How can we represent data, information and knowledge in a cognitive way, exchange it? How can we use that for improving scholarly communication? That's a topic particularly important here also for TIB. One approach, which is now maybe one of the main approaches for dealing with heterogeneous information
03:26
is according to the linked data principles, especially when it's also distributed in a distributed information system. I'm a member of the Institute of Distributed Information Systems and there we have URIs, which identify things in the data.
03:47
URIs are identifiers and libraries. We of course have other identifier systems as well, ISBNs. Many of you know these IAN barcodes, for example, which are used for products.
04:02
It's very important to be able to identify all kinds of things also in the data and then providing a mechanism to look up this information. As we have on the web the possibility to retrieve web documents, we can also retrieve data items using the same principles,
04:22
using URIs, which are de-referencible, using the web protocol and return a description, which everybody can understand. For web documents we have HTML as a standard for representing these documents and that's the reason why we can build search engines, why we can build e-commerce applications,
04:41
why we can exchange information on the web in this global information space For data, this information representation is not so suited and there is the resource description format. I will tell you on the next slide how this works. The same mechanism as we can link between websites, we can apply here to the data
05:00
and then link between different data items and data storage. These linked data principles were coined also by Tim Berners-Lee, who is not only the inventor of the web, but also one of the guys between linked data, for example, or now knowledge graphs, as it's called in a bit more modern way.
05:22
What is RDF? This data representation formalism. The basic idea is that we organize information very similar to how we humans organize information in sentences. Sentences consist of a subject, a predicate and an object. In Germany we have a lot of additional variations,
05:41
but I think also more than 90% of languages are based on this simple paradigm with some variations. We can, for example, say that the faculty of electrical engineering and computer science organizes this Andridsfohrlesung today, 2019, and this as a subject, predicate and object.
06:02
As we use objects of one sentence as subjects of other sentences in our natural language, when we talk, when we write texts, we can do the same here and we can say that this Andridsfohrlesung 2019 starts or takes place on 20th of May, 2019, and it takes place in Hanover.
06:21
And that already illustrates a bit how we can span or create a small knowledge graph here consisting of how many triples? How many statements do we have here? Three. Java knows the answer. So basically three statements which create your small knowledge graph and you can already imagine that you can attach more information to the notes.
06:44
You could, for example, describe the faculty in more detail. We could describe the lecture here in more detail. Here, Hanover actually refers to another entity in DBpedia, a knowledge base, which we extracted from Wikipedia almost a decade ago
07:02
in Leipzig, which contains a lot of structured information about all entities which are described in Wikipedia. And the interesting thing about this data model, this triple statement data model, is that we can also write this down as subject predicate object statements
07:21
and we can easily integrate information from different sources. And that's very different and difficult with other data models. If you look at relational database, if you look at XML or object-oriented representation paradigms, it takes a lot of effort and time to integrate and to combine data models from different sources,
07:43
while here it is in a way already built into the data model. It's very easy, almost trivial to integrate information from different sources. Of course, the idea is also to reuse these identifiers. We see here that this organizes, starts, takes place,
08:03
these are predicates from a vocabulary and we reuse these kind of predicates or we are supposed to reuse them and then the integration really makes also semantically sense and we can, for example, create a search engine for events in Hanover or worldwide by integrating information from lots of these sources.
08:23
This kind of knowledge representation already takes up and more and more companies, organizations are using that. Here's an example, for example, information about an enterprise, DHL, for example, we can represent in this way, we can say DHL is active in an industry,
08:42
we can attach labels in different languages, for example, Chinese and German, we can describe the headquarter, the height, we can represent units and data items there and in this way establish a knowledge representation, which also can span not only different languages,
09:02
but different domains, cross-domain knowledge. We don't have to structure the information in advance, but we can add more information on the go basically. And this paradigm of representing information and knowledge graphs becomes more and more popular. Knowledge graphs are often a fabric of concept, class, property, instance relationships.
09:28
They use some formal representation formalism, for example, RDF or all the web ontology language, which builds on top of RDS. It often integrates information from different sources,
09:41
from different domains, also of different granularity. This can be, for example, instance data, which can come from open sources, from private, for example, supply chains, also close data sources, product models and companies are examples of such close data, can comprise derived and aggregated data,
10:01
schema data, like vocabularies and ontologies, which can be represented in the same way as the data. And there's also a difference from other data representation formalisms. Metadata taxonomies to categorize information, links between internal and external data or mappings, for example, to other data representation formalisms,
10:24
such as relational databases. And all these kind of types of information and data can be represented according to this subject predicate object RDF statement paradigm. Here we see an example of the DBPDA, excerpt of the DBPDA knowledge graph
10:43
with facts about Bob Dylan, United States, Steve Jobs. And there are many more. DBPDA was one of the first knowledge graphs. Meanwhile, there are many others. One which is now more better integrated with Wikipedia is Wikidata,
11:01
which followed up basically the work of DBPDA. There's one in the library world or many, but one example is the German G&D, the German national...
11:21
Thank you very much, Lambert. And another one in industry, for example, is Google's knowledge graph, one which spans many different domains and builds the backbone basically of Google services. And Google invested a large amount of money, I think more than $100 million,
11:40
in acquiring a company called Freebase and expanded this work, invested even more in building this backbone, this knowledge graph, which organizes information about people, places, organizations, which Google now uses to provide services like Google Maps, for example, or e-commerce, web search. When you search, for example, on Google, you find not only documents,
12:03
but you also find results from the knowledge graph, for example, facts about people, what political party they belong to, when they were born, who are their children, or if you look for restaurants on Google Maps, you get information from this knowledge graph, and this is all interlinked and connected.
12:22
Now, how can we exploit this information? And one project we performed, and partially also here in Hanover, the nice thing is, since I'm here already for almost two years, I can also report some results, which we have already achieved here, is that we can use such knowledge graphs for question answering.
12:40
So we can use natural language questions and provide answers to users and exploit this information from knowledge graphs from the web of data. The web of data basically is the web of interlinked knowledge, which is increasingly available on the web, and then provide such question answering services to citizens,
13:01
communities and industry. Many of you have maybe used Google Now or Amazon Alexa, and these are examples or instantiations of such question answering systems operating on also structured knowledge. So if you have, for example, a question like, who is the director of Clockwork Orange? We have to understand a spoken question,
13:24
we have to analyze this question, find an answer, find data to answer the question, then present the answer to a user and finally show the result. As you can see here already, it's not that simple,
13:41
because Clockwork Orange, for example, is a named entity and it's not like an orange clockwork, but it's a piece of art, a movie in that case. So another example is publications and health reports related to Alzheimer's in Greece.
14:01
So that's a question which is more related to research, for example, where we need to integrate maybe different data sources, not just one, but we could integrate, for example, PubMed or data from the World Health Organization's clinical trials. And that's an area where Maria Esta Vidal, for example, is working with us intensively on this topic.
14:21
So how have we tackled this problem? Previously, often, question answering was tackled by a homogeneous approach, basically. And we, in this WDAqua Marie Curie training network with a group of 15 PhD students, tackled the problem
14:41
by developing a knowledge-based architecture for developing question answering systems. So it's not a monolithic system which takes a question and then provides an answer, but it's basically a pipeline where you have lots of different components for query decomposition, for data source selection,
15:01
query execution, named entity recognition, answer generation. And you can integrate lots of different components depending on the use case and the application area. And maybe I'll show you how this works in reality. You can actually try this out if you go to wdaqua.eu slash qa.
15:24
You can test it yourself. And let's have a look. For example, if you are interested in our question, Clockwork Orange, you can enter the question here on top
15:40
and then you actually get the answer, but you not only see the answer here, it's derived from Wikipedia, but you also see a confidence basically. So confidence, how confident the algorithm is about the result. You can also provide feedback. And of course, you can ask a lot of other questions.
16:02
For example, who is the mayor of London? Sadi Khan, or who is the author of Le Petit Prince? Let's see, Antoine de Saint-Exupéry. Or you can also ask paintings by Monet, for example.
16:23
You will get a collection of paintings by Monet. So this nicely illustrates that the structured knowledge provides us a lot of power basically to answer questions and also go a step beyond maybe what question answering systems currently can do,
16:45
because question answering systems like Google Now or Amazon Siri, they work basically with relatively rigid templates. So there basically the type of question is often predefined and they have thousands of creators who basically create these kind of templates.
17:02
While here it's more open, you can ask all kinds of questions. Everything which is contained in the knowledge graph can be answered. While in these commercial QA systems, it has to be basically predefined what kind of questions can be answered. On the other hand, you also can ask weird questions and get weird answers.
17:25
I think this is one of the reasons why Siri and Alexa, for example, don't allow all kinds of questions but have these templates they adhere to, because it has to be child-proof, for example, and provide certain safety constraints.
17:43
Now, how is this related to scholarly communication? I would like to come a bit to the topic, how can we apply this for TIB, for libraries, for digital libraries, to organize information flows also in the digital area for us scientists, for scholars.
18:02
If we look a bit at examples, how libraries traditionally work with publishing, how publishing works, so this was maybe one of the reasons why the wall in East Germany came down, because the East Germans all wanted to have this nice auto and trailer catalogs, where you had product descriptions, prices,
18:22
you had also identifiers, numbers to identify the products. You also 20 years ago had a lot of maps, street maps basically, and there was a whole publishing industry, publishing these kind of street maps and you had to buy a new one every year.
18:40
Or you remember maybe the time when you had phone books to look up phone numbers, so these were all industries with thousands of publishing houses worldwide, often billion dollar euro industries, but they all disappeared and it works completely different nowadays. If you look at e-commerce, we don't use PDF versions of mail order catalogs,
19:05
but we use new applications like Amazon, for example, or eBay or Idialo, where you can really organize and drill down in the information in a completely new way. And that's very different from just a PDF, a digitized version maybe of a mail order catalog.
19:23
Also street maps, we don't use PDF versions of street maps, but we use navigation systems, which allow us also to zoom in into maps, zoom out to personalize maps, to locate information on maps. So it's a completely new way how to interact with this information.
19:42
And what we can see there that this world of publishing and information exchange has profoundly changed and many of these verticals and domains and new possibilities were developed like this zooming, this dynamic, the business models changed completely.
20:00
We have much more focus on data interlinking services and search in the data. Also the integration of crowdsourcing plays a very important role. If you look at Amazon, for example, you have lots of reviews, which are crowdsourced from people who bought these products. Or in Google Maps, you have also reviews of the businesses on the map, for example.
20:24
So integrating this information and integrating also contributions, user contributions is a very important aspect. So how about scholarly communication? Scholarly communication is how we researchers publish, share, exchange information. How does it look there? This was one of the first publications from 1667,
20:44
The Philosophical Transactions of the Royal Society, one of the first journal publications in the 17th century. In the 19th century, publications looked like that here. In the 1970s, for example, here this famous paper about relational databases
21:03
looked like that. PostScript was already used as a typesetting system. Nowadays we use PDF, so we publish as open access, we share PDF documents maybe on the internet, but it's only partially machine readable, doesn't preserve much of the structure,
21:23
doesn't allow embedding of semantics into activity. And a bit it compares to digitizing a mail order catalog as PDF and then sending it around as email or putting it on a website, or digitizing a street map as PDF. That's how we represent information in science nowadays.
21:46
And these other domains were completely disrupted. In science we use, I think, very antique ways of knowledge sharing and information exchange. And there are a large number of issues and problems there.
22:02
We don't use the potential of digital collaboration, for example, in science. We have also this monopolization of commercial actors, which exploit their lock-in effects. We have a reproducibility crisis, a proliferation of publications, a deficiency of peer review.
22:21
I will go a bit more in detail, for example, the proliferation of science. You can see here that in this decade, from 2004 to 2014, the number of publications in science and technology, these are the areas we take care of as TIB, for example, almost doubled and probably it continued to grow.
22:41
Not so much because we in Germany publish much more, we publish a bit more, but it didn't double in Germany or in the US. It slightly increased, but it tripled in China, for example. It quadrupled in India. And if you look at Brazil or Russia, these are countries which now also enter this scientific or publication market
23:02
and follow in our footsteps and publish large amounts of documents. And with regard to reproducibility, there was a study in Nature. And in this study, 70% of the experiments were failed
23:21
to be reproduced by another scientist and 50% of scientists also failed to reproduce their own experiments. And I think this even happens to us, because maybe after a few years, the Java runtime environment is not available anymore and we cannot run our own experiments after some time. This, of course, differs a bit between scientific areas, domains,
23:45
but I think it's a major issue, because reproducibility is one of the cornerstones of science. And I think this results in a big duplication, also inefficiency. We have quite, terminology is not clearly defined.
24:03
Every paper, of course, defines its terminology, but if you take 10 or 15 papers, even in the same area, addressing the same research problem, there are very subtle differences in the terminology and that makes it extremely difficult to compare, to integrate, if these problems, approaches, methods, characteristics
24:23
are not properly defined. And imagine how would engineering work or how would building construction work if we could not identify the parts exactly, if we could not put them exactly together. And unfortunately, my impression is in science, we often have this situation,
24:41
these different bits and pieces don't really fit very well together and it takes enormous effort to make them fit afterwards or later on. So we have this lack of transparency, information is hidden in text, integratability, different research results are not very well fitting together.
25:02
There is almost no machine assistance. We use maybe Google Scholar or our TIB portal, but these full text search or methods, they don't work very well in supporting our scientists. Beyond metadata, we don't have much identifiability.
25:22
Collaboration is a major issue and also getting an overview. It's very difficult, it takes years for a scientist to get an overview in a certain research field. And I want to illustrate that, if you look for example in our TIB portal for CRISPR, which is a genome editing method,
25:40
you find 9,000 results. If you do the same search, we don't have that much because it's a biochemical method and we don't focus much on bio-methods, chemistry a bit. That's why we have 9,000 results. Google Scholar has 238,000 results. And now, imagine you are interested in the precision,
26:03
the safety, the cost of the method, or you want to know what specifics has genome editing when you apply it to insects or who has applied it to butterflies. You are basically lost in this pile, haystack of publications. So now, after this depressing message,
26:23
how can we fix it? There was already this vision also in 1944 also von Vannevar Bush organizing information in a kind of memex he described at that time. And I would say for this time, it must have sounded quite esoteric.
26:41
So the idea was to have a desk as a researcher and on top of the desk you have a tablet, which shows you at your fingertips the information, which is magically provided by some mechanism apparatus below your desk and gives you basically this information you are interested in for your research.
27:00
I would say at that time, 1944, it was like really science fiction. Today, my impression is we have really the opportunity to realize something like this and to create this memex. And we started working in this direction here at TIB using this approach of knowledge graphs
27:22
for identifying overarching concepts in scientific publications like research problems, definitions, research approaches, methods, like artifacts where we have publications, but also data, software, image, audio, video, knowledge graphs and ontologies
27:42
and then very domain specific concepts. Like if we go into mathematics, for example, we have definitions, theorems, proofs in physics, experiments, data, models, sorry, in chemistry, substances, structures, reactions and so on and so forth. And we need to go much deeper in the publications
28:01
and identify these concepts and link those concepts, not just on the level of publications, but on the level of these concepts and clarify them maybe in such a knowledge graph. I want to show you a bit how this can work. If you have, for example, publication here, Practical Guide to CRISPR-Cas9 Genome Editing in Libidoptera,
28:23
Libidoptera butterflies, it's a Latin name for butterflies. We can now not only represent maybe the bibliographic metadata like the author and the title of the document, but also information about the research problem, which is addressed by this publication about the methods,
28:42
which are applied on what species they are applied here or the experimental data and represented in some kind of a knowledge graph, linking these concepts with each other and establishing relationships between these different entities and describing information.
29:01
When we do this for many publications, where we reuse those identifiers and link between those publications, we can then indeed query and get an overview of the state of the art, for example. In order to do that, we need to lift knowledge graphs
29:20
to a more cognitive level, where they can represent uncertainty, disagreement, varying semantic granularity, also the evolution and provenance of information, but at the same time being flexible and simple. We want to follow in the science graph project,
29:41
which started now in May, ERC, a consolidator project, the notion of some knowledge molecules, where basically these contributions of a research artifact are captured in such a molecule in a relatively compact, simple, but still structured unit of knowledge,
30:00
which can then be incrementally enriched, annotated and interlinked. If we look at knowledge graphs today, they describe real-world entities, atomic entities, we add and delete facts, we enrich maybe facts, collaborate by enriching facts, but in the future, we need to make knowledge graphs more cognitive
30:23
in the sense that the base entities are not only real-world entities, but also conceptual ideas, like it is in research often. It's often, of course, linked to the real world, but often it's also ideas, algorithms, methods we develop,
30:40
we conceive intellectually, interlink them, annotate them. Of course, we not always agree, we have disagreement, we might also review or provide different assessments of contributions and this should be captured in such knowledge graphs. There might be also a drift over time of these concepts
31:00
and varying aggregation level and semantics emerges in a way out of a collaboration of different scientists. And as a result, also here the goal is to identify information and to provide information, for example, use this question answering approach
31:21
I showed earlier and apply this to such knowledge graphs, cognitive knowledge graphs, which capture scientific knowledge. For example, we can answer the question, how do different genome editing techniques compare by parsing the question, by discovering named entities and relationships and links between them,
31:43
then translate it into a formal query. Here's a query language for knowledge graph called SPARQL, very inspired by SQL. Construct a query and then render results to a user. And apply here this pipeline approach, different components,
32:02
because named entity recognition, for example, is something which works relatively well in specific domains, but it doesn't work well in open or generic domains. For example, for biomedical concepts, we can achieve quite good precision and recall, but for other domains, we need different approaches.
32:20
So we need to plug in different systems and mechanisms for constructing such queries, for example, specifically for questions with regard to specific domains. And provide in the end an overview of the results to researchers.
32:40
For example, here a comparison of different genome editing methods, their specificity, safety, ease of use and cost. This is a mock-up, so this is not reality, but I will now show you some example of a prototype, which we started to develop in the last year. And we have a first early alpha version.
33:03
You can actually go to labs.tip.eu slash ORKG. ORKG stands for Open Research Knowledge Graph. And that's a service which we want to develop now in the next years here in Hanover, which implements and realizes exactly this vision
33:20
of structuring information for scholarly communication in such a knowledge graph. How does it work? So the idea is you describe research publications. Maybe in the medium to far future, we don't even need publications anymore. We just describe our research findings in the knowledge graph. For the short term, I think we still need these publications
33:43
and we will still have them. So you can start with a publication and if you have a DOI, you can actually add your DOI there. You can describe or link it to a research field your publication is related to.
34:00
And then based on a cross-ref query, for example, you can automatically get already the bibliographic metadata if the publication is already indexed in some kind of bibliographic databases. And finally, or most importantly, then describe the content of the publication in the semantic way. And that's, of course, a very early prototype.
34:24
Our Ph.D. student, Allard Erlen, who is also here in the auditorium, developed this interface, but we have to experiment much more in the next months or even years. And the idea is then to describe, for example, here the publication addresses the problem of sorting algorithms
34:41
and then describe basically the approach, like this is a merge sort algorithm, it's implemented in C++, has a stable implementation, has a best complexity, worst complexity, and you can add arbitrary additional properties and reuse properties from other publications, which tackled, for example, the same problem already.
35:03
And describe your publication. Then we can also create similarity computations, for example, and then show you or show an overview of the state of the art of the work addressing a certain research problem, here, for example, related to sorting algorithms.
35:22
This is, of course, a classical example, a toy example for us computer scientists. And I will show you a bit more complicated one in a minute. But this maybe gives you a bit of an intuition of what we want to achieve in the future, that this, of course, works for arbitrary research problems, that you can quickly get an overview,
35:41
what is the state of the art in a certain field and that we represent information about approaches there and having this crowdsourcing approach, where researchers add this information by themselves, where we librarians at TIB, for example, or the library community contributes, helps curating that and, of course,
36:01
also involving machine learning and automatic techniques, maybe suggesting already information, so that you don't have to fill out the forms completely manually, but information is already suggested and that's something we will work on in the next years. In order to do that, behind this user interface here,
36:21
we extended a bit the RDF data model, because you see here this subject predicate object or resource predicate resource representation, so the base data model is also RDF, but we need a lot of additional metadata, basically, because every statement can be debatable,
36:41
researchers can agree, they can disagree, so we need to attach further information on these kind of statements and that's what you can do, you can attach metadata to each individual resource, to each predicate, but also to the statements. Arbitrary metadata, of course, always important, is when was the statement made, by whom was it made,
37:02
to have an identifier for the statement, but you could also attach, for example, who agrees to the statement, who disagrees, and represent a kind of scientific debate in that way also, according to the data model. And we also developed, of course,
37:20
application with different layers, where you have a graph database, we use a property graph database, because a property graph database allows to add this kind of data model, and then API, of course, which allows them to provide and develop different user interfaces
37:41
on top of the application, where we are now experimenting with and expanding and extending this. Now, I would like to finally show you a bit more complicated, complex example, where we also try to even lower the barrier for scientists more and to increase maybe also the possibility to make things more reproducible.
38:04
So this is work by Markus Stocker, one of the postdocs in my group, who worked on making statistical analysis also reproducible, like in this paper, for example, there is a statistical hypothesis test included,
38:22
which uses statistical methods ontology to represent it, and we can store that also then in the open research knowledge graph. So, for example, if you read the paper, you have then here a sentence, which indicates, which basically says that, I'm translating it as a lay person now,
38:41
that patients who have heart failure often suffer also from iron, that their blood is not capable of binding sufficiently much iron. And this is statistically validated, and there are some plots integrated here, which illustrate that and basically provide evidence for that.
39:02
But these plots, often in the papers, they're very small print, difficult to read, and five or seven years later, you cannot easily get the data back and falsify or verify that. And what increasingly many researchers are using is, for example, Jupyter notebooks, but you can also integrate that into SPSS or R
39:22
or different other environments. And what Markus developed here is an approach, which stores the results of your Jupyter notebook in this open research knowledge graph. So you actually don't have to leave your environment, but while you perform a computation in your Jupyter notebook, the computation results or the data
39:42
is also added to the knowledge graph at the same time. Once you then describe your paper, like you again have a Doyle, for example, and add your paper to the open research knowledge graph, and it addresses your desire and deficiency
40:01
in heart failure patients, and then you can add a research contribution here for this research problem, and we can then link it to the data, which was captured by the Jupyter notebook. So in that case here, the statistically significant hypothesis test,
40:20
and we can link basically this data generated by the Jupyter notebook to the contribution in this research publication. And after linking that, we then can browse the paper again, and we have here a rich description then of the contribution data.
40:41
So we have this statistically significant hypothesis test. It's a p-value computation, which includes two sample t-tests with unequal variance, has three input variables, and we can then go in more detail and look into the particular values.
41:00
We are of course also experimenting. This is a generic user interface for looking into the graph. Of course, we want to in the future also develop maybe more tailored. If you have such a p-test, you can also visualize it in a different way and make it more intuitive. You can basically browse, zoom into the graph
41:20
and look at all the values up to the measurement values you have measured and you have added to your Jupyter notebook. Here, for example, 105, and this numeric value then corresponds exactly to one point in your plot, in a diagram, in your publication.
41:43
And this illustrates then also the knowledge graph, the resulting knowledge graph. So we have the paper here as one entity, the authors which are linked to the paper, but then we have also the research result. And then we have a description of this research result
42:00
with these statistical contributions or these statistical computations basically represented as a knowledge graph. And that's making it really reproducible so that you can at a later stage maybe also compare different or aggregate information from different scientific papers, which now requires a lot of cumbersome manual effort,
42:22
sometimes even emailing the authors trying to get the data. And we hope that in the future this helps to make research much more reproducible, verifiable and also represent information in a more structured systematic way.
42:41
Yeah, this leads to the end of my overview talk. Of course, I didn't manage to present you many things we also work on. There are a number of other projects we do at the faculty and L3S or at TIB, like SlideWiki or Inclusive OpenCourseWare,
43:01
where it's about a crowdsourcing wiki environment for open educational resources or big data for factories, where we use this knowledge graph, vocabulary, semantic knowledge representation techniques to represent information for integrating data in manufacturing processes, where we are in a large European consortium
43:22
with many companies or nursing AI, which is coordinated by also postdoc in my group, Gabor Kishmirk, where it's about assessing the effect of AI on nursing in the future or the projects like IASES and Big Media Lutex,
43:43
Maria Istar Vidal, for example, takes care of, which apply these methods for medical data analysis and integration or environmental data analysis with Henry Fair, where we are involved in the community. So this is to give an overview.
44:03
And of course, there are many people already involved since I started already almost two years ago. We have a sizable number of people in the team. Many of them are here in the audience, like my postdocs, Marcus Stoker, Gabor Kishmirk, but also Java Chamanara, Jennifer Sousa and the PhD students, software developers
44:23
who work on these applications and collaborators like Maria Istar Vidal or the colleagues from Leipzig who are also here today from InFi, which also take part in the science graph project. So this is the team responsible and supporting this research
44:43
since we cover quite a wide variety and we also have the ambition, of course, to bring these knowledge graph techniques also to many different areas, ideally to all the areas where TIB is also active, like engineering, science and technology, basically. And I think we are still in the beginning there
45:01
and there's a lot more work to do in the future. Thank you very much for your attention.
Recommendations
Series of 3 media