We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The DataCite PID graph

00:00

Formal Metadata

Title
The DataCite PID graph
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
This webinar explores the DataCite PID Graph, a model to describe resources identified by PIDs, and the connections between them - developed by DataCite in the context of Project FREYA.
Principal ideal domainGraph (mathematics)Connectivity (graph theory)Conceptual graphFinitary relationLink (knot theory)Graph theoryMereologyService (economics)Local ringSoftware frameworkImplementationLevel (video gaming)Software developerInformationBuildingCollaborationismMereologyGraph (mathematics)Greatest elementBitRadio-frequency identificationProjective planeIdentifiabilityTouchscreenMetadataRow (database)Connectivity (graph theory)Theory of relativityAuthorizationInstance (computer science)Right angleQuicksortSlide ruleService (economics)Graph theoryMultiplicationVirtual machineWebsiteLocal ringUniqueness quantificationGoodness of fitLaptopSoftware frameworkCartesian coordinate systemIntrusion detection systemSoftware developerLevel (video gaming)Function (mathematics)ImplementationMultiplication signPoint (geometry)Personal identification numberCASE <Informatik>Open sourcePhysical systemInformationGame theoryStandard deviationWeb 2.0Single-precision floating-point formatPower (physics)Vector potentialFrame problemConceptual graphCollaborationismLine (geometry)Set (mathematics)Link (knot theory)DiagramFunctional (mathematics)Numbering schemePrincipal ideal domainWeb serviceDigital object identifierComputer animationProgram flowchart
Revision controlConnectivity (graph theory)View (database)File formatPrincipal ideal domainGraph (mathematics)Element (mathematics)WritingComputer-generated imagerySample (statistics)Communications protocolGraph (mathematics)Data modelVertex (graph theory)Finitary relationMeta elementPersistent identifierSystem identificationEvent horizonTouchscreenSlide rulePrincipal ideal domainGraph (mathematics)Point cloudLine (geometry)Radio-frequency identificationSoftwareMetadataTotal S.A.WordIdentifiabilityQuicksortSelf-organizationTraffic reportingSet (mathematics)Point (geometry)Event horizonAuthorizationPhysical systemElectronic mailing listCategory of beingLevel (video gaming)Link (knot theory)Data storage deviceConnectivity (graph theory)Software frameworkDirection (geometry)Theory of relativityControl flowFlow separationComputer animation
Vertex (graph theory)Query languageGraph (mathematics)Variable (mathematics)Correlation and dependenceWindowIntegrated development environmentGraph (mathematics)AuthorizationResultantSet (mathematics)Connectivity (graph theory)Arithmetic meanPhysical systemProcess (computing)Point (geometry)Query languageMetadataSource codeXMLProgram flowchart
Graph (mathematics)Principal ideal domainFacebookOpen sourceService (economics)Interface (computing)Query languageDisintegrationWindows RegistryPartial derivativeRepository (publishing)Self-organizationLibrary (computing)CountingLaptopClient (computing)Vertex (graph theory)Function (mathematics)Visualization (computer graphics)Graph (mathematics)Projective planeType theoryLibrary (computing)LaptopSet (mathematics)BuildingSlide ruleStandard deviationMereologyRadio-frequency identificationInterface (computing)Query languageMultiplication signGame controllerFront and back endsCASE <Informatik>CodecVisualization (computer graphics)NumberService (economics)DatabaseMetadataSubject indexingSheaf (mathematics)Staff (military)Formal languageIdentifiabilityPoint (geometry)State of matterMoment (mathematics)BitInformationPrincipal ideal domainFitness functionClient (computing)Internet service providerData centerINTEGRALOpen sourceFunctional (mathematics)Software developerSoftwareConnectivity (graph theory)WebsiteWindows RegistryRevision controlLatent heatElectronic mailing listComputer animation
Set (mathematics)Data centerTable (information)Graph (mathematics)Software bugDependent and independent variablesPhysical systemPrincipal ideal domainCartesian coordinate systemMassJSON
Set (mathematics)View (database)Metropolitan area networkTotal S.A.Repository (publishing)Digital signalObject (grammar)InformationPattern recognitionPrisoner's dilemmaGraph (mathematics)Principal ideal domainLatent heatErlang distributionCountingTable (information)GeometryElectronic mailing listCloud computingFood energyFreewareForestSoftwareVertex (graph theory)ZeitdilatationMathematicsSoftware repositoryCodeCausalityMeasurementModal logicHorizonProgrammer (hardware)TwitterExploratory data analysisTable (information)1 (number)Dependent and independent variablesRow (database)Text editorSoftwareComputer iconAreaInformationLocal ringElectronic mailing listTraffic reportingCASE <Informatik>Theory of relativityTouchscreenConnectivity (graph theory)MassPrincipal ideal domainGraph (mathematics)MereologyPower (physics)Digital photographyObservational studyProjective planeIdentifiabilityWeightSocial classNatural numberNumberGoodness of fitPrice indexRepository (publishing)CountingLatent heatSelf-organizationVideo gameSlide ruleWebsiteData centerFamilyThread (computing)Information privacyBoolean algebraExpected valueKey (cryptography)Procedural programmingSystem administratorCartesian coordinate systemComputer architectureSoftware development kitSet (mathematics)Computer animationTableJSONXML
Field (computer science)IdentifiabilityMetadataMoment (mathematics)WebsiteOnline chatPresentation of a groupArithmetic meanSet (mathematics)Euler anglesProjective planeType theoryTrailConnectivity (graph theory)Representational state transferQuicksortCASE <Informatik>Data typeDifferent (Kate Ryan album)Software repositoryView (database)YouTubeLevel (video gaming)Information privacyRadio-frequency identificationGraph theoryLatent heatGraph (mathematics)Revision controlGodMereologyNumberAnalytic setSystem callSoftwareRepository (publishing)Windows RegistryQuery languageCentralizer and normalizerContent (media)Service (economics)1 (number)Core dumpSelf-organizationComputer configurationInformationMessage passingMultiplication signPoint (geometry)Identical particlesSoftware development kitUniverse (mathematics)Context awarenessPhysical systemGradientSimilarity (geometry)Row (database)Cross-site scriptingAdditionPay televisionStandard deviationLink (knot theory)Limit (category theory)Raw image formatInstance (computer science)Sampling (statistics)INTEGRALTheory of relativityEvent horizonDirection (geometry)Fundamental theorem of algebraNoise (electronics)Term (mathematics)Intrusion detection systemGroup actionSpacetimeVariety (linguistics)AuthorizationOpen setCountingFamilyFeasibility studyBinary fileWireless LANWeb browserOutline of industrial organizationComputer animation
Transcript: English(auto-generated)
Hello everyone thanks for joining today's webinar. Before we start I just wanted to mention that you're all muted because we expect many people on the line today. We do of course encourage you to ask questions but you can do that using the Q&A functionality. You should be able to see
a Q&A button at the bottom of your screen. We will monitor the questions that come in. We can answer some already during the webinar and we'll save the others for when the speakers are done and we will also be sharing a recording of the webinar afterwards. So today we're talking
about the data site pitgraph which we developed as part of the European project Freya. If we move to the next slide you can see the agenda for today. So first Robin will tell you a bit more about the concept behind the pitgraph, what the pitgraph actually is.
Then Martin will talk about how we built the data site pitgraph and Christian will tell you a bit more about the kinds of questions you can ask the pitgraph and then as I said we'll have time for a Q&A session. So Robin over to you. Okay great thanks. So yes so my part
of this is to introduce what is the pitgraph anyway because this is kind of a tricky concept we've taken from one of our European Union funded projects the Freya project and this is causing some confusion for people so we're going to talk a little bit about what the pitgraph actually is. So really the approach that we came the approach that we use for this in Freya
is thinking about research as already being a graph. So researchers, institutions, publications, data sets, all that stuff are already connected just by virtue of being created by different people and being related to different people and all of those sort of entities and relationships between these different items already form kind of a conceptual graph of the connected research
landscape and really what we want to do through Freya and through these persistent identifiers is to make that conceptual graph into something that's more actually traversable by humans and machines. So to do this we really consider PIDs to be the backbone of connected research
of connected research persistent identifiers. So having unique persistent identifiers or PIDs for researchers and their outputs is crucial to connecting pieces of the research landscape together and we say that it's crucial we really mean here is that unique PIDs and especially in the case of the connections the metadata that backs them is what helps enable machines
to make these connections for us. So it's much easier to point to specific things when they're all uniquely identified and by having this sort of standardized metadata and these unique IDs within the metadata it makes it a lot easier for us to all point to each other's stuff. So we think that PIDs already have the potential to enable this kind of connected research graph
that we're talking about but we're not yet really taking full advantage of their connecting powers the way that we really could be and so that's what we envisioned for the PID graph. So the idea behind the PID graph is that we can link these different PIDs together
through relations in their metadata so when one PID says it's related to another PID through some kind of related identifier scheme and we can link all these together to enable the discovery of connections that are at least two hops away we might say. So the hopefully the diagram helps explain this a little bit. The idea currently is that between item A and item B
we might already know the relation between the two of them from their metadata. So for instance if A was an author we might know that they wrote a particular publication for instance by virtue of that being in that author's ORCID record. So from the metadata that is included with that PID we already know the relationship between A and B and we might have something
similar happening for B and C. Again if B is a publication and C is a data set we might know from the metadata from either the publication or the data set that B and C are also related. So there's certain things we already know just by virtue of the metadata that a PID contains or that a PID is related to but what we're really looking forward to doing with the PID graph is
being able to figure out that relationship between A and C. So knowing that A authored B and B was based on data set C we could then be able to say that author A is in some way related to data set C and it's kind of those two hops away that we're really focusing on right now with the PID graph. So I'll tell you a little bit about the PID graph concept. So as Helena
mentioned this is part of our work on an EU funded project called Project Freya and the idea here is that the various Freya partners of which data site is one will implement services that enable their own local PID graph that will enable the PID graph for their particular PIDs that they hold and the PIDs that they have access to and the idea would be that looking up a single PID
in those services should return the graph for that PID all the things that it was related to and then the various infrastructural partners like data site the people who make the infrastructure that helps some of this stuff run through us and through the magic of these related identifiers that are in the metadata we can help bring all of these local graphs together
and then the idea would be that users external users can then tap into that PID graph to use in their own applications and to make their own implementations of the PID graph and so on and so forth. So that's kind of the general concept that we're working with here this kind of system of multiple PID graphs all coming together and so I think we should really emphasize at this
point that the PID graph really is a framework so at this stage of the game the PID graph is just a framework for connecting the PIDs by the end of Freya we will not have a single standalone entity or web service that is the sole PID graph so we're not looking at this point at creating
a giant all-encompassing PID graph that someone can just you know go to a website and push a button and get a giant graph again we're kind of working from that federated concept that I described earlier so the Freya team will be producing APIs we'll be producing documentation for this
examples of how you can use the PID graph we're doing some cool stuff that you put our notebooks in that regard and then we'll also be doing our own implementations based on this kind of framework that we're working on and by the end of Freya developers at other institutions will then have what they need to implement their own PID graph services to take this work forward
so with that let's talk a little bit about data sites implementation of the PID graph and that's really kind of what the rest of this is going to be about the rest of the session today so we are building a GraphQL API specifically suited for querying the PID graph and we're using the power of our metadata plus connections to cross ref and orchid as kind of some of the other
large infrastructural sort of people to present information on the connections our users have have indicated to us in the related identifiers of their DOIs and so by using this GraphQL API anybody can then consume this body of information that we are making available through the power of connected metadata so then the question we often get from people at this stage
we're talking about the PID graph and sort of explaining the concept and what it means we get people asking how can I get my stuff in the PID graph and again to emphasize at the end of Freya we will not have one sole gigantic single PID graph but there are some ways you can get your stuff into data sites PID graph in a way so for data site members
if you're already creating DOIs via data site any of the metadata you provide to us that is findable and your public metadata will be exposed by our implementation of the PID graph that's ultimately the goal here we're currently working on this with the GraphQL API but eventually that's the idea of exposing all the metadata you provide to us so this means you should remember
to include related identifiers because we can't make a graph if there's no connections between things so if you want your stuff to be showing that it's connected to all this different stuff if you want to show that your data sets are connected to the researchers at your institutions or to different publications and so on and so forth then you need to make sure to include
those related identifiers that you know about in the metadata that you submit to data site because otherwise we won't know about it and we won't be able to actually serve up these connections to people when they search for your stuff and so then through data sites collaboration with these Freya partners and other people like Crossref and ORCiD and such your related identifiers will then be connected with other things like Crossref IDs and
sorry ORCiD IDs and Crossref DOIs so we'll be able to connect all these things together Now if you're not a member of data site for now you'll still need to have some technical know-how to be able to construct your own paragraph implementation but a lot of the stuff we do at
data site is open source and so you'll be able to check out the kind of stuff we've done if you want to see a similar way to a similar approach to use to be able to do this for yourself so you can check out our GraphQL implementation at the link that I've provided this is kind of rolled up into our broader all-in-one API but you can see what we're doing
here at that link and then as I sort of alluded to previously we're also working on making a nice pile of example Jupyter notebooks that you can use to navigate the PID graph and to try to explore some of the questions that you can answer with the PID graph which is some of the stuff that Martin and Christian are going to talk about next so then with that I will turn it
over to Martin to talk about how we built the PID graph and I'll stop sharing my part so you can take that over. Thank you Robin. So I will now talk about our initial work how we built the PID graph and I hope you can all see my slides as well. It's not full screen Martin.
Oh yeah of course okay so that's better. So the first question is what do you put in the PID graph what kind of resources and just to to get you started I share this word cloud that we generated at the last Research Data
Alliance plenary in April sort of what kind of things you would like to see connected in the PID in the audience of a total of about 35 people gave these answers and there are some obvious things in there obviously research data people software was very high list but also
other things so this can clearly see PID graph can very simply or very easily go in all kinds of directions become very complicated and you also see things on there where PIDs are still at an early stage so for example instruments
and metadata linking these PIDs to other things so this is probably more the future if you think about what we can achieve now then a good starting point is what research graph a very related initiative that started out in the Research Data Alliance is using for their graph
and they're focusing on researchers research data publication and grants and we fully agree that research data sits at the center you see similar pictures where maybe depending on where it comes from the researcher on the publication sits in the center there's a few things that
we're close enough to that this is sort of common interest in adding them and also where the hits and the metadata are evolved enough which could for example be software organizations funders but this is a good starting point and research graph has been doing this for
for several years and that covers a lot of territory what you need is resources like data sets and publications that come with PIDs and with metadata the metadata fall in two categories one is describing metadata this is the name of the researcher or the title of the publication
and linking metadata which is usually two PIDs linked together this is for example linked to an author via his or her orchid identifier citations via their identifiers funding etc
so that's a very generic concept what you need is PIDs that have rich metadata support this some PIDs can only be at a receiving end so for example across the funder id is it a pod identifier but the system itself doesn't store any links to other PIDs it's only
sort of linked to if you will the next step then is to to take the linking metadata and sort of look at them separately process them etc and you end up with sort of all the linking metadata from one data set for example might be 10 citations two funding references and three
authors with their identifiers and you basically want to sort of break this down into atomic pieces and and put it all together so that at the end you have resources described by
PIDs and describing metadata and you have the connections between two PIDs which also come with some metadata maybe so the kind of relation is described but it's basically connections between two PIDs and I think you know where this is going that then you have different systems for this that data side for example we have APIs that describe
DYs and DY metadata and we have a system called event data that just focuses on on these links on these connections and this is exactly also what for example the RDA scolix framework is doing that sort of for these connections it's much easier to work with
them if you just make them atomic and just have connections between two PIDs at the simplest level and this is sort of we have been at this point already for a few years with things like
event data and scolix but what we're trying now and PID graph is bringing this together again and we decided and that was only a few months ago that we think GraphQL is the right technology to do this because it allows you to bring all these connections and resources together
allow sophisticated queries sophisticated means for example you can query not only one place you can query all the data sets for example a particular keyword and look at all the linked publications and again do a query there etc and with GraphQL you can also decide what you want
to show and I see my example I didn't do a good job because you see the results for this right side where the authors in this first example of this data set that you see on the right side for this query out of the six million data sets we currently have in our system that
the first few authors didn't have an orchid which is not unusual it's something that's not so widespread yet unfortunately every author necessarily has an orchid metadata but you get the point how this works and GraphQL is really the key technology for enabling queries for
bringing the pieces back together and this is a technology which is quite popular and is open source and there's libraries in every language you might be using but in a scolic community is not yet so widely used and we think that's a mistake and we try to push this technology
because we think this is a perfect fit what we try to do with pit graph which is that everything is a graph that every resource in the graph or every node in the graph can be described by a globally unique identifier which is of course the pit there's a standardized
query interface no matter how you access the pit graph what you are interested in for where you come it's always the same and that makes it much much easier to build client application and also very important this is a query interface the backend services are exactly
as before we didn't have to build new databases search indexes apis in the background that can be reused it's really the query interface that has changed and here you see the api endpoint which we made available two months ago and we call it pre-release because
it is still changing we're adding more things we're enabling more kinds of queries etc but it's totally fine to use that and there you will find millions of resources in there so what is also important that graph ql supports federation so that the graph ql api
endpoint that we provide a data site not only queries data site but also briefly data roar cost of funder registry and orchid partially partially means if you have an orchid identifier will return information about that person and integrating the various pit services that sit at
different places together via graph ql and also have deeper integrations that allow more sophisticated that's something that we're working on for your project graph ql is
for somebody who's used to work with apis is relatively straightforward but it's still a to explore the pit graph and to make this easier we started to provide example jupyter notebooks to work with this api and this is an example using r and r has a graph ql library
that we have included here that same as two for other languages that are commonly used jupyter notebooks like python so there are standard libraries which means the queries are always the same and you can really focus on writing notebooks that answer your questions
instead of spending a lot of time building a technology solution to work with the graph ql ipis that we use for pit graph and this is my final slide to just give you an example of a
data sets publications and researchers connected to a particular grant which is the freya grant showing that as robin showed earlier that the connections in the graph can be quite complex even for such a small number of resources and confusing but also giving you much more information
about interesting relationships in this graph and visualizations of the pit graph using jupyter notebooks is a typical use case something you can do now and can go deep to answer all kinds
of questions which is exactly what we will talk about in the last section and kristian will out of this work at the moment as martin mentioned we are currently releasing the graph ql api for
the pit graph as a pre-release version and this means that the api well this api being like the primary access point for pit graph at the moment you can expect that there will be some functionality but we will be adding for the functionality in the near future both as part of
development work and part of the freya project but now i want to show you what can you ask at the moment and i'm going to talk about four specific questions that you can make to the pit graph and hopefully that will give you a grasp of the type of questions that you can
make and the type of question you can expect to make in the near future these questions are based around specific resources specific resources these are things that that i guess everybody in the audience is familiar with this would be data center data centers funding grant funding grants researchers and scientific software so now i will go to the next slide please i think martin you
are with control yeah thank you so the first question is about data centers and data centers are in essence a collection of data sets they are the registered ui metadata for but in many
cases they are completely separated of the entity that collects all the citation data and these data centers they are rarely identified with an identifier and those identifiers are connected in certain way to the data sets and those data sets have related identifiers
connected to the other items making these citations so in the pit graph you can make questions such as how many citations does a data center have and next please um and when you ask that question you will get a list with all the citations and all the i and all the items from that data center here i'm taking an example of the london school
economics and bringing all that information in and this is a common response that you will get when you ask a question using the graph ql api in pi the graph and i don't want you to argue too much with what this json response is but it's a very standardized response and really
we think that it's really clear but i'm going to this really easy to transform this actually to something more legible as martin said this is we are using this technology because it will be really easy we think it will be really easy for anybody else to implement their own systems and create applications that's about so if i transform this into a table
and martin will help me with that and it will look like something like this and i know it's massive but this practically sent information into a table it's a very good example but i i hope that can elicit what we got when we asked that question and i think at the top you can see
just the all the information that for in this case the london school economics already know about the data center that they have 433 publications that you see on the top on the top rows and a list of all the publications they have as you see in the left columns in the screen you will also get get the who created these publications and there are keys like this
for those that have them and as you know not everybody has one but the important part is the the column on the far right and this one is about all the information that comes from related identifiers and this is information that in many cases in this specific example
the london school economics data center or repository does not have knowledge about these expectations and this is the stuff that the pid graph will bring you in when you ask questions such as this one um the next question that i want to talk about is about funding grants um
this this uh incredibly important funders would like to get be able to get an indication of the back of the grants and as well as the specific benefits individuals that benefit of these grants and by nature of using the procedures identifiers we get the possibility to ask questions about them so one could ask the questions such as what grants uh what's the
pid graph for a specific grant publication and if you go to the next slide here i'm going to take a specific grant grant funded project that's the freya project which is the one that's as mentioned earlier by robin in which the data site is working and and you can ask what's the
pid graph for all the publications on the freya project and what you will get is all the items that actually are related to that grant again you get this response that is still very standard compared with the other one if we go to the next slide
and this is actually something that martin showed earlier that actually that same response you can take and transform into this graph that represents actually all the items in that all the items connected to that grant that specific funder grant funding which is a the yellow dot in the middle and it's connected to all these other items being researchers
publications and funders and you can get that graph precisely from from that response from the GraphQL API that comes from the pid graph but i want to show you how again this in a less abstract way just in a table and and this will be kind of the list that you will get
from that one and i here separated into two massive tables practically the table at the top show that how many data sets are related to this grant from the project and the table at the bottom show how many publications are there related to that one you will see number two at the very top of the table the first one and in the middle second one you will get 20 so 20
publications there all of them related to this kind of uh kind of uh grant from the project and also and you will get practically all the organizations for those ones that i have included together with the all the red and the fires that these publications that they
have and this broader defines are actually many of the nodes that are in that graph next place okay so the third example that i want to the third question that i think you can make is actually about researchers and in this one um researchers obviously would like to
know how their data and publications are accepted and administrators would like to know which ones have more citations or or not and for researchers we know we have our kit like this and actually we can start asking questions about researchers by using the architecture this and ask questions such as how many citations does our researcher have and again uh when you do a question like
that you will get again a standard response uh here i'm using a hand receptor which i think is the audience uh and i'm getting everything that is connected uh to him and the items that have been cited or related to the items they have created and if i transform that again to this
table just to give you to give you an idea of how this looks you will get practically the main information about the researcher at the very top with a total count of all the data sets of publications that they have and together with all the information again in the far right that you
will expect the repository i know that henry works for imperial college so partly we know we are pretty sure that most of the information that's in the on the left of this table will be something that this data center of this repository already knows but the part of the citations is some of that information that they might not know and it is the things that the pid graph
is bringing bringing to the to the table and making accessible um and the last case that i want to show you is about uh software and this one i will jump immediately to the example and not show the js response and just to talk about like
you can make questions when you use the pid graph such as what software on the subject warming is out there and who is from it and you will get again a js response that you can transform it into a table or put it in your uh you say your repository your application and list row and you will get every hit of that response will give you an item that is a piece
of software um that actually enter the relation that has with any funding body that the that you have and i think you can see one here and probably the four row which is one piece of software that i use for global warming and uh studies that is actually funded by the national
science foundation and this is the kind of questions that you can make there are not only that require that due to the power that we have with the pid's connections but also that you can make where it's over those connections you can answer questions such as this and so i think we can go to it next but i think this is the last one yeah so those are the four cases
that i want to show you today data centers grants researchers and software uh there is a lot of work that we are still doing and the pid graph and the graph qli pi and i think uh i think elena will have probably more uh comments about this thank you okay great thanks everyone
um so there are some questions coming in so we'll take some time to answer those feel free if you have other questions to share them via the q a or also in the chat if that's easier um so here is the one question that has already been answered according to a second message from the
same person but let me ask it anyway just for the benefit of others um so in the pit graph how does one distinguish funder from grants who wants to take that uh this is martin i can take that and
you sort of um have answered yourself that the data set metadata we have um the distinction between the funder and the grad in the metadata and you can have obviously funding multiple grants and all the funders for the same data set or item um i want to raise one challenge here
which is typical for pit graph that funder identifiers is pretty standardized and we all know how to use that how to use a data set metadata and cross-site metadata in open-air
metadata but grant funding is still newer and that's partly because it's more complicated and there's sort of ongoing work of standardized grant identifiers etc that just shows a limitation of um pits and pay graph in general that there are always things that are more established in
uh the pit is available metadata that contains links available etc and other things that are evolving organization identifier is another example where you will find hardly any links for raw identifiers because that's a new identifier um yeah so that just sort of for things like
grants to keep that in mind and i see there's more questions yeah so i see a very interesting one in q a um the amount of pits we will need for this pit graph to work is virtually unlimited is this sustainable who will create and more importantly maintain all these pits so that
they do not get duplicated thinking not just of pits we're starting to envision like grants or org ideas but also of pits for research equipment or for patents for instance so i guess martin you might want to take that as well again a very good question i think
um yeah there's different ways you can answer that one of them would be that the pit graph tries to describe what exists and obviously there are things that are easier to describe because for example you take pits for research data repositories there's only two and a half thousand
so that's very easy organizations research organization might be 100 000 and samples and data sets might be in the millions or billions so that's that's just a fundamental problem of things so i would not think that we would create extra pits for the pit graph
but that this limits us of what we can do and i think there has to be yeah i think something that we didn't say specifically in this webinar but that is very important for us that in the freya project we are building production-grade infrastructure
so everything we are building can scale to millions and many millions of things and that's sort of one reason we pick graph trail as a technology because that's obviously a challenge when you scale this up it's also on the other hand the question of what kind of graph is
really relevant to you and since we since i know from what direction you're coming from i would say that the graph for the institution might be very big but that's still much more focused on on a much larger graph that goes sort of across the whole discipline for example so i think doing something that is as focused as possible and then see where you can scale up and what's
the trade-off if you add lots more things does it slow you down or complicate the graph that's very important there's also something that some pits are very good in making your graph distinguishable and other pits create a lot of noise for example pits for funders and
institutions that potentially connect everything to everything whereas grants and researchers for example in this case are much more specific and make it easier to see the specific connections okay thanks um the next question that came in maybe i missed it but how far is the
integration of roar ids as pit for organizations so maybe robin you want to answer that sure yeah so um so the roar registry exists um it is is live it is able to be used um in
terms of integrating with data site uh services and data site metadata um currently as part of the metadata working group we are looking at um including roar ids including an affiliation
identifier in the metadata that we could put a variety in because um while the registry is live there's actually currently not a space in the data site metadata schema to put an affiliation identifier you're able to specify an affiliation for a creator but there's not an actual field for an identifier for an affiliation so we're working on that um shortly that should be out shortly
um in the next minor uh schema version that's coming um and so then once that's in place you'll be able to use the roar id for its primary intended purpose of identifying relations it is currently possible to use it as a name identifier for an organizational creator that's
something that you could do but that's not really the same as using it for an affiliation identifier and so we're working on that piece so the registry exists we just got to incorporate it with data site stuff um so i also saw a question in chat in the future how might one attach contacts to a data citation was the data reused or not was the data replicated
etc uh christian maybe you want to answer that um i think well i mean at the moment we have it's true that at the moment we have not looked into adding
that to the schema uh in a way or actually on the events of how the events are created a centric interesting question um and definitely something probably we have to look as part of the future projects really to make that account but yeah i don't think that i have a satisfactory attitude for that one at the moment
i i would add to that that in our presentations we have very much focused on the resources of the notes and with what they are connected but we haven't really talked much that the connections for example between a data set and the publication also can have meaning so
a data site and its colleagues like have a relation type of obviously there's a date when this connection was made and also somebody who did that so there's a lot of context in these connections that you can expose also with graph trail but we haven't really focused
much on that to say i only want to see citations of a certain kind etc yeah and maybe i can add that um in our webinar last month we talked about the work
we're doing around uh looking at views and downloads and citations of data sets and making that available so that webinar is also available through our youtube channel um so a couple more questions have come in so noto uses something called a concept doi a mini pit graph if you will
how does that relate to pit graphs or is it simply a staging post towards that aim i guess martin you might want to yeah i would say um some of us in the freya project had an in-person workshop last year to get this graph work started and we collected use cases and tried to summarize them
and one very important one that came out in this is versioning versioning of data sets and software is obviously very common and can get very complicated and there are some issues currently that are sort of not fully addressed what zenodo is doing is basically
versioning of duis to make it easier for example if somebody cites a specific version that you can aggregate all the citations through different versions together that's what this concept identifier is good for there is also of the fear that if you create data sets
if you're if they are too granular which is good for specificity it makes it harder to aggregate all the citations for example the data repository gets together and with the pit graph you can address these questions where it's just you have much more flexibility how you
i could get things together and kristian showed an example of that for for repository okay great um so someone said a great idea so thanks um will the pit graph support
analytical queries as well such as give me the researchers sorted descending by the number of citations to their publications data sets and software produced within the last five years now who wants to answer that well if that's a use case we definitely were looking to
implement that i guess yeah so um i'm not sure how many people in this call are familiar with graph ql because at the end of the day this is a normal api that you see many places for all kinds of services um so the short answer is everything you can do well basically everything
you do with rest api you can also do with craft ql with some some things being easier others more difficult so it's not just connecting pits together but for example we already support standard queries so you can and i just showed an example with god climate uh we sorting
pagination all these things are of course also parting graph ql and that allows you to to build all kinds of sophisticated queries but as kristian said that's not probably not something that's sort of uh the very near future um because it it goes deeper
and deeper but it's it's definitely something that's possible and that that's something we are willing to do if there's enough demand for that okay um so we also have a question central registries for different types of globally unique pits seem to be at the core
of the graph being workable which ones are missing so i don't know martin if you want to take that as well um
yeah yeah so one that's currently missing is uh clearing for crosstalk d-wise and obviously that's that's a very large number very interesting content um but of course there was another partner in freya so this is something that will happen um
i think the bigger challenge is if you have the same data type in many different places so if you for example um identifiers for funders organizations people there might be more than one option but it's relatively straightforward but if you look for publications and data sets
you have to look in so many different places that i think that's that's a really hard challenge that we haven't really started to address in combining not just the data sets from one api but the publications one other api but uh of course there are many publications with
data side duis there are close to two million data sets with crosstalk d-wise and how you bring all this together that's not trivial the specific question regarding crosstalk d-wise the freya project we have this month started working on a sort of common diy search
which will make these kinds of things very feasible okay final question of the day um at the end of the project will this give universities an overview of all their research outputs so robin do you want to answer that not exactly i mean so
some of this could be possible but at the end of the freya project we will not have a service where you as a university could you know go to a website and put in your name and see
everything related to you not at this point in time um so we will we are making the infrastructure that will enable that kind of thing to be possible but at the end of freya that will not yet be a thing that is definitely something that we're interested in in looking at
in general for our at least for the data site members in the future to be able to see information about themselves and what's happening with their doi so that kind of thing is on our radar to plan for for later but within the context of freya no no magic buttons or anything as of yet it's just the infrastructure piece that we're solving i would like to add something
to that which is that i i fully agree what robin said that i see this most complementary which is making it easier for institutions to enhance the information they already have and
maybe for example they have some pieces publications data sets or kits for people etc that can be used as starting points to find more connections which is of course basically what the systems are doing and where commercial vendors charge money for this is also what research graph has been doing for several years as they have something called their
augment api and that is something similar that we envisioning here that if you have 100 orchids for researchers in your institution you have a lot of information already but we might find additional things that you don't know in particular if you follow the graph over
more than one connection so that's definitely not the the magic bullet but it certainly should improve this information that is available to institutions depending on how they do this now this might be a big win or a small win if they have a twist system that's fully established that's probably a smaller win but fundamentally there will always it's always possible that
there's additional information found via this technology graph ql has another interesting feature which we didn't talk about and that's sort of a little bit further in the future but that's subscriptions so you can also take these queries and then get notifications whenever
there is something new that has appeared and that of course makes it much easier to keep track of the things that are of interest in your graph okay um i'd like to leave that this for today
so thanks everyone for joining as i said the recording will be made available via our youtube channel and we hope to see you again next time