We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

After Make Data Count: the building blocks of data metrics

00:00

Formal Metadata

Title
After Make Data Count: the building blocks of data metrics
Title of Series
Number of Parts
5
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
During the last 2 years a lot of work took place within the Make Data Count project to take the first steps towards data metrics. Now that the MDC project has come to an end and services are transitioning to DataCite, we would like to show you what infrastructure DataCite has available as a result of the project. We will demo how you can contribute usage stats and citations and show the different ways in which we're sharing these numbers with the community. There will also be plenty of time to discuss next steps.
Metric systemCountingStatement (computer science)Data managementLink (knot theory)Web pageView (database)WebsiteRepository (publishing)Server (computing)Process (computing)Electronic visual displayData structureCanonical correlationData structureCountingRepository (publishing)Projective planeMetric systemNumberLanding pageField (computer science)Medical imagingPrice indexFigurate numberOpen setPresentation of a groupMultiplication signView (database)Link (knot theory)Computer configurationMereologyArithmetic meanStandard deviationContext awarenessInformationCASE <Informatik>WebsiteoutputMarginal distributionComputer animationXML
CountingFile formatSlide ruleView (database)Computer fileGoogolPresentation of a groupTouchscreenCodeSelf-organizationConsistencyMetric systemType theoryCommunications protocolBlogProcess (computing)Digital object identifierEvent horizonTouchscreenMusical ensembleProjective planeMehrplatzsystemCountingNatural languageData centerOpen sourcePresentation of a groupForestType theory1 (number)Computer fileView (database)Different (Kate Ryan album)Latent heatNumberCASE <Informatik>Traffic reportingLoginInformationRoboticsRepository (publishing)PlanningMassEntire functionLimit (category theory)CodeService (economics)Multiplication signSuite (music)Game controllerTerm (mathematics)Regulator geneTable (information)ImplementationMereologyData storage deviceProcess (computing)DigitizingSelf-organizationCartesian coordinate systemAreaLevel (video gaming)Network topologyGraph coloringStandard deviationLiquidCollaborationismPentagonDegree (graph theory)Electronic visual displayCategory of beingMetadataCommunications protocolRevision controlStatisticsFile formatStructural loadVariable (mathematics)Electronic mailing listProper mapVolume (thermodynamics)Filter <Stochastik>Metric systemElectronic data processingSoftware repositoryOrder (biology)Sheaf (mathematics)Shape (magazine)Physical lawSet (mathematics)Scripting languageTranslation (relic)Event horizonDigital object identifierShared memoryOnline helpSoftwareComputer animationXML
Event horizonView (database)LoginPeer-to-peerPerspective (visual)Range (statistics)TelecommunicationSession Initiation ProtocolOffice suiteImplementationComputer configurationComputer networkLine (geometry)InternetworkingVisual systemDrag (physics)Latent heatRevision controlRobotMetreDiallyl disulfideRepository (publishing)MiniDiscObservational studyBootingAsynchronous Transfer ModeMathematical analysisOpen setTraffic reportingSlide ruleFile formatShared memoryWebsiteLink (knot theory)Event horizonRepository (publishing)Electronic visual displayImplementationWeb 2.0Distribution (mathematics)Service (economics)CountingImage resolutionBitConnectivity (graph theory)LoginMultiplication signInformationVideoconferencingTerm (mathematics)CASE <Informatik>Set (mathematics)1 (number)Proper mapStatisticsRoboticsSource codeMereologyVariety (linguistics)View (database)CodeIdentifiabilityMetadataElectronic program guideQuery languageProcess (computing)Instance (computer science)Data structureOnline helpNumberSoftware repositoryLatent heatData centerNatural languageCodeProteinMetric systemProjective planeFocus (optics)Connected spaceOrder (biology)Interface (computing)Sheaf (mathematics)Graph coloringSoftwareCuboidElectronic mailing listVisualization (computer graphics)DataflowHypermediaComplex (psychology)Reduction of orderRight angleGoodness of fitInternet service providerInformation retrievalFormal languageHierarchyComputer animationJSONXMLSource code
Inclusion mapDigital object identifierMetadataFinitary relationType theoryRepository (publishing)StatisticsMetric systemEvent horizonSource codeSimilarity (geometry)Computer virusWeb pageToken ringVideo trackingGoogle AnalyticsWebsiteInformation privacyElectronic visual displayType theoryBitTheory of relativityStatisticsEvent horizonDirection (geometry)Set (mathematics)InformationProcess (computing)MereologyTerm (mathematics)WebsiteCASE <Informatik>Metric systemProjective planeLevel (video gaming)CoprocessorMoment (mathematics)Slide rulePattern languageRepository (publishing)Information retrievalView (database)Presentation of a groupNumberCategory of beingQuicksortRevision controlBeta functionDifferent (Kate Ryan album)Electronic program guideGraph (mathematics)Multiplication signMetadataField (computer science)Point (geometry)Instance (computer science)IdentifiabilityAxiom of choiceMultiplicationElectronic mailing listForm (programming)Perspective (visual)Electronic data processingSoftware repositoryDecision theoryService (economics)1 (number)Representational state transferUsabilityCodeCountingTraffic reportingPlanningDigital object identifierRandom matrixElectric generatorAdditionCross-site scriptingComputer animation
Similarity (geometry)Web pageComputer virusToken ringStatisticsRepository (publishing)Google AnalyticsWebsiteVideo trackingInformation privacyText miningSource codeMetadataGraph (mathematics)Principal ideal domainInformationAdditionTwitterArchaeological field surveyEvent horizonMetric systemProxy serverRepository (publishing)Proxy serverInformationCASE <Informatik>StatisticsSlide ruleMetric systemCollaborationism2 (number)TwitterWebsitePresentation of a groupLevel (video gaming)Multiplication signQuicksortArchaeological field surveyGraph (mathematics)Latent heatCentralizer and normalizerProjective planeSimilarity (geometry)Goodness of fitSelf-organizationConnected spaceComputer wormInternet service providerBitRadio-frequency identificationBoundary value problemElectronic mailing listMusical ensembleEvent horizonVideo gameFile archiverStress (mechanics)Letterpress printingDependent and independent variablesSinc functionFreewareSet (mathematics)DigitizingService (economics)Information privacyOpen setRule of inferenceMetadataServer (computing)Arithmetic progressionText miningCountingWeb pageMereologyAdditionTrailWeb 2.0Principal ideal domainSource codeInheritance (object-oriented programming)Computer animation
Metric systemCountingArithmetic meanQuicksortOrder (biology)Service (economics)Price indexRow (database)AreaWebsiteContent (media)IRIS-TRadio-frequency identificationVideo gameSound effectSheaf (mathematics)Slide ruleDifferent (Kate Ryan album)Link (knot theory)Software developerProcess (computing)Graph (mathematics)Repository (publishing)CodeFunctional (mathematics)NumberTraffic reportingBitSelf-organizationMereologyLoginInformationBasis <Mathematik>Event horizonMultiplication signMathematical analysisImplementationReading (process)Software repositoryElectric generatorPlanningYouTubeCombinational logicDecision theoryMoment (mathematics)Open setInformation privacyComputer animation
Transcript: English(auto-generated)
So in the webinar today, we'll be talking about the building blocks for data metrics. And if we look at today's agenda, first I'll spend five minutes talking about why this is important,
why we need to think about making data count, and then Christian will take over and he will talk about the make data count project and what data site built within that project. And Robin will give a summary of what that actually means for you, what do we now have available
that you can use, and Martin will tell you where we go from here, so what the next steps are, and of course we really welcome your input. And then as I said, we'll have some time for Q&A. So in case you're wondering why we're talking about make data count all the time,
that is because make data count was a project, and it was a project that set out to build the social and technical infrastructure necessary to start developing data metrics. It was funded by the Sloan Foundation and it ran from May 2017 until now, and that's also the reason why we're doing this webinar now, because I think some of you
have been wondering what will happen after make data count, and I really want to assure you that we're continuing that work. All three partners, DataSite, CDL, and DataOne, are very committed to continuing the work that we started within make data count, and DataSite will also continue to host the infrastructure, and that's why we wanted
to show you today what that infrastructure is and what you can work on with us. So why do we need to make data count? Why is it so important to be thinking about this? And I think there are many images and figures I could be showing here, but actually yesterday
I saw a presentation about the European Open Science Monitor, and that's why I picked this image, because they did a survey and that showed that most researchers think it's very important to have access to research data and that it would benefit their research.
That's the first bar you see there. But then they also say that sharing research data is not really associated with credit or reward in their field, so only around 40% thinks there's credit or reward associated with sharing data. And we can't really start addressing that unless there are some indicators for the impact of data.
And before make data count, there was another project called Making Data Count that already started doing work on this and looking at what indicators we could look at. And what came out of that was that citations, data citations, were seen as most important,
but other options would be downloads or links or views of landing pages. And now you may think, but sometimes I already come across views and downloads and citations, so don't we already have some of that? And I think what's different here from existing efforts
is that these were often individual efforts, and the numbers you see don't mean the same thing. And if you're comparing apples to oranges, you can't really start using those numbers. And that's why it's very important that we develop a standardized way to start looking at these things so that we can compare and we can start to assign meaning to these numbers.
And so that's what we wanted to do within the Make Data Count project. So this is basically the structure of the Make Data Count project. And we realized that data citations were a really important part, but some work had already been done on that in the context of the Research Data Alliance,
so we decided to leverage existing initiatives and incorporate that into the hub we were working on. But for usage metrics of reviews and downloads, there wasn't any kind of standard. So we developed a new recommendation together with Project Counter so that repositories could feed standardized information about views and downloads into this hub.
And that would then give us a way to bring all of that together, of course, in an open way so that everyone can then extract that information and start displaying that information. And so today we want to show you what that means, what we build, and what we have available.
So Christian will now tell you more about that. Thank you, Elena. I think you can see my screen.
So yeah, particularly in this part of the presentation, I would like to talk about what we actually have been building to Make Data Count. And I'm going to start something not with infrastructure, but something that we helped to go out for and actually were driving to a certain degree. And that's, as Elena likely put it, the counter-product practice, which is this
standard of how this usage data should be processed and how it should be reported. We actually created this standard together with the partners in the Make Data Count project. And co-authored in close collaboration with the counter.
One year ago, exactly, actually, there was not yet the case of research data was not well supported within the counter-standard. But after we did the first draft of this counter-standard, we moved forward that goal to actually
make it that we have a standard of how to count and report in process of usage data. Earlier this year, actually, we released the number one of this code of practice research data and you can find it in the counter website. The reason why there, well, I mean, as Elena put it, we need a standard way to actually count usage for data.
And it is different to count it for data that is for text, documents and other type of resources that actually counter was already standardized. And the difference are mostly because we have different use cases when it comes to data. There is, for example, no need to track access to it by institutions as most research data is available.
We also have differences about granularity. Data sets frequently include individual files and sometimes they are aggregated and can be merged and split. We also have differences in versions. Research data frequently have many versions
that are something that you don't have with documents, for example, or publications. And probably the most, one of the most important differences is that we have non-human users of data. Sometimes there are scripts of automated tools that frequently use and fetch research data. And previous counter specifications will actually filter those ones and will not count them.
But for the, for research data, we have to do an exception with it. The counter code of practice actually, you could probably divide it in the case of research data in two parts. The first part is about processing the usage data and this comes in the shape of logs.
And this is part of the counter code of practice. You will find sections that will tell you what's the minimum information that your logs need to have in order to actually process them and extract this usage information. You will also tell you how to filter different things. For example, what should you do when you have double clicks by a user in your repository and how should that be counted and how should that be dealt with.
There are also filters for robots that are useful. I mean, non-human users that you would like to keep and some ones that you want to actually exclude. And it tells you actually how to go about, how to differentiate those ones.
And actually provides a list about this. And also, it helps you actually to deal with another variables like volume. This is regarding about the size of the data that you have in your repository. And what's the need to actually, the difference between the load data set that is like one kilobyte against a data set that is like one megabyte over one gigabyte.
The second part of the code of common practice actually deals with the reporting. I'm going to use the term sharing these usage metrics, but just for the sake of this presentation I will use the term sharing, but it's also about reporting.
And this part of the code of practice actually helps you to identify the different types of what we call metric types, which actually is divided into two big categories. That in a human language translator will be something I say, what is a view in a file, view in a research data set, and view and download in a research data set.
What's the difference between those two. Also, how to, what are the different access methods to those data sets, and how do you differentiate those ones when you are counting and aggregating. There is also the concept of sessions. This is how long would you count, for example, a user to be
working within the same session and saying like, well, this single user download this data set X amount of times. I'll view it only one time and not more. And that's part of the specification of the code of practice. One final part actually, it also tells you the format and the protocol that you should be using to actually share this usage data.
This specification is called SUSHI, which is just the initials of standardized usage statistics harvest initiative protocol. And this is the protocol that we use, the format that we use actually to exchange in a standard way usage data about research data.
Once you have read the whole counterpart of practice, you are practically ready to actually implement it. And actually, here's where the part about infrastructure comes about.
There are, I would say, four steps in the infrastructure that you will to make the right account. The first one is processing usage logs, which is mostly the bulk of the code of practice, how to implement the implementation of that. The second part is about sharing those usage logs and that usage data and that's the second part that I showed you about the code of practice.
And there are two additional parts, which actually are about consuming those usage data and actually about displaying them. And we have been doing work and building infrastructure in each of those areas. So what I'm going to do in the next steps is just showing you the different types of infrastructure we have built to give you an insight of this.
The first part is about user processing and infrastructure. And the first thing that I want to say here is that rather than the fact that we have built an application for actually doing this, what I want to say is that there are six institutions already, six repositories, well, six organizations that actually have implemented the code of practice
and built an implementation to process usage logs according to the code of practice. That is Dryad, Dataverse, Zenodo, DataOne, CDL and DataSite. All of them have implemented a piece of software that actually processes logs.
There are slightly different implementations, but all of them, each one of them follows the code of practice. And what every organization, every repository out there is expected to actually take the code of practice,
I can tell you that Dryad and CDL have a very similar implementation. It has an open source implementation that Dataverse and Zenodo have actually used to actually create an implementation. DataSite has also created an implementation of the log processing and we will make it available soon for everybody to use it as an open source solution.
The next thing that we built in terms of infrastructure is for sharing usage reports. And here, I'm pretty sure that probably many of you that work in data repositories have seen this cartoon.
And practically, I think here DataSite is flipping the table in data repositories. And what you have probably asked many times to all your users that share your data. So now what we ask is like, it's your turn to share your usage data with us. I guess DataSite is now the data repository for usage data. And to help you with that, we have created an API that lives in DataSite and in the DataSite event data service.
And this API is used to actually share usage reports. Any repository out there which has DOIs in the resources can actually share reports
in this API and we will be accepting them and making them available to everybody. So once you have processed your usage and you have formatted according to the code of practice, you can practically use this API to actually share them. It doesn't matter how big your report is, we have actually made available capabilities so
that you can send massive usage reports and we can process them and make them available. We also have plans in the near future to actually not only accept the standard format that we're using, the make that account project, but also other counter compliant formats.
And so that this repository of usage data keeps growing and is available to everybody. So we can all share this usage data and make it comparable. In the next step of building this infrastructure, I want to use and talking about consuming and consuming usage.
I just want to make a practice talk about a little bit of a service that we use actually for consuming. And that's the event data service from DataSite and Crossref. This is practically the main component of a franchise infrastructure.
And this is a service between DataSite and Crossref. Event data is a service that provides connections between persistent identifiers and other resources. And it was actually built with the focus on social media and data citations. But in the make that account project, with all this effort that we are doing towards sharing usage, we
are using event data as the main place to actually allow repositories and anybody out there to consume usage data. Having this in a centralized way, like all the usage there actually help us to eliminate silos and improve the
information flow and produce complexity as well, because it provides a single place to retrieve usage citations, not only usage. It also eliminates a lot of work in the repository, because I will show you some of the features that we have there that will help you to consume usage in a better way.
So every time that you are sharing data usage reports with DataSite, we push all that usage to the event data service. And we make it available for everybody to access it in the same way that you will access citations from that service.
We also have coupled that with a few very useful aggregations. You can aggregate all usage and citations by researcher from there and also repository or data center. And actually, I want to show you an example of how would that be. One interesting example, and I'm not going to show the three of them, but I want to show you how
a researcher can actually retrieve a researcher that probably would like to know how their data is shared, access it. And what they have in the world of PIDs, we identify researchers with Orchids, and they also are part of the data service.
So we're using a variety of methods to get Orchids to event data. And these are connected to all the data sets of the researcher and in place to all the data usage reports. And therefore, you can query things like how many citations and usage do my data sets have.
And here I put an example of Julia Davis from University of California, San Francisco, and how she came with only hierarchy ID, can go to the event data service and obtain all the links to all the data sets. And in place, get all the usage from those reports that you have submitted as a repository and extract all the usage as well from those ones.
In a similar fashion, you can do this with all the other usage cases as I mentioned, but I'm not going to mention this, but there is tons of documentation in our website and you can see how to use the data service to actually consume this in a useful way in your repository.
The last piece of infrastructure that I want to talk about is about displaying usage data. And we actually, we are using event data for this as well to extract all the data, and we are putting it in a few places. First of all, we are displaying this in dataset search.
We use terms such as views and downloads, which is a human readable vocabulary for the language that is in the code of practice, but we find it more accessible. And you will see that you go to search and you will see that some of the resources that
have views and downloads there is because it's coming directly from repositories that have been sharing that data there. We not only show the counts, but also we actually show you how can you show the distribution of how usage has been coming and being reported over the time.
And we have actually plans to display citations here as well, and to actually display the wide resolution counts, which is something that we will be working in the near future. Now, not everybody displays views and downloads of usage in this way. There are many other ways, and I want to show you a few other implementations that how they are using event data and all the services that we have, all the infrastructure we have provided.
Here, first of all, I have the dash repository. This is from the University of California. And you will see on the right side how they have a box for metrics and specifically views and downloads.
Another example would be DataOne. They actually aggregate usage by something that they call member nodes, which is a big network of repositories.
And they actually make a visualization of how much downloads and views have these member nodes over a period of time. And using exactly the same services. Everything processes according to the code of practice and actually displayed in a different way.
And the last example I want to show you is Zenodo. They are actually displaying views and downloads directly into all the interfaces. So if today you go to Zenodo to any of the resources that are out there, you probably can see those. And those views and downloads actually process according to the code of practice.
So this is some of the stuff that we have been building. I think in the next section we are going to discuss a little bit about what can you do next. And I think I'm going to hand over to Robyn for that. Robyn? Yes. Thank you. Let's see. I think you're clicking through for us, right? Yes. Okay. Good. Okay.
So, yes. Helena's told you a little bit about the make data count project and why this was important. And Christian's told you a little bit about what data site and other make data count partners have done so far. And so the question then for you guys is, you know, what do I do now? You might be wondering. So we're going to talk a little bit about the actual concrete steps you can take to start making data count at your institution.
Using some of the resources and services that we have come up with. So next slide, please. Okay. So let's say you want to contribute usage stats. There's a couple of steps to this. So step one, as Christian alluded to, is you should process your logs according to the counter code of practice.
And so you can, there's a link here to see the counter code of practice for research data, particularly at the project counter website. There are also codes of practice for other things are not research data, but make data count was specifically concerned with this particular code of practice. And as Christian mentioned, you can process your logs in a number of ways.
Technically, as long as they meet through these requirements that are put out by the counter code of practice. If you need some help with that or need to see an example of a tool that might help process these statistics, then you can look at this tool that CDL has developed. And there's a link there for the GitHub repo where that lives.
And technically, you could just stop here if you wanted and display these things in your own repository. But step two, which we, of course, would be interested in you doing is submitting your logs to data site using our usage reports API. And this way they can go to a to a central place and this enables some of the more interesting
queries that Christian mentioned were, for instance, a researcher could, you know, look up themselves in event data and then see their usage from various repositories included along with their citations and this kind of thing. So we like the idea of having this stuff in a central place. And there's a link here to tell you how to use our usage usage reports API to submit those logs to us.
And we have a full guide at the link on the very bottom that tells you a little bit more about contributing usage stats and consuming usage stats that come out. OK, next slide. So let's say you want to contribute citations. As far as data site is concerned, if you include related identifiers and the DOI metadata that you submit to us.
And I have an example here of how you might do that for a particular related identifier within the metadata blob that you send us. Then we will include this information in the event data service that Christian mentioned. So as long as you're putting in related identifiers with their relation types,
then we will send that information to event data and it will be there to be retrieved. You can't yet add related identifiers in Fabrica. This is through our form anyway. This is something that will come soon. We are going to eventually make it so you can add all of the relevant fields via the Fabrica form and related identifiers will be one of them.
And that's a big request that people have sent us. There is a long list of different relation types that you can submit when you submit related identifiers. And this is all listed out in our schema documentation. And you can find the latest version of that at schema.datasite.org. And this will tell you all the different available relations and give you some examples of
some examples and explanation of what those different relation types actually mean. OK, so that was a little bit about contributing these things. Let's say that you want to now display them in some way. So in this case, if you want to display views and downloads, step one, of course, is retrieving the usage information that you actually want to get.
And as mentioned, you can retrieve usage events from data sites, event data API. This is just part of our own REST API with its own separate endpoint, the slash events. And we have a guide that will explain to you what the different available categories of usage events are according to the countercode of practice and let you know how you can actually query for those.
And so once you retrieve the information that you want, then step two, of course, is displaying that relevant usage information on your website. As Christian mentioned in his part of the presentation, data site uses the more user friendly terms views and downloads, as do the other make data count partners and most of the people that we've seen. So you can see what that looks like in data
site search or you can look at the examples that Christian provided from some of the other partners and the people who've done this stuff. And we do also in our support documents, we have a full use case for CDL's experience of implementing the countercode of practice. And you can see their kind of rationale and a little bit of it'll discuss a bit more about how they decide to do certain things that they chose to do.
And OK, next slide, please. So let's say you want to display citations. So again, this is follows a very similar pattern to the usage statistics. So again, you can retrieve the citation or other relation information that you want in the event data API again, which is part of our REST API with its own endpoint. This is where all of your related identifier information goes.
And we call these relational events. We call these linking events, the ones that link particular data site, DOIs to other things like other Crossref DOIs, other data site DOIs, this kind of stuff. And so the event data guide that we have will explain what the different available categories of linking events are.
But again, these just come from the relation types that are in our data site metadata schema. And so you can also view the full schema documentation to see what those might be, to see what you want to retrieve. And then, of course, again, step two is display the relevant information on your website.
One note of caution with this, with the citation information and the relation information that we have, is that there are multiple relation types that you can use to describe a related identifier. And depending on kind of what you're talking about or how your particular data repository operates, there could be multiple relation types that could describe what you might consider to be a citation.
And we do leave the choice of relation type up to the submitting repositories. So you may need to have a bit of caution when you count to see what other people include and what kind of numbers you're getting from this. Because, for instance, we have something A can reference B. A can also cite B. And there's supplements and there's all kinds of other things.
So the world of data citation is a little bit muddy in terms of what it actually means. And so you'll want to explore some of those relation types and see what it is that you're actually trying to describe when you're putting out these kind of information. And step three, for bonus points, is we now just very recently have released our GraphQL API as a pre-release version.
This means it's not quite ready for prime time just yet. But you are able to try it out and see what it does for you and give that a shot. But it will be changing over the course of time. It's sort of a beta, not quite released version just as of yet. But this way you can have a little more fun with data relations.
The GraphQL stuff is not a part of the make data count work. It's not something we did as part of that project. It's a part of another project we're in called Freya. But this is something that could be interesting for other people to see what other kinds of relations are available. Because, again, this information comes from the type of metadata you would submit like these related identifiers.
Okay, and that's all for my part. So now I'll turn it over to Martin to talk about the future and where does DataSite go from here. Thank you, Robin. My name is Martin Fehner. I'm the DataSite technical director. In the next 15 minutes or so, I will talk about what is left to do.
So, yeah, thanks for advancing the slides for me. Like every project, including make data count, we did a lot of great things. But, of course, there's still work left. And I listed some of the things here on this slide and I will go in more detail with most of these things.
So, the first one is really that data repositories have started to do the log processing and send you the stats. But, as Chris said or showed, this is at a very early stage. There are a few data processors doing that.
And I will go in more detail in a moment. We should think about how can we increase adoption of user stats, how can we make it easier for data processors to implement all these things that we talked about if you're not in a grant funded project and have extra resources, et cetera. From the citation perspective, what is left to do is mostly for publishers.
So, when you look at the data citation information that's sent by data processors and publishers, you see that the data processors that are members of data site, work with data site device, for the most part are doing a great job.
We have about a million data citations linking to publications from cross-site members. In the other direction, the numbers are much, much smaller. So, there's obviously work to do there. And that's something that I will not go in more detail because that's a little bit out of scope for data site, but just suffice
to say that we are working very closely with CrossFit on this, in particular Helena, and happy to go in more detail here in the discussion. For data site, there is additional work to do finding data citations in other places. There's additional work to do with regards to aggregation and Christian touched on this already.
Also, we have talked about citations and user statistic in this presentation, and that was the work we were focusing on in the MakeDataCal project.
But there are other kinds of data around data sets, in particular, altmetrics. And finally, we have been talking mostly about infrastructure so far, but where we really want to get to is data metrics, and there's work to do, and I will talk a little bit about what the next steps are there.
So, the next slide, I will talk a little bit about user stats and why adoption is still at a very early stage. Collecting user stats is not trivial.
Log processing in a standard way is generating reports, and all this is quite resource intensive. And if you're not a large data repository with lots of resources, that might be something that looks a little bit scary. So, we have been starting to think about providing this as a service, as data site for our members.
And obviously, that's not something that others haven't talked about before. And sort of the closest to this is very successfully happening in the United Kingdom with Iris UK, which sort of provides a centralized service that basically every UK diversity takes part of.
Generating counter compliant reports for text documents hosted in institutional repositories. They have started to work on data metrics as well. And what we would be implementing if we do that, and we had early planning stages for that, would be quite similar to that.
The service works not by processing log files, but by using a token that's embedded in web pages, which works similarly to web tracking services like Google Analytics and Matomo work. The big challenge in this is obviously privacy, in particular if you cross borders of countries and
maybe also move inside and outside of European Union with particular privacy rules since last year, etc. And that's hopefully there is a little bit of discussion around this slide and that's definitely something we will be working on going forward in the next few months.
In the next slide, I will talk about additional sources for data citation, which is primarily two things. One is to look at data sets with data set DUIs that are cited in publications that don't have Crossref DUIs.
And one example for some communities is preprints by archive. But of course there are lots of text documents that don't use DUIs and we should think about how we can collect those citations. And then citations might not always be put in metadata and the best approach to find
those is text mining, which usually requires a license to have access to the full text. The Europe PubMed Central is doing this for life sciences for the open access corpus there. And that's something we have looked at.
There are additional data citations that we can include that don't appear in reference list. And obviously this is focused on life sciences, but we can do similar approaches with other disciplines and also with partners that do this text mining already. The next slide, I will talk a little bit about aggregations.
Basically, Christian has sort of introduced you to the concept already. We are doing this already, but we want to make this much easier. And that's basically work we're doing right now in the EC funded prayer project that was mentioned before.
And we call this the PID graph. And in the next slide, you see an example of a PID graph, which looks super complicated, but it's basically all the connections.
Starting with a single researcher, what are all of his or her publications in this case, not datasets, and what in turn is sort of referenced by these publications. This is work that is a fairly early stage, but this will allow us to do very sophisticated aggregations.
And you can expect to do much more in a few months on this. Finally, altmetrics. That's something that is used a lot for journal articles and other text publications, but nobody is really doing this for research data.
That's mainly because in the surveys that we and others have done in previous years, there wasn't so much of an interest. But what we haven't done is sort of revisit this, whether this has changed since we did the survey a few years ago. And also, this might be that for specific disciplines, for specific kinds of
data, there is a lot of altmetrics that's interesting to capture and expose. And this could not just be tweets, but it could also be Wikipedia and other kinds of altmetrics that could provide useful information. The event data, as Krista mentioned, is a collaboration of Crossref, and Crossref
has built a lot of infrastructure around tracking these kinds of information already. So it would be relatively straightforward to expand this to research data. And that's something that we definitely want to do. It just hasn't been a top priority because we felt the citation use stats are more important.
But that's something that definitely will happen. Next. These are my final two slides. We want to move beyond building infrastructure. What we're really interested in is data metrics and data side can contribute to that.
We think that data metrics is something that is not there yet, but that we are on a good path. We are sort of in the second stage. The first stage is really, and we all have worked really hard on this for many years.
First stage is to building community agreement and research data and data citations are critical for scholarship. In many cases, we could say that we have achieved that. There's still lots of work to do. But then we and others have worked to the second step, which is building infrastructure to collect these citations and use stats.
And as we presented today, we have made good progress there. There's lots of work to do with adoption, for example, use stats, but we have started to think about what comes next. And that is sort of moving towards data metrics. And as a first step, sort of as one of the final things we do
with the Make Data Account project is starting to reach out closer to the bibliometrics community and work with them on what is needed for data metrics. And we have already planned a mini workshop as part of a bigger bibliometric conference in a few months.
And on the next and final slide, I list some of the things we have to think about and consider as we start to develop these data metrics together, the bibliometrics community and the broader research community, that metrics for journal articles
are widely used, but not everything there is perfect and there are issues. And we should learn from that and not repeat some of the mistakes made, for example, that you use the journal where the article was published as a proxy for impact.
And that, of course, we want to try and avoid that. And for example, say the dataset is not a part of the repository where the dataset was sort of published and hosted. Initiatives like DORA stress sort of the best practices for responsible metrics, and we have started to talk to these communities.
And we should work closer with them that we basically, when we move toward data metrics, that this is not something that relies, for example, on a single number and that there are sort of arbitrary boundaries that a metric above 10 is great.
And if it's 9.9, it's not. This sounds silly, but this is what actually happening, as many of you know, with some of the journal metrics that I currently use. And finally, we want this to remain a community effort that's not locked behind payrolls and commercial providers.
So far for data citations and use stats, we are on a very good path with, for example, this colleagues effort at RDA that many organizations are participating in, and everybody can contribute and consume this information, and we hope it stays like this. And with that, I hand back to Helena and we can hopefully answer some questions.
Well, thanks a lot all three of you for this great overview. There was a lot of information in there. But we still have time for a couple of questions for all three speakers. You can use the Q&A button or you can also use the chat if that's easier.
We already have a couple of questions. So the first question is, you mentioned that this was already a second MDC project. Does that mean there could also be a third one? I don't know who wants to take that question.
I can answer that. Yes, that's something that project partners are discussing. Nothing is set in stone or even submitted. It's clear from the presentation today that there's both a lot of work left to do, in particular
with adoption, but also that there's still a way to go to move from infrastructure to data metrics. So definitely something that we want to move forward and something where we hope we can convince the funder to help us with that.
OK, great. Yeah, and I think the next question is probably also for you. So who would decide when these numbers are ready to be used as data metrics? That's a very good question, and I don't think there's an easy answer.
I think the short answer is, if you look at metrics for other scholarly content, you have to be patient. So the time from starting to reliably track, for example, citation counts to when we start to think about as indicators or metrics, that's more in the order of 25 years.
And hopefully this goes faster with data metrics. But just the fact that you can start to count something and there's still a lot of work to do doesn't mean it's metrics. The first step is obviously that we can trust these numbers. And for that, we have to count this sort of for a little bit longer than what we're doing now, maybe also independently by different organizations.
And then the second part is just a lot of, aside from the technical and data part, there needs to be a lot of community agreement. So simple answers like, is 10 data citations of something more meaningful than eight?
Or what is a good number? Or how do you manage the difference between disciplines where the citation culture, the data citation culture is very different? So does a count of five data citations in social sciences mean the same as in life sciences and most likely not, et cetera?
So because we know that impact assessment and metrics have a huge effect on how we all work as scholars, we should just be careful and sort of do one step after the next and not jump ahead of ourselves and say we have data metrics
and just use these numbers how you feel fit. I think we will get there, but it's just a path that has more steps left before we can say we have data metrics. Okay, so I see a couple of implementation questions. So let me start with one from the Q&A.
As an Irish UK contributor, it would be preferable if all data from our repository could be in one place for analysis and benchmarking rather than data metrics in data site and publication metrics in Irish UK. Forgive me if I misunderstood. So I think, yeah, there might be a misunderstanding. So I don't know who wants to take this question.
If it's okay, Christian, I'll try to answer that. We have worked with Irish UK from the beginning, mainly because they are also very involved in the counter initiative and sort of where we were the code of practice. What is clear is that Irish UK is sharing this information with others.
So, for example, open air is also we're using that. So it's not an either or situation, but that if Irish UK had reposited to all this work, that it would be nice if we can also show this data site infrastructure. For example, if it goes via the event data service, it's very easy to combine citations and use that in one API, in the data site search, et cetera.
So there's no decision to be made to do one or the other. It's really more about exposing this information as widely as possible.
And of course, that's up to the data repository whether or not that is something they want to do or whether they want to sort of keep the information in Irish UK. Yeah, I hope that answers your question. And then in chat, there's a question, are there plans for institutional repositories like DSpace to include the needed functionality?
I don't know if Robin or Christian wants to take that. I mean, when it comes to, for example, consuming the usage data, I think it's something that we have not discussed with other repositories.
I'm not sure if Robin knows about that. But with regards to providing usage, I think this is usage mostly providing the usage of the reporting of that. Something that whoever is collecting the logs, the organization that should be providing the functionality,
DSpace also understands you will install it and you will have all those logs there. Maybe it could be that they could work and actually make that process easier.
But I'm not sure if they have plans for that at the moment. Can I add two things? One is that open repositories is just around the corner. It's less than two weeks away.
And that's, of course, a great opportunity to discuss some of these things in detail there. We have already had talked to the DSpace folks a little bit. The other aspects is one of the challenges that we saw is that many repositories are not either data or text documents.
And a good example is Zenodo that has implemented our court of practice. And that right now the log processing is about 90 percent the same for text and data documents with the differences that Chris had described early on.
So we should figure out how we can make this easier for, for example, institutional repositories that host data and publications to do log processing in a standard way that it's sort of both conforming to the court of practice for research data, but also to the sort of counter court of practice for publications.
Either the most recent release five or the many people to use the release four so that the processing is the same and that the report generation has some differences because they are slightly different things. And that is, of course, something that needs a little bit of work on the DSpace side,
but that's probably 90 percent of the work has already been done because these reports are very similar. Okay, thanks. Then I think we have a last question probably for Robin. If my developers want to start implementing, is there one place where they can find all the technical documentation to do this?
Yes. So we have on our support site a, we've recently changed it up a little bit. So we have a new section just for usage and citations. And I believe that was the one of the links that was on the slides in the section that I was doing where I said,
you know, read more, read more about this in detail. So we will be sharing those slides with everybody. And so then you can actually click on the links and all this. But yes, in our in our support site, there is a usage and citations place of a link you to all the appropriate documentation.
Okay, great. And thanks again to all three speakers and also thanks to all the people who joined today and asked great questions. As I said, we'll be making the recording available to our YouTube channel. And there will be a webinar next month where we'll be discussing the pit graph. So we hope to see you again next month.
Thank you.