We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Analytics & Reporting at different levels for a CRIS based on DSpace

00:00

Formal Metadata

Title
Analytics & Reporting at different levels for a CRIS based on DSpace
Subtitle
the use case of the Peruvian National Platform
Title of Series
Number of Parts
9
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022
Production PlaceBerlin

Content Metadata

Subject Area
Genre
Abstract
4Science was awarded a contract from the Consejo Nacional de Ciencia, Tecnología e Innovación Tecnológica (Concytec) for the development of the National Platform #PeruCRIS, based on DSpace-CRIS and funded by the World Bank. In the context of the project, a very sophisticated solution was developed for the analytics and reporting functions. This solution provides DSpace-CRIS with a powerful set of tools for data analysis, reporting and visualization, based on a combination of state-of-art and open-source technologies OpenSearch, SuperSet and Dremio.
Keywords
Meeting/Interview
Computer animation
Computer animation
Computer animation
Computer animation
Program flowchart
Computer animation
Computer animation
Computer animation
Computer animation
Computer animationProgram flowchart
Computer animation
Diagram
Transcript: English(auto-generated)
This meeting is being recorded. We have two interesting talks in the first session today, both looking on analytics for dSpace CRIS. The first one is reporting about analytics and reporting at different levels for a CRIS based on dSpace.
Looking at the use case of the Peruvian national platform, it's presented by Andrea Bolini from Force Science. Welcome, Andrea. The stage is yours. Thank you, Pascal. So let me say I'm very happy to open the second day of this precious event that annually brings
together this base user of German-speaking countries. I'm happy to share with you today our understanding of analytics and reporting needs based on years of work on CRIS implementation project.
Today, we will use one of our running project, the implementation of the Peruvian national platform with dSpace CRIS as base for our narrative. So let me start with some context about the project. The Peruvian CRIS platform has been funded by the World Bank, which
aims to provide to the Peruvian citizen a modern information ecosystem in science, technology, technological information to provide broad value, accessibility, and development. The main goal is to develop an open, interoperable,
and integrated national network that will be able to collect and organize all the information related to the Peruvian research activities, giving visibility and transforming them in knowledge so that decision makers and all the interest parties can take valuable action to support the research.
The expected benefit are many. Mostly related to have timely statistics, reports on national research development activities, better monitoring and evaluation for public funding at national open access policy, disseminate research results,
analyze trend and impact, sharing and discovery, innovative technologies, ideas, new market, competitor, and partner, better decision making on different level, national, local, institutional,
private sector, and general population. As you see, all of these benefits are about analytics, understanding, extract value from your information from your data. So the two pillars of the project
are sustainability and interoperability. Of course, there are many aspects that need to be accounted for that, but the key answer in our approach is provided by adoption of enterprise trade, open source project, and use of agile methodology.
Adopt open source technology require a careful scouting and evaluation of available solution. Both the technicality then to governance and license model must be considered. The selected solution must be monitored
and we need to be aware of the community mood. An active role is required for an effective use of the solution. ForScience is a lead contribution of this base and we are happy when we can contribute to other open source project as well.
The agile methodology allow us to govern the change instead of wasting time fighting against them. In a so large project, it is obvious that change will arise. We provide high quality software using agile methodology
that can be improved step by step. Agile is an abusive word nowadays, and at ForScience, we take it seriously, investing time in training and having several member of our staff certified as product owners, scam masters, scam developer.
The architecture of the Peruvian national platform is quite complex as you can imagine due to the scale of the project. It's an interoperable project. It's not a monolithic platform. It's built on top around this base CRIS
that is at core of the project. But information are automatically collected from several sources. We extract, we receive information from institutional CRIS system
that share this data with the central installation using standard such as OIPMH, Serif, using to open air guideline for CRIS manager and so on. But also from some commercial and not commercial database,
you see Scopus, Halicia, X-Ref, PAMED, and many, many other. Of course, there are also some government sources of information like Renasit, Sunat, Sunedo and other Peruvian national database.
On top of that, this is just a Peruvian research data, but the research is something that work at the international scale and is inside a broad community where open data exist and need to be used
to enrich your information, to extract value. So what we have done for the analytic part, that is the main focus of today presentation.
We have introduced mainly three component. Open search, that is a community driven open source search and analytics suite derived from Elasticsearch, but licensed with the Apache license. Just because Elastic is not longer
an open source initiative, it's not adopting an open source license. And we were very careful about that. A technology partner experiencing open source know that an in-depth analysis, understanding of underlying communities
is required to make sustainable designs and avoid to expose the project to risk. And for this reason, we have from the start decided to stay away from locking traps at Incubana, via the XPAC and decided to use the open distro Apache license at start
and later on move to open search when Elastic have decided to change the license of the core analytics platform becoming not longer open source. The second component of our analytics solution is Dreamio, another open source project
that provide a data lake engine that create a semantic layer and support interactive queries, however distributed data source. To provide a visualization of this information, we have adopted Apache superset,
a modern data exploration and visualization platform. Here you see the flow of data in our open, in our analytics component. The main information come from the space crease, they are synchronized to,
they are sent to open search with a queuing mechanism that allow near real time synchronization if needed or overnight synchronization. In this step, data are pre-processed so that we are able to provide different view
over the same data in open search. We will see more about that later. Open search became one of the source of information for Dreamio, for the data lake together with many other unknown source that exists into why.
Well, it's important to know that I'm talking about unknown source because the project is not to build an analytics over a set of data that we know in advance, but we want to give to the Peruvian government the freedom to join the data from their information system
with any other source that could become available in future, any open looking data, any local database or spaceship that they will provide with additional information. And this data lake created in Dreamio will be visualized and explored using server set.
So in a so large project and in general in any crease system, there are different user and the project scale can be quite different. The national scale of the Peruvian project will not apply, of course, to any other project. For this reason, we have followed our approach
of progressive enhancement of the solution from the functional perspective, moving from a built-in support of analytics in this space crease, to an analytics of them powered by open search, to a data lake engine
that is powered by Dreamio and server set. What do we mean for built-in level? This space crease support a flexible search engine that allow data analytics and exploration. Results can be exported in configurable formats,
including CSV, Excel file, PDF. And aggregation can be used to narrow the analysis, facet browsing can provide basic visualization. So here in the screen, you see how onto Peruvian project, we introduce graphical visualization for search result
or for the wall database, and we can provide different type of graphs, such as pie, bar, line, and so on. And this graph can be built on top of any aggregation dimension
that you want to use on your data. So we are talking about publication type, is involved with the institution, authors, years, dimension of your project, and so on. This visualization can also be included
at different level of the platform. So all of them in the search, they can be included when you visualize detail about a specific project or a specific person. Data can be extracted from this space crease. You can of course use the REST API,
but you can also very easily extract your data in an Excel format or CSV format or any other custom format that you are going to configure in the platform. So it's not only about local visualization. And all the search and export in this space crease
are contextual based on the security of the user that accessed the system. So you can create a report about public data, but also reserved data. Which is the second level?
The second level is the analytic setup that has been engineered in the Peruvian project, but we are able now to offer as a generalized solution to any of this space crease installation, just for a share of the initial cost of design.
The analytics have done provide self-service capability. In this space crease, you are able to configure many aspects. You can configure additional facets, additional filter, graph, and so on. But if you change your mind, you need to change the configuration. The analytics have done provide you
self-service capability. You can change your reporting system without changing configuration, without changing the code. Moreover, during the ingest of the data, the data are distracted. And this is very important to provide easy analysis from different perspective. For instance, it's quite common to focus on,
when you talk about publication, you can look to the publication itself, or you can look to the single contribution of a noter in a publication. If you try to count how many publication an institution has, there are some scenario where it's also important to know
how many contribution to a specific publication, a specific department has provided, or another department has provided. And it will be quite different if a noter is the single author to publication, or is a noter in a large group where there are external author
or other code or in the same department and so on. It's also important to know that the analytics have done can be accessed from external application as a normal SQL data set so that you can query these tool
also from Power BI or Tableau or even Excel. What is to provide? Provide you the option to create several dashboard. Here you see dashboard that are predefined for organization, to analyze your organization unit, your person, your project, publication.
You see that for publication, there are several view over this publication just to resume the previous argument. So you have analysis of publication as a whole, of single contribution view of the publication, a different timing of perspective of publication where you don't see publication associated
with department by the mean of the affiliation stated in the publication metadata, by the mean of the affiliate of the current affiliation of the researcher. You see total number of different widget
can be arranged in any dashboard. This is an example of the project dashboard where you see the number of projects that are running in your institution, the distribution hover here of the project, to the different involvement of different institute in your project,
and also to economic value, to distribution of economic value of the project for different institute in different scale of the project. So that you know that institute B have a larger number of very costly project or very high family project.
Over publication, another set of default dashboard that is provided allow you to see again, the total number to analyze how many peer reviewed against not peer reviewed publication,
year distribution, type distribution, contribution by single institution, institute to the scientific output, the keywords, so the research area or that are focused on your publication,
distribution by author, you see different performance. So you can use metrics like the scalpel situation or web of science situation, or other metric that you compute at institutional level for your publication to compare the performance of different institute.
Here you see an example based on a fake metric that we have generated over the average of the metric value for the publication and the median distribution of this metric for the publication.
So you can make some sophisticated statistics analysis of your data. And again, you can see how on height map, how your publication are distributed over the scale of your evaluation. Again, about evaluation is important
that you can base your evaluation on existing bibliometrics, but also on other metric that the rule are defined at your institutions so that you give different weight to different publication type or a number of contribution
or impact, social impact of the research output that you are able to track somewhere in the system. Today's board are all interactive so you can click on any element and narrow your analysis to this specific element and go down up to the detail of raw data
that contribute to this analysis, such as the list of publication. The service capability is quite simple. So you can edit your dashboard, you can rearrange the element just using drag and drop,
resizing the element. You can create a new panel using panel that you have already configured or create completely new visualization using a set of predefined widget that exists in the platform. So last layer of analytics. You see that the analytics are done,
provide you some service capability and a lot of value, but it's still limited to what you know in advance. So you start from the space crease data, you can enrich and enter this data with a couple of external source that you have identified in advance.
Dreamio is a data lake engine. This means that on the user interface, your administrator are able to register on demand new data source and configure on the user interface the way that this data source need to be joined with your existing data to create a virtual dataset that can be created.
So you can join your data with the any old estimates, sparkle endpoint, Excel file, database or S3 data and so on. And Apache superset is quite similar to what open dashboard hacking
to former Kibana can provide you in terms of widget tool, dashboard capability and so on. But they work on your digital datasets. So it's not limited to only to this space crease data.
And in this example, you see that for the Peruvian project, we have joined to crease data with demographical data that come from other national database to visualize normalized data over the Peruvian country map or research impacted activities.
This is just another example of a dashboard in superset to show the capacity of the widget for patent and intellectual property in the Peruvian project. So thank you. I hope to get time for some quick question.
Thanks, Andruy. I guess we have time and we already have a question for you. Do you cater to different partner organizations in the sense that you provide an adaptable data model, send the multi-tenary in DSpace Chris
or have provider map the data to your data model? Do you cater to different partners organizations in the sense that you provide an adaptable data model, send the multi-tenary in DSpace Chris or have provider map the data to your data model?
The question is also in the chat. Okay. I'm not sure to catch hold to the point in the Peruvian project is some sort of multi-tenancy because all institution have their data
is related from the other when these data are contributed to the national database. But these data are later on aggregated at the national scale because of course there are collaboration among the institution and Consitech will provide the editorial check
of this data and normalization, the duplication of this data across institution. At the analytics part, data are based mainly on the self data model, but of course this can be extended,
is extended to meet the local need of institution so that the extra information can still be processed by the analytics model and of course also more into data lake.
Are there any more questions? So I think we give you another minute for questions
and in the meantime, I might remember you on the great event and we had together at Open Repositories 2018 where we were dancing as bears on a big stage in the ideas challenge. And if you don't want to see that you should come up with a question very soon.
I'm afraid I scared everybody. And thanks a lot for this great talk. Thanks. And I'm sure we'll see each other today a little bit more later. Sure. Bye-bye.