We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Towards visualizations-driven navigation of the scholarship data

00:00

Formal Metadata

Title
Towards visualizations-driven navigation of the scholarship data
Title of Series
Number of Parts
16
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
One of the key goals of Cornell University Library (CUL) is to ensure preservation of the scholarly works being published by the Cornell faculty members and other researchers. VIVO is an open source and semantic technologies driven application that enables the preservation and open access of the scholarship across institutions. Driven by different needs, users look at VIVO implementation at Cornell from different viewpoints. The college requires the structure data for reporting needs. The library is interested in preservation of the scholarship data. University executives are interested in identifying the areas where they should invest in the forthcoming future. First, these viewpoints do not completely overlap with each. Second, current user interface represents the scholarship data in the list view format. Such representation of the scholarship data is not easy to use and consumable by the users. In this presentation, we present our ongoing work of integration of D3 visualizations into the VIVO pages. Such visualizations are constructed on the fly based on the underlying RDF data. A visualization-driven approach provides an efficient overview of the huge linked data network of interconnected resources. These visualizations are intuitive for the users to interact and offer the ability to visualize and navigate through the large linked data network. We discuss the performed (data) gap analysis as well as a few of the visualizations in detail and their integration into the VIVO framework.
Lecture/Conference
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animationProgram flowchart
Program flowchart
Computer animation
Diagram
Diagram
Computer animation
Lecture/Conference
Lecture/Conference
Transcript: English(auto-generated)
I did a really annoying thing and fixed someone's name incorrectly for me, so make sure that that's right first. Yeah, so I work with metadata at Cornell, and I am here to present on behalf of two of my colleagues that were really excited to get this presentation accepted, but were
unable to make it to Germany, so they send their apologies, but I'm here presenting the work of Muhammad Javed, who is an ontology engineer and semantic applications developer, sorry I have my speaker notes here, at Cornell University, and then Sandy Payette, who is the director of IT research and scholarship at Cornell University.
And we're going to be talking today about visualizing scholarly output, the drive of visualization-driven navigation of scholarly data, or the project name is Scholars at Cornell, so talking specifically about that. Scholars Cornell is a project at Cornell University where we are hoping to capture
and manage, enhance, display, serve up the scholarly output of Cornell University. We mean this as a whole. It's building off of previous work at Cornell on the Vivo platform, which I'll mention a little bit more in a couple of slides, but to really not look so much at just singular
faculty profiles, although that's part of it, but to really have the library help manage and curate and expose scholarly output of the university. And so with that, we're sort of in phase one development. We have a test demo up, which hopefully will work, and we'll see you later.
But with this phase one development, we're looking to provide high-integrity semantic knowledge base, which would then enable exploration and navigation of the scholarly record of Cornell University, and then the discovery of expertise, impact, and collaboration, whether existing or possible, of Cornell faculty and researchers.
So that's a lot of words. What do we mean? So we're looking to really provide a nice interface to say this is the impact of Cornell's research on the world. We want to grow faculty engagement in this work. So we are Cornell University. We have a Vivo instance. Vivo actually started at Cornell some years ago. Vivo is a researcher profiling system built with Java and linked data in the back.
It's had mixed engagement from various departments at Cornell due to just the data was ... There was a lot of duplicates. There was a lot of ... Does the faculty member update it here or update it there? What's going on with this? And so scholars at Cornell is looking to possibly reboot engagement with faculty members by
thinking about faculty curation and ownership from the beginning. They also wanted to serve as a data and visualization service that is motivated by patterns in data. And the best I can understand of this work is that we're looking for improved data algorithms
to match and merge many different sources of scholarly description and data about our faculty members. We'll see some of those data sources in a second. To find patterns in algorithms and semi-automated pathways to get that data cleaned up, merged in some sort of format that can work in the platform and then exposed.
So it started ... The scholars at Cornell project has started and really focused on who are the stakeholders, particularly internally, for this reboot. One of them we think ... Library, obviously, we have a couple different questions we might ask of this platform.
Academic departments is a key. Internal stakeholders, students, and then the university as a whole. What do we mean by this? Well, for library stakeholders, for example, we really want to be able to serve up data and visualizations and information that would answer questions like, well, what journals and resources are faculty members targeting?
What are they really using? We know we have that one person that always wants to send you purchase requests for something, but does that really capture what the scholarly output is and where our researchers are focused? Are those resources covered in catalog and repositories, so are we getting appropriate access in some way? Are we purchasing incorrectly? And what should we prioritize for various efforts such as open access efforts or preservation
initiatives? For academic departments, it's really a sort of a reporting tool. They want to know questions ... In this academic departments, we mean like deans and department chairs and academic staff. They want to get questions like how many articles are out there?
How much research is occurring? Where is it being published? How often do we collaborate with other departments? What does that look like? A lot of grant reporting information that's possible in this work, and we'll see some examples of that. And then, what research areas are being covered or are emerging? That doesn't necessarily mean chemistry is always going to be a chemistry department,
but perhaps there's some emerging sub-domain within that that we now want to consider getting more faculty for or supporting in some other way. For students, and we ... Scholars is more focused at PhD level and graduate students at this point, but that's not to say it couldn't expand.
For students, we're really thinking about how can we help you find experts in research areas, especially when there's so much cross-departmental and cross-discipline work going on now? How can you find someone who might be interested in a particular subset? Ask them to be your supervisor, hopefully. And also, look at various emerging scholarly areas and find articles on it, and we're
focusing very much on articles in phase one, so you'll see that quite a bit. For the university, we want to know, and I've said this pretty much before, we want to know what the high impact areas of Cornell work are and what our global impact is in the emerging trends and in collaboration opportunities. So with that background, we get a little bit into what data goes into this system,
what is Scholars, and what data can we leverage within it. Scholars at Cornell right now pulls data from a couple different sources. We have our existing Vivo production instance, which we're looking at to see if we can possibly pull and leverage some data from that.
Vivo, again, is our faculty profiling researcher system. It's got some data in it. Can we pull that? Can we clean it up? Can we get it usable for this new system? We definitely are looking at or pulling stuff from OSP, is Office of Sponsored Research at Cornell University. They give us all their grant data, so we can pull that in and merge it and have
information about that in the system. We pull information from human resources. This is where we really capture all of the departmental affiliations and status of our faculty members and researchers. We pull information from the registrar to see what kind of course loads are occurring or that sort of output.
And then at the core of Scholars at Cornell is going to be the symplectic elements feeding into a new Vivo instance. And we're looking at symplectic elements for this to manage all of the various streams of data and to help with merging, but also be an access point for faculty members to own articles so they can go in and say, yes, that's my article.
Yes, this is me. Because often we all have seen where articles have Smith JS as the author name. Is that this person here? You'd be surprised how many Smith JS's there are in a particular work area. Symplectic elements serves in that way, and then we serve that up to Scholars, which
is another Vivo instance. This is a slightly different diagram that shows a little bit more how the data gets pushed back and forth. You can see that hand creation. We have symplectic elements, which is elements 5.0, really serving as that linchpin for
data refinement, cleaning, access point for faculty members to self-curate. And then it goes into a cache and feeds into Scholars. The only data I'm aware of, and I would need to have this confirmed, that doesn't go through that elements section right now would be the HR data, which is where we say this is the person's name and their affiliation within the college and that sort of information.
And that does get eventually pushed back to elements. So this is a slightly different view on that data structure. See if I missed anything. And what happens with this structure is after elements, what gets cached and then pushed
to Scholars is what we're calling an Uber record. It's basically where we've merged all these sources and made a record that can then serve as clean, curated, accessible data in Scholars. So this Uber record is we take information from Web of Science, which we're particularly heavily using in this test instance, that gets pulled in as symplectic elements, gets
curated by faculty members, we take information from the HR records that are coming in, we directly query the Oracle database behind a couple different campus systems to get informational grants or teaching efforts or whatever. Have that go into that Uber record and then push that into Scholars.
This is just the start of a sample Uber record since I was talking about it. Not particularly exciting except for it is clean data going into a system, it's aggregated. So you can see here a little bit of what's going on. If you want to see a full one, I can pull a sample full one for you, but it's not really the heart of this.
What the heart of this is, in my opinion and for this conference, Scholars is building very much off the vivo model and ontologies. So you can see here this is the modeling for one of the people, persons, that would be managed in Scholars and that would have connected to it eventually grants, departments,
research outputs, articles, which we all would then also manage in Scholars. So you can see we have FOA of person, we have information that comes from the HR department that gets asserted on it, it gets related to positions, which we would manage from the HR data as well, and then we would start to apply more information according
to these other data sources like this. This is relatively simple but actually has been quite powerful for the visualizations we're going to see. This is where we have a person who is the author of an article from the article data that we get from Web of Science or elsewhere, but primarily Web of Science right now, we get the journal name. From the journal name, we get any sort of subject assignments.
We then have inferred the subject assignments attached back to both the article and the person. Not perfect but it gets us a place to start with visualizing this data. And this doesn't indicate expertise, it just indicates that they've written something on this topic.
So the partnerships with Cornell with the faculty are extremely important for this process to work. It means that they need to go in and check and harvest and curate and look at it and really take part in building their own scholarly portfolio. Now I'm going to test the fates and see if I can show you something on the site.
And if it fails, don't worry, I took screen shots because internet failing at a library technology conference is a bylaw at this point. Nope, that's not what I want, sorry, I just showed someone else's. So this is the homepage for scholars at Cornell. You can see if we scroll down, we've got a couple of different entry points for how
you might start accessing the data captured here. You can go in and start looking around according to, I'm going to skip this, because what I really want to show you is these are the types of visualizations all built with
D3 that are pulled in from that Uber record that goes into this system and it is managed underneath with all of the data modeling and such we saw before. You can see we've got, keyword clouds are just ubiquitous now, but we've got those, we can figure out collaborations, research grants have been a really interesting thing
to sort of harvest and play with that data to see what's going on, and person to subject area, which we're going to look at. But I want to first go dig into academic units so you can see what it looks like. We've got quite a few different academic units, some of which you might argue aren't necessarily that, but we've got those here that you can start to navigate through.
I'm going to go to college and this has primarily the College of Engineering data in it right now because the scholars team has been partnering with them quite closely as a test case. It's working okay so far, thank God. We've got departments underneath and for each department or unit you could have visualization
at that level as well, but I'm going to drill down. Here we've got faculty members, now let's see if this will work for us. This should be a visualization of all the faculty members at Cornell or in that mining school of biomedical engineering. Faculty members are in the middle and then those subject assignments that we were talking
about get out here on the edge. If I was to click on one person, he becomes the center all of a sudden, I can see what he's been working on, that's the topic, let me see if we can go back, you can reach over and then what pops out are other people that have also worked on that topic. You can sort of dig into this, play around with it as much as you like, but if I click
on the faculty members page, you see we get a page in scholars for this person, which also means we have a URI that we manage for this person. One of the things we're doing with the scholars going forward is to make sure that those URIs are persistent so that other universities can continue to query it, but indicate if a faculty member is no longer at Cornell.
The other thing I want to show really quickly, I've got to get back, oh come on please. So if we are back in engineering, I'm going to go back to mining school because I know that one works for me.
Something else you can sort of look at, if it can come up, and it's going to take a second to load, these are the grants that are associated with that department vis-a-vis the faculty member information. If you hover over it, you see the name for the grants, you can click on it, get more information, click on the grants, and we have another record in scholars as well
as a URI for that particular grant application. And we have that linked to the faculty members through that HR data. One final thing I would like to show, if I can get it to work, is the possibility
for collaboration. Let me go back. So we do want to support people collaborating. This shows how many times someone from a particular department has collaborated with someone else in a particular other department at Cornell. You've got engineering department in the middle, you've got the high level code names pulled
from Cornell organizational data for the departments, and then you've got further on like sub-departments or units within it, you can hover over and see how many projects or something they collaborated on together, click on that, you get a new visualization where you have the people in their department and what they were up to.
So this is meant to help support both showing how we're interacting with each other at Cornell, but one of the phase two goals for this is to actually extend it to have it show collaborations with other universities as well. So let me go back to PowerPoint. So luckily the demo worked.
Yes. That's all I wanted, so now I can die happy. Going through the screenshots, there are screenshots in there in case. Exposing data going forward, we don't have a SPARQL endpoint at present, but built into vitro and vivo are a variety of APIs, and so that's something that is actively being
developed. We do want other universities to start seeing that they can use this data, the context of Cornell's scholarly output, maybe Harvard wants to see it and see how it compares or what's going on. That's something we're actively thinking about building out. And like I mentioned before, we're keeping those URIs now in perpetuity where we did
not with the previous vivo. If they left Cornell, we basically no longer managed the URI. So if you want to know more about this presentation, you should really contact Javed because he's brilliant and he's the guy behind especially all the visualizations. That's his email right there. Thank you for dealing with me presenting his work, and I appreciate y'all's attention.
Actually, before I ask the question, I wanted to ask your very last remark.
Ring the bell for me. Oh, yay. Which said that, well, not necessarily in a good sense, that you say that if somebody leaves Cornell, then his or her URI is not managed anymore. Absolutely. So what I can tell you about that, and this is something I can speak authoritatively
about, so thank you. One of the things we tried to do at Cornell was pull URIs for faculty members into our MARC records leveraging the subfield zero. That's an entirely separate project, but it's a context for the answer I'm giving you. We would want to pull that in particularly for dissertations so that we could have an identifier for that faculty member and we don't necessarily have an ORCID ID or a name
authority file record from the Library of Congress. When I started running that reconciliation process and trying to do some updating and some entity resolution stuff, I all of a sudden got a bunch of 404s and was like, what is going on? And that's where I uncovered that there was sort of an evolving process over time.
The data was still stored somewhere, but the URIs were no longer exposed. And we were like, this is a problem for our workflows in particular, but hey, could you reconsider that decision? With scholars, they're saying, yes, that is no longer going to be the case.
OK, it will not be. Good. Sorry for the long story, but that's the context of it. Now my original question, I'm sorry because this is just triggered here. How should I put it? So how much of that is Cornell specific?
It's entirely Cornell specific right now. So comes the standardization person. What I would like to see if I do a scholarly publication and I can convince at last my publisher that they would put up proper metadata for journals and authors, et cetera, I would
like that data to be in such a format that you could use it directly. So I want that type of metadata to be standardized enough that the scholarly community as a whole, Cornell included, could use it rather than having isolated silos.
Absolutely, absolutely. My response to that would be twofold. The actual model of the RDF data exposed to scholars is based off vivo. And so that is a community ontology that we're extending and hopefully that can get wrapped in or who knows. That's all reuse where possible and clear modeling.
For those Uber records, there's such an internal process for the merging and loading that I don't even know if we would want to expose those necessarily or just make sure we expose that really nice RDF data that comes out of the system. Everything probably not, but as much as possible should be general and not bound to a university, however great university that is. So you have a metadata person giving a presentation for a couple of programmers.
Yeah, it should be generalized and it should follow standards, absolutely. Okay, any other questions? Come on. Yes, sir.
Maybe not a question, but a comment continuing to what you said. Springer Nature are now publishing a globe of information on their journals and articles and so on in a project called SciGraph. And for now, they've selected not to reuse anything to make their own ontology. I've spoken to them a couple of times.
There are pros and cons to this. The big question is how established is Vivo and how well it covers the universe of science? Because before Vivo, there was SciGraph here in Europe. Okay, it's an XML model, but a very general model of relations between things, very generic thing.
And a few years ago, there were attempts to artificially size it. My understanding is that SciGraph is the foundation of current research information systems in Europe on a mass scale. It's also the data model behind OpenAir, the European portal for articles produced by
Horizon 2020 and before that, FP7. So I think that a lot of data modeling for this domain is kind of already done, but there just needs to be more coming together, especially between the states and Europe, I think.
Sure, absolutely. I'd be interested to help support any collaborations between us and others. These two in the front seem like you're particularly interested, but we can see what we can do together going forward. I would also say that the real force of scholars right now is not necessarily as much remodeling in the back as trying to figure out our priorities for what this
application is meant to serve and then having that data exposed so that reporting and visualizations can happen. Actually, just a remark on that. I may be mistaken, sorry, if it's the case, but I think that there is a bibliography work done at schema.org, which may be a way to standardize it and not make it independent.
Right now, scholars focuses on article data and data we can pull from Web of Science. Phase two definitely has within scope monographs and books, and what that's going to pull in is other projects at Cornell that are already looking at RDF modeling of that, in particular
LD4L, which we'll be hearing about tomorrow from another very brilliant speaker. There's one question here. Yes, we got the first question via Twitter. Oh, yay. I invite anybody watching the live stream to ask another question, so I will just read
this out. It's by Christian Hauschke from TV currently, I think. Can the person take their profile with them when they leave, for example, a pair of wire and orchid synchronization? They could definitely export the data and pull it along, absolutely.
Taking the data, sure. Their profile's still staying up. I would presume so, but there will be some sort of indication on it that this is no longer a fact to remember at Cornell, and we wouldn't still be collecting more information about them because they would no longer be in the HR systems or something like that. It would be an interesting comment or work area to see how we could say if they moved
to Harvard and they had something similar, how we could link those two. Still more? Just one short question. Is this demo system open for everybody to have a look? I believe so, and I'll triple check the GitHub link is public.
I know it's on GitHub. I have to confirm that it's open, and then we can see. It's closed now, okay, so we'll see if we can share it in some way. Okay. It would be great. It's really great to browse. It's great work by someone who's not me, so I hesitate to comment on the openness
of their code, but I don't see why we couldn't. But as I understood, the website is open. The website is a demo, and we're still figuring out the kinks, so you'll be asked for a password if you just sneakily copied that URL from my browser, and I apologize for that. Okay. Okay.
I think we have time for one more question. How do you cope with, I would say, flux in your auditorial data? I mean the chair X moves to the institute, and the institute is moved to a different department. Professor A wrote a paper with Professor B. B wasn't at Cornell then, but he's now here,
but the professor. They're a visiting fellow, they change their title, yeah, they get a special chair. So I would imagine, so like I said, this is a demo, and when I say that, again, it's because it's a demo tightly coupled with their partnership with the College of Engineering,
and so I imagine at this point, they're just trying to figure out what kind of...