Betty's Re Search Engine Talk
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 2 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/66914 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer | ||
Production Place | Event "Engineering Research Software" NFDI4Ing Community meets Archetype BETTY on 27.02.2024 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
Computer animation
06:02
Computer animation
06:53
Computer animation
10:11
Computer animationProgram flowchart
13:20
Computer animationProgram flowchart
Transcript: English(auto-generated)
00:01
Exactly. Okay, then, official hello everyone. My name is Vasily Seibert. I'm super excited to be here and to share with you Betty's research engine, what it is, what it does, and what's new about it. I've had the chance to talk about
00:22
Betty's research engine a couple of times now, also in the frame of community clusters. So today I want to try something different. I want to do a demonstration, a more practical use case of how Betty's research engine can help you accelerate your research.
00:42
Of course, I'll reiterate what it is. So Betty's research engine is a software that you can use to search for research software directly and link it to corresponding metadata. So it implements what we call cascading search.
01:02
We initiate the search by searching for research software where it's stored directly, for example, on platforms like GitHub. Then what we do is we look into that software repository and we try to get a hold of all the metadata we can get a hold on,
01:21
and especially references. And when we find references, we try to look these references up on other platforms. So what we essentially do is we try to get all the metadata about a software that we can get and put it all back together to provide you with a
01:40
with a nice search result. All right, so, but what does that mean in practice? Let's say you're a researcher who is interested in research software like Betty in the domain of seismology. And you want to find research software that is
02:01
related to that topic. And you start off your search by a platform that promises you to yield the best results, which of course is Google Scholar. So you type in seismology software and you see there's a couple of nice examples
02:21
right here up front. So we have here the first example that is cited about 1500 times. We look it up. And the first thing we see is that the article is hidden behind a paywall. We see that there is an abstract that we cannot fully view, unfortunately, because this is
02:41
because we do not have to access, we do not have the rights to access this content. It is a software package, so it fits our description, but it's from 2001. So it's not that, yeah, it's not that relevant for us. Let's look at the next result. Ops by a Python toolbox.
03:03
This seems great, but oh, it's on the same journal. So we do not have access to this. And we do not have access to the full abstract, but we really want it. So let's say we're willing to invest the money to buy this article. And it just so happens that I have this article right here.
03:25
And if we take a look at it, if you're this hypothetical researcher that I'm describing, you're in for a hypothetical surprise, because what you find when you look through this article is that it has no reference to the software itself.
03:42
It has references to things it is similar to, it has references to packages it uses. On the references, if you look at the references, it reference mostly literature.
04:00
So yeah, that's sort of a bummer, right? Because we bought this article, but there's no actual reference to the software itself. But maybe that gives us a chance to reflect because what we did is we were searching for research software, but we started off our search by looking for literature.
04:22
So maybe if we would do something different, maybe if we would go to GitHub, right? And we would be searching for seismology here. And because we're an advanced user and we're mostly interested in research software, it would be reasonable to assume
04:45
that research software is software where there is a doi string in the readme. And GitHub actually allows us to search specifically for that by decorating our search query. So what we type in here is doi in readme.
05:02
That is GitHub language for give me all the repositories that have anything to do with seismology and that have a doi string in their readme. If we look for that, we get 427 results. Well, what are we supposed to do with it? We cannot go them through one by one.
05:21
We just don't have the time for that. Can we sort them by anything? Well, we can sort them by best match, whatever that means. We can sort them by most stars. If we do that, we quickly find out that this most stars doesn't mean relevance in research. Most stars means relevance on GitHub
05:42
and this is not exactly what we want. So how do we find research software in the domain of seismology? Well, it's a good thing that we all went to this Betty community cluster meeting and we found out about Betty's research engine. They told something about a cascading search.
06:02
Let's see how that works out in practice. So what we do is we go to Betty's research engine. We type in seismology. It already gives us a hint how to construct a search query. So we type in the same search query, doin-readme.
06:20
And as we start the search process, we can see how it comes to life. It's initiated the search on GitHub. It found 427 repositories. It's loading them. And as you can see, there are different platforms which we are cross-referencing to. And we can sort them by, for example, total citations.
06:41
And the first thing that comes up, lo and behold, is a repository that has something to do with OpsPy. Let's take a look. Well, it has 72 contributors. That's nice. It's probably not the original software repository,
07:01
but it has its own auto toy, which is cool. It has also a link to a video on YouTube, which is in one hour, the description of what that Python framework is and what it does. And we also found seven publications
07:21
that are correlated to it, which we showed up here. Right, let me just take a look at my notes. Right, so what's cool about this is that we search for seismology in Betty's research engine
07:40
and we get this nice table of software repositories and the number of citations we found on different platforms. We do not have to wait to finish up this search because I already have it here. What we can do is we can sort them by total citations,
08:01
which gives us an accumulated result of all the citations that we found on different platforms. We can also sort according to specific platforms, like open citations. Okay, it doesn't seem to have open citations at all,
08:23
but we can sort them according to data side. And what we can also do is if we take a closer look at it, we can look at the readme that we pulled from GitHub. We can look at details, which sometimes gives us additional information about the repository.
08:41
Like here we got from Zenodo an abstract. We can show the publications that we found to that. We have a link to the GitHub repository and there we have a new feature that we're very excited to talk about. It's the submit to your cookie feature. I'll get to that shortly.
09:02
We can export the search results by clicking the download button next to the repository name. We have so far free download formats that you can choose from, like for example, JSON, CSV, or BIPTECH. You can also download all the results
09:20
by clicking the download button next to the progress bars. Let's do that. Let's download the JSON and open it. And what you get if you open the JSON file is, well, all the metadata that we collected to the software repositories put into one data model
09:41
that we defined ourselves that has all the important information like the named URL, a list of publications. This does not have any publications. Yeah, you would need to find a repository of publications here, but then you would get a list of publications as well.
10:02
So you can export the search results as a JSON file and then do your own pre post-processing with it. All right, so this was the export function. Let's take a look at how we can export this
10:21
to the ORCAGI. I don't know if you're familiar with the ORCAGI. The ORCAGI is the open research knowledge graph that is also an NFDI initiative by NFDI 404. For data science, it is, I mean, it is what it basically says in the title, it is a knowledge graph.
10:41
And we can just look if that repository is already in the knowledge graph by quickly looking it up. And we see it is not yet in the open research knowledge graph. So what we can do is we can submit it to the ORCAGI.
11:01
And this is sort of a user in the loop kind of function, because the first thing it shows you is our entire data model. This is all the data that we were able to collect to this repository from GitHub, but also all the additional information
11:20
like the publications. I was looking previously for a list of, for a repository with a list of publications. This one does have publications. For a publication, we have a DOY, a name, source, and offer names, a URL, if there were any. And what you can do is you can go through this,
11:41
you can inspect this, you can edit this to your liking. And if you agree with what it says right here, then you can go on next. And what it does now is that it's looking up on the ORCAGI if the software repository is already an instance in the ORCAGI.
12:04
And if it is, then it will grab the data from the ORCAGI so that to avoid any redundancies. But as this repository was not in the ORCAGI, we shouldn't see any conflicts in here. And what we're trying to do here is
12:23
we're trying to map our data model which of course is different from the data model that is stored on the ORCAGI. But we try to map those data that we were able to collect to the ORCAGI data model. And we can inspect this here if we're agree with it.
12:44
We can also add information if we like to. For that, we can select a property and these are all like the properties from the ORCAGI that we can choose from. And of course, we have here the papers that cite this software. What we can do now is we can click on Submit.
13:01
For that, we have to log into the ORCAGI. Fortunately, I have an ORCAGI account and my data is already pre-stored here. And after I log in, I can see a pop-up notification that says success. The repository has been submitted to the ORCAGI and it already provides me with an ORCAGI link.
13:20
And if I click on it, what I see is that the OpsPy repository was added to the ORCAGI with all the information that we gathered. And what you can also see here is that the papers that cited this repository were also put into the ORCAGI.
13:43
So we created additional instances that reference this software repository. All right now, what else interesting is there to show? After the search is conducted, you can sort for citations,
14:02
but you can also sort them according to other preferences. Like for example, on the right, we have a languages column and here you can sort them by the languages that would you like to show. Like for example, if you're interested
14:22
only in repositories that are written in, let's say C sharp, you can click on it and you will get the one repository that has C sharp as the language that was specified on GitHub. And yeah, I guess we're getting to an end now
14:43
about Betty's research engine in general, what you will find if you open the landing page is that you have a link to our paper in which we described everything there is to know from a technical perspective,
15:01
because you have to imagine, I mean, I'm talking about this and it sounds maybe a little bit easy, but it is a fairly complex search process because we start off by looking for repository stored on for example, GitHub, right? We collect all the metadata we can get, especially the references. And then we take the reference
15:22
and we search on other platforms and you can see on connection starters, which other platforms that are. So is GitHub, GitHub is clear GitHub user content. That's an additional service that we use to know the OpenAlex, open citations and data side. You'd have to imagine that each of these services
15:41
has its own rates limitation, primary rate limitation. GitHub even has a secondary rate limitation. So this is one of the technical difficulties that occur when you use this service. And this is exactly the reason why we require you
16:00
to provide your own GitHub access token when you use Betty's research engine. Because how it functions is that it's a bit counter-intuitive, but you do not have a central service that you communicate with. What you do is when you go to the URL that we provided, which is from Rechenzentrum Hausung and TU Claustad,
16:23
we send you all the code that you require to conduct this search as an answer to your request. So you have the entire code essentially living in your browser and from your browser,
16:40
so from your local network, you do all the requests. And this is the exact reason why you have to, well, the research engine watches out for rate limitations itself, and it does error handling, but you have to keep in mind that these are still your rate limitations, right?
17:03
And we have a link to our paper, to Ingrid. We tried to publish our paper there for about a year now. It turns out it's more complicated than we initially thought it would be. We have a link to our GitLab. The entire code is, of course, public.
17:20
You can look at it, you can edit it. If you like to add or change anything about it, you can. As you can see, this repository is actively maintained. And of course, you can do merge requests. If you go to how does this work, you'll see a video that I once made
17:41
that explains the search process or reiterates the search process and shows you what you need to do to get started with Betty's research engine. And we have also an instance on Docker Hub,
18:01
or we have a Docker container stored on Docker Hub, which you can download and then use for yourself. And here you can provide the GitHub token. The GitHub token is the one thing that you need to get started. Additionally, you can also provide Zenodo tokens
18:20
and open citations tokens. This will give you additional freedom when it comes to rate limitations, although this is not necessarily required. All right, and I think this is about, yeah, the most important stuff. Oh, one thing to mention maybe about reflection
18:42
because I'm presenting this, I'm selling this right now, of course, but if you were to reflect on this service, there's a lot of technical difficulties. And there's this one difficulty that we haven't solved yet, and that is reliably finding references. So this is the one obstacle that we have to get over
19:02
to make this a super good service. And you can read about it in our paper. The main difficulty that we faced, we came out or we stumbled upon it when we did an evaluation. So for the paper, we did an evaluation
19:22
of 400 software repositories. We went through them by hand, which was a lot of work, let me say this. And what we found when we went through those 400 software repositories was that we only identified 35% of those references correctly.
19:42
So there's a lot of repositories that we didn't find the preferred reference, that's what we call it. Two, and the explanation for that, or one possible explanation for that you can see in here, it's that only about 20% of those repository
20:05
actually made use of structured text elements. And what we have with our cascading search is a very rule-based approach. So we look for structured text elements in the readme file, which is okay for structured text elements. But as it turns out,
20:21
the vast majority of preferred references is actually written in plain text. So you have this legacy software where some researcher put their repository on GitHub and said, and if you like this software, then please cite us. Well, this sort of reference is hard to get by
20:41
with a rule-based approach. We pondered about what we're actually working right now is a data-driven approach to this problem. There's a subdomain of natural language processing, which is called Named Entity Recognition. And this is how we're trying to introduce AI
21:01
to Betty's research engine. All right, thank you.
22:08
I mean, we have the insight from this evaluation that we did in the paper, which was from the domain of ecology. So not engineering sciences. But so now we don't have any statistics
22:23
that are so general that we can say, okay, this applies to all research software in general or in the engineering sciences. However, my personal perception,
22:41
and I think you will share it if you do this on your own, is that it is in fact the case that most references that are made in GitHub or on GitHub repositories, these are not made with structured text elements or with services like Zenodo.
23:01
These are just in plain text references. But this is just a personal perception. It does not have any statistical proof behind it.
23:37
Oh, I mean, if you wanna increase visibility,
23:40
you can because there's a lot of services that provide you with the opportunity to do that. Like Zenodo, if you wanna do that, you can do that. The problem is that the vast majority of research software does not do that. And I think the right approach to this huge amount of legacy software
24:01
and even new software that doesn't do that is not by enforcing any standards or enforcing any technological solutions on them. I know this is kind of what NFTI tries to do in some case, so I'm very careful to go on that ice.
24:21
But another approach to that would be to create technological solutions like Betty's research engine that tries to find this legacy software. So go the other way around.
24:45
Yeah, exactly.
25:23
Imagine Betty's research engine was a central service that you would communicate to. We would very quickly run into bottlenecks because each of the services that we talk to like GitHub or Zenodo or open citations, they have their own rate limitations. And not only primary rate limitations,
25:40
but also secondary rate limitations that are connected to an IP address. So the workaround that we came up with is that if you go to the URL that we provided from Betty's research engine, we send you all the code that you need to perform the search on your local machine.
26:01
So it's basically your code, it's on your browser, and you're doing everything yourself, which means that if you want to talk to GitHub, you would need to do that by talking to the GitHub API yourself. And the GitHub API requires you to provide a valid GitHub token,
26:21
and this is why you need your own GitHub token. It's a bit counterintuitive, yeah, but we don't see any other solution to this because once you have some sort of centralized service,
26:45
this is not technically possible to implement this cascading search. As far as we see it, I don't know if anyone has another solution, please let us know, because this was one of the major obstacles that we face, yeah.
27:48
Yes, absolutely. There's so many difficulties that we faced and that we're still facing. Like for example, the one thing that we haven't sold yet is reliably identifying the preferred reference.
28:03
This is like the one big obstacle that we haven't got past yet. This would improve Betty's research engine drastically if we could reliably find preferred references that are made in plain text. Beside that, Betty's research engine lets you conduct the search
28:20
and then export the search results. So you can create data sets with Betty's research engine really fast and really easily. Beside that, then there is the architectural perspective. Once you like get your hands dirty and go into the code,
28:41
you'll see that it has room for improvement. But we made our code public and we invite everybody to contribute to that. So yeah, I guess the research possibilities that could be connected to Betty's research engine are practically endless.
29:09
What do you mean conversation?
29:29
No, it doesn't have the capabilities of a large language model. However, you can export the search results and alter them to your liking.
29:41
And now that you say conversation, when you use the export to Oracagi feature, it shows you the data model that we came up with, which you can alter and then proceed to publish that to the Oracagi. So that is a user in the loop kind of functionality,
30:03
which I guess you could call a conversation on the highest level. Thank you.
Recommendations
Series of 2 media