Presenting ORGKG Ask: a next-generation scholary search and exploration system
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 3 | |
Author | ||
License | CC Attribution 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/69579 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Computer animation
00:38
Computer animation
03:59
Computer animation
04:34
Computer animation
05:09
Computer animation
05:43
Computer animation
06:28
Computer animation
10:10
Computer animation
10:42
Computer animation
11:30
Computer animationProgram flowchart
12:35
Computer animation
14:43
Computer animation
15:02
Computer animation
Transcript: English(auto-generated)
00:03
So I directly then go to the title of the presentation and then what I want to talk about today. So today I will be presenting a service called ORKG-ASK. So this is a service to be released earlier this year as an addition to a project or a system
00:21
we're working on already for quite some time called the ORKG. And the system is aiming to build a, as you can see here, a next generation scholarly search and exploration system. And of course, indeed using LLMs. So what I will be talking about today,
00:41
first, I will very briefly introduce to you ORKG. Then I will get into the details of the actual service I'd like to discuss. And that's then divided into two parts, into the front end part. So the features we have, what do we offer users, and the backend parts a bit on the technical level,
01:02
how do we accomplish this? So first let's take a step back and let's see the problem that we're trying to address. So we are working on scholarly communication. And as researchers, you know what is the main way that scholarly knowledge is communicated nowadays
01:21
and actually already for several hundreds of years. So this is mostly in a narrative document-based form. So the files have become digital, of course, so we don't have actual papers anymore, but the rest is still remarkably similar. We share this in a story and send this around.
01:41
And in the end, journals publish this and then researchers have to read this. This is working quite well for humans, but it means that machines have some difficulty accessing the knowledge presented within the text. So it really needs to extract the knowledge there to be able to provide active support to users.
02:03
So as a consequence, if you use the traditional search system, such as Google Scholar, in this case, there is a very specific search query. So I'm looking for a task, author name disambiguation, using a specific methods, using graph embeddings. And we can see there's more than 18,000 results
02:22
for this specific query. And not all those articles are relevant, of course. It's quite specific here. So what we were trying to do and what we are trying to do with the ORKG is build a structured representation of the knowledge using knowledge graphs, where we not only provide the metadata,
02:41
but also the research data presented within articles. So this is a very simple example here, where we describe the research problem and the approach, but this small graph here can become much more complex, depending on the topic of the paper. And then suddenly finding information becomes running a query.
03:01
So we would only get the results back to actually address those topics. So that's something we tried to do with ORKG. And with ORKG ask, this newer service, we are still supporting the natural language documents there because we have LLMs now that are able to interpret them
03:20
and extract information from them. And we're then providing an interface that answers questions that researchers have also in natural text. Finally, the idea is to feed this back to the ORKG and sort of build a knowledge graph while people are using ask, but this is more of a future perspective. So I won't be focusing on this part today.
03:41
I will really focusing on what we are doing with ORKG ask and how we are building this system. So again, looking at our mission, so we're trying to build a system that helps researchers find and explore research articles. And here you can see what the main components
04:01
are that we're using. So we're using a paper corpus with approximately 75 million items there. All those items have an abstract. And some of those items also have the full text in case it's open access articles. Then we use three main components to support the system. So that's semantic search.
04:21
That's then large language models and knowledge graphs. And those components work together to basically help users find and explore articles. And this all comes together then in the exploration user interface. So that's the next part I will be talking about called the front ends. So when you're using ORKG ask,
04:42
this is how the search result page looks like. So here we can see the query we entered and here we can see a list of articles relevant to the query, but also answers to a specific question. So quite a lot of things are going on on this page. So I like to discuss some of them in more detail.
05:00
So here you can see some of the categories and some of the features we offer. So starting with the question answering part. So as you saw, here we have the search query and this question is answered in two different ways. So we have the listed articles where individual answers per article are listed,
05:22
but we also have the so-called synthesis answer on top where the answer is synthesized based on the top five questions in this case and trying to come up with a general answer also using references to the specific articles. Also, we have some more general features built in
05:42
such as a reference manager where you can save articles to collections and you can manage those collections yourself then. A nice feature here is that it's also possible to import BIPTECH files. So you could, for example, import a file from Zotero or Mandalay
06:00
and add them to an ORKG ask collection. And then it's possible to select items from your collection and basically ask questions about this. So currently this is sort of an early stage feature, but we think this could be quite powerful where you can in the end question your bibliography and try to find maybe hidden relations
06:23
and try to get inspiration from those articles when they are grouped together. Then we also offer certain filtering options apart from the common filters that you often see such as location, year and language. We also offer filters based on specific fields.
06:43
So you can select fields that are in the database and filter by items from there. But also we have filtering by topics. So those are topics linked to external knowledge bases and then using entity linking, yeah, disambiguated and added to the system.
07:01
So what that means in this specific case is we can filter here, for example, on machine learning. It's a topic from DBpedia and we can find all the papers that actually mentioned this topic in their abstracts. Then some other nice features such as the top authors provide some ways to further explore the data.
07:23
So the idea here is that we have a feature that provides, for example, relevant reviewers or experts in the fields based on a specific question. So in this case, I entered a question in the system and I get a list of people that might be experts
07:42
because they appear multiple times in the list of authors of the articles listed. So this helps to get a better sense of research domains and again, facilitates the exploration use case. Then it's possible to save searches as well. So ask can be used to find literature
08:02
but it can also be in a structured way. For example, using or for a structured literature review. And of course those structured literature reviews takes time. So it's possible to save a search and to at a later point, come back to the search to continue there. Also, it's possible to exclude items
08:21
from the search results when they're not relevant or as I showed before, include items from your bibliography. So it makes sense to save them and to be able to restore them at a later point. Then briefly regarding some export features we have. So to support the case of exporting data
08:41
or importing it to the ORKG, we have the ORKG CSV exports. We have several formats here but you can use to export the data. Also you can export the table to LaTeX and yeah, include it in your research articles. Of course, after double checking everything and probably manually changing certain things
09:02
but we have quite a lot of different export formats. Another interesting feature here is the custom extraction. So by default, the answer to the question is displayed as I already talked about but also additional information is extracted from those papers. So that includes the insights,
09:23
too long didn't read or a very short summary, conclusions and so on. So users can add columns to try to extract specific information from those articles. So it could be limitations or even numeric data like the number of participants and so on as long as this is actually mentioned
09:41
in the context of the article, this will be displayed here. Then briefly also a bit about the technologies that we're using and that ties into that we're trying to make this a sustainable surface. So we're trying to go beyond a research prototype but actually provide a platform where researchers
10:01
can go to when they are looking for literature. So we're using the latest technologies regarding the front end, also for the backend I will show some of the tools we're using. And this then also relates to some of the features such as responsiveness and the dark modes where you can use the tool on different screen sizes
10:20
but importantly also with a high zoom level on the browser to support accessibility. The same to dark modes. So dark mode is nice for a lot of users but for users with a disability it's also sometimes required to actually use this. So we also try to think about accessibility to make sure the system is usable by everybody.
10:44
Finally, regarding reproducibility. So this is a complete open system. So all the components we're using, they're open source. The front end is open source. The backend is open source. And we try to be as open as possible. So next to every content that's generated by an LLM
11:04
we're trying to provide all the information that people can use to reproduce those results. So we can see the prompts. We are providing parameters such as the seats but also things like temperature and so on. And using all this data, you can when you run the model on your own get the exact same results back.
11:21
So this is quite a unique feature that other systems do not offer. This is currently not yet available. It's work in progress that we hope to release this soon. Then quickly going on to the backends and the technologies that we're using to make this possible. So here are some of the key components there.
11:41
And I will use this diagram to explain this a bit more. So we start with asking a question then we're ranking documents based on their relevance to the question. Then we extract information from it. And then we try to answer the question. And this follows a so-called a rack approach. So retrieval augmented generation.
12:01
If you are familiar with LLMs and you have studied them a bit more you're definitely familiar with this term but if not, I will explain this later. I'll start with the embeddings. So what we do is we create a vectorized version of all the contents. So that includes the abstracts and the full text
12:20
but also the question that comes in. So we do this using nomic embeddings. They are some details for the people that are interested in it. So we have the documents, we throw them in a model and we get a factorized documents back. Then we have a semantic search there. So that's the next component
12:40
where the search query comes in. And then we are ranking the documents based on the relevance to this specific search query. So we're using Quadrant for this. It's a vector database and it supports filtering. So that's quite important here. So with these filters we are able to narrow down the search space
13:00
before we are actually doing the ranking. So that makes it possible to first filter on dates or on authors, for example. Then coming to the rack approach. So basically the workflow on top here already represents this approach but as you know, with LLMs, it's rather difficult to make sure that you get the information
13:24
from a specific context if the model is trained on it. So with a rack approach, we first retrieve information then we add this information to the prompts, the so-called augment step and then we generate a response and display to the user.
13:41
So this here is what I told before. So we generate embeddings, we do the ranking. So we rank all the articles and find the most relevant articles for these top N articles. We use the context and add it to the prompts. Then we also add the research question to the prompts
14:00
and then we use that together to display the answer to the users. So finally a bit about the LLMs. So we use a rather small model here that works quite well for us. Here you can see the hardware in case you're interested. Interesting to see is the TPUs that we're using here. So since we run the model ourselves,
14:22
we need quite some GPU power there but this all works quite well for our use case. Maybe one more thing about the prompts so we can see them here again. They're rather large because the model is small and we found out that the more specific we are
14:41
in the prompts, the better our results are. So here we have the same prompts for the synthesis as well. And then finally a bit about the caching responses. So we're trying to invoke the model as little as possible to make sure we don't have this waiting time but users are able to explore the system quickly.
15:03
So these are the technologies I just mentioned, the same for the backends. And yeah, the system is available online. So everybody can already try this out themselves. There's a possibility to give feedback as well. So also we are really looking for feedback from researchers to further improve the system.
15:21
Thank you.