Semantic Scholar
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 3 | |
Author | ||
License | CC Attribution 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/56044 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
00:21
Computer animation
01:35
Computer animation
Transcript: English(auto-generated)
00:01
Hello, my name is Alex Wade and I am joining you today from the Allen Institute for Artificial Intelligence and I'd like to take this opportunity to tell you a little bit about SemanticScholar and a subset of the AI and ML use cases that we employ in the creation of that tool.
00:20
The Allen Institute for AI is based in Seattle, Washington. We were founded by the late Paul Allen, really as a core AI research institution, but we also operate very much as a product team in many ways in order for us to fulfill our mission of AI for the common good. We want to do not only the core research, but we look in many cases for services
00:44
that we can develop that employ that software to solve very specific problems. So the SemanticScholar was really sort of built around this notion that we are facing an increasing number of scientific research publications every year. By the latest count, there are between five and six million new publications coming
01:04
out every year and this number is growing year over year. The ramifications that this has for the research community is that there are increasingly more publications than anybody has time to read, even in very, very narrow fields. So the value proposition that we hope that AI can help to provide is providing some
01:24
tooling to make it easier to discover, to decide, and then really to understand the scientific research that is going on. And that's really our mission with SemanticScholar. As a combined research team and product team, our research team really focuses mostly
01:41
in the areas of natural language processing, in some cases in sort of science investigations over the data products that we build. That then manifests itself in a very large number of algorithms and components in our pipeline that rely on AI and ML approaches.
02:00
And I've listed just a few of them here on this slide. For the sake of brevity, I'm just going to focus in on a few representative examples. So the first example that I'd like to give you is we're dealing with a very large citation graph and a very large heterogeneous graph of papers, of people, of affiliations, and of venues.
02:21
And in many cases, the set of relationships between papers can be very informative around which papers are more influential, which areas are growing faster, et cetera. One of the areas that we've been doing some work on is classifying citations. So this is a paper page for a preprint that we wrote about the CORD-19 Open Research Dataset.
02:45
And what we can do then if we navigate down on this page is not simply enumerate the set of papers that are citing this paper, but to classify them in a couple of ways. So you can see here that we have a dropdown where we can filter down by citation type.
03:03
And effectively, all this is doing is looking at the section of the citing paper to see where they are making a mention of this paper here. So if I, for example, am interested in those papers that haven't simply casually mentioned us in the background section, but maybe talking about us in the methods section, I can filter down the set of citing documents to just those methods citations.
03:26
In addition to that, by looking at a number of different features like where in the paper this paper is being cited, how often it's being cited, and what the actual language is in the span of the area where it's being cited, we can classify citations into
03:44
those that are highly influential. In other words, the paper citing it was building upon this paper, not simply mentioning it. And you can see here that we have this flag here and the ability to filter down to just the highly influential citations.
04:01
And we can also then surface some of the information out of this paper by this excerpt section here. So now we're lifting out the bounding citation sentences from this citing document to see what they have said about this paper. So this is one tool that we provide to authors that helps them sort of zero in if they're
04:22
looking for related research, if they want to see how the research is being used downstream. A second example is to provide what we call extreme summaries or TLDRs affectionately internally here. As scholars need to look at more and more papers, either skimming a set of search results
04:43
or here I'm still looking at a list of citations to this particular paper, abstracts can be, while intended to be a summary of the paper, can still be quite lengthy and can still have some fairly technical jargon in them. So it's a bit of a commitment to go through and read a large number of abstracts.
05:04
So what we've done is developed a language model that has been trained on the scientific literature, but we'll take input like this abstract here, which is four or five sentences, and we will distill that down into a single sentence abstract of summary
05:21
and working with our users, we found that this provides a much more effective means for them to skim down a list of papers and make a determination about whether or not they then want to drill in and see more about that paper. Finally, in the area of sort of document understanding, we have done a number of things
05:41
to build what we hope to be more effective recommendations for our users. So in this case, what we've done is we have taken the papers and built document-level embeddings around each one of these papers, which allows us to look at the document similarity, which documents are similar to other documents.
06:02
Part of this embeddings model also takes the citation graph into effect, so understanding where there are strong citation links between these documents or clusters of documents. And in doing so, what we're able to do is make some really good recommendations to users. So in this particular case, I have a folder in my library that I save papers
06:26
that I am interested in, that I may want to be citing or that I want to read. And in this case, I've actually created a folder that I've called Academic Graphs, and I've put at the moment about 10 papers into that folder. As I discover new papers, I put more papers into that folder. And we use these document-level embeddings then to
06:45
look at those 10 papers in this folder and find newly published papers that are similar to those papers that I already have in there. So on a daily basis, I get a number of new
07:00
items in my list here that I can scan down. And we do a couple of things then to help make these lists better over time. If I find a document that I like, I can actually add that into my folder. The next time those recommendations are run, then that paper will be taken into account in the recommendations that it gives me. But occasionally, I will get an
07:22
item in my feed here that isn't really something that I want to read. And by making some relevance judgments on these, by giving this a downvote, I can mark this as not relevant, then in addition to the positive examples that I have in my folder of saying these are the sorts of things that I want to see, we can then also supplement that with some negative
07:41
examples of things that I don't want to see papers like this. And those two work hand in hand in providing the next round of recommendations tomorrow when we will suggest more papers to this user. So those are three examples of how we're using AI ML in the product right now. I just want to sort of call out one last thing, or two last things here, which are
08:02
resources that are available for folks to utilize today. We believe very strongly in Open Science. We try to make most of our, if not all of our algorithms, openly available on GitHub for reuse. And where we can, we also release the data sets that we use to train the algorithm and to evaluate the algorithm. So these starts to form nice bundles of information
08:25
for people who A, either want to use this in their own products, or B, want to benchmark their approaches against ours. So one example of this is that entire academic graph of over 200 million papers and over 50 million authors is all available both via an open API, as well as
08:46
monthly snapshots. And we have a number of different research groups and products that are using this API to bring data and features into their own products. And then related to that is a separate corpus of information that we've made available for the NLP community. So this is a smaller subset
09:04
of those papers. There's only about 136 million papers in this graph, but most notably this also includes about 12 million full text publications that we have converted from PDF into a more machine readable JSON format that allows people to do large scale NLP projects over a wide range
09:25
of disciplines in the literature. So I invite folks to check out these resources if you are interested. And finally, we're starting to look at a number of ways that we can apply AI approaches to some of the problems of accessibility, whether or not that's making
09:43
academic papers more amenable to screen readers or providing affordances in the reading experience that help a user understand a paper, especially if you're looking at something that is outside of your domain and may not understand all of the terms of art. So we're applying new approaches to the reading experience. So I invite you to watch this space.
10:04
And with that, I thank you for your time.
Recommendations
Series of 16 media
Series of 15 media