DISCOVERY - BIOfid: Accessing legacy literature the semantic (search) way
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 14 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/60267 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Computer animation
Transcript: English(auto-generated)
00:00
Adrian is a software and data engineer and aims to provide easier access to historical biodiversity literature. This is yours, Adrian. Thank you. So, welcome. I'm Adrian Pachtheit. I'm working for the BioFit project. We are especially interested in legacy biodiversity
00:27
literature and how we can approach it in a semantic way so users can search documents semantically. So, a few words about the BioFit project. We are funded since 2017 by the DFG. Our special focus
00:43
is text annotation and text mining, specifically of German texts. We have focused texts that are vascular plants, butterflies, and birds currently. So, if you are searching, for example, raccoons in the portal, you won't find them because the portal does not know them. We are providing several tools
01:07
for the community text annotation, bio anthologies, and the thing that I'm currently and will here talk about is semantic search on documents. So, let's jump into it.
01:21
When you query the BioFit portal, you get an output here on the top that is first repeating your query and also showing you what the portal has recognized. In this case, it recognized Taxus baccata, which is a plant, and it realized that it was mentioned Germany.
01:48
Both were recognized correctly. What the portal does now is it resolves the names. So, it looks up is there next to Taxus baccata also subspecies of Taxus baccata available in the database.
02:04
The same thing will be applied for Germany. So, Germany is not only taken as a term. It will also look up, for example, Bavaria, Berlin, places, Brandenburger, anything that is in Germany and is in the texts. When you just hit one of the page hit buttons, you get a short preview of all
02:29
the hits that were within the document. In bold text are the terms that are referencing back to your search queries. So, you see here in green bubbles, surrounded are locations.
02:43
All these are in Germany, the bold ones. Also, you see Taxus baccata here at the bottom that is surrounded in purple. So, how does the portal know what is said by the documents and
03:04
how do we get the data out? First, I will go through the pre-processing and then I will later show you how we get query the data. We are relying currently on mainly our own biofit corpus
03:25
that is quite small, but we are expanding it with data or texts from the Biodiversity Heritage Library and from Zobodad. At least from BHL, we get a lot of public domain texts, which
03:42
is awesome. These texts we download automatically and throw them in our text annotation pipeline. That does some amazing stuff with machine learning. I won't go into the details here. Mostly important is it will figure out the sort of name. So, is it a noun? Is it an adjective?
04:05
It will sort our dependencies. So, how are the words related to each other within a sentence? I will come to this later. And of course, the most important thing for us biologists is the named entities. So, is this word a taxon? Is this word a location? Is this a person? Is it
04:24
something else the named entity hopefully will show us? And finally, from the text annotation, I get out a very crude XML format that I have to transform into appropriate annotated
04:41
texts. I enrich this with metadata for the documents and also extract data from all these annotations. I generate metadata ontologies that help me. These ontologies help me to query
05:01
the data faster and more efficiently. All these data are stored within three databases that have to be perfectly in sync. Otherwise, nothing will work. I will go into detail through these three databases, which is the key value database, just termed here index database,
05:22
a triple store, and a document database. And I will now show you how these databases are related to each other and how they can talk to each other more or less. The index database is basically just, as I said, a key value database. You give it the
05:42
string plants, and it will return you the ID 6. You give it taxos baccata, and it will return you the number 528 something. In the triple store, the ontology is saved.
06:03
The ontology knows, for example, that taxos baccata is a plant. If I query plants, I also get taxos baccata as part of the results. It also knows, of course, the ID of taxos baccata. It knows that taxos baccata has yellow flowers. It knows that the trivial
06:25
name of taxos baccata is European Jew. You see immediately that the ID between the index and the triple store are the same. Great, so we have already connected them.
06:41
Now, how do we connect this knowledge back to the documents? This is done quite easily, because we have annotated texts. Every word is surrounded by some XML text. We now know here that taxos baccata is a plant, and we know exactly which ID taxos baccata has.
07:08
Guess what? It's the same ID that already exists in the triple store in the index database. Every taxon contains at least one of these URIs. Most locations
07:24
only contain a Wikidata URI just for reusing them. Okay, now I showed you how the data is connected within the database. Now the question is, how do we get the data out? When the user is putting in the query,
07:48
so for example, in this case, plants with yellow flowers, I would expect taxos baccata being among them because we just saw taxos baccata has yellow flowers and is a plant.
08:01
So then the user query is run through a natural language processing. The language processing will recognize plants as a plant, surprise, and also will recognize that yellow is an adjective and flowers are nouns. Furthermore, it will realize how these words are depending on each other.
08:23
Yellow is a dependency of flowers, and flowers is a dependency of plants. So these words are connected to each other. The index database on the other side will return the respective values for these strings. So in this case, we saw it in the index database,
08:47
the plants has the id 6. Here are just for demonstration purposes, yellow has the int id color yellow and flowers has the id flower color. And then it goes through a very complex
09:06
if-else rule set that will forge some database and data objects that are just transferred to a template engine. And this template engine takes all the data and just
09:22
puts it in the right place, generating a sparse query that says the taxon that we are searching for has a flower color and this is specified as being color yellow. And the triple store takes the sparkle query and returns, among others, the id of taxos baccata. Finally, the triple store
09:51
response is just thrown against the document database and the document database will find it, not because it is looking for taxos baccata, but it has also indexed the id of the annotation
10:07
and it's treating it like a word. It doesn't care if it's a proper word or an id. And then we get back our documents that are then presented on the portal.
10:22
The HTML is already stored, so I mostly don't have to do too much to get a presentation ready. What are the lessons learned from all these data back and forth? If you are doing a semantic search on your own, see that all your NLP enrichment stages
10:47
of your user query are stored at some place. Even if you don't use them now, you probably will use them later for an expansion. You should prefer templates before
11:00
rules. So you can do natural language processing, that's totally fine. However, I would suggest using natural language processing in combination with templates because rules are very, very, very hard to maintain. There is an example that's called Lango. It's a GitHub repo.
11:22
That's a very good approach to how you can combine natural language processing and templates. You should be ready when you're doing stuff like that to drown in data. You have to make sure that your application is ready to handle a large database response.
11:45
And you should not underestimate the power of normalized data. I just learned that my document database knows a lot about my ids. For example, Taxos Picata is very often associated with ids or
12:05
URIs, so terms that are within the Alps or referencing the Alps themselves, referencing southern France or some parts of eastern Europe. And guess what? These are the
12:21
distribution of Taxos Picata exactly. So the document database without machine learning knows exactly which terms are referenced to some terms. There are, of course, also alternative approaches to do this. You can, for example, go with topic modeling. So you can
12:44
give a document, some keywords that says, this is about plants, this is about a specific plant, or this is about a specific animal. However, I learned that these, at least the mentioned ones,
13:02
are not scaling this well because they are using large databases if you have a lot of documents. And some of these algorithms, specifically the semantic analysis, is very hard to parameterize
13:20
for getting out good results. And then, of course, the elephant in the room is machine learning. You can do this. However, you have to get training data. I didn't have training data, so I went with the approach that I showed you. Machine learning may be the better way at some point. Specifically, it doesn't use so much resources. And I'm not too sure if I have
13:51
to figure this out. One last thing I want to mention is that we are currently working on information extraction. In this case, we want to make it
14:07
easier for the users to extract data from the portal. And this is specifically that we say this taxon, Taxos Parkata, was mentioned to be in at this place. And we put this into
14:29
some diving core data sets and will provide it first in our portal and then in other biodiversity infrastructures. And this was lately termed under some, at least
14:44
slightly under the term Nanopublications. It goes into this direction, but it's not perfectly the same. And that's it for my part. I just want to highlight our GitHub repo, where you can also find some of the tools that I showed you,
15:03
and specifically our UBLabs blog, where I also go into some more details about the semantic search.