We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

DISCOVERY - BIOfid: Accessing legacy literature the semantic (search) way

00:00

Formal Metadata

Title
DISCOVERY - BIOfid: Accessing legacy literature the semantic (search) way
Title of Series
Number of Parts
14
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In BIOfid, we make full texts of legacy biodiversity literature available through a semantic search. The semantic search is capable of processing simple single queries for a species (e.g. "beeches"), but also handles restriction for traits like "Plants with red flowers" by applying Natural Language Processing combined with a rule-based generation of database queries. For this purpose, the semantic search is aware of biological systematics (e.g. beeches are plants). Subsequently, the semantic search returns all documents that contain these species. Future expansions will include the geolocated search for a species in a specific area (e.g. "Beeches in the alps"). To enable these search capabilities, the semantic search draws from a pool of both ontologies and semantically annotated full texts. Within BIOfid, these two kinds of data are intertwined to support the machine understanding of the full texts. In this talk, I will give an introduction into how literature and data are automatically harvested, processed, and prepared for being queried and presented in the BIOfid portal. Furthermore, I will give insights on how user query analysis is done in the portal, discuss its pros, cons, and alternative approaches (e.g. machine learning). Finally, I will give a insight on the current work in BIOfid that involves the extraction of facts from the full texts (Information Retrieval and Extraction).
Computer animation
Transcript: English(auto-generated)
Adrian is a software and data engineer and aims to provide easier access to historical biodiversity literature. This is yours, Adrian. Thank you. So, welcome. I'm Adrian Pachtheit. I'm working for the BioFit project. We are especially interested in legacy biodiversity
literature and how we can approach it in a semantic way so users can search documents semantically. So, a few words about the BioFit project. We are funded since 2017 by the DFG. Our special focus
is text annotation and text mining, specifically of German texts. We have focused texts that are vascular plants, butterflies, and birds currently. So, if you are searching, for example, raccoons in the portal, you won't find them because the portal does not know them. We are providing several tools
for the community text annotation, bio anthologies, and the thing that I'm currently and will here talk about is semantic search on documents. So, let's jump into it.
When you query the BioFit portal, you get an output here on the top that is first repeating your query and also showing you what the portal has recognized. In this case, it recognized Taxus baccata, which is a plant, and it realized that it was mentioned Germany.
Both were recognized correctly. What the portal does now is it resolves the names. So, it looks up is there next to Taxus baccata also subspecies of Taxus baccata available in the database.
The same thing will be applied for Germany. So, Germany is not only taken as a term. It will also look up, for example, Bavaria, Berlin, places, Brandenburger, anything that is in Germany and is in the texts. When you just hit one of the page hit buttons, you get a short preview of all
the hits that were within the document. In bold text are the terms that are referencing back to your search queries. So, you see here in green bubbles, surrounded are locations.
All these are in Germany, the bold ones. Also, you see Taxus baccata here at the bottom that is surrounded in purple. So, how does the portal know what is said by the documents and
how do we get the data out? First, I will go through the pre-processing and then I will later show you how we get query the data. We are relying currently on mainly our own biofit corpus
that is quite small, but we are expanding it with data or texts from the Biodiversity Heritage Library and from Zobodad. At least from BHL, we get a lot of public domain texts, which
is awesome. These texts we download automatically and throw them in our text annotation pipeline. That does some amazing stuff with machine learning. I won't go into the details here. Mostly important is it will figure out the sort of name. So, is it a noun? Is it an adjective?
It will sort our dependencies. So, how are the words related to each other within a sentence? I will come to this later. And of course, the most important thing for us biologists is the named entities. So, is this word a taxon? Is this word a location? Is this a person? Is it
something else the named entity hopefully will show us? And finally, from the text annotation, I get out a very crude XML format that I have to transform into appropriate annotated
texts. I enrich this with metadata for the documents and also extract data from all these annotations. I generate metadata ontologies that help me. These ontologies help me to query
the data faster and more efficiently. All these data are stored within three databases that have to be perfectly in sync. Otherwise, nothing will work. I will go into detail through these three databases, which is the key value database, just termed here index database,
a triple store, and a document database. And I will now show you how these databases are related to each other and how they can talk to each other more or less. The index database is basically just, as I said, a key value database. You give it the
string plants, and it will return you the ID 6. You give it taxos baccata, and it will return you the number 528 something. In the triple store, the ontology is saved.
The ontology knows, for example, that taxos baccata is a plant. If I query plants, I also get taxos baccata as part of the results. It also knows, of course, the ID of taxos baccata. It knows that taxos baccata has yellow flowers. It knows that the trivial
name of taxos baccata is European Jew. You see immediately that the ID between the index and the triple store are the same. Great, so we have already connected them.
Now, how do we connect this knowledge back to the documents? This is done quite easily, because we have annotated texts. Every word is surrounded by some XML text. We now know here that taxos baccata is a plant, and we know exactly which ID taxos baccata has.
Guess what? It's the same ID that already exists in the triple store in the index database. Every taxon contains at least one of these URIs. Most locations
only contain a Wikidata URI just for reusing them. Okay, now I showed you how the data is connected within the database. Now the question is, how do we get the data out? When the user is putting in the query,
so for example, in this case, plants with yellow flowers, I would expect taxos baccata being among them because we just saw taxos baccata has yellow flowers and is a plant.
So then the user query is run through a natural language processing. The language processing will recognize plants as a plant, surprise, and also will recognize that yellow is an adjective and flowers are nouns. Furthermore, it will realize how these words are depending on each other.
Yellow is a dependency of flowers, and flowers is a dependency of plants. So these words are connected to each other. The index database on the other side will return the respective values for these strings. So in this case, we saw it in the index database,
the plants has the id 6. Here are just for demonstration purposes, yellow has the int id color yellow and flowers has the id flower color. And then it goes through a very complex
if-else rule set that will forge some database and data objects that are just transferred to a template engine. And this template engine takes all the data and just
puts it in the right place, generating a sparse query that says the taxon that we are searching for has a flower color and this is specified as being color yellow. And the triple store takes the sparkle query and returns, among others, the id of taxos baccata. Finally, the triple store
response is just thrown against the document database and the document database will find it, not because it is looking for taxos baccata, but it has also indexed the id of the annotation
and it's treating it like a word. It doesn't care if it's a proper word or an id. And then we get back our documents that are then presented on the portal.
The HTML is already stored, so I mostly don't have to do too much to get a presentation ready. What are the lessons learned from all these data back and forth? If you are doing a semantic search on your own, see that all your NLP enrichment stages
of your user query are stored at some place. Even if you don't use them now, you probably will use them later for an expansion. You should prefer templates before
rules. So you can do natural language processing, that's totally fine. However, I would suggest using natural language processing in combination with templates because rules are very, very, very hard to maintain. There is an example that's called Lango. It's a GitHub repo.
That's a very good approach to how you can combine natural language processing and templates. You should be ready when you're doing stuff like that to drown in data. You have to make sure that your application is ready to handle a large database response.
And you should not underestimate the power of normalized data. I just learned that my document database knows a lot about my ids. For example, Taxos Picata is very often associated with ids or
URIs, so terms that are within the Alps or referencing the Alps themselves, referencing southern France or some parts of eastern Europe. And guess what? These are the
distribution of Taxos Picata exactly. So the document database without machine learning knows exactly which terms are referenced to some terms. There are, of course, also alternative approaches to do this. You can, for example, go with topic modeling. So you can
give a document, some keywords that says, this is about plants, this is about a specific plant, or this is about a specific animal. However, I learned that these, at least the mentioned ones,
are not scaling this well because they are using large databases if you have a lot of documents. And some of these algorithms, specifically the semantic analysis, is very hard to parameterize
for getting out good results. And then, of course, the elephant in the room is machine learning. You can do this. However, you have to get training data. I didn't have training data, so I went with the approach that I showed you. Machine learning may be the better way at some point. Specifically, it doesn't use so much resources. And I'm not too sure if I have
to figure this out. One last thing I want to mention is that we are currently working on information extraction. In this case, we want to make it
easier for the users to extract data from the portal. And this is specifically that we say this taxon, Taxos Parkata, was mentioned to be in at this place. And we put this into
some diving core data sets and will provide it first in our portal and then in other biodiversity infrastructures. And this was lately termed under some, at least
slightly under the term Nanopublications. It goes into this direction, but it's not perfectly the same. And that's it for my part. I just want to highlight our GitHub repo, where you can also find some of the tools that I showed you,
and specifically our UBLabs blog, where I also go into some more details about the semantic search.