We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Improving Named Entity Recognition in the Biodiversity Heritage Library with Machine Learning

00:00

Formal Metadata

Title
Improving Named Entity Recognition in the Biodiversity Heritage Library with Machine Learning
Title of Series
Number of Parts
15
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Scientific names are important access points to biodiversity literature and significant indicators of content coverage. The Biodiversity Heritage Library (BHL) mines its content using the open source Global Names Recognition and Discovery (GNRD) tool from the Global Names Architecture (GNA) suite of machine learning and named entity recognition algorithms, to extract scientific names to index and attach to page records. The 2017 BHL National Digital Stewardship Residents (NDSR) are working collaboratively on a group of projects designed to deliver a set of best practices recommendations for the next version of the BHL digital library portal. NDSR Residents Katie Mika and Alicia Esquivel will discuss (i.) BHL and the significance of taxon names, (ii.) the current workflow, proposed improvements, and example workflows for linking content across scientific names including semantic linking to biodiversity aggregators such as Encyclopedia of Life and the Global Biodiversity Information Facility, (iii.) how to use scientific names for content analysis, and (iv.) optimizing manuscript transcription of archival content, which introduces problems like outdated and common names, misspellings, and antiquated taxonomies to GNA tools. Authors invite questions, comments, and discussion from audience members as the Residents prepare to submit their final recommendations at the end of the year.
Lecture/Conference
Lecture/Conference
Computer animation
Computer animation
Computer animation
Lecture/Conference
Lecture/Conference
Computer animation
Computer animationLecture/ConferenceMeeting/Interview
Computer animation
Computer animation
Meeting/Interview
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Lecture/Conference
Meeting/Interview
Lecture/Conference
Transcript: English(auto-generated)
Hi everyone, I am Alicia Esquivel and I'm here today with my colleague Katie Micah. We are two research fellows for the Biodiversity Heritage Library. We have spent the last year researching and developing a set of best practices and recommendations for the next iteration of BHL.
So today we are going to be discussing biodiversity informatics and semantic links between the BHL library collection and domain data repositories, specifically the links that occur between scientific names. So briefly of the Biodiversity Heritage Library or BHL as we will refer to it often today is a consortium of
digital, it's a consortium of natural history and botanic libraries that digitize their material and make it freely available for people to use online. As of December of 2017 there are over 53 million pages
available online for people to read and use. So our project is funded by IMLS. We are national digital stewardship residents and for the past year, like I said, we are working on improvements for the next iteration of BHL.
My main project has been to conduct a content analysis of the collection and today I'm going to be talking about how scientific names are indexed in BHL, what these names are currently linked to, and how I was able to use these names to conduct a content analysis on the corpus.
Katie will be speaking about how to create structured data from some of the archival material that's currently in BHL and how to move forward towards linked data in biodiversity informatics. So this is a look of the current BHL data model. I hear some laughter in the audience.
So this model is organized in tables and stored in a relational database. This magnified insert that you can maybe kind of see here is about the scientific names in BHL and I'm going to talk a little bit more about how those are indexed out of the text
and what BHL is currently doing to link out to other biodiversity informatic places. And scientific names are important because names are what can be used to
can be used as an index point in connecting different types of biodiversity data, including literature, observational data, images and videos, genetic data, and taxonomic catalogs. So the majority of biodiversity data does relate to scientific names, which makes it a great index point for getting between these different types of data.
So BHL indexes their scientific names from the OCR text and uses the global names recognition and discovery tool, which is made by global names architecture. Global names architecture is a system of web services that helps people register, find, check, and organize
biological scientific names and interconnects online information about the species. So there are a suite of different tools that are made by global names architecture, and I'm going to be talking about three of them today. These are the global names usage bank, global names index, and global names recognition and discovery.
So the global names index and global names usage bank are central components of what makes up the global names architecture. And this is what actually allows for the semantic linking across the scientific databases. And the global names recognition and discovery tool is what BHL specifically uses to index their own scientific names.
So the recognition and discovery tool uses two different algorithms called Taxon Finder and Netty Netty. Taxon Finder is a dictionary-based tool that uses the global names index as its thesaurus to search for
scientific names throughout the text. And Netty Netty is a naive Bayesian algorithm that uses machine learning to identify and discover scientific names. And it does this based on the particular type of letter combinations that are common for scientific names.
For example, some of the training data would teach Netty Netty that a scientific name looks like a two-word phrase that often begins with a capital letter and would end with a US at the end. So those are some of the indications of phrase being a scientific name.
The semantic linking done through scientific names is made possible by the global names usage bank. The global names usage bank indexes every usage of a scientific name and the resources in which they are mentioned. So globally unique identifiers are given to
three concepts in the global names usage bank. The agents, references, and taxon name usage instances. So agents are typically people or organizations that are authors of the references. The references are the published or unpublished works that the names are found in and the taxon name usage
instances are every time a name is used in those references. So references can have multiple T and U's. The T and U's are used to decipher protonyms. A protonym is the first time a name is used to identify an organism.
So T and U's contain unique and persistent identifiers. They contain links to the reference at a page level if possible, which is done in BHL and links to the protonym T and U. It also includes the indication of the taxonomic rank,
the exact spelling of the name within the reference, and a link to the T and U that represents the immediate parent taxon. This is what the data model just for the T and U's looks like and in this way
different spellings of the names are brought together into one taxonomic concept. So currently BHL is linked to the encyclopedia of life through scientific names and
users can search by scientific names when they perform an advanced search. There's an option to search by scientific names and I'm actually going to run through that process with you right now.
Okay, so this is the advanced search page and the scientific name box. So I will run a search on this scientific name and the results show up here. So if I click the first result, I get every instance in BHL at the page level where this name is mentioned.
So I can link directly to this page. So the name is here on this page and to the side
is every scientific name that is found on this page and there's a lot because this is an index. So there are a lot of names on this page and these link out to the encyclopedia of life.
So BHL serves as the literature background for the encyclopedia of life which aggregates data from multiple biodiversity data sources. And so here are those literature references from BHL that are hosted here at encyclopedia of life.
There are also maps that come from occurrence records from global names information facility. And there are also more scientific names here organized by different taxonomic backbones and so
I'm going to step back to the original results in BHL. And you can see that these species and subspecies are not actually linked together. This is just a search for the string.
So if you look at all of these name sources, we have a similar display that was in encyclopedia of life. However, these are mostly static and you can't really browse through the collection in this way.
Which is why when I wanted to do a content analysis based on scientific names I had to do it this way. Thank you. All right, so
first I had to take a download that is freely available on BHL. It's just a CSV file of all name instances that are found in the text. I can then feed that back through a global names tool that resolves it to a particular taxonomic reference source
and then take those results and filter them into a taxonomic Kingdom level. So quickly that looks like this. It's a very easy tool to use. You can upload CSVs into the resolver tool.
Make sure that you have headings that map correctly to the global names headings. All that you need is scientific name and a list of names and the most simple way to perform this. You select a data source. I use catalog of life. And this is what the results page looks like.
So this sample size was only about 30 names and it took two seconds to process the whole BHL corpus. I was able to run a hundred thousand names at once which had to be done 38 times and each time took about 45 minutes. So the results that are produced come in a CSV file and this is part of those results, just the classification paths.
So I was able to take these classification paths and run it by a Python script over it to filter them all into Kingdom levels and that is what these results look like. So this is the whole BHL corpus split into their taxonomic kingdoms and
so the unique names are in dark gray and the total occurrence of names are in the light gray and in the middle are a species estimate of the names. So this gives the collections committee an idea of which kingdoms are better represented in BHL.
And there is potential to use visualizations like this to browse through the collection. So now I will pass this along to Katie who's going to talk about field notebooks in BHL. Okay. Hi. So my project with BHL has largely focused on archives collections.
So one immediate issue that comes up all the time is that creators of manuscript items don't generally refer to taxa by their valid scientific name. So global names has a hard time identifying and indexing on outdated vernacular and common names. So in considering BHL's primary users, it became pretty clear that scientists and taxonomists are increasingly interested in doing large-scale
computational research across our collections in concert with biodiversity big data repositories. So these ones are the Global Biodiversity Information Facility or GBIF, the Ocean Biogeographic Information System,
UBIS, and the Encyclopedia of Life, which Alicia already showed you. So my idea was maybe is it possible or useful in any way or sustainable to extract some of this vital research data like species occurrence events, which is a record of where and when a particular taxon was observed and recorded.
And then transform that into a format or a schema that can be understood by these aggregators. So here an example you can see an ornithologist field notes and you can see that it combines scientific Latin binomial taxon names and some symbols that refer to counting or indexing.
And then below you can see that there are a lot of common vernacular names and abbreviations and just a mess of unstructured handwritten data that's trapped in an image. So there are some text mining programs and natural language processing systems that we can run over OCR or transcribed material.
But they're often not good enough to manage complex and outdated linguistic models that are really common in 18th and 19th century informal documents. Specifically, we would run into problems with negations and counting, trait descriptions that span and jump between and across different observations, interpreting tables or illustrations when they pop up,
and understanding relationships between entity types like locations, abbreviations, and taxon names. So as Alicia mentioned earlier, our research focuses on improving the next version of the BHL portal. And so one thing that they're particularly interested in implementing is a new IIIF compatible image delivery system.
So my research initially focused on how to maybe use something like the IIIF annotations to sort of identify occurrence events in manuscript collections and in grey literature and then link that using linked data to connect them to the published literature, to
taxon treatments within that literature, to these biodiversity big data repositories, to digitized specimen collections from natural history museums, taxonomic nomenclatures. I mean, there's a lot of stuff. But unfortunately, this is sort of a story of failure. There isn't really any linked data that's available for biodiversity so far. There's no knowledge graph yet.
There's some ideas. These are a few conceptual models about what biodiversity informatics agents and data owners are imagining as a domain knowledge graph and some linked data models for Darwin Core and Darwin Core archives files, which are the schemas we use to represent this data.
But there is no widespread adoption. No one's really implemented it to the extent that you can make it usable. And biodiversity data is still largely stored the way we do in tables in our giant awful relational databases. Also, several of our member institutions are already engaged in crowdsourcing transcription programs.
So these are the three platforms that are probably most widely used. They turn, essentially they turn images of manuscript items into machine readable text. But they have a lot of different levels of encoding that they support. Some of them really can only give you plain text. Some can encode it completely into TEI, XML.
And we have some people that are starting to implement IIIF and are using the open annotation data model and the web annotation data model. But so since we have a lowest common denominator, we are a consortium institution.
We need to be able to take in a lot of different types of data. We have to start at the bottom. And so I this is going to be kind of a low-tech description of what we're doing or what we're hoping to do soon. So generally I'm walking it back and now my question is morphing a little bit into how can we use these existing transcription programs
and transform it into some kind of interoperable schema and hook them into biodiversity data universe in a way that isn't going to impede future linked data efforts. So the simplest again lowest tech way is to just tag instances of taxa, location, and dates. And these are the primary components that are absolutely required to declare species occurrences.
We can then we can also parse any kind of tag that our members are going to give us with whatever data export they use. So we can parse XML really easily and again the open annotation or w3c annotation models are fine, too. We can also utilize the index names that are identified by the global names recognition service that Alicia just told you about.
This is just an example of media wiki syntax that is going to be the simplest most flexible easiest to implement sort of process to collect some kind of structured data from even just plain text transcriptions.
So then once data is tagged we just extract it and then we need to do some intellectual work to create the relationships between the different data that we pull out. This is going to depend on the transcription platform. Some will again export the XML, some will export CSV files of tagged subjects,
and then some will export just the plain text without parsing anything. But in general the idea is that you want to associate names with their locations and dates. And then you can see in this CSV file where each row is an occurrence record, and then the columns will be the the Darwin core
sort of titles that you can use to keep going. So this is just a slide about reconciling data. We use OpenRefine and an R package that we have called taxize that helps to do some taxon referencing to turn common names
into standard names. I also use the R package ggmap to turn location strings into coordinates. It works very well. I was surprised by that. And then the Python library date parser to standardize dates. Once we have good data we will crosswalk it into an official Darwin core archive file
which is the standard schema still for biodiversity data. One of the simplest ways again is just using a CSV file. We use Darwin core terms as headers and we can create URIs for occurrence records. So the occurrence record will have a URI for future linked data references, and then we can also add URIs for our bibliographic data.
Which I don't think is on this one, but farther down in the CSV you can see that all of our bibliographic data will have a URI. And so then gbif doesn't have linked data, but they do have a validation service. So you can give them whatever data you have or whatever you would like them to index and
they will validate it for you, which is pretty cool. And then a final step is to go back and officially resolve all the scientific names and all of the data that you've pulled out that's been tagged and extracted and make sure that when that goes back in to BHL global names can still find it and index it.
So now these data have been added to the global biodiversity information facility and other domain-specific databases for occurrence data. They've been identified by global names, indexed in BHL, and added to the taxon bibliography. So clearly we're at the very beginning of our linked data journey
and we would sort of love some advice if you have any from anybody that's worked with really complicated vocabularies like taxonomic nomenclatures, specifically, or is trying to hook into a knowledge base that isn't really supporting linked data yet.
Okay, all right. We don't have a ton of time for questions, but while the next speakers are setting up their computer if someone has a quick one, I'd also recommend you check out the BHL's blog. They do some really good stuff. Any quick questions? Oh, okay, Tom. Oh, sorry.
Hello. Thank you for the talk. Having biodiversity data and as linked data presupposes that one would have persistent
maintained URIs for taxa. Do you see any organizations that are candidates for providing persistent URIs for taxa? Definitely not. And that's sort of the main problem is that there is a lot of conflicting issues with the the naming part of it. So if you want to connect, you can,
places will be able to create URIs for the name string that will be valid and people can agree on, but the problem is when you connect that to an organism that might have a disputed sort of taxonomic issue. So there has to be a way to be a bit more flexible, I think.
Separating the concept of the organism from the name, which can be hard because one concept can have many names and one name can have many concepts of organism. So that sounds like a great topic for y'all to chat about over coffee, because I'm sure Tom has ideas to help with that.
Thank you again very much.