We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Linking Data about the Past through Geography: Pelagios, Recogito & Peripleo

00:00

Formal Metadata

Title
Linking Data about the Past through Geography: Pelagios, Recogito & Peripleo
Title of Series
Number of Parts
16
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Pelagios is a community-driven initiative that facilitates better linkages between online resources documenting the past, based on the places they refer to. Our member projects are connected by a shared vision of a world in which the geography of the past is every bit as interconnected, interactive and interesting as the present. Pelagios has been working towards establishing conventions, best practices and tools in several areas of "Linked Ancient World Data": i. Linking and aligning of gazetteers. Gazetteers are the primary knowledge organization mechanism in Pelagios. In order to foster integration of gazetteers from different communities, we have been developing an RDF profile for publishing gazetteer metadata as Linked Open Data. ii. Tools to aid linking. To simplify the process of linking documents to the places they refer to, we have developed an Open Source geoannotation platform called Recogito. iii. Tools to visualize and navigate. To make the growing pool of data in Pelagios more accessible to everyday, we are working on a search engine called Peripleo. Peripleo will allow the navigation of the interconnected gazetteers that form the backbone of Pelagios, as well as the objects and documents that link to them. iv. Infrastructure for re-use. Data created in Recogito is available under CC terms for bulk download. Peripleo will feature similar capabilities and, in addition, offers a comprehensive JSON API to enable re-use in 3rd party applications and mashups. SWIB15 Conference, 23 – 25 November 2015, Hamburg, Germany. http://swib.org/swib15 #swib15
Lecture/Conference
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animationMeeting/Interview
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Transcript: English(auto-generated)
Okay, thank you for the introduction. In my presentation today, I want to speak a little bit about a project that I have been involved with for the past two years or actually about an initiative that dates back even longer. I think we started around 2011 to assemble a small community of people with the very
sort of free form goal to simply find ways of better interconnecting data and there we want to be really quite open and not specific about any kind of online data about the past really and linking it up according to linked data principles.
So just the question, I've never asked this before but I think I can dare to do it now. Has anybody ever heard of the Pelagios project before? Okay, a few people, so that's cool. Yeah, for those who don't, I'm going to give you the whole introduction today. So what is Pelagios?
As I said, it's sort of a very free form initiative, a collective of people with the interest of connecting data about the past and the way we do it is by geography. So space is the interconnection medium we use to connect online resources and at the moment, we have around 40 partners from around eight countries and those partners
together have accumulated one million annotations and what I mean by annotations in this context, I will also make clear in these presentations. So the data is very heterogeneous, we have partners that maintain text archives for examples, databases of archaeological sites, of archaeological objects, databases of
inscriptions, museum databases, literature, image collections, any kind of data which is very heterogeneous, comes with their own formats, own description standards. The only thing which is sort of homogeneous, sort of the core idea behind this, is that every data in some way relates to place and that's the kind of principle that
we want to leverage to get some organization into the data and to connect it. So that's the idea behind Pelagios. I think at this point, it also makes sense to speak a little bit about what Pelagios is not about. So Pelagios is not a data aggregator, so we don't go to our partners and say, give us all your data and we'll host it at our own location and then everything's going to be fine.
So that's not what we do, we're not a repository and we're not a standard data model. So we don't want to mandate a single data model which everybody should adhere to and once they adhere to the data model, everything will be interoperable. So that's also not the idea of Pelagios. So if we're not here to do anything useful, what are we doing?
So the idea, we have given it the name of connectivity through common references rather than a common schema. So the idea is that everybody is free to express the data in whichever way they like, but when they make a reference to a place, then they should do so using URIs. So Pelagios is exactly based on the idea of using URIs to refer to entities
and since Pelagios is about place, we're speaking specifically about place URIs. I'll give you an example of how that can work. What we see here is one page from Google Books. It's an English translation of the histories by Herodotus and for those who know the Herodotus, it's full of references to places, of course.
Mentions of places like Athens and Sparta. As a human reader, that's completely clear to us. We have the context. We know what Athens is. We know that Athens is a place. We also know that it's about Athens in Attica, so the ancient Athens, not Athens, the capital of modern Greece. To computers, that's not obvious.
So what we want to do is we want to have unambiguous identifiers and we want to annotate the content with those unambiguous identifiers. And in this case, we give them identifiers saying, okay, I'm talking about Athens and I assign it the identifier Pleiades 579885 or for Sparta, handing out the identifier Pleiades 570685.
So that's just an ID number from somewhere which provides me with an identifier for those two places. So those who've been in the digital humanities domain, they probably heard about Pleiades. What is Pleiades? Well, Pleiades is a gazetteer. It's basically a website or a database of places from the ancient world.
It has a website. You can search for places. But the real value of Pleiades is that it has identifiers for places. It also has geometry. It has names, all that sort of stuff which you expect from a gazetteer. But the real value and what we're interested in from Pelagio side of things is just those identifiers. So in Pleiades, every place has a URI
and that's the value of Pleiades for us in Pelagios. We can also do this with different kinds of media. So we can have an image and this image is actually a screenshot from a tool which I'm going to show just in a minute. We can also do this with different gazetteers, of course, so you don't need to be limited to Pleiades. There are also other gazetteers.
Of course, you all know geo names. You know the Getty Thesaurus of geographic names. So those are, I would say, global gazetteers. But there are also other gazetteers, which I would call community gazetteers, which are very focused on a specific cultural domain, maybe a specific area or specific time interval. This here, for example, is an example from Past Place.
That's an emerging gazetteer which aims to cover the whole world after Pleiades, roughly speaking. So a medieval Europe, sort of equivalent time ranges all over the world. But we're not restricting the gazetteer. The principle is just to have this annotation process with gazetteer URIs.
From an abstract point of view, you could say that what is it that we're doing? We have our, I don't see the mouse pointer here. But anyways, on the bottom, we have our web resources and those could be web pages. They could be produced from a database, whatever. They will have links among each other already because they're documents on the web.
But individually, they might be isolated. So there might be sort of silos around the web. On the other end, we have the gazetteers. They will also have links between places, for example, Pleiades. Athens is linked to Pleiades Eptica as sort of the higher-order administrative division. But again, the gazetteers themselves, they might be isolated.
They might be silos. And what Pelagis achieves is that it first of all, through this act of annotation, it creates links between the documents and the places. And if you look sort of at the bottom, you can see that there are two documents which may have been isolated before, but because they're now referencing the same place, there's now a connection.
We also have a mechanism which is not based on annotation, but it's very similar. It's based on linked data, in any case, which creates connections between gazetteers. So you can also say, OK, here's Athens in my gazetteer and here's Athens in your gazetteer. Let's create a connection. And again, through this network, we might again make connections
between documents on the bottom level, which may not have existed before. So that was the sort of abstract view. But from a user's point of view, that's not so useful. So users like interfaces they can use to search. And we've experimented with something which is more a bit like sort of Google Maps.
And I will show you what we built in the last months of the Pelagis 3 project. So this is an interface which allows you to explore all sorts of data coming from our partners. So you can explore the entire collection at once. You can say, OK, here's the coverage of the entire collection. You can see here the temporal distribution.
You can see the various facets where the data is coming from. For example, from which sources the data is coming from, which languages the documents are in, and you can search. You can set filters on pretty much anything. So let's set the filter, for example, for one of the collections.
So that's one of the partner data sets. You can see what the coverage of this one data set is. You can see what the temporal distribution of the data set is. Let's set a filter on one of the smaller data sets. That's called the aerial photographic archive of the Middle East. So again, different distribution of the partner data set. You can also see preview images.
Many data sets come with sort of thumbnail images for the records in there. You have full text search. So in this case, we can search for a term. One example I like to use is the tetra direction. Let's say an ancient Greek coin type. So we search for it. We can see, OK, here's the geographical distribution.
For this search term, you can also see the temporal distribution for this specific search term. If you zoom in somewhere, things get updated in real time. So you can see, OK, the temporal distribution changes because that's the temporal distribution in just that map area. You can also see on this different dot sizes.
There are different sort of local centers where most of these items are located. We can zoom further, we can pan the map further, and we will see that the temporal distribution will adapt to your current viewport. So if you zoom out again, we can also drag the handles. And we can see, OK, the distribution changes
because that's just the the items in that particular time span. And if we drag it across, we can see how sort of this footprint changes. And you can also see on those bigger dots, how the local centers sort of start to change and wander around.
And that really is just based on people tagging up their items with a little bit of sort of Dublin core type metadata for timing and these kinds of things, plus those annotations in the gadget here. So a really nice interface that works across many collections because those collections basically use linked data. I think that's sort of the point to take home.
That's sort of the power of linked data. Ah, thank you. Yeah, I think I'm going to stop this here to not spend too much time. So the question is what you've seen here was a partner or this is one specific example for for a database kind of content.
So there was a database with coin finds and every coin had a single point. That doesn't really work when you have when you're dealing with literature like the histories, Herodotus, because you have one object and it has lots of points and lots of coverage. So this tool also can handle this. The question there is more, how do you actually work?
How do you actually produce those annotations? And in the past two years, we had a project where we built some tooling, which enabled us to take documents, specifically text and maps. And I also want to show you a short screencast of that tool. The tool is called Recugito. And it was a tool which we built for us to take
specific corpus, which we defined as part of the project. So what you can see here is the image annotation area. So we could upload maps to the tool and then we could start to identify places reference in those maps. So there was a simple drag, a quick tool
where it could very quickly create transcriptions, basically is we started to we started to investigate ways of how we could automate this. And it turns out for handwritten maps, it's pretty much impossible at the state of the art to to automate the identification of toponyms.
So instead, we took a route of say, OK, we're going to build a manual tool for annotation about to build what we're going to build one, which is really quick. So you can do the basic steps very rapidly. So you can have with two clicks, you click, you drag, you click again and use you basically already identified the location of of a toponym of a place name on a map.
Then there is a way to transcribe. So you can simply add a transcription. And this way you build up your annotation base within the tool. We have to finish that. We have the same interface for texts.
In this case, we can automate a lot of things. The text is already digital text. So plain text, basically. We did run our texts, which we worked on in the project through named entity recognition first. But again, it turns out named entity recognition is, of course, not perfect. So again, we wanted to use interface, which on the one hand allows us
to manually tag and manually correct very quickly. And also, you can see on this color coding. So there are gray things and green things. The green things mean, OK, here we have already a valid annotation. So the basic principle in Pelagios has always been that we want to hand validate these things because we just want to ensure the data quality.
Here's another interface which shows you once you have annotated your text, you also need to create those mappings to the gazetteer. That's a bit of a trickier interface. I can't go into the details there, but that's basically a view where you can go through all the annotations that you created. You can see which gazetteer match the computer has done. You can search for better matches if you think it was a bad match
done by the computer. And you can again sort of quality control and also correct automatic matches. Yeah, and if you want to learn a little bit more about the tool, the URL below, that's sort of a beginner's tutorial where I can read through and you can see how the tool works
and whether it's something that you want to use maybe to. OK, just some quick numbers in terms of the output, because one criticism, of course, OK, that's all manual tooling. So how productive are you with it? And I think we were quite productive just to show what we did in this two year project,
which was spent on the one that on building those tools, but also on using those tools. My colleagues, they went through 317 documents in this tool in eight different languages. They identified almost 130,000 toponyms in maps and texts and hand verified about half of those.
So hand verification is, of course, the more tedious process. That's why we're not completely did everything completely validated 100 percent. But 50 percent, I think, is already quite a high number as well. We also submitted this the tool to a thing called the Open Humanities Award, and we suggested that if we get this award,
we would hold two public workshops with students. And we were lucky to win this award. So we held two workshops, one at the University of Heidelberg and one at the University of Mainz, where we would just let students play with the tool. And again, that was first of all, it was a lot of fun, I have to say. And it was also quite impressive on how productive the students were.
Here are some impressions from the workshop. So that's the session at Heidelberg. We had 27 students of different backgrounds, geography and archaeology primarily. And yeah, there's another picture from Mainz from the second workshop, which we held with 22 students.
Again, mixed backgrounds. Again, archaeology, but this time also engineering students. So different kind of anger. Yeah, that was about one year ago. And the students were quite, we were really quite impressed with the quantity, first of all. So they identified more than 5,000 places in texts. And if you play around with yourself,
you can see that's actually really quite productive. So you can just double click on a place name in the text. And it's basically tagging up a text is almost as fast as reading a text. They also located more than five and a half thousand toponyms on maps. Again, I think that shows that the tools sort of getting mature. And that's you can really work quickly with the tool.
They made 1,450 map transcriptions. Again, you can see that's sort of more effort required, lower number and 680 gassety resolutions. Again, this was a thing that's the most tedious part of the process. And that takes the most time. But one thing which I think was also worth mentioning in the first session, we had about 140 gassety resolutions.
Then we redesigned the user interface completely. And then we got like almost 500. So that also shows that you can really gather feedback very, very explicitly in when you have people work with it at an early stage already.
Yeah, that's just an example of what a tagged map looks like in the tool afterwards. I'm not sure whether you can really see it that perfectly. No, anyways, I think you can see there are many grey boxes.
The grey boxes basically mean you have an annotation which also has a transcription. The red boxes simply mean, OK, we have the location of the toponym, but we have no transcription. And those few, the green ones mean, OK, here we have transcription plus gassety image. Yeah, so here's the green one. And you can sort of see this imbalance.
So it's quick to tag, it's quick to transcribe. It's not so quick to do the gassety matching. But again, I think with more user interface work invested, things get better there too. Yeah, so the question is, how can you reuse the stuff that we do? Well, one thing is you can reuse the data as a user.
We are a community. We try to, on the one hand, encourage people to prepare their data as linked data. And this thing is something that I haven't come across too often yet, but I hope that we will come across it more often in the future where you have, in this case, it's a portal run by the American Numismatic Society,
and they have a link called Linked Data down here. And one of the linked data icons is a Pelagis icon, and you can download RDF according to the Pelagis profile. We don't have our own ontology or anything. We are reusing ontologies, but we are making recommendations about which properties to use and how to expose metadata about your stuff in a very limited way.
But in the same way that other people are doing it to make things more interoperable. Another thing, Rekogito, our own tool, of course, you can also download the data there. All the annotation data is by definition CC0 licensed in Rekogito. So you can just visit the URL, download the data. In this case, it's CSV data because it's tabular data
and it's easy to transform into all sorts of other formats. It's also easy to import in a spreadsheet. It's easier to import in a GIS system because we're dealing with PLACE. So you might want to use a GIS system for exploring the data. And what if it's all possible? Also, if you go to Rekogito, you can have those preview pages here,
which show a map view of a particular document that was annotated. Again, you can download the data right from those preview pages. We also have an API, an API, one of the evil APIs. Of course, we're based primarily on the idea of making things available under stable URIs.
But I think there is still one thing missing about this easy search. So the API is really there for people to define simple search queries and then discover the resources through that. And the search API in that case is it's basically simple. It allows you to search for the basic properties
that Pelagius collects about objects. You can do things like, OK, here's a query term on the full text like silver. So give me everything on silver that matches on the term silver between 100 PC, 180 for a specific place. So that's the kind of things that the search API does. And I also should say we primarily built it to sort of eat our own dog food
because that's the API that PeriPlayer is running on. So everything that PeriPlayer does, you could actually do yourself just based on this. In general, all the tools are open source. We have a GitHub account called github.com slash Pelagius and everything's hosted there for every tool that we have.
We also run a public instance. So these are the instances where we set up our own our own PeriPlayer and our Rekogito so we can play around with that. It's a prototype. I should say the PeriPlayer one, at least Rekogito is sort of more tested already.
But again, we will be working or we hope that we will be working next two years very heavily on making these tools more widely available. Also, so that you can upload your own content, for example, and work in your own workspaces, also improving the documentation to set up your own your own instances in your own institution.
That's all possible now, perfectly possible, and people have done it. But obviously, the documentation and everything is still sort of there's room for improvement there, I would say. Yeah, again, also one thing that's worth mentioning, when you want to work with Pelagius in any way, you need gazetteers. So gazetteers are at the heart.
So you should either align your data to one popular gazetteer like geonames or play IDs or any gazetteer really that's suitable or which you think is suitable for your for your domain. You can also bring your own gazetteer. So that's also perfectly possible. Again, we have some sort of a an RDF profile defined.
That's not our own ontology. That's just reusing things from other ontologies. But we just have a page where we document things that other people have done and where we recommend, OK, if you do it like that, if you use those kinds of properties, then you will have the same kind of structure that other people are using for their gazetteers. And that makes things into a problem.
Yeah. So finally, I want to end with a call that you should not only ask what Pelagius can do for you, but also what you can do for Pelagius. First of all, of course, you can use it and give us feedback. That's very important for us. We are looking for more use cases specifically from the library domain.
You can publish your data as linked open data, link it to Pelagius. There are profiles which you may want to take a look at. But really, even if you don't follow our profiles, I think the main point is use your eyes for places when you refer to places in your data. I think that's the key point. Yeah. And that's, I think, is the point I want to end with as well.
So thanks for your attention and I hope for questions.