We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Using LOD to crowdsource Dutch WW2 underground newspapers on Wikipedia

00:00

Formal Metadata

Title
Using LOD to crowdsource Dutch WW2 underground newspapers on Wikipedia
Title of Series
Number of Parts
16
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
During the second World War some 1.300 illegal newspapers were issued by the Dutch resistance. Right after the war as many of these newspapers as possible were physically preserved by Dutch memory institutions. They were described in formal library catalogues that were digitized and brought online in the ‘90s. In 2010 the national collection of underground newspapers – some 200.000 pages – was full-text digitized in Delpher, the national aggregator for historical full-texts. Having created online metadata and full-texts for these publications, the third pillar ''context'' was still missing, making it hard for people to understand the historic background of the newspapers. We are currently running a project to tackle this contextual problem. We started by extracting contextual entries from a hard-copy standard work on Dutch illegal press and combined these with data from the library catalogue and Delpher into a central LOD triple store. We then created links between historically related newspapers and used Named Entity Recognition to find persons, organisations and places related to the newspapers. We further semantically enriched the data using DBPedia. Next, using an article template to ensure uniformity and consistency, we generated 1.300 Wikipedia article stubs from the database. Finally, we sought collaboration with the Dutch Wikipedia volunteer community to extend these stubs into full encyclopedic articles. In this way we can give every newspaper its own Wikipedia article, making these WW2 materials much more visible to the Dutch public, over 80% of whom uses Wikipedia. At the same time the triple store can serve as a source for alternative applications, like data visualizations. This will enable us to visualize connections and networks between underground newspapers, as they developed over time between 1940 and 1945. SWIB16 Conference, 28 - 30 November 2016, Bonn, Germany http://swib.org/swib16/ #swib16 Licence: CC-BY-SA https://creativecommons.org/licenses/by-sa/3.0/
Lecture/Conference
Computer animationProgram flowchart
Computer animation
Program flowchart
Computer animationMeeting/Interview
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animationProgram flowchart
Computer animation
Computer animation
Lecture/Conference
Lecture/Conference
Lecture/Conference
Transcript: English(auto-generated)
The Library of Latvia, and I will be leading this first session. So our first speaker is Olaf Janssen from the National Library of the Netherlands, with a presentation on using LOD to crowdsource Dutch World War II underground newspapers on Wikipedia.
Well, thank you for the introduction. I'm not only working with the National Library of the Netherlands, I'm also a volunteer of Dutch Wikipedia, and for the next 20 minutes or so I'm going to talk about a project to make as many underground newspapers from the Second World War from the Netherlands
as possible, available on Wikipedia, using linked open data techniques and crowdsourcing. And I would like to give you a bit of background perspective on these underground newspapers. During the Second World War, the Dutch Resistance issued quite a number of illegal newspapers,
and they came in every shape and form. On the one hand, there were these really big newspapers, big titles, well organized titles, nearly professional, often with tens of hundreds or even hundreds of people working
on them, and these had also local and even regional editions. And on the other hand of the spectrum, there are these really small, amateur-made, pamphlet-like issues. For instance, there was this student who built an illegal radio, and he listened to the BBC in London, and he wrote down what he heard, and he handed that transcript to
his neighbor, and that neighbor handed it again to their neighbors, and in such a way the news was distributed. So from really big to really small newspapers, everything was collected, and right after the war, the Dutch Institute for War and Holocaust Studies in Amsterdam, the NEOT, started collecting those newspapers for posterity, and over the years they managed to collect
some 1300 individual titles. I'm talking about titles here, not issues, and in total there are some 200,000 pages of illegal newspapers, and it's about the archive that you see at the back of the slide there, that's the entire archive.
And of course when you start to collect stuff, you want to make sure that it is properly described in a library catalogue, and this has happened in the Netherlands, there is a common Dutch catalogue where this is available, and all those 1300 titles have been described, so this is obviously bibliographic metadata.
And throughout my talk, I will focus on this particular title as an example, De Geus Ondo Sudenten, which was a student's newspaper from the area of the Hague and Leiden, so you'll see this come back more often. Six years ago, all these 200,000 pages, they were digitized.
And now they are available in a website called Delfer, and Delfer, that's our Dutch national aggregator for historic full-text publications, so not only these illegal newspapers, but also other newspapers and books and magazines as well. And all these 1300 titles are available there, for instance like this one, again,
De Geus Ondo Sudenten, and obviously you can read the newspaper, but you can also search it on the word level, because there is also full-text OCR and also auto files for a word highlighting available.
But now if you want to ask contextual questions about these newspapers, like for instance, what sort of newspaper was this? Was this a really radical, illegal newspaper, or had it a more moderate tone of voice? What was the history of the newspaper? Who wrote it? Where was it printed?
How was it financed? How was it distributed? And were there any other relations with other illegal newspapers during the war in the Netherlands? And when you look at the kind of questions there, the what, the who, the where, the how, and the relations, no surprise, that's already, you're asking semantical questions.
But the bad news is that you can't answer these kind of questions on the level of Delfer, because that's a big drawback of Delfer. There is no contextual information available about these illegal newspapers. And when I was thinking of this project, I thought, well, where would most normal people
go if they want to read about historical text? For instance, these historical newspapers, where would they go? What would your bets be? If you want, say you're interested in a particular newspaper, where would you most
likely go when you want to read about it? Google, yeah? And where would you end up then? Yes, that's the correct answer. Yeah, I would have a problem if you would have answered it differently.
Because that's true, yeah. So for this particular title, the Heus und Studente, there is a decent Wikipedia article. And now we have a problem. Of course, you can't do a project when there are no problems. I was a bit nervous about using this particular slide here in Germany, but as you can imagine,
I'm not beaten away here. So I guess it's okay. And when you do a project, you have a problem to solve. And the problem is that if you want to discover these newspapers or study them or understand them or place them in a historical cultural context, then it's currently harder
than should be necessary. And that's because the information on these illegal newspapers is distributed across multiple and even unconnected resources. And to be precise, we have the descriptions, which are in the library catalog, metadata,
we have the content, this is the full text in Delfer, and we have the context in those Wikipedia articles. You see there at least, because there is not a little problem, because the article I showed you, that's actually a really carefully chosen exception. Because the reality is that there are only very few illegal newspapers that have their
own Wikipedia articles. In other words, if you look at the list of available articles on illegal Dutch newspapers, at the beginning of the project, some two years ago, this was the full list. These are about 15 to 18 titles, while it could be 1,300.
So there was still a lot of work to do here. We can tackle both problems, and for that, I started a project, and I started a project on Wikipedia, that's why it's called WikiProject, and the aim of the project is to systematically and uniformly describe and also interlink all these 1,300 illegal newspapers from the
Second World War. And why did I choose Wikipedia? Of course, I could have chosen to do a non-project environment, but I thought, well, if you want to reach a big audience, then you should do your project directly on Wikipedia, because
over 80% of the Dutch population uses Wikipedia. And if you would do it in your own project environment, you would have to do a lot of extra effort to get the audience to your own environment. And the second consideration was, because the open nature of Wikipedia, you make sure that all the data you put in the project, you put in Wikipedia, can also be taken
out again for other reuses, like think about Wikidata, DBpedia, or you can even base data visualizations on it. So the big challenge of the project was to get hold of contextual information.
We have all the metadata, we have the full-text content, that's already available. The lack was for contextual information. And I was really lucky, because I discovered there was a book printed in the Netherlands in the 80s of the last century, the Underground Press, 1940-1945, and this book has contextual
entries about nearly all of those 1300 newspapers. And when you look at an individual entry within the book, and again I chose the article for De Geus Ondestudente, every newspaper in the book has a unique ID, in this case
it's number 199. And also in the red box there, you see it says De Hegg there. This is the place of publication. So you can make relations between newspapers and place names. And obviously, no surprise for this audience, that starts to smell a bit like semantics
already. Then there is this blob of text, and this is ideal raw material for your Wikipedia article. And when you look at it in a bit more detail, over there you see two persons mentioned. Those were brothers, Mr. J. and H. Drion, and they were law students at the time.
So you can make a relation between titles and person names. And when you look closely there, in the red box, you see three red numbers, 106, 360, and 748. And those are IDs of other students' newspapers with which the number 199 collaborated.
So there were collaborations between that newspaper and the other three. And again, you can start to have another building block of your semantic network. So we needed to free up and open up the information that was in this book.
And the first step we took was we actually scanned the book. We ran it through Acrobat so that we got full text OCR. And then we contacted the copyright holder of the book, which is the Dutch Institute for War and Holocaust Studies in Amsterdam.
And we asked them, well, could you please release the book under a Creative Commons license? And they very happily collaborated with us. So now this book is available online, just as a flat file PDF. And it's correctly licensed so that we can actually do further work with it.
And then there were three other problems that we needed to tackle. First of all, we needed to convert this PDF file into a structured database. And we needed to link the titles in the PDF to place names, person names, and other titles. Next, we needed to make links between those titles and the same titles in our library catalog
and in Delfer, so about the metadata and the full text. And the last step was to create links from the titles to external sources other than the library catalog and Delfer. Well, those three challenges, this is where the co-author of this presentation, Gerhard,
comes into play. Gerhard couldn't be here today. And that faces me with a problem because Gerhard did all the really hardcore technical work. And yesterday, I only had my first introduction into linked open data. So I'm not going to make an effort to explain on the technical level what he did,
because I think you're all much more bigger linked open data gurus than I am. So I'm just going to show you what he did. He used these kinds of tools and techniques and services. And then you all think, ah, that's what he did. And if you really want to know, he gave me his slides, which look like this.
And so you can come to me after the talk, and you can look at the slides, and I hope you can make your own conclusions about the work he did. And those slides will also be, of course, available in the presentation that will be, I hope, available after this conference.
But the main outcome of his work, that those three red crosses changes into three red checks, of green checks. So we made, that's what it comes down to, we made a linked open database, a virtual triple store, containing information about these underground newspapers.
And actually, that was the first time that it was done in the Netherlands, which I found quite surprising, because, well, it's more than 70 years after the war, and there was still no proper database about underground newspapers. And now we did two, we could do two things with the database.
First of all, we could generate Wikipedia articles from the database. That's what we already did. And the second thing, but it is much more future work, is that we could also, for instance, export it to Wikidata, to DBpedia, or we could use it as a basis for data visualization.
So let me focus a bit further on the first thing. In my project goals, I highlighted the words uniformly. So I wanted to make sure that all the Wikipedia articles that we generated from the database would have the same layout, that they will all have an info box,
and all the source references would be on the same spot. So all the articles would look more or less the same. And when you talk about uniformity, you always have to think about templating. So we made a template, and using the database, we generated 1,300 Wikipedia articles.
We didn't yet generate full articles, but we generated article stubs. And just to show you what these stubs look like, this is a full Wikipedia article, again, on the same newspaper, and on the yellow box at the top, you see the URL this page is available under.
And now I'm going to cross out everything that is not automatically generated. So all the stuff you see there, which is not grayed out, that comes from the database and the template. In other words, if I invert this slide,
the text in the red box, that was the only bit that was added by humans, to make the stub into a full article. And this is a process we're currently doing. This started about two months ago, and we are involved with Dutch Wikipedia community to expand those stubs into full articles.
Currently, we have a group of about 8 to 10 volunteers, who with, I must admit, various intensities work on the project. But it's going on, gradually. It's not like 100 articles a month, but bit by bit, the list of available Wikipedia articles is expanded.
This was the list I showed you. This was before the project. And this is a list about three weeks ago that I last checked it. So you see, gradually, the number of articles that's available is growing. And of course, we do this to make the Dutch population happy.
And I hope that there will be other volunteers from other Wikipedia chapters that will translate the Dutch articles into English, for instance, or in German, so that even more people can know the history of the Dutch illegal newspapers.
And this man won't be happy. Thank you. Thank you. And now for the questions. When you're asking questions, please also get a microphone.
These are working. OK, so.
No, I would lie if I would say that would be true. The majority of the newspapers is notable.
But there are quite many, especially those really small newspapers, of which little contextual information is known. And if you would create an article, that wouldn't meet the notability threshold. So what we're doing for those really small article titles is to make an overview page, like the article on small illegal newspapers.
And it has just an overview of all these smaller individual titles. All right. Next questions. Mark? How did communication with the Wikipedia community work in this project?
Particularly, I know that sometimes I speak sorrow about putting lots of data which was generated into Wikipedia. How did you communicate with the community to make this possible and accepted by the community?
We involved the community right from the start. We explained our project approach and our project goals right from the start of the project. And we asked input from the Dutch community. Then when we got our input, we generated a small number of test stops.
And we asked the community, well, what do you think of them? Do they meet your standards? And we got feedback from that. And once we got an OK from the community, we generated all the Wikipedia stops. But we generated them not in the main namespace of Wikipedia, but in a separate corner in the project namespace.
And that's where they are at the moment. So every time a stop gets expanded, it gets copied over to the main namespace of Wikipedia. If you wouldn't do that, you would indeed probably get a lot of problems with the community. Yes.
Next question. The biographical data you attempted to link to, because you extracted it from the publication, did you succeed in linking it to library data?
Not yet. That's a stage of the project we still have to do. Or have there been Wikipedia articles already about, for instance, these two brothers? Yeah, you see that as a spin-off. Not only articles about those newspapers are written, but also new articles about persons who worked on the newspapers have been created.
Actually, some volunteers only write articles about the persons working on the newspapers. You see that happening, yeah. All right. And time for one more question.
I have a question about the reference of the persons behind the newspapers. Some of them paid very dearly for their activities underground. I'm having a hard time understanding. I have a question about the persons, about the names you are linking to. You are linking and referencing the names in your project about the persons that edited those underground newspapers, right?
And are you also referencing or linking to the information to their ultimate fights? Because actually, in the Dutch resistance, many were executed in concentration camps at Gestapo headquarters.
Others survived the war. Are you also retrieving external data about the ultimate fights of the persons that you reference? Yeah, we made, we used the links that are available from DBpedia
to get hold of that kind of information. There are links from the database to DBpedia, and from there we extracted the kind of information you asked for. Does that answer your question?
Okay, that's okay. You'll have a chance to follow that up after this. Just to conclude, is the data that you created just linked open data? Is it available for us to play with? Yes, of course, yes. It's open data, so it's available. Okay, thank you.