RESEARCH INFRASTRUCTURE - Documenting & preserving programming languages & software in Wikidata
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 16 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/60337 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Bonn, Germany |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Lecture/Conference
00:54
Computer animation
Transcript: English(auto-generated)
00:09
Yet another Wikidata mentioned here during SWIP.
00:27
Hi, I'm John Samuel. I'm Associate Professor at CPU Lyon. And I'm Catherine Thornton, and I work in the Digital Preservation Department of Yale University Library. Our other collaborator, Kenneth Seals-Nutt, is the software engineer who built one of the systems that we'll be talking about today, and
00:46
he's not joining us today because he is presenting at Wikisight. Thank you. So, some years ago when I started programming, I started programming in basic and local. I am not sure that all of you have heard about these languages,
01:01
and I think many of these people in the audience may have heard about Fortran, Cobble, and Pascal, and I'm not going to teach these languages in my college as well. So these things that have that have been taught in the college some years before have become some sort of extinct languages. I will not comment on it completely, but these things should be taught to the next generation, and the talk is about this aspect.
01:27
So how is it happening right now? So right now if you go to English Wikipedia, you have got languages that have been described, and here I'm showing some languages which are very popular, Python, Java, and C, and their info boxes which describe these programming languages.
01:46
And there's another database, which we will be talking about today. It's called Wikidata. An interesting aspect of Wikidata is that it is multilingual knowledge base, and here I'm showing you the languages which have the most
02:01
multilingual labels, and here if you see properly, there are C, C++, Java, Python. They are the most popular languages because they have labels in multiple languages. And why do we need to talk about languages? If you take Indo-European languages,
02:21
you know that there are language trees. There is something similar in the programming languages. How we program in functional language, programming language, is not the same approach that we do in the procedural languages, and it's not the same approach when we do declarative language programming. So there are different ways to program to achieve the same task,
02:41
and if you see the programming language, how they are grouped, you find that the popular programming paradigms are procedural programming, imperative programming, functional programming, object-oriented programming. You get this information thanks to the amount of information, current information, on Wikidata.
03:02
But there's another interesting aspect, but if you see the current programming languages, they are not focused on a particular paradigm. They are languages which use multiple paradigms. If you see, Java has introduced Lambda programming recently because people shift from one language to another and they need some sort of easy ways to program. And here we are showing a graph of
03:27
languages which have more than three or four programming paradigms within the same programming language. Now if I take, again, if you want to see popular languages, you can take some examples of
03:43
programming languages and count the Wikipedia articles. And if you see at the center, you have PHP, Java, C++, JavaScript, and it shows you. And then you see at the outside of the circle, like in the area, that they are not so much programming. They have just a couple of articles on the Wikipedia.
04:02
And when I say multiple Wikipedia articles, I mean they are described in many different languages. And similarly, the same aspect. If I take the reverse, I want to see how many programming languages are described in different language, different human languages. So English, as usual, is taking the bigger circle.
04:22
And then you have the European languages around the second, the second level of circles. These things comes just because of things that have been linked together on Wikidata, all this. I will show you the queries that have been used to create this. Again, if I take the number of labels, you'll find that the English is still
04:41
at the middle and the European language out in the next level of the labels. So this is all about programming languages. But then there is also a concept about software. These need to be described as well. So here I'm showing two Wikipedia info boxes, English Wikipedia info boxes, on software.
05:02
And I continue the same approach to see what's happening, the software with the most number of labels, the software with the most number of articles on Wikipedia. Again, look at the size of the English Wikipedia. It is describing a lot of softwares compared to the number of programming languages.
05:24
Same on the labels, on the Wikidata part, etc. But that's not enough. We also have operating system, we also have web services because we are shifting from the programming, from our desktops, from our PCs, to the web services as well. So things have to be also described
05:41
on Wikidata, and there is a lot of work to be done towards that work. So shifting now, we see that Wikidata is a cross-domain knowledge base. And how could we use this for digital preservation? So some of the organizations that we're working with, the Open Preservation Foundation,
06:02
which is working on providing open source tooling for the work of digital preservation, Software Heritage, which is an international program to archive software source code, and then the project that I'll be talking about today and the work that we'll be describing today is coming from is called EASY,
06:22
which stands for Emulation as a Service Infrastructure, and this project is to provide pre-configured emulated computing environments for legacy software that Yale University Library can no longer provide the relevant computing environment for
06:41
physically on campus, so we provide it in a pre-configured emulated environment. And in order to do this, we need a lot of metadata about all of the different components that go into these pre-configured environments, and so we have decided to create all of that metadata as linked open data in Wikidata.
07:03
So what is Wikidata? A quick introduction. So it started in year 2012. What is interesting, it's a free, open, linked, structured, collaborative, and multilingual knowledge base. But what is very interesting, which I like about Wikidata, is to say that there's a shift from
07:20
multi-subdomain, multilingual Wikipedia site. So everybody comes on a single website called Wikidata and collaborate on it and translate on the same website. Think some years before we were having en dot Wikipedia, fr dot Wikipedia, de dot Wikipedia, and now we are going on into the ww dot Wikidata dot org, and we are working to create a multidomain
07:44
knowledge base website. So it's a collaborative, multilingual, multidomain ontology website. Why I want it, why I showed those Wikipedia info boxes on my first slide, because currently most of our the community are trying to import data that has already been written on the Wikipedia info boxes in different languages.
08:04
There are softwares that have been described only in French, but not in English, and vice versa. And if you take other languages, they are all only described in one particular human language. So the current goal is to obtain all this data and put it on Wikidata, but
08:22
the future goal is, if Wikidata can become as a single source, you could have all up-to-date information on Wikidata, and all the language info boxes will take the data from these, from the Wikidata, and you will have an up-to-date information. So you do not have to worry whether the article has become stale or not.
08:44
So what happened in the past? In the past, the individual, proper individual collaborators worked on individual languages, and they proposed what are the properties for those info boxes. So here you see paradigm designed by a developer, typing discipline, etc.
09:02
But if you go to the Italian Wikipedia, you have something called Utilisa for a software, which is not used in any English Wikipedia. But now what is happening with the Wikidata, people are coming on Wikidata, and they are proposing the different ways to describe the different programming languages or software. So here is an example of Python.
09:24
I just show labels in English, French, Spanish, and German. So you see label, the description, and also the aka, also known as, etc. So people propose the properties, and you have now some of the properties that are used to describe Python. Something that is, in the end, what I like about it, is influenced by, so you can also get a
09:45
timeline or an influence graph of different programming languages. Here an example. Here we say that we call that language, the language, and then you have the label, and then you have the description for a property like P17,
10:00
which could be used to describe anything like whether, in which country a person was born, etc. So how is Wikipedia properties are created? So it is, again, it's a collaborative website. That means people propose it, and the community members propose properties, and they are put to discussion, and they are put to voting, and then they are created.
10:23
There could be a possibility that people who, the proposers, could also translate this property at the proposition time, and then people, after the property has been created, it has been, it can be deleted, and then later used. But don't think that the properties are there forever. There is a possibility that the properties can be deleted, and there is always something called the proposition to delete,
10:45
which is again discussed, which is voted, and which is deleted. So this can happen. But the next question is, there are so many properties, like around 5,000 properties right now, how would I know which property to do? I'm working in programming language, I'm working in software, or you're working in
11:04
archaeological websites, et cetera, how do I know what properties to use? So for everything, there are communities, there are communities behind a particular topic, and they've created WikiProject. So here is an example of a WikiProject called WikiProject Informatics slash Programming Language, and if you come to this project, you will get details of what are the properties that I
11:25
require to describe a programming language, et cetera. This is very interesting. To find, because it's very difficult. As a newcomer, when I started working on Wikidata, I had too much of difficulty to find the right properties.
11:40
So here you could find like instances like the Linux package, the creator, the developer, the instance of, et cetera, the properties that have been proposed. Another interesting aspect is, thanks to the structured knowledge base, you could also use Histropedia to understand the timeline of programming. So you know Swift, which is a recent programming language by Apple,
12:05
you find it in the list, you can get the timeline, what were the languages, what were the popular paradigms in different periods of time? So thanks to tools like this. To give you a little bit of an overview of the status of software data in Wikidata right now,
12:23
there are currently more than 85,000 instances of software or a subclass of software in Wikidata, and this is really many different types of software. This is research software, commercial desktop applications, a lot of description of free and open source software, many different kinds. And
12:46
this is a bubble chart visualization showing licenses that have been approved by the Free Software Foundation, ranked by the bubble size indicates how many software items currently described in Wikidata are
13:01
available under one of these licenses. And this is an example of how external IDs from some of our institutions are being used in combination with software data in Wikidata. So these are Unix utilities that have their own identifier in the Library of Congress
13:23
authority file or in the GND. The German Research Network, members of the German Research Network, showing here software titles as described in Wikidata that members have developed. And
13:40
this is a graph visualization generated by the Wikidata query service, Sparkle Endpoint, showing for file formats that have a Library of Congress file format identifier, please show me all other external identifiers that are also used to describe those file formats.
14:01
So what we're trying to show here is that our institutions, if we are willing to put our data into Wikidata, we can then combine our own data with all of the other data that's in Wikidata to learn even more about the resources that we're already describing. And our specific project
14:24
that's part of Emulation as a Service infrastructure is the Wikidata for Digital Preservation Portal. So this is a portal that sits on top of Wikidata and it is directly inspired by another portal called WikiGenomes and
14:41
the Wikidata for Digital Preservation Portal provides a streamlined interface specialized for people in the digital preservation domain. And we also have a few specialized searches that are, we hope, useful for people in digital preservation. You can search for file formats by PUID, by MIME type, other things like that.
15:03
And so this is collaborative work between our group and also the Open Preservation Foundation and we had the idea to make a portal on top of Wikidata because of the success that we saw with WikiGenomes. And so we were hoping that we could support the use case of people who have domain expertise in digital preservation
15:26
but may not be interested in joining the Wikidata community or learning all of the ins and outs of the Wikidata data model. We wanted to present them with an interface that has a property checklist of the most relevant properties for software and file formats.
15:44
And we wanted to do this so that people wouldn't be overwhelmed by the large number of properties available in Wikidata. And so the way that we built this, this is a Flask app. It's powered by the Wikidata integrator, which is a tool that you can use to write bots
16:05
for Wikidata or also to get data in and out of Wikidata quickly. And then we also combine that with the MediaWiki API. This is a screenshot of the interface. You can view the project at wikidp.org.
16:28
So we would like to finish by talking about something called a tool called WGProp, which I've been working on. This is something to understand how properties are being created. Is it a really multilingual approach?
16:44
If yes, how many languages are covered? So this tool helps you to understand, really get real-time statistics on how the properties are created. How many have been created? How many have been translated? What is the path of translation?
17:00
Et cetera, et cetera. So this is a screenshot. You could check. I will put a link at the end. You'll see. And we would like to... So, yeah, you can get real-time translation statistics. All the supported languages that are supported by Wikidata could compare if your language is performing better than English, for example.
17:24
And you could also check what are the available properties for an entity. And again, it's using the Wikidata Sparkle endpoints and MediaWiki API to build this tool. So finally we would like to conclude that. So we have to think that it's we have to think beyond the preserving cultural heritage.
17:44
We have to think to preserve digital heritage. And this is very important because otherwise our future generation will not know about what were the prominent languages of a particular time. And projects like, if you take Wikidata as an example, we could
18:01
store such information for future uses and that's it. And you have all the tools to build very cool APIs, I think. Thank you. Thank you very much.
18:20
Okay, just put some links to tools and projects if you wish and references in the slide. Okay, thank you very much. Do we have any questions or comments for the speakers? I'll ask the question. As someone who works with Wikidata, I get asked this question all the time about
18:46
the value of the metadata there and fears about vandalism or incorrect changes or perhaps removal of properties that you're relying on. So can you express a little bit about how you've approached that?
19:04
So one thing which the tool which I was working on like wd prop is about tracking such type of vandalism. So you could find out like, we were trying to find out like P856, which means official website. The English label is official website. And if you check the
19:24
historical details of this P856, it has been vandalized many times. And you could, if thanks to these tools like wd prop, you could find what happened, at what duration this particular item or this particular property was vandalized. So you have, since it's everything is open source, you have the
19:42
track record of whatever has been done, it is very easy to find what has happened. Okay, any other questions, comments?
20:02
I wonder if, I know you are active as persons in Wikidata. I wonder if there has been feedback about your project from the Wikidata community of people joined in or how did it go?
20:22
If I talk about the programming languages and softwares, I usually, I add a lot of new projects and people, and I see after that there are no new entries that have new items have been created or new properties have been added. So I feel people are seeing this and there are new
20:41
proper softwares and programming like that are being added. So there's a track list for that as well, and for that purpose. I find that often people are saying, why isn't my thing there? Why isn't this thing I care about there? And so something wonderful about sharing queries of Wikidata, even if they're showing
21:00
current gaps and what's missing, sometimes that can really inspire people to say, oh, let me go add that. I would like to see this be more complete. Okay. Still, I think we have room for one question if you like. Okay.
21:28
I hesitate to ask only because the answer might be a really long one, but could you comment for a minute about your use of the data in Wikipedia as part of your emulation environment? I think I sort of sensed a connection there, but you didn't get to explain it much.
21:44
Yes, thank you for asking. So the Open Preservation Foundation provides characterization tools for files. And so let's say that we have a unknown image, we characterize that and find out what file formats are there. And then we look up in Wikidata,
22:08
what are the software titles that are known to be able to read that file format? And we use that to make a recommendation about which of the many, many, many thousands, tens of thousands pre-configured emulated environments we should show to the user in the interface for their final selection.
22:28
Okay. Is this working? Okay. Thank you again, and then we will move on.