We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Good citizenship and bad metadata: Wikidata at the National Library of Israel

00:00

Formal Metadata

Title
Good citizenship and bad metadata: Wikidata at the National Library of Israel
Title of Series
Number of Parts
36
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
36
Cohen's kappaLibrary (computing)InformationTouchscreenPresentation of a groupReverse engineeringRoundness (object)Coma BerenicesPower (physics)DemosceneArithmetic progressionLibrary (computing)GodSlide ruleMeeting/Interview
Cohen's kappaVirtual realityMetadataMetadataProjective planeAreaInternationalization and localizationCohen's kappaDisk read-and-write headBackupRepresentation (politics)Library (computing)Computer programming
Library (computing)Multiplication signMultiplicationProjective planePower (physics)WikiDigital photographyFacebookWave packetGroup actionInformationAreaUsabilityComputer animation
DatabaseInformationHardware-in-the-loop simulationMusical ensembleCache (computing)Sheaf (mathematics)Electronic mailing listAuthorizationFormal languageClassical physicsMereologyData conversionComputer configurationProjective planeStaff (military)AreaComputer fileStapeldateiInformationDatabaseLibrary (computing)Point (geometry)NumberStudent's t-testGroup actionAlgebraic closureArtificial lifeFlow separationDifferent (Kate Ryan album)Term (mathematics)Text editorQueue (abstract data type)Online helpCanonical ensembleField (computer science)Perfect groupPlanningMappingWordUsabilityConsistencyDisk read-and-write headIdentifiabilityAddress spaceComputer programmingStandard deviationFile formatUniformer RaumDiscrete groupBitSet (mathematics)MultiplicationMetadataWhiteboardLibrary catalogFrequencySource codeElement (mathematics)Digital rights managementOutline of industrial organizationCodeCuboidTheory of relativityExistenceSystem callRegular graphSocial classSubsetAssociative propertyAdditionProduct (business)Level (video gaming)Power (physics)Rational numberChannel capacityHypermediaPhysical systemCodeExterior algebraScaling (geometry)MetreReading (process)Source codeXML
Dynamische GeometrieMetadataLibrary (computing)Projective planeService (economics)Regular graphData managementDigital photographyAuthorizationPhysical systemComputer fileLink (knot theory)Library (computing)IdentifiabilityForm (programming)Source codeMetadataConnectivity (graph theory)Student's t-testProcess (computing)Wave packetPlanningProduct (business)Staff (military)Row (database)Branch (computer science)NumberQuicksortMereologyTheory of relativityQueue (abstract data type)Complex (psychology)Normal (geometry)Basis <Mathematik>FreewareSeries (mathematics)InformationGoodness of fitCentralizer and normalizerDatabaseWordSystem callWikiDataflowUniverse (mathematics)Arithmetic meanPresentation of a groupCore dumpVotingMetropolitan area networkSource codeXML
Cohen's kappaIntrusion detection systemWordPlanningAuthorizationComputer fileFormal languageProjective planeOcean currentMereologyUniverse (mathematics)Row (database)FamilyWikiDatabaseInformationData conversionDigital rights managementMassForm (programming)CASE <Informatik>Library (computing)Set (mathematics)Self-organizationConnected space
Roundness (object)WebsiteObservational studySource codeMereologyAuthorizationProjective planePoint (geometry)InformationDatabaseLibrary (computing)Row (database)Intrusion detection systemWiki
Cohen's kappaTerm (mathematics)Multiplication signWebsiteOpen setNumberStandard deviationInternet service providerOnline helpAreaScalabilityData conversionDigitizingCellular automatonProjective planeDisk read-and-write headDatabaseTask (computing)Mixed realityDecision theoryExistenceFormal languageLibrary (computing)Data managementLibrary catalogData structureHypermediaInformationScaling (geometry)Shape (magazine)Content (media)Endliche ModelltheorieSubsetFile formatMetadataBijectionLattice (order)
TwitterCohen's kappaDrop (liquid)Data conversionHypermediaWeb pageInheritance (object-oriented programming)
Cohen's kappaComputer wormString (computer science)CASE <Informatik>IdentifiabilityRow (database)MathematicsPoint (geometry)DatabaseComputer fileInformationComputer configurationMultiplication signAuthorizationNatural numberData conversionCategory of beingNeuroinformatikVirtual machineReading (process)Library (computing)State of matterBargaining problemLiquidHypermediaField (computer science)PlanningSemiconductor memoryAreaTelecommunicationLibrary catalogPersonal area networkTouchscreenDependent and independent variablesGame controllerDegree (graph theory)ExistenceConnected spaceDoubling the cubePresentation of a groupPower (physics)Event horizonField extensionComplete metric spaceText editorGene clusterProjective planeAlgorithmExtreme programmingMeeting/Interview
Transcript: English(auto-generated)
Hi, nice to see you. Nice to see you again. Yeah, so do you have any slides that you... Let me share the screen. So I'm looking forward to your presentation. It's been a year since we were talking at WikidataCon about some of the work you were doing in the National Library of Israel,
and been following it since. It seems like you've been making good progress, so look forward to hearing about it. Thank God. Okay, you're seeing my slides, right? Because it's taking over my whole screen. I'm going to remove myself and leave the floor to you.
Okay, thank you. Hi, I'm Ahava Cohen. I'm the head of Hebrew cataloging at the National Library of Israel, and I teach cataloging and classification at David Julian College MLIS program. I'm also the backup representative from Europe to the RDA Steering Committee,
with areas of interest in multilingualism and internationalization of metadata. So it's obvious why when National Library of Israel started to think about Wikidata projects, they turned to me and asked me to head up the project. Our road to multiple Wikidata projects has been a long and complicated journey.
The National Library of Israel has been involved with Wikipedia for quite a while. We've been running monthly edit-a-thons for almost 10 years. But thanks to a $54,000 grant, National Library of Israel declared 2020 would be our year of Wiki at the library.
We got off to a running start, uploading photos from the Dana Danie Collection to WikiCommons, releasing one of our reference librarians to serve as a half-time Wikipedian in residence, running Wikipedia training for public school and academic librarians, of course this was before all the shutdowns,
and opening up an Israeli Wiki library's Facebook group. Our crowning achievement for 2020 was coming in as the top country in the world in the one lib, one ref campaign. While we had all this experience with Wikipedia, we were just beginning to understand the power of Wikidata.
Looking back, I wonder if it took us so long, because our Wikipedia work was the province of the reference librarians, and they were already used to packaging information for ease of use by patrons. Our Wikidata projects began when we brought on board a cataloger who felt far more comfortable
with creating discrete bits of metadata than with writing whole sentences inside Hebrew Wikipedia. Our first big Wikidata project stemmed from a local problem. We switched from Aleph to Alma as our integrated library system, and that meant that the multiple authority databases which we had managed had to either be combined,
moved to an external system, or discarded altogether. The Israeli publisher's database had been built up over the course of several years by our legal deposit staff, and it included information about publishers, their contact information, their history, and rights management.
After a reorganization of the department in 2016, the database was no longer in use, and it was no longer maintained, but when we discussed shutting it down and erasing the metadata as part of our move to Alma, we found out that there were researchers who had been using that information, and they protested the closure.
After we looked into alternatives, we realized that the information we had could be of use to Wikidata. We had high quality authoritative information about over 1,700 Israeli publishers.
Most of them were not yet represented in Wikidata. Any attempt in the future to make a large-scale Wikidata project for Hebrew bibliography would require the information that we were in danger of erasing. So we had to be good citizens, step up to the plate, and donate our information.
And so the Wikidata Israeli publisher's database project at the National Library of Israel was born. As I said, our database was built for internal use. Since we used Ex Libris' ALIF for our acquisitions work, the information was stored as MARC data, but nearly every field was localized.
Even if we had found the perfect crosswalk from MARC authority data to Wikidata, it wouldn't have been of any help in this database. It's not that the metadata was bad, per se. It was perfect for its intended use.
The problem is that our Wikidata project wanted to make the metadata that had been gathered serve a different purpose. And to do that, we needed to get the information, both what was encoded and what was stored as narrative, into a standardized format.
Even little things like consistency in the metadata was lacking. Sometimes an address would have P period O period box, sometimes it would be P period O box, sometimes P O box. Sometimes the area codes came before the phone number, sometimes after the phone number, sometimes there weren't even area codes at all.
It all had to be cleaned up before we felt comfortable uploading the metadata to Wikidata. Another issue was, since this was an internal database, public and private information was all mixed in together. Before we could upload any of the information we had, in keeping with National Library policy,
we had to make sure we cleaned the database of people's personal addresses and telephone numbers, leaving only the corporate information to be shared with the world. Before we closed down Olive, we saved the database as an Excel file and began to normalize the metadata.
Now, those of you who've done batch uploads to Wikidata are probably groaning here. Why did we use Excel? It's a good question. The answer, of course, is that we just didn't know any better. But we did try to learn. We started to teach ourselves OpenRefine.
Of course, that right away limited our options because the people doing the learning had to be able to follow tutorials that were out there, and those were mainly in English. So we started out the project with three strikes against us. The people working on the projects had to understand how the database was constructed. They had to understand how to map the database to Wikidata.
And if we wanted them to work in OpenRefine, they had to be able to understand the tutorials. We were lucky to get a grant from Wikimedia Israel to attend WikidataCon 2019, where Simon gave me an amazing workshop on OpenRefine.
We learned the basics of OpenRefine, but left Berlin with a lot of questions about working with multilingual information. And that's a crucial area for us because we work in four languages of cataloging, Hebrew, English, Arabic, and Russian. Another area in which we had problems was group work.
And here I'm going to make myself sound a little bit dim. Because of all the languages, we had to pass the project from department to department. The same people who know Russian don't necessarily know Arabic. We didn't realize that if we passed the projects back and forth within OpenRefine,
we could export it and import it as an OpenRefine project. So we kept taking it over to Excel, having to rejig some of the metadata, bringing it back into OpenRefine, and we made a huge mess. But we've learned better now.
Another problem was that some of the publishers already exist in Wikidata, but they're not standardized there either. When our student workers started to reconcile the metadata using the names we had in our database, they often missed entries which exist because, for example, sometimes the term publisher was added by a previous Wikidata editor
before the name of the publisher, sometimes after, sometimes as a qualifier in parentheses, sometimes they use the words publications instead. You know what? It's easy to think about all the problems we faced. But what's of help to the community is our successes. To date, we've uploaded more than 500 of the 1,700 publishers,
and a Wikidata editor who's also a librarian in another Israeli institution has been helping us merge the items which we created without realizing that other items for that same publisher existed.
With one successful project underway, we're beginning to work on the next so that mapping and planning can be finished when the staff that's uploading the publishers is ready to upload the next project. The Anonymous Jewish Classics project began as part of a hallway conversation
at the International Federation of Library Associations World Library and Information Congress held last summer in Athens. The IFLA cataloging section already publicizes lists of authorized access points representing works for a number of cultures. We agreed that there's a lot of literature discussing Jewish religious works
and many, many, many editions of those works, but libraries across the world use different terms to describe the resources. At first, we thought we would just make a list of what we in Israel call those works and what they're called in Library of Congress authority files.
And then we thought, you know what? Let's add in the Wikidata queue number. National Library of Israel authority librarians had already been in contact with Wikidata editors who were working on what we call the canon set of Jewish religious texts, so we knew there was a certain amount of coverage in Wikidata already.
We compiled a list of 535 anonymous Jewish classics, which we had had in another database from Aleph, which we had closed and folded into our regular authority file. It was called the database of uniform headings in Judaica.
And we began finding the relevant identifiers by us in Library of Congress, Wikidata and BIAF. When we heard that the PCC, the Program for Cooperative Cataloging, which is an international cooperative effort aimed at expanding access to library collections
by providing useful, timely and cost-effective cataloging that meets mutually accepted standards of libraries around the world, when we found out that they were launching a pilot project, which really hit a lot of the areas of our concerns, like production rates and how to document sources for data elements
and what meets the notability requirements, well, we jumped on it and we applied and we were accepted to the pilot project. This way we can take a project that we've already decided we were going to do and use it as a learning opportunity. In addition, because the PCC pilot exposes our project to other libraries
right from the planning stage, we've had expressions of interest from abroad. Instead of doing all the work in-house, we hope to cooperate with librarians across the world. Beyond our two major projects, we started creating ongoing workflows
which bear Wikidata in mind. We've begun on a select basis to enrich Wikidata from our authority records, particularly with information about related institutions and works and with identifiers. This is something that I worked on this morning because I happened to have been doing his authority record.
When we start authority projects, such as our current push to create authority records for printers who produced Hebrew texts prior to 1830, we build the workflow with Wikidata already in the package so that it doesn't become an added burden.
When we work within our own authority files, we've started to enhance it with Wikidata queue numbers. In the future, we hope to do a project where we'll pull out of Wikidata all the items which have our identifier. We actually have two identifiers in Wikidata, one for our ALMA system
and one for our ALMA system and then plant them in an automatic process within our authority files. We also work with external projects like the Ben Yehuda project which aims at making out-of-copyright Hebrew literature available for free online.
They have authority identifiers. Their names don't necessarily match the forms that are available at the National Library of Israel, VIAF, Wikidata, and we're helping them normalize their data so that they can pull information in from various sources
at the beginning of helping them work with linked open data. We also work with outside researchers. We have a project going with Yossi Gauron who's the founder of the Lexicon of Modern Hebrew Literature in which he's improving his lexicon
by cross-checking his names with our authority database, Library of Congress, VIAF, and Wikidata, and he's helping us improve our authority files by giving us information we didn't have. We also give him our information so everyone cross-fertilizes everyone else.
And of course we're taking care of our own. We're working to improve the structured metadata and the links to Wikidata in the 24,000 photos that we've uploaded to Wikicomons. When we have a good workflow built, then we will try to scale it up as we upload more and more pictures to Wikicomons.
As I said at the start of my presentations, in some ways the global pandemic hurt our Year of Wiki plans. I think it hurt everyone's plans this year, but on the Wikidata side, it really only did us good. Because we needed to find productive work for public services staff and for our student workers,
while the library was closed to visitors, we had the opportunity to train people from outside technical services to do the normalization work and the uploading. That created cross-divisional teams which allowed technical services to explain the complexities of metadata management.
When we started the Israeli Publishers Project, public services thought it would take a few weeks, maybe a month or two to upload the data. Now that we've trained them to work with the data alongside us, they realize how complex taking non-normative data and putting it into Wikidata is.
That means that they have a greater respect for what technical services does, and that's going to be of a lot of use in future cooperative projects, both within the Wikisphere and within our regular everyday library work. Also, working outside the hardcore of metadata librarians
and teaching people who didn't know about the work that we do gave us a chance to create documentation in Hebrew, which was sorely lacking. We identified knowledge and understanding of how to work with OpenRefine as a crucial component of our efforts, and the fact that all sorts of things had to go online
meant that we could take part in training sessions from places that we would never have been given the budget to travel to. We did two intensive sessions on OpenRefine with Manas University's Malaysia branch. There's a great series of OpenRefine and Wikidata workshops coming out of University of Ottawa.
We would never have been able to participate in any of this stuff in a regular year because of COVID-19. We're there. Our authority file is named Mazal, Ma'agal Zuyot Lumi, the Israel National Authority database,
but the word Mazal is also the Hebrew word for luck. Our plans for Wikidata projects in the future based on Mazal is more hard work than it is luck. We've set goals on the operational side.
We want to upload large parts of our authority database of Mazal based on our ability to add multilingualism to Wikidata entities, to add in other language forms, to add in other connections, for people, for places, for corporate bodies, for families.
We want to upload multiple IDs. In our own authority database, we have our IDs, Library of Congress IDs, VIAF, Rights Management Organization IDs, and after a national push, we've added orchids to thousands of Israeli lecturers.
Pretty much every current university lecturer in Israel has an orchid, and that orchid is in their authority file within Mazal, and we can contribute that to Wikidata. Because we have a history of working with people and with their heirs when relevant,
and negotiating what information we're going to make public within our authority database and what information we won't record so that it stays private, we can also make significant contributions to the discussion of ethical use of personal information within the wiki world, and we're really looking forward to those conversations.
On the aspirational side, our goals include strengthening our reputation as an authoritative source of information, particularly about Israel and Jewish topics, and strengthening our partnerships within the library community, within the wiki community,
and especially with the librarians who also are doing Wikidata work. We also want to serve as a support to projects like WikiCite. Since all of our Wikidata projects are based around bibliographic and authority information, we think we can make a significant contribution, especially given our multilingual authority records
and the multiple IDs that we can contribute, which will really help with author disambiguation. Once we get our authority records into Wikidata, we can then start contributing things like article databases,
particularly Rambi, which is a database of articles about Jewish studies, and that would really boost that part of Wikidata and WikiCite, which at this point isn't sufficiently developed. What do we need to get there from here?
We have several concerns which we need to discuss in-house and with the wider community, which are also probably concerns of other mid-sized Glam institutions. Our problems are multiplied by the mix of languages and cultures represented in our collections,
and if we can solve our problems, we become a model for non-Anglophone and non-Western Glams. One problem is customization versus standardization. Locally, over the nearly 130 years of the library's existence, we've created a number of databases for in-house use.
These databases, as you saw earlier, don't conform to international standards, and we need to decide how much to invest in bringing that metadata up to code so that it can be contributed. We might have jumped in over our heads in choosing the Israeli publisher's database as our first large-scale project.
We took a database, which in terms of standardized metadata was in really sad shape, and in terms of ethics had never been explored. Over the course of the year, we had to sell the project over and over because it never ran smoothly. Had we chosen a database which was more standardized in the metadata structure,
we might have had faster wins and so gotten easier buy-in from other departments. We have to sell it also by thinking about how much the information is worth to the wider world. Will its usefulness to others make it worth our investment in manpower?
Can we sell it as being good citizenship? But our main problem is how do we sell this investment to our management and to grant-giving institutions? Unlike Wikipedia and Wikicomons, Wikidata doesn't directly drive traffic back to our catalog or website, so it's harder to provide numbers as return on investment.
On the Wikidata side, there's still issues of multiculturalism. Some of the metadata that we want to contribute, particularly about classical Jewish and Islamic authors, uses metadata which doesn't meet Wikidata standards such as the use of culturally linked calendars.
Because there isn't a one-to-one correspondence between the calendars, say between the Jewish calendar and the Christian calendar, the secular calendar, that means that we have to try to ascertain the hour of birth to do the conversion to the Christian calendar or give up on the granularity of the exact day of birth
so that we can convert the date to a format that Wikidata accepts. There's still cultural issues involved here in what Wikidata will accept as metadata. Our ability to convert the data from our cultures into what's acceptable to Wikidata
will have implications for scalability, which is our next long-term concern. We've invested a lot of time and money in uploading the Israeli publisher's database and that's an investment we can't sustain over the other big projects we would like to do, which are much bigger than the Israeli publisher's database.
No matter how much we want to help the community, we still need to run the library. Future projects which are based on more standardized metadata will require less investment, but it's still really time consuming, particularly because OpenRefine Reconciliation
doesn't handle multiple languages in a single search very gracefully. What other GLAMs can do in one pass, we have to do in four or more passes, which takes a lot of time. We have to find a way to scale our work so that we don't need to find investors and champions for each and every future project.
Our last concern is reuse. The community will continue to find innovative ways of using the metadata we contribute, but we also need to find ways of taking the content that we're contributing to Wikidata and reuse it so that we can bring traffic to our catalog and to our website,
as opposed to the Vanderbilt project where it was, let's think about what we can contribute to Wikidata. We understand what we're contributing to Wikidata, but for funding reasons, we also have to show what we're contributing to the National Library of Israel. We're currently looking at ways to reuse the publisher data
in support of both open library initiatives and to help digital humanities scholars. We suspect that as we add more metadata into Wikidata, the opportunities for reuse will grow, but any such reuse will always require further investment of time and manpower,
leading us back to the scalability issue. Is it worth all the work and all the salesmanship? I think so. I'm pretty sure everyone watching this does, but we still haven't fully convinced the people who make the funding decisions, and I think that, as much as anything else, is a task for those who are working at the intersection of libraries and Wikimedia.
I would be thrilled to continue the conversation both over the course of the rest of this session in the Etherpad or beyond most social media. I go by my personal name, drop me an email, drop me a tweet,
hit me up on my user page on Telegram. I would love to continue the conversation with you. Great presentation. I really love the fact that you've made the most of COVID
and the opportunities to work with people you wouldn't normally be able to participate in their events all around the world. It is a small positive out of quite a strange year, I would say. The other thing that really resonates with me is
you're talking about personal data and personal information in Wikimedia, and we were talking about that on Monday. I don't think anyone's got the answers about what the appropriate amount of information that we should be holding about people. A question I was asked was about researchers,
whether we should be storing their email, for example, and just how much information we should have about people. So, if you're having those conversations, I'd very much like to join you. I know a lot of other people are interested in those topics as well. One of the advantages of Israel being a low power distance,
we don't play six degrees of Kevin Bacon, we play two degrees of Kevin Bacon. Everyone knows someone who's connected to you. So, authors feel no problem in contacting us and saying, you have my year of birth, get that off, I don't like it. Oh, but we need it to disambiguate, but okay, we understand you don't want it,
what can you give me instead? And we're constantly negotiating with authors about the information that's in their authority files. So, by the time we've finished with an authority file, it's information that they've agreed to make public. And we even have an author questionnaire that on top says,
we want you to look over your information because we will be making this available nationally and internationally, so make sure you're comfortable. And if at any point in the future you're not, let us know and we will work with you to get it to a state that you are comfortable. So, if it's in our database, it's author approved.
I think you have just answered the first question there about communicating with researchers that were in our left firm that they're going to have their information put into Wikidata. It sounds like you've got an ongoing conversation with people or certainly channels of communication that are open if people have issues.
Oh yes, definitely. I mean, sometimes it goes to an extreme. I've had authors call me on the day of my daughter's wedding, half an hour before the wedding, when I told her I was kind of busy, she said I could call her back after the ceremony. But yeah, it's a really close communication.
We have another question as well about whether the catalogues were receptive to adding Wikidata into their workflows. Yeah, we had already been pushing Wikipedia and some of them didn't feel very comfortable with Wikipedia,
but then suddenly we say here's a project that you can work on, you can get the same glory that the Wikipedia people are getting in the library, but it's the kind of stuff you like doing anyway. Did anyone, for example, ask or question whether it's just another identifier or was that actually a selling point to them?
It was a selling point. Because we had found that VIAF had problems disambiguating particularly Asian authors and once we started adding in multiple identifiers into our authority records and adding multiple identifiers from our authority records into Wikidata, we noticed the clustering got better.
So we could sell it. Here we have a proven case, business case, you want your data to work right, add in the identifiers on both sides. I think it's quite interesting to see how quickly VIAF update when Wikidata identifies a merge that's needed.
They're obviously paying a lot of attention to what our human community of editors is doing and I think it's really important that we're able to work together because obviously you can do a lot with machine reading and all clever algorithms but a human will spot things that a computer won't.
So being able to monitor Wikidata does identify those mergers and they seem to be onto it, which is great. I think that's all of the questions. Let me just scan through the etherpad. Just a comment saying I love that your authority file is named after Luk, which I think is quite charming as well.
It's actually a double thing because mazel is luk but also because it's an abbreviation, in Hebrew you use quotation marks. So you've got the first letter and then the abbreviation for a person who's dead, of blessed memory, is zal.
So it's also what are the dead people? Because that's one of our hobbies as librarians. We like to add in the 046 subfield G and close off death dates as soon as someone dies. It is nice to have completeness in that.
Probably not so nice in that particular respect but not for the people anyway. That was great. Thank you very much for coming to talk to us today. I really enjoyed your presentation. Thank you for having me. Okay, very welcome. I think it's time for Alice now, so I'll switch over to screens.