Snippet - FAIR Findable #1 - Into to FAIR and F for Findable I: Finding PARADISEC

Snippet - FAIR Findable #1 - Into to FAIR and F for Findable I: Finding PARADISEC
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
great thanks very much kif and I'd like to thank Anne's for the support of Paradis sick over the years it's a
paradise ik has been running for some time and as you can see it's become a significant collection we have 31 most likely more than that because it's increasing almost every day but around 31 terabytes of material representing over 1,100 languages and you know 162 thousand files and seven and a half thousand hours of audio so it's a significant collection and there's a huge management task involved in that and one of those tasks also is making sure that this material is findable by the people we want to find it we have a
catalog that we've been working on for a number of years we've built our own unfortunately we didn't find one on the shelf that we could use but the catalog allows you to look at material with Geographic point of entry into a faceted search we have our I so open our eyes initiative and dublin core based metadata we try to be as
lightweight as possible with the metadata because their experience we're all researchers I'm a linguist my colleague Linda Berwick is a musicologist and our experience was that people just wrote into a metadata if it's too complicated so we've tried to make it as simple as possible and to make the catalog do as much of the work for you as possible so using controlled vocabularies doing predictive data entry and having you know minimal number of fields as you'll see here we have is a screenshot of catalog we have the possibility to make the metadata private so as keith was just saying fair doesn't mean that everything has to be made publicly accessible if you're constructing a collection you can keep all the metadata private and then publish it when you're ready you can also assign various kinds of access conditions including you know open so to normal conditions or closed subject to whatever conditions you want to specify because our project is really focused on language materials from small languages that is all of the 7,000 other languages that are out there in the world we include language identifies for subject language and content language of items in the collection and this is the linchpin that lets us then feed to a number of different harvesting services that I'll show you in a minute our online catalogue lets you specify geographic coordinates which then also allows you to search using that
geographic information because of the work we're doing we have lots of connections into the region in particular the Pacific and we are actively seeking collections in the Pacific collections of analog tapes that need to be digitized and you can see the various agencies there that we've collaborated with and continue to collaborate with digitizing hundreds of tapes and then putting them into the collection and making them accessible so
when we talk about findability we can talk about the sort of granularity of finding we can find collections and we can find items and we should be able to drill down into the collection to find things that were particularly interested in so we can sort of characterize findability on a scale if you like from from zero to ten so if we talk about
research materials primary research materials that people have in their offices or in their homes typically the find ability of those things is about zero it may be one if your colleagues know that you've done this work and you have these tapes sitting in your office but a speaker of the language trying to locate recordings that you made with their grandparents they're not going to be able to find that material so from our point of view in Paradise if we infer that these records must exist because we know that the research has been done so we can go looking for it and then what we can do with that is we
could add records to our catalog pointing to analog materials and we do this in some instances we also point at websites that we know exist so there are some fine websites that have language materials on them but the web it's might be transient and what we then do is point at the wayback machine or the internet archive entry for that so here's an example of a text that was produced in Solomon Islands put on line by the project Canterbury which is a Anglican archive online archive but it's a website there's no guarantee of longevity and so by us putting it into a catalog it then makes it available and findable via the search engines that we'll see in a moment so then we increase the find ability of that - perhaps 3 out of 10 and using the language identifier so there you could see the three-letter iso 639 3 code for languages in this case it's lkn what
we've also done is provided images of manuscripts so this is a collection of papers produced by Arthur Capel during his life he was a professor of linguistics at Sydney University when he died he left a huge number of papers which we then digitized we just set up a camera and took images of all of these papers and as you can see in the bottom right there there are a lot of handwritten original manuscripts which were really valid valuable from a research perspective but you know sitting in a box in his executives house they're completely unfindable so putting entries into the catalogue and we put this through the Heritage data management system to put a HTML framework around it and you can then find these items and you know
resolve to the level of the image know what you can't get to the transcript of the image because at the moment although we have images there but one of the next things that we do to increase findability is to include transcripts together with recordings so here's an image from our catalogue and what we have is time aligned transcripts of recordings these allow us to play the recording and you can imagine because I won't show it to you that as the recording players it Scrolls through that transcript so this is increasing findability significantly you can resolve down to the level of words and find them in the context of the recording one of the
other things that we do is we embed some metadata into the header of the wav files in our collection we create a broadcast wav format file which is the european standard for archival formats of audio files and you can see a little snippet of XML there which is extracted from a catalogue and insert it into the wav file before it's all sealed up and put into our collection and we use
persistent identifiers of various kinds because the collection started as I say 15 years ago we have a internal persistent identification system which is a collection followed by an item number more recently in the last couple of years with put DIYs through the whole collection so we have do is from the level of each file up through items and up to the collection level you can see also that we have a Zotero and Mendeley integration so that also makes things findable in that people will cite these items using this form and they can click and insert them into there so here armindel a toner basis we have an API we
have two feeds that we produce so people can link into collections RIF cysts is at the collection level and that's what's harvested by research donor Australia and other services a trove also harvests that material and the oai-pmh feed is primarily targeted at the open language archives community so linguists have been very good at setting up services based around these language identifiers and the Olek page allows you then to look at all the material that's produced by any one of their 60 member archives for any given language so it's a fantastic resource for finding information about the world's languages and if we update an item in our catalog then the nightly harvest from our lake will update that oh like harvest the next day so as you can see research data
Australia takes feed and and produces it in interesting ways so the benefit for us is not only that our material is more findable but some of these services present the information in our catalog in ways that we don't so you can do faceted searches and some of these services and it also links into all kinds of other services and and data providers that allow you then to do interesting new searches there's the
open language archive community page they have a faceted search on the right and a whole lot of services that they provide are advertised on the left there if you're interested in languages at all the really really the the one-stop shop for finding information what's what in any archive in the world in their harvested system this is the virtual
language Observatory which is a European service funded by a Clarion in Europe they also take our feed and you can see that you can search a collection through that service as well and WorldCat the
international catalogue of all libraries also takes our feed so that's sort of on
the the the big picture side of it and international search engines on the other side the people that we want to find this material had in the Pacific and we've been working very hard to get material available in forms that can be accessed by people in the Pacific on the top right there's a really interesting little project that was run in madding where they took recordings and played them at a local market and asked people in the market to comment on the recordings perhaps enrich the metadata in that way they then sent back to us in a spreadsheet which we were able to record into a catalogue at the bottom you can see a speaker of one of the languages who happened into my office in Melbourne and went through the collection and found his grandfather speaking and he was quite amazed by by that so there's an example of how unfindable I suppose the material can be that he had to come into my office to find it and that's one of our big problems is how to make the material in our catalog accessible to people who aren't perhaps always looking around on the web because they just don't expect to find material in their language on the Left there's a man who's working in our office in Sydney this was an ends funded project to enrich the PNG metadata in our collections and he's going through listening to material and adding metadata where he can so what are the
other ways that we're promoting the collection is by building a virtual reality project so what you're looking
at there is a map of Vanuatu and each of those shards of light coming up represents a language where there's a little symbol there you can listen to a snippet of the language which comes out of the Paradiso collection and you can see some information about how much we know about the language what there's a grammar if there's a lexicon and so on and how many speakers there are of that particular language now this is generating a lot of publicity as you can see on the right there's an article from the Papua New Guinea post Korea and on the bottom right there's an article from written about this in pursuit of Melbourne University and so on getting this publicity is important exactly so that people will then going to look in the catalog and to find information or think about collections that they have that need to be digitized so this is it's an investment of time and effort to build the virtual reality but it's captured a lot of public attention and it's also you know a research output in that it it is driven by well-formed data in the in the Paradise ik collection we've automatically snipped 20 seconds out of audio files and used the naming convention and the metadata that's in the catalog to then feed this virtual reality display so ultimately we do want
to give this material out to the Pacific and what's amazing really is that now most people in the Pacific have mobile phones that are accessing the internet on the right you can see a poster for the internet on your phone in Port Vila and better are two and on the left you can see a church but above the church there's a mobile phone tower which is now the way that people are accessing all this kind of information so we want to make material findable for people in these remote locations even in the Highlands of Papua New Guinea or in the most remote parts of the Pacific so the catalog is is findable to them through various means including of course Google but we also need to make the data accessible interoperable and reusable for them but I'm not going to talk about that now so paradise X
created a standard metadata set that means that as the data comes in its described with as a light touch as I say we apply as much metadata to items as possible but for some of the legacy material there's just very little metadata and we have to infer what we can we also rely on people putting that metadata in online if they can or sending information to us we are always open to enriching the metadata that's in the collection the main point of the metadata is that you are able to then locate the primary records and have them play to you or see them or download them if you have the privileges so all of that makes it more accessible and findable and I hope that much theory of the metadata through api's for our discipline-specific and more general search tools that makes it as well we do many things to try and publicize the existence of the collection including what may seem gimmicky virtual reality or augmented reality but all of this goes to increasing public knowledge of the collection so that will it will increase find the findability but also increase our location of analog data that needs to be digitized part of all of this also requires data management training so that people know about how to build their own collections so we do a lot of training of researchers here in Australia but also in the Pacific and we also have a lot of engagement with community agencies in the Pacific and try to get funding to run digitization programs with those agencies so that's
our story about findability thank