DCB2010: ARDC Party Identification Infrastructure
Data Capture Briefings
DCB2010: Party Identification Infrastructure
The ANDS Data Capture Briefing was held in Melbourne on September 2, 2010. The briefing was designed to provide an introduction to ANDS and its services for representatives of Melbourne-based research institutions engaged in data capture projects with ANDS. A number of participants also provided descriptions of their institutional projects.
I'm here to talk about the australian research data Commons party infrastructure one of our core infrastructure projects and ants and the purpose of this particular infrastructure project and I'll briefly
mentioned some others at the end of this presentation is to improve the discovery of datasets in research data Australia
and the way it attempts to do that is by linking research data sets via the researchers that are involved with them and also by expanding the amount of information that might be available in research data Australia about that researcher or that research organization which helps people understand the usefulness or the authority of the data
and you may the first question probably is well you know why do we need an infrastructure the people who are providing the collection activity and service records to answer will also provide party records which says that who this person is the problem there is that there's a the fairly common scenario where you know dr. John Smith might be a researcher at Monash University he's responsible for a number of data sets that are published by Monash University he's also working on a collaborative research project it is also responsible for data sets that are published by Queensland University of Technology and also works as an advisor to a program at CSIRO and their data sets published by CSIRO with which is also involved so you can see there that we won't be able to link those data sets together unless there are some sort of common identifiers and this is where the party infrastructure comes in so it attempts to enable this linking by persistently identifying researchers and research organizations using a public identifier and we heard Nick talking talking earlier about what an identifier is it means that somebody is taking responsibility for managing that identifier and by having a public identify for researchers and research organizations it just improves overall management of researchers in the research sector and has a lot of spin-offs and to a certain extent the major beneficiaries of having this infrastructure is not necessarily research data Australia the beneficiaries will be the academic publishers the research grant authorities like a RC and in HS MRC there are a lot of people who are trying to move towards its Nirvana of being able to say who an identifier is no matter in what context they are producing outputs and and no matter who they are affiliated with so we needed to
choose the public identifier scheme answers you know it's not an ongoing organization we're not going to be here in the future we needed to use a scheme that was going to continually be supported and managed by an ongoing organization and so there was already the enolate people australia's service providing a public persistent identifiers for Australians and Australian organizations they already had eight hundred and eighty thousand records based originally on the Australian name Authority filed so basically the people in there are people who publish or people who were published about but they also then included a contributors such as the encyclopedia Australian science the Australian parliamentary library for politicians australian women's register the australian dictionary biography online they now have Queensland University of Technology researchers in there so we are getting more we have their contributors in that space that would give us more information so dr. John
Smith at the queensland university of technology who also works the Monash University will also may have an entry because he's an Australian scientist in the encyclopedia Australian science which is quite a lot of biographical information about him that's not going to come in through party record that you're supplying in your FCS feeds so they have already established an aggregation service they have got software and hardware that and processes around the managing of aggregation for multiple contributors around Australia they have contributors for most because of libraries Australia Australian universities are already contributing records the National Library of Australia about publications and they're already submitting most institutional repositories are already submitting records out of their institutional repositories for the National Library of Australia now for australian research online so it was good to sort of have all of these things connected in the one space and also they're committed to integrating with international identifier services such as a virtual international authority file the international standard name identifier is a another initiative and orchid which i think is an initiative of the academic publishers to try and establish an international identifier so because they're integrating with those it means then la identifier that where the Australian public identify were using for researchers will map to these other identifiers and will help connect everything together the research data sets as well as publications from the same person or organization so how Anne's established this project was to separate it into two stages because this was a difficult one it was in a sense that was an infrastructure that was already in place it was uncertain as to whether the research sector the particular the universities who are the predominant research organizations and people like CSIRO Geoscience Australia for example would would actually feel about this idea of assigning a public persistent identifiers and I stress public the only information it will lead to is public information it's not a Australia card type identifier so it was important that stage one of the project
involved a lot of consultation about gathering requirements and establishing with representatives from that sector that a infrastructure and processes that they actually felt would work and at least would be supported by the majority of institutions we're not anticipating that will get complete coverage here there will always be whatever reason perhaps some organizations that may opt out but unless the majority unless the majority was going to eventually participate in this than it wasn't worth doing and this is why we had to have stage 1 stage 2 now the advisory group met for the final time about last week week before and they have endorsed the architecture
that's been described in the first stage and recommended to answer that we proceed with this project and so we are now moving into stage 2 which is the implementation phase there's a project wiki which you may go to which has more information also the advisory group now is terminated but a lot of the people in the advisory group are now in our early implementers group and they will help ends and the NLA define the requirements of the sorts of tools the exchange methods for information some of the matching rules may need to be varied in order to accommodate the needs of the universities so the implementation stage
the second stage of the contracted project with en la is adapting their existing infrastructure which is designed for librarians at the NLA to to match sort of identifies amongst contributors that we spoke of not really designed for what would be needed to make the efficient the inflight matching of large numbers of researchers so that infrastructure needs to get adapted so that institutions can use the tools and the identity matching service needed to be modified and the tool that's used for identity matching enhanced and so in order for this to be implemented and the tool improved we need to ingest work with a number of early implementers and this is the stage two which will probably finish in the first part of next year and they will produce technical documentation and guidelines and after that the project is
funded has ended but of course there is ongoing support needed for this infrastructure for ann's that will mean providing training guidelines to
contributors about how to use it liaison and some tools for example making sure that the metadata store projects integrate with the party infrastructure project for the NLA they will continue because they have to for all sorts of other reasons it's their mandate to maintain the infrastructure they will register contributors and the harvest arrangements with the universities and the government agencies and continue to support and improve the data administration tool so that's just an
ongoing commitment so just a little bit of a pictorial view on the left will
have a ir DC contributor could be an institution could be part of an institution somebody like yourselves who is contributing collection service and activity records to the ardc now at the moment you also may need to include party records but the purpose of the infrastructure is that the party records can be submitted independently to the party infrastructure so information about your researchers about your research groups your departments can be submitted directly to the party infrastructure and in return a persistent ID and other words in NL a party identifier is assigned and that means that we don't need your party records anymore what we need is in your collection service and activity records a related object which says this in la
partie identifier is the manager of is the chief investigator or whatever the relationship is to your collection or to your service what's your activity once you have that relationship you know that there is already a record in the ends registry for that for that identity so there is no need then to supply party records but of course you do need to survive party records or someone needs to supply party records it may not be the individual project that or data capture project it may be that your library your research office is already providing information about all of your researchers that are involved with your projects to the NLA and so you really just need to be able to link to those records but I'll talk about those implications later and then from the eight party infrastructure the urns collection registry collects party records which is a link to the collection records that we also received from the contributors so it's a little bit of a little bit of splits instead of in this sort of model that you've seen so far just a little bit about some people Australia or which is now trove
if you're familiar with trove their people and organizations view they of course don't use riff CS as the underlying schema for records about people and organizations because riff CS was designed for a data registry of collections and services registry and it only captured enough information about people to provide a little bit of context whereas what they're looking at is getting a lot richer information
about people from the sorts of data sources that they're talking to Australian dictionary of biography Australian scientists encyclopedia Vince trained science etc and so they use a well-used or well-known standard in
their library and archives sector for describing corporate bodies persons and families called EAC hyphen c PF which is the new version which they're just now moving to from the older version which was too called EAC and as well as supporting names it supports which descriptive information it supports related parties it supports related resources and it supports two description of function activities of functions and activities and that's a link to that standard interested so just a little bit of uh
some again another diagram the model there's the data providers providing
party information the NLA would obviously hope that most people have a no way I repository and we know that's not true just yet but most of you is a building towards one of you I never won already but they will accept party information that's in a static xml file accessible over the internet as well and
they will negotiate with you when you set up a relationship or you or the institutional whichever body within the institution is providing the party information will sort of contact them and say I've got party records information to give to you to get identifiers for and also to say well you know do you want to use EI seneschal records in the AAC format you want to have you already got them in RIF CS format will take that if that's what you've got but they can also work with you and whatever internal you might use another directory standard that for
describing people and organizations within your research management software or your staff directory system and they can also use that format and map up to EAC so that's just a negotiation phase they take that information in to their harvester via their harvester into the people Australia database if it's automatically matched now at the moment the rules for automatic and action are very very tight and most of your records would not get automatically matched because you know an identical first name and surname is not sufficient to say that that's the same person they use things like birth and death dates and other ways to do automatic matching but most would not match unless they have a common identifiers so without some sort of common identifiers to go with the name then probably it will end up not being matched and we'll go into the party data administration tool where someone will have to manually say yes that's the same person or know that John Smith is it's not the same person it's in there's no
there's no one on that name already in people Australia it's a new person and it's a new identity and when it's in the people Australia database it's not only available via trove where you can search and browse people organizations it's also available via machine query services using the SIU standard API and it's also available for harvest by their oir upholstery so you can harvest those records back again and you can harvest it with the NLA identifier so your own systems can automatically populate your systems within la partie identifier against that person's entry in your system and from there you can harvest it back again or you can send a request and the change now because we're moving into using this service in the research data context is that the party administration tool will now be available to the data providers as well as to the National Library for them to do their own matching for their own records and this is an example of what the party
administration tool currently looks like and it'll be released at the end of October with October for a small number of early implementers who are going to provide their records to the National Library about their researchers for them to use and basically feedback improvements and as the tool is improved then we'll
release it more widely next year and so you can see that the entry there that's entry for thomas huxley and you can see on the side that there are actually three contributors so there are three sources of information about thomas huxley one was the lie business tralian other words the australian name authority file and encyclopedia strain science and australian dictionary of ography or line at all provided records about thomas huxley and you can see that some biographies they're related people
and organizations and resources things that either about thomas huxley or by thomas huxley now if we were to look at a researcher for example someone from QUT you can see here now the contributors to the name authority file in the Queensland University of Technology and because the when the Queensland University of Technology supplied their records it was out of their institutional repository so the National Library of Australia was also able to harvest from them related sorry publications in their repository that were connected to Greg hoon and so you now see the selected resources from their repository as well as other publications that were already in people Australia by or about Greg Cohen and then if you look at the back in the people Australia database this is what happens they don't change anyone's they just take what the person contributes it stays or the organization contributes it stays in there so there's one identity for Douglas Mawson one identity Authority that's the party ID but within the database there will be three quite separate records that are each being provided and the local ID that was submitted as the identifier by the local institution when they submitted that data and that can always be matched against to the they will keep that match point so the encyclopedia Australian science could always send an SIU request to trove with their local ID and it will return back the NL a party identifier so and also if they then send an updated feed and they have expanded the information about Douglas Mawson it knows that it's the same record because it's got the same local ID and so it will replace that record rather than create a new one it will do an automatic match and so what this means for data providers to the aidc is that they're
encouraged to provide party information to the NL a party infrastructure service and to obtain la partie identifiers in this way we will build a common set of parties for the australian research sector rather than a slab of risk of identities from this institution this interesting is this institution many of
whom will overlap over time affiliation information can also be captured in EAC so that people will be able to see that this person was at queensland university of technology until 2008 and then they move to the Monash University this sort of information is a benefit in the research and innovation context and so
what we also then want to see is that the collection activity and service records that are provided to ants contain links to an elite party records so inside those collection records or data set records for example there will just be this related object element and you can be confident that when you see your data set record in RDA that there will be a lot of information available
there about that person who is managing that data set without you having to provide that information in your collect data set records so for data capture projects a little bit different than
from sitting the Commons projects because you're often working in a quite specific subset or area of the institution so you really need to work with other groups within your institution who are also got contracts to provide data records to ends so that you actually come to a common agreement on what the key structure you'll use for party records and who was going to provide those party records are you each going to provide party records to ants or is one area like you know the research management office are going to do it on behalf of all of the areas institution it doesn't really matter
it's quite valid for the party infrastructure to get a set of party records from a particular research center which has very rich information in their system about those researchers much richer than perhaps their staff manic personnel system or their research office system has and so if you provide records about that person the research office provides records about that person they all get all that information gets connected into the one identity so even though you're two different contributors you can be referring to the same person and you get a sort of an amalgam of information about the person what you provide and what somebody else is provided but to make that happen easily so you don't have to do a lot of manual matching you want to sort of agree what key will you use what is some unique key within your institution that you can use so that that record that you provide about the researcher in your research program can be matched with the record that might be provided by another area of university about that person because you've chosen to use the same key structure for identifying if you don't that's okay too but somewhere along line someone may have to manually match because there's not enough information if you like to do an automatic match because the name being the same alone is not enough but as part of this project the NLA and ends will be looking at other information we can use to do probabilistic matching such as publications field of research clothes there are other things that might be able to be used for automatic matching so that people don't have to do manual matching and we'll look into that as as we go along