DCB2010: Storing Metadata

Video thumbnail (Frame 0) Video thumbnail (Frame 611) Video thumbnail (Frame 1536) Video thumbnail (Frame 2076) Video thumbnail (Frame 3278) Video thumbnail (Frame 5397) Video thumbnail (Frame 6832) Video thumbnail (Frame 7855) Video thumbnail (Frame 8850) Video thumbnail (Frame 9940) Video thumbnail (Frame 11896) Video thumbnail (Frame 16662) Video thumbnail (Frame 18146) Video thumbnail (Frame 19280) Video thumbnail (Frame 26802) Video thumbnail (Frame 28679) Video thumbnail (Frame 29514) Video thumbnail (Frame 31027) Video thumbnail (Frame 31779) Video thumbnail (Frame 32756) Video thumbnail (Frame 33376) Video thumbnail (Frame 34150) Video thumbnail (Frame 35358) Video thumbnail (Frame 36371) Video thumbnail (Frame 37979) Video thumbnail (Frame 38579) Video thumbnail (Frame 39370) Video thumbnail (Frame 40005)
Video in TIB AV-Portal: DCB2010: Storing Metadata

Formal Metadata

DCB2010: Storing Metadata
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
The ANDS Data Capture Briefing was held in Melbourne on September 2, 2010. The briefing was designed to provide an introduction to ANDS and its services for representatives of Melbourne-based research institutions engaged in data capture projects with ANDS. A number of participants also provided descriptions of their institutional projects.
Data storage device Metadata Motion capture Computer programming Metadata
Scale (map) Link (knot theory) Observational study Projective plane Metadata Planning Line (geometry) Metadata Computer programming Variable (mathematics) Data management Data management Process (computing) Object (grammar) Software Energy level Configuration space Object (grammar) Arc (geometry)
Addition Email Inheritance (object-oriented programming) INTEGRAL Multiplication sign Data storage device Electronic mailing list Metadata Similarity (geometry) Data storage device Number Universe (mathematics) Resource allocation Arc (geometry)
Laptop Focus (optics) Demo (music) Information Key (cryptography) Multiplication sign Connectivity (graph theory) Data storage device 1 (number) Motion capture Mereology Computer programming Equivalence relation Metadata Twitter Data management Term (mathematics) Coefficient of determination Business model Information Determinant Asynchronous Transfer Mode
Table (information) Information Demo (music) Term (mathematics) System programming Metadata Information Library catalog Metadata Physical system
Context awareness Googol Information Coefficient of determination Metadata Information Determinant Experimentelle Versuchsforschung Physical system
Context awareness Information Demo (music) Coefficient of determination Metadata Information Endliche Modelltheorie Determinant Design of experiments Experimentelle Versuchsforschung Computer programming Metadata
Point (geometry) Authentication Game controller Link (knot theory) Logical constant Information Link (knot theory) GUI widget Data storage device Set (mathematics) Open set Login 2 (number) Mechanism design Natural number Personal digital assistant Term (mathematics) Directed set Information Row (database)
Point (geometry) Service (economics) Link (knot theory) Multiplication sign View (database) Sheaf (mathematics) Set (mathematics) Open set Login Mereology Disk read-and-write head Polarization (waves) Dimensional analysis Metadata Number Frequency Spreadsheet Natural number Information Data conversion Series (mathematics) Address space Physical system Overlay-Netz Email Information Moment (mathematics) Data storage device Physicalism Bit Instance (computer science) Element (mathematics) Personal digital assistant Sheaf (mathematics) Order (biology) File archiver Self-organization Pole (complex analysis) Embargo Spacetime
Computer icon Service (economics) System call Link (knot theory) Information View (database) Disk read-and-write head Row (database)
Point (geometry) Context awareness Greatest element Service (economics) Identifiability Link (knot theory) Range (statistics) Source code Motion capture Set (mathematics) Online help Mereology Metadata Number Endliche Modelltheorie Determinant Mathematical optimization Descriptive statistics Task (computing) Standard deviation Theory of relativity Information Digitizing Projective plane Moment (mathematics) Physicalism Instance (computer science) Line (geometry) Variable (mathematics) System call Arithmetic mean Process (computing) Personal digital assistant Repository (publishing) Website Self-organization Right angle Quicksort Species Row (database)
Windows Registry Web page Multiplication Meta element System administrator Interface (computing) Decision theory Source code Data storage device Flow separation Metadata Fast Fourier transform Membrane keyboard Data management Process (computing) Repository (publishing) Series (mathematics) Physical system Computer architecture
Information Universe (mathematics) Data storage device Metadata Device driver Bit Information Energy level Motion capture Object (grammar) Device driver Metadata
Identifiability Constraint (mathematics) Multiplication sign Projective plane Data storage device Motion capture Metadata Parallel port Bit Metadata Theory Exterior algebra Personal digital assistant Universe (mathematics) Extension (kinesiology) Resultant
Point (geometry) Multiplication sign Moment (mathematics) Data storage device Grand Unified Theory Metadata Physical system
Slide rule Building Latent heat Personal digital assistant Repository (publishing) Digitizing Universe (mathematics) Repository (publishing) Cuboid Generic programming Augmented reality Instance (computer science)
Link (knot theory) Information Connectivity (graph theory) Source code Data storage device Coma Berenices Event horizon Repository (publishing) Queue (abstract data type) Cuboid Physical system Form (programming) Window
Point (geometry) Windows Registry Group action Code Multiplication sign Parameter (computer programming) Twitter Derivation (linguistics) Web service Goodness of fit Computer configuration Finitary relation Information Exception handling Physical system User interface Service (economics) Interface (computing) Data storage device Shared memory Parameter (computer programming) Drop (liquid) Group action Derivation (linguistics) Software Interface (computing) Video game Physical system
Windows Registry Asynchronous Transfer Mode Projective plane Metadata Instance (computer science) Windows Registry Metadata Data management Computer configuration Repository (publishing) Software Diagram Extension (kinesiology) Asynchronous Transfer Mode Computer architecture
Point (geometry) Greatest element Link (knot theory) Multiplication sign Range (statistics) Metadata Degree (graph theory) Goodness of fit Process (computing) Case modding Repository (publishing) Object (grammar) Case modding Repository (publishing)
so what I want to do is I want to give you an overview of what we're trying to do in the metadata stores program talk about some of the solutions that we're funding and then I guess throw it open for more general questions yeah yeah
that'll do as I think I probably
mentioned in a throwaway line earlier on today we're in the process of updating our business plan I think Ross Wilkinson spent in a chunk of yesterday closeted away in a room updating it so we should have a business plan available soon for the next well it's no longer the next year the Lex nine months for you to look at we don't some having to point you to what's in the current publicly available document which is a project plan from last year but this text has naturally changed much so the research metadata
story infrastructure which is was the name of this program was really about infrastructure for creation management and harvesting of metadata about collections and objects I won't read the rest of the definition because I'm actually going to talk about some of those concepts in a minute so the problem we've got is that there are an
increasing number of data stores out there some of those are institutional
monash has a thing with the catchy title of the large research data store melbourne university is looking to put in something similar number other institutions around australia have got or are building out large data stores in addition there's the national infrastructure so arcs have got the arcs data fabric some of you may have been on mailing lists and seen road shows about the arcs data fabric road show that's coming up some of you may have also seen announcements that they've now got integration between the arcs data fabric and Amazon s3 and there's there was an allocation of money in the super science budget at the same time as the 48 million for the Anzar DC
activity to contribute towards the national data storage infrastructure 47 of that million has gone to an initiative at the University of Melbourne called nectar and I'm sure Steve manners would be happy to answer questions on that 50 million of it is going to infrastructure focus specifically on storage and there have been discussions about the business model for that and who the lead agent would be but no announcement yet that's in part caught up in caretaker mode and then of course there's international data infrastructure things like Amazon s3 and equivalents although s 3 is less attractive for Australia because of the backhaul traffic costs unfortunately all of those have relatively poor support for rich metadata about the data that's being stored they're good at storing ones and zeros they're not good at storing metadata and if people want I can do a quick demo of the data fabric it's not on my laptop i'll have to remember how to authenticate i'm probably on remember how to authenticate to show you what the data fabric looks like in question time for those of you that haven't seen it so i won't do the data fabric demo and we'll see if people want that at the end so what we're trying to do in the metadata stores program is provide ways to enable you to
manage metadata that's going to drive an enable reuse now the reason I'm talking about this in a data capture briefing is because what you're capturing is data and metadata coming off the instruments and that has to go somewhere and so that's why metadata stores or a key component in terms of this infrastructure and the way that we're currently talking about the kinds of metadata that we want ideally to have available are in terms of these four things it's the second time I've used for things today I after I'm not sure this is a trend I guess it gets away from the standard trend of using three of everything so we like to talk about information for discovery information for determination of value information for access and information for use you'll notice by the way that I haven't called this discovery metadata or determination of value metadata but information for that's in part because
those phrases seem to resonate better with researchers who don't necessarily think in terms of metadata it's not true for all disciplines but certainly true for some so what does that actually mean so the easiest one is information for
discovery this is the thing that's kind of closest to catalog metadata you can infer some of it from other information or from linked information you can extract some of it from other systems as I talked about earlier today and some of it you have to enter manually but if you think back to the RDA the RDA demo earlier on this afternoon a lot of that was information for discovery so let's
assume that I've gone searching either on google or an RDA I've found some data that is potentially of interest I now need to move on to the next step well the next step is I need to decide whether I care and this is where information for determination of value comes in do I care about this data
enough to investigate further and here the answer is really it's all about the context so two years ago when we were first working out how Anne's was going to look I think we thought we were just mostly going to build a discovery system and when we went out and consulted
around our original model people said don't bother if all you're doing is providing discovery that's not going to be enough you have to give people context for the data and so what we're trying to do in determination of value is provide as much context as possible
and some of the stuff that you saw in the RDA demo and in fact I might do a quick demo that in a minute was based around providing that context so information about the researcher or the research program or the institution they work for or publications that were associated with the data or what the experimental design look like or the availability of reuse metadata so those are things that would help a researcher decide yes I care about that because
associated with a nature publication or I know the researcher or I trust the aarc or I think that everything that the University of Melbourne does is wonderful or not third thing once you've decided that you've you care about this particular data set you then need information for access you need to be
able to get to the data now there's at least three possibilities here one of them well two of them are obvious two of them at least for me when I started doing this were not obvious the first obvious one is a direct link to the open access data so you find the collections record collections record has a way that allows you to click straight through to the data and that's if you like the simplest possible use case the second use case is you have a link but you have to authenticate so in terms of its that that one there register login for restricted access remember that Ann's doesn't control what your data store does we simply point so this would take you to whatever your data store is that can enforce its own Access Control regimes login using an authentication mechanism you manage log in using the double AF whatever we don't care those
are the 2i guess obvious one open access restricted access the two that are perhaps less obvious and therefore useful to talk a little bit about login but for open access the first time I saw this one I thought wait a minute what's going on here an example of a system that uses this is an organization in the Netherlands called dance the data archive and networking service which is part of the Royal Dutch Academy for arts and sciences and dance is the major social science data archive for Dutch research it does some other disciplines as well with primarily social science in order to access their data or data that they hold you have to login and the reason that you have to login is twofold firstly so the person who gave them the data who uploaded the data to them can see who else is downloading their data and using it that one you can kind of see you know I'm I've contributed my data I'd like to see who else cares about the stuff I'm doing fine the other reason for logging in is a little bit more subtle when you say I want to download this data you can see everybody else who's downloaded the data so as a reuse of the data you can see all of the other reuses of the data and I asked them why they done that and the answer was they were trying to build a community of practice or a community of interest around a particular data set so that you could discover other people who also cared about data that you cared it out and say I wonder why they're downloading that data set of strike up a conversation with them so that's actually quite a nice use of log in for open access data and anyone can get a login including Australians it's not restricted to Dutch users only the final information for access use case again is one that seemed a little weird the first time I saw it but I now understand it and that is there is no link to the data there is no link to an underlying data store there's an email address or a phone number or physical address and a number of the researchers that we are working with are saying no before anyone can reuse my data I want to talk to them I want to have a conversation with them I want to check that they're not one of my competitors I want to make sure that they understand the full brilliance of my research design whatever I want to be the gatekeeper so we're going to see a number of instances of that and all of those possibilities are fine from our point of view we absolutely don't have a problem with those there is in fact of course an additional dimension that you could overlay on top of that which is a an embargo that is it will be open access or you have to login to get it or you can contact me but there's going to be an embargo period within which you can't get it and the last one is information for
reuse now this obviously varies hugely by discipline but it's the kind of metadata that you need to enable you to reuse the data once you've got it and it might be the methods section in your paper so many articles in nature have now got an electronic supplement which is the the detail about how they did the experiment because the actual amount of space in nature itself is constrained it might be the reagents they used in a chemical experiment it might be the calibration values it might even be something subtle like what to the variable names mean so for instance if you've got a spreadsheet you know you're the data that you've downloaded as a spreadsheet and some of the polar data that's available through the research data Australia at the moment some of that pole the data is a series of Excel spreadsheets but in a number of cases the column headings for those spreadsheets are not explained and so you have to guess now if you're someone
who works in that discipline you can probably guess reasonably accurately if you're someone outside the discipline trying to reuse the data your guests might be wrong and even even if you are trying to guess what do you do with a column that's called temp is it temporary is a temperature if it's temperature is it in Kelvin or is it in Fahrenheit or is it in Celsius is it in real mujer there's no way of knowing unless you've got at least some kind of legend Nick shaking his head at remover I did it for you Nick so there's a significant need for information for reuse that's going to vary enormously so this is yet another view of ISO 2146
sorrow and go do that let me just quickly show you what that looks like for some real data so here is the coral example that sally was showing you before so here's a collections record
coming from the Australian Institute of Marine Science and this is actually displaying in number of the things that I was talking about discovery access determination of value reuse although not all in an optimal way so discovery well you might have been searching for octo coral which will I which is what i was searching for or stolen different whatever that is you might have searched on species names see fancy whips whatever so you can imagine a range of searches that would have used some of that information to get you to this particular record information for determination of value well you might say I'm interested in rapid ecological assessment so therefore if they're using that particular methodology I'm more interested in the data set because I think that's a good way to do things you might say that down the bottom that I trust the stuff that cutter inna fabricius does or that I think that the Australian Institute of Marine Science in general does good stuff so that might be part of your information for determination of value information for access i think is where this particular record is going to fall down but let's try yeah okay so in this case this is going to take me to one of those metadata standards that job in talked about before I so 1911 five you can see out there you can save this record as 19 115 and in this particular case the online resource is not the actual data set it's a link to the aims website so in fact the way you get access to this data is you contact Katharina Fabricius to get it so the information for access is one of those go through a gatekeeper models and then the last one which is information for reuse well some of it here they fetch Lee put in the description and you you'll see this with a number of records they've actually got quite a lot of what some disciplines might call methods information in the actual description so they're telling you how they're doing their transects they're telling you on depths they're using their telling you what the site variables were so visibility and modified seki technique and again if you're one of those kinds of researchers this is the kind of stuff you care about so there's a reasonable amount of reuse information in there now in this case I can't actually see what the actual data looks like I'd need to contact Katharina Fabricius the last point I wanted to make about this one is that there is information hidden in the 1911 5 record that we are not surfacing at the moment and in particular and I think Sally showed you some of this stuff before now a lot of this is stuff that you might think well I don't want to surface it however you probably do want to surf or something like this here's the related publication that's as a good day yet so here's the related publication information it would be nice if we could extract that and display that as related information in the collections record because that would actually be helpful context information for someone so we might have to have a look at the crosswalk we're doing from 19 115 into our collections record the reason why this isn't showing up is that they haven't put it in as publication they put it in a supplemental information so it's basically just a lump of text and it would be relatively difficult to work out how you'd surface that in a collections record but that kind of link to a publication is the sort of thing that would be nice and if your projects are doing data capture and there are associated publications it would be really nice to see links to the publication's in the records as part of that determination of value information so I won't talk about that is not well that may be what I said I was I wanted that's not what I meant so if we care about these kinds of information information for discovery for determination of value for access for reuse where are we going to get them from well down the left are the ISO 2146
entities collections parties activities and services next column across are the research instances what does that mean in a research context physical and digital collections individual and organizations for parties whole lot of information about research projects and we're still trying to work out how to model things like synchrotrons or beam lines on the synchrotrons of services Nick will have that solved by this weekend right Nick excellent it's actually a relatively tricky modeling task but it is one we know we need to solve okay so we just need to recognize and experience and the problem will go away the identifier source Nick talked about the role of persistent identifiers for collections Sally Sally moniker has already talked about people Australia for individuals and organizations we're in active discussion with the IRC and the NHMRC about identifiers for research projects providing linked data endpoints for those don't have a solution for services at the moment and inside institutions this stuff may live in your institutional repositories you haven't talked about that in your institutional
metadata stores which I'll talk about in a minute in your HR system your research management system so how does that get into the collections registry membrane from this morning that we use the collections registry to build the RDA pages you can just do a series of feeds so that data source administrator interface the job in showed you you can have multiple data sources for institution if you wanted to you could have a repository and a metadata store and your HR system and your research management system for separate data sources that's fine or you can feed
stuff through a meta aggregator and then do a single feed from the aggregator into the collections registry that's an architecture decision that you will need to make so what are
the drivers for the metadata store I've already talked about the paucity of
metadata in the existing data solutions clearly something that you're providing as a metadata store needs to meet your needs as an institution for managing rich metadata our needs to get feeds of information about collections but also as I think a couple of people have mentioned needs to solve the problems for seeding the commons and for data capture remembering that all of the data capture funded universities have also got seeding the commons money and those are a little bit different certainly collection stuff plus associated information in the data capture world possibly some object metadata as well in
an ideal world we would have funded all of the metadata stores projects last year we would have got those solutions Phil finished we would have had them ready and deployed into institutions and we then would have said and now guys we'd like to spend money with you on data capture unfortunately the world that we live in rather than my preferred parallel universe we thought we had to spend all our money by the middle of next year and so we did everything in parallel and as a result the early activity metod metadata stalls projects
things that we started talking to institutions about in September last year are either still building solutions or in some cases haven't started yet and we're building the activity and party identifiers infrastructure that you've heard a little bit about today and at the same time you're doing data capture and seed in the Commons projects so this is clearly not ideal but is unfortunately what we're going to have to live with the alternative would have been I guess well there really wasn't any alternative we could I guess in theory have said when we got the time extension just put a hold on all your data capture projects and come back and talk to you in a year but we would have
been ledged so we didn't say that so we have to live with the fact that we're trying to do everything in parallel and if i had time at this point i would play fantastic commercial that EDS built EDS showed 10-15 years ago now about building an aeroplane in the air while you're flying which is sometimes how Anne's feels so what metadata stalls solutions do we have at the moment so
the first is a system based on vitro University of Melbourne is the lead for this which is why they're in italics cutie and Griffith I are who were also funding on a research meta data hub are picking that up and using it and you wa I believe are also proposing to use that too it's using an RDF triple store based solution technology developed out of Cornell and Simon porter at the University of Melbourne would I'm sure be happy to answer questions on that if you had any second solution is a thing
called inject because we can't come up with a better name for it yet lead
agency for this is stray lien digital futures Institute at USQ some of you n o Peter Sefton this is building on top of an existing institutional repository solution and they're actually testing it out with the University of Newcastle to make sure it meets their needs so they're building not a generic solution but they're building to a specific use case and the Newcastle instance of this is called red box for reasons that will become clear on the next slide swinburne university is also interested in picking this one up Peter Sefton at USQ or Vicki Picasso Newcastle would be able to help you with details on that the reason it's
called red box is this so all of the red stuff is what this is building the green stuff is external systems that this talks to the blue stuff is the institutional repository and the red stuff is the new components so they have an event queue which monitors external sources of information they have a pluggable harvester that can slurp stuff out of those they have a form system and they're using the institutional repository is the underlying store and then they can feed stuff to us if you're interested in this Peter Sefton blogs about it reasonably frequently on his log at PT serfdom com third solution
is I forgot what you caught it this morning Anthony it's no longer called the TARDIS derivatives it's the i share my Reese I was going to say I heart my research later I knew that was wrong I share my research data so this is building on the TARDIS code that Stephen Drew like us has done I still haven't done the twitter thing for your name yet ATS pets nuts if you follow want to follow you on twitter there we are see I've done it in in real life how times that so this is going to have the ability so these are and Anthony's bullet points except for the last one the ability to stage data upload data via web interface annotate data with new parameters map experiments provide a web service interface access systems around groups this one is intended to be generic I mean the stuff that Steve built for in the original TARDIS was specific to protein crystallography this is a generalization of that to sit on top of the large research data store I'm saying it's potentially useful because
it looks like it would be a really good idea but it's running late so until I actually see it running I'm reluctant to say more than potentially useful but I'm sure Anthony would be happy to talk to you about it afterwards and the last option is so the second last option is you can run the software that we use for the collections registry locally within
your institution you can run a local or per instance that was designed to operate in a federated mode but it's primarily focused on meeting the needs of the ends registry so it's not intended as an institutional metadata solution having said that a and you have said that for their data capture project they're going to use it and are probably going to be extending it so that is another possibility that's available now and extensions will probably be coming out of the ANU use of it that's the
architecture diagram don't worry about that and then the last option is you can use your existing institutional repository but I would skip to the
bottom bullet point which is the recommendation is don't so it's doable but job in and neck and spent some time looking at it and it's a valley of pain the valley is deeper and longer for some institutional repository solutions and for others but none of them is at less
than a valley and without a degree of pain most institutional repositories don't like storing large objects that in a good collection support they really only a designed to do DC and mods and don't cope well with doing rift CS and there's a range of other problems so think very hard before you go with that solution and I'm fractionally over time that I'll stop here