DCB2010: Storing Metadata - TIB AV-Portal

DCB2010: Storing Metadata

00:00

1

Australian Research Data Commons (ARDC)

Treloar, Andrew

Formal Metadata

Title

DCB2010: Storing Metadata

Author

Treloar, Andrew

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/35912 (DOI)

Publisher

Australian Research Data Commons (ARDC)

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

The ANDS Data Capture Briefing was held in Melbourne on September 2, 2010. The briefing was designed to provide an introduction to ANDS and its services for representatives of Melbourne-based research institutions engaged in data capture projects with ANDS. A number of participants also provided descriptions of their institutional projects.

Transcript
Annotations

Speech

Text

Image

00:00

MetadataMotion captureComputer programmingMetadataData storage deviceLine (geometry)Meeting/Interview

00:24

PlanningLine (geometry)Process (computing)Projective planeMetadataData storage deviceComputer programmingLecture/Conference

01:01

Data managementSoftwareMetadataObject (grammar)Link (knot theory)Level (video gaming)Observational studyArc (geometry)Variable (mathematics)Scale (map)Configuration spaceMetadataData managementObject (grammar)Computer programmingNumberData storage deviceComputer animation

01:23

MetadataData storage deviceData storage deviceNumberAdditionArc (geometry)Multiplication signINTEGRALUniverse (mathematics)Similarity (geometry)EmailResource allocationInheritance (object-oriented programming)Electronic mailing listLecture/ConferenceMeeting/InterviewComputer animation

02:11

Asynchronous Transfer ModeEquivalence relationComputer programmingMetadata1 (number)LaptopData storage deviceFocus (optics)Demo (music)MereologyMultiplication signBusiness modelGoodness of fitData managementLecture/Conference

03:36

InformationCoefficient of determinationData managementInformationMultiplication signKey (cryptography)Data storage deviceConnectivity (graph theory)TwitterMetadataTerm (mathematics)DeterminantMereologyMotion capture2 (number)Computer animation

04:33

InformationMetadataSystem programmingTable (information)MetadataTerm (mathematics)InformationLibrary catalogDemo (music)Physical systemLecture/ConferenceMeeting/InterviewComputer animation

05:14

Experimentelle VersuchsforschungMetadataCoefficient of determinationInformationDeterminantInformationGoogolPhysical systemContext awarenessEndliche ModelltheorieLecture/ConferenceMeeting/InterviewComputer animation

05:54

Experimentelle VersuchsforschungMetadataCoefficient of determinationInformationContext awarenessDeterminantEndliche ModelltheorieComputer programmingInformationMetadataDesign of experimentsDecision theoryNatural numberDemo (music)Meeting/InterviewComputer animation

06:38

Link (knot theory)Directed setGUI widgetInformationLogical constantInformationNatural numberSet (mathematics)Open setTerm (mathematics)Row (database)LoginArtificial neural networkMechanism designGame controllerLink (knot theory)CASE <Informatik>Data storage device2 (number)Point (geometry)AuthenticationMeeting/InterviewComputer animation

07:56

NumberPoint (geometry)MereologyView (database)InformationFile archiverAddress spaceDimensional analysisPhysical systemLink (knot theory)Open setLoginMultiplication signSelf-organizationEmailCASE <Informatik>EmbargoOverlay-NetzBitData conversionFrequencyOrder (biology)PhysicalismSet (mathematics)Instance (computer science)Data storage deviceService (economics)Meeting/Interview

11:06

Sheaf (mathematics)InformationElement (mathematics)CASE <Informatik>SpreadsheetDisk read-and-write headSheaf (mathematics)NumberNatural numberSpacetimePolarization (waves)Pole (complex analysis)Moment (mathematics)Series (mathematics)MetadataInstance (computer science)Computer animation

12:06

InformationDisk read-and-write headView (database)Meeting/Interview

12:51

Computer iconService (economics)System callLink (knot theory)Row (database)Hand fanRange (statistics)MereologyInformationSpeciesDeterminantService (economics)MetadataMathematical optimizationQuicksortOnline helpProjective planeContext awarenessNumberSystem callVariable (mathematics)CASE <Informatik>Point (geometry)Theory of relativityLink (knot theory)Endliche ModelltheorieDescriptive statisticsWebsiteProcess (computing)Standard deviationMotion captureSet (mathematics)Greatest elementProgram flowchartLecture/Conference

17:52

Endliche ModelltheorieIdentifiabilityService (economics)Arithmetic meanInformationInstance (computer science)Source codeProjective planePhysicalismSelf-organizationContext awarenessLine (geometry)DigitizingRight angleMoment (mathematics)Repository (publishing)Task (computing)MetadataData storage deviceFinite-state machineProgram flowchart

19:07

Windows RegistryData storage deviceInterface (computing)Membrane keyboardMultiplicationRepository (publishing)Process (computing)Physical systemWeb pageData managementSource codeSeries (mathematics)System administratorMetadataFlow separationLecture/Conference

19:41

Fast Fourier transformInformationMetadataLevel (video gaming)Device driverMotion captureWindows RegistryMetadataDecision theoryComputer architectureMeta elementDevice driverData storage deviceBitUniverse (mathematics)Object (grammar)InformationDiagramProgram flowchartLecture/ConferenceMeeting/InterviewComputer animation

20:41

MetadataParallel portUniverse (mathematics)Projective planeData storage deviceResultantLecture/Conference

21:11

MetadataConstraint (mathematics)Projective planeTheoryCASE <Informatik>Exterior algebraMotion captureBitExtension (kinesiology)Multiplication signIdentifiabilityComputer animation

21:50

Moment (mathematics)Multiplication signPoint (geometry)MetadataPhysical systemData storage deviceEuklidischer RingLecture/ConferenceMeeting/Interview

22:15

Grand Unified TheoryMetadataData storage devicePhysical systemComputer animation

22:46

Repository (publishing)Augmented realityBuildingGeneric programmingCuboidRepository (publishing)Instance (computer science)CASE <Informatik>Latent heatSoftware testingSlide ruleUniverse (mathematics)DigitizingLecture/ConferenceComputer animation

23:34

WindowLink (knot theory)CuboidComa BerenicesData storage deviceConnectivity (graph theory)Physical systemEvent horizonRepository (publishing)Queue (abstract data type)InformationForm (programming)Source codeComputer animationProgram flowchart

24:15

Level (video gaming)Parameter (computer programming)Finitary relationDrop (liquid)Interface (computing)InformationService (economics)Group actionPhysical systemDerivation (linguistics)BlogWeightInterface (computing)TwitterGroup actionCodePhysical systemShared memoryDerivation (linguistics)Level (video gaming)User interfaceWeb serviceParameter (computer programming)Point (geometry)Multiplication signException handlingVideo gameData storage deviceLecture/ConferenceMeeting/InterviewComputer animation

25:19

Instance (computer science)AuthorizationLocal ringSoftwareWindows RegistryComputer configurationGoodness of fitMeeting/Interview

25:43

Asynchronous Transfer ModeWindows RegistryMetadataData managementSoftwareMetadataAsynchronous Transfer ModeInstance (computer science)Windows RegistryProjective planeExtension (kinesiology)Computer animation

26:15

Link (knot theory)Case moddingObject (grammar)Repository (publishing)MetadataRepository (publishing)Computer configurationDiagramComputer architecturePoint (geometry)Greatest elementProcess (computing)Multiplication signLecture/ConferenceComputer animation

26:40

Degree (graph theory)Repository (publishing)Goodness of fitRange (statistics)Multiplication signCase moddingFraction (mathematics)Object (grammar)Meeting/Interview

Transcript: English(auto-generated)

00:02

So what I want to do is I want to give you an overview of what we're trying to do in the metadata stores program talk About some of the solutions that we're funding and then I guess throw it open for more general questions Yeah

00:21

Yeah, that'll do as I think I probably mentioned in a throwaway line earlier on Today we're in the process of updating our business plan I think Ross Wilkinson spent a chunk of yesterday closeted away in a room updating it So we should have a business plan available soon

00:44

For the next well, it's no longer the next year the next nine months for you to look at We don't so I'm having to point you to what's in the current publicly available Document, which is a project plan from last year, but this text hasn't actually changed much So the the research metadata store infrastructure, which is was the name of this program

01:06

Was really about infrastructure for creation management and harvesting of metadata about collections and objects I won't read the rest of the definition because I'm actually going to talk about some of those concepts in a minute

01:22

So the problem we've got is that there are an increasing number of data stores out there Some of those are institutional Monash has a thing with the catchy title of the large research data store Melbourne University is looking to put in something similar Number of other institutions around Australia have got or are building out large data stores. In addition, there's the national

01:45

Infrastructure so arcs have got the arcs data fabric Some of you may have been on mailing lists and seen road shows about the arcs data fabric road show that's coming up Some of you may have also seen announcements that they've now got integration between the arcs data fabric and Amazon s3 and

02:04

there's There was an allocation of money in the super science budget at the same time as the 48 million for the ANZAR DC activity to contribute towards the national data storage infrastructure 47 of that million has gone to an initiative at the University of Melbourne called nectar

02:22

And I'm sure Steve Manos would be happy to answer questions on that 50 million of it is going to Infrastructure focused specifically on storage and there have been discussions about the business model for that and Who the lead agent would be but no announcement yet. That's in part caught up in

02:42

Caretaker mode and then of course, there's international data infrastructure things like Amazon s3 and equivalents although s3 is less attractive for Australia because of the backhaul traffic costs Unfortunately, all of those have relatively poor support for rich metadata about the data that's being stored

03:03

They're good at storing ones and zeros. They're not good at storing metadata. And if people want I can do a quick demo of the data fabric It's not on my laptop I'll have to remember how to authenticate I could probably Remember how to authenticate to show you what the data fabric looks like in question time for those of you that haven't seen it

03:23

So I won't do the data fabric demo and we'll see if People want that at the end so what we're trying to do in the metadata stores program is Provide ways to enable you to manage metadata that's going to drive and enable reuse now the reason I'm talking about this in a data capture briefing is because

03:44

What you're capturing is data and metadata coming off the instruments and that has to go somewhere And so that's why metadata stores are a key component in terms of this infrastructure And the way that we're currently talking about the kinds of metadata that we want

04:03

Ideally to have available are in terms of these four things It's the second time I've used four things today. I After I'm not sure this is a trend. I guess it gets away from the standard trend of using three of everything So we like to talk about information for discovery information for determination of value

04:22

information for access and information for reuse You'll notice by the way that I haven't called this discovery metadata or determination of value metadata but information for that's in part because those phrases seem to resonate better with researchers who don't necessarily think in terms of metadata

04:42

It's not true for all disciplines, but certainly true for some So what does that actually mean so the easiest one is information for discovery This is the thing that's kind of closest to catalog metadata You can infer some of it from other information or from linked information you can extract some of it from other systems as I talked about earlier today and

05:05

Some of it you have to enter manually But if you think back to the RDA the RDA demo earlier on this afternoon a lot of that was information for discovery So let's assume that I've gone searching either on Google or an RDA I've found some data that is potentially of interest I

05:24

Now need to move on to the next step Well, the next step is I need to decide whether I care and this is where information for determination of value comes in Do I care about this data enough to investigate further? And here the answer is really it's all about the context

05:42

so Two years ago when we were first working out how ANDS was going to look I Think we thought we were just mostly going to build a discovery system And when we went out and consulted around our original model people said don't bother If all you're doing is providing discovery, that's not going to be enough. You have to give people context for the data

06:04

And so what we're trying to do in determination of value is Provide as much context as possible and some of the stuff that you saw in the RDA demo And in fact, I might do a quick demo of that in a minute was based around providing that context so information about the researcher or the research program or the institution they work for or

06:25

Publications that were associated with the data or what the experimental design looked like or the availability of reuse metadata So those are things that would help a researcher decide Yes, I care about that because it's associated with a nature publication or I know the researcher or I trust the ARC

06:43

Or I think that everything that the University of Melbourne does is wonderful or not Third thing once you've decided that you've you care about this particular data set You then need information for access. You need to be able to get to the data

07:02

Now there's at least three possibilities here One of them well two of them are obvious Two of them at least for me when I started doing this were non obvious The first obvious one is a direct link to the open access data So you find the collections record collections record has a way that allows you to click straight through to the data

07:22

And that's if you like the simplest possible use case The second use case is you have a link But you have to authenticate so in terms of it's that That one there register login for restricted access Remember that ANZ Doesn't control what your data store does we simply point

07:44

So this would take you to whatever your data store is that can enforce its own access control regimes Logging using an authentication mechanism you manage login using the double AF whatever we don't care Those are the two I guess obvious one open access restricted access

08:03

The two that are perhaps less obvious and therefore useful to talk a little bit about Login but for open access And the first time I saw this one. I thought wait a minute. What's going on here an example of a system that uses this is an organization in the Netherlands called dunce the data archive and networking service, which is part of the

08:28

Royal Dutch Academy for Arts and Sciences And dunce is the major social science data archive for Dutch research It does some other disciplines as well, but primarily social science

08:44

In order to access their data or data that they hold you have to log in and The reason that you have to log in is twofold firstly so the person who Gave them the data who uploaded the data to them can see who else is downloading their data and using it

09:04

That one you can kind of see you know. I'm I've contributed my data I'd like to see who else cares about the stuff. I'm doing fine The other reason for logging in is a little bit more subtle When you say I want to download this data You can see everybody else who's downloaded the data so as a re-user of the data

09:24

You can see all of the other re-users of the data and I asked them why they done that and the answer was they were trying to build a Community of practice or a community of interest around a particular data set So that you could discover other people who also cared about data that you cared about and say I wonder why they're

09:44

Downloading that data set I'll strike up a conversation with them So that's actually quite a nice use of login for open access data And anyone can get a login including Australians. It's not restricted to Dutch users only The final information for access use case again is one that seemed a little weird the first time

10:06

I saw it, but I now understand it and that is there is no link to the data There is no link to an underlying data store. There's an email address or phone number or physical address and A number of the researchers that we're working with are saying

10:22

No before anyone can reuse my data. I want to talk to them. I want to have a conversation with them I want to check that they're not one of my competitors I want to make sure that they understand the full brilliance of my research design, whatever. I want to be the gatekeeper So we're going to see a number of instances of that and all of those possibilities are fine from our point of view

10:44

Absolutely don't have a problem with those there is in fact, of course an additional dimension that you could overlay on top of that which is a Embargo that is it will be open access or you have to log in to get it or you can contact me But there's going to be an embargo period within which you can't get it

11:05

And the last one is information for reuse now this obviously varies hugely by discipline but it's the the kind of metadata that you need to enable you to reuse the data once you've got it and It might be the method section in your paper. So many

11:22

Articles in nature have now got an electronic supplement Which is the the detail about how they did the experiment because the actual amount of space in nature itself is constrained It might be the reagents they used in a chemical experiment. It might be the calibration values It might even be something subtle. Like what do the variable names mean?

11:43

So for instance if you've got a spreadsheet You know your the data that you've downloaded is a spreadsheet and some of the the polar data that's available Through the research data Australia at the moment some of that polar data is a series of Excel spreadsheets But in a number of cases the column headings for those spreadsheets are not explained and so you have to guess now

12:05

If you're someone who works in that discipline You can probably guess reasonably accurately if you're someone outside the discipline trying to reuse the data your guess might be wrong and even If you are trying to guess what do you do with a column that's called temp

12:22

Is it temporary is a temperature if it's temperature? Is it in Kelvin or is it in Fahrenheit or is it in? Celsius is it in Rheumuah? There's no way of knowing unless you've got at least some kind of legend Nick shaking his head at Rheumuah I did it for unique so

12:41

there's a significant need for information for reuse That's going to vary enormously So This is yet another view of ISO 2146. So rather than go do that Let me just quickly show you what that looks like for some real data so here is

13:01

The coral example that Sally was showing you before So here's a collections record coming from the Australian Institute of Marine Science and This is actually displaying a number of the things that I was talking about discovery access

13:21

Determination of value reuse although not all in an optimal way so discovery where you might have been searching for octocoral, which would I which is what I was searching for or Stolen different whatever that is You might have searched on Species names

13:40

Sea fans sea whips whatever so you can imagine a range of searches that would have used some of that information to get you to this particular record information for determination of value Well, you might say I'm interested in rapid ecological assessment So therefore if they're using that particular methodology

14:02

I'm more interested in the data set because I think that's a good way to do things you might say that Down the bottom That I trust the stuff that Katarina Fabrizios does or that I think that the Australian Institute of Marine Science in general does good stuff. So that might be part of your information for determination of value

14:26

Information for access I think is where this particular record is going to fall down, but let's try Yeah, okay So in this case This is going to take me to one of those metadata standards that job in talked about before I saw one nine one one five

14:43

You can see out there. You can save this record as One nine one one five and in this particular case the online resource is not the actual data set It's a link to the aims website. So in fact the way you get access to this data is

15:01

You contact Katarina Fabrizios to get it. So the information for access is one of those go through a gatekeeper models and Then the last one which is information for reuse Well some of it here they've actually put in the description and you you'll see this with a number of records

15:21

They've actually got quite a lot of what some disciplines might call methods information in the actual description So they're telling you how they're doing their transects. They're telling you what dips they're using They're telling you what the site variables were So visibility and modified Secchi technique

15:41

And again, if you're one of those kinds of researchers, this is the kind of stuff you care about So there's a reasonable amount of reuse information in there now in this case I can't actually see what the actual data looks like. I would need to contact Katarina Fabrizios The last point I wanted to make about this one is that there is information hidden in

16:03

The one nine one one five record that we are not surfacing at the moment and in particular and I think Sally showed you some of this stuff before now a lot of this is stuff that you might think

16:20

Well, I don't want to surface it. However You probably do want to serve for something like this. Here's the related publication. That's Yeah, so here's the related publication information it would be nice if we could extract that and display that as related information in the collections record because that would actually be

16:45

Helpful context information for someone so we might have to have a look at the crosswalk. We're doing from one nine one one five into Our collections record the reason why This isn't showing up is that they haven't put it in as publication they put it in a supplemental information

17:03

So it's basically just a lump of text and it would be Relatively difficult to work out how you'd surface that in a collections record But that kind of link to a publication is the sort of thing that would be nice And if your projects are doing data capture and there are associated publications

17:20

It would be really nice to see links to the publications in the records as part of that determination of value information so I won't talk about That is not well that maybe what I said I was I wanted that's not what I meant

17:41

So if we care about these kinds of information information for discovery for determination of value for access for reuse Where are we going to get them from? well Down the left are the ISO 2146 entities collections parties activities and services Next column across are the research instances. What does that mean in a research context physical and digital collections?

18:05

individual and organizations for parties Whole lot of information about research projects and we're still trying to work out how to model things like Synchrotrons or beam lines on the synchrotrons as services Nick will have that solved by this weekend, right Nick

18:21

Excellent, it's actually a relatively tricky modeling task, but it is one we know we need to solve Okay, so we just need to recognize Nick's brilliance and the problem will go away the identifier source Nick talked about the role of persistent identifiers for collections Sally Sally

18:43

Monica's already talked about people Australia for individuals and organizations We're in active discussion with the ARC and the NHMRC about identifiers for research projects providing linked data endpoints for those don't have a solution for services at the moment and

19:01

Inside institutions this stuff may live in your institutional repositories You have been talked about that in your institutional metadata stores Which I'll talk about in a minute in your HR system your research management system So, how does that get into the collections registry membering from this morning that we use the collections registry to build the RDA pages

19:22

You can just do a series of feeds So that data source administrator interface the job in showed you you can have multiple data sources for an institution Now if you wanted to you could have a repository and a metadata store and your HR system in your research management system For separate data sources, that's fine

19:42

or You can feed stuff through a metadata aggregator and then do a single feed from the aggregator into the collections registry That's an architecture decision that you will need to make So what are the drivers for the metadata store? I've already talked about the paucity of metadata in the existing data solutions

20:05

Clearly something that you're providing as a metadata store needs to meet your needs as an institution For managing rich metadata our needs to get feeds of information about collections But also as I think a couple of people have mentioned needs to solve the problems for seeding the Commons and for data capture

20:24

remembering that all of the data capture funded universities have also got seeding the Commons money and Those are a little bit different certainly Collections stuff plus associated information in the data capture world possibly some object metadata as well

20:41

In an ideal world we would have funded all of the metadata stores projects last year We would have got those solutions Finished we would have had them ready and deployed into institutions And we then would have said and now guys we'd like to spend money with you on data capture Unfortunately the world that we live in rather than my preferred parallel universe

21:03

We thought we had to spend all our money by the middle of next year And so we did everything in parallel and as a result the early activity meted or metadata stores Projects things that we started talking to institutions about in September last year are either still building solutions Or in some cases haven't started yet

21:21

and we're building the activity and party identifier infrastructure that you've heard a little bit about today and At the same time you're doing data capture and seeding the Commons projects So this is clearly not ideal but is unfortunately what we're going to have to live with the alternative would have been I guess

21:41

Well, there really wasn't any alternative we could I guess in theory have said when we got the time extension Just put a hold on all your data capture projects and we'll come back and talk to you in a year But we would have been lynched so we didn't say that So we have to live with the fact that we're trying to do everything in parallel And if I had time at this point, I would play a fantastic commercial that EDS built EDS showed

22:03

Ten or fifteen years ago now about building an aeroplane in the air while you're flying it, which is sometimes how Ann's feels So what metadata store solutions do we have at the moment? So the first is a system based on vitro University of Melbourne is the lead for this which is why they're in italics

22:23

QUT and Griffith I are who we're also funding on a research metadata hub are picking that up and using it and UWA I believe are also proposing to use that too It's using an RDF triple store based solution technology developed out of Cornell And Simon Porter at the University of Melbourne would I'm sure be happy to answer questions on that if you had any

22:47

Second solution is a thing called inject because we can't come up with a better name for it yet Lead agency for this is Australian Digital Futures Institute at USQ Some of you may know Peter Sefton

23:01

This is building on top of an existing institutional repository solution and they're actually Testing it out with the University of Newcastle to make sure it meets their needs So they're building not a generic solution But they're building to a specific use case and the Newcastle instance of this is called red box for reasons

23:21

That will become clear on the next slide Swinburne University is also interested in picking this one up Peter Sefton at USQ or Vicky Picasso at Newcastle would be able to Help you with details on that. The reason it's called red box is this So all of the red stuff is what this is building

23:41

the green stuff is external systems that this talks to the blue stuff is the institutional repository and The red stuff is the new components. So they have an event queue which monitors external sources of information They have a pluggable harvester that can slip stuff out of those they have a form system

24:03

And they're using the institutional repository as the underlying store and then they can feed stuff to us If you're interested in this Peter Sefton blogs about it reasonably frequently on his blog at PT Sefton comm Third solution is I forgot what you called it this morning Anthony. It's no longer called a TARDIS derivative

24:25

It's it's the I Share my research I was going to say I heart my research data. I knew that was wrong I share my research data. So this is building on the TARDIS code that Steve Andrew Larkus has done. I still haven't Done the Twitter thing for your name yet at spits nets if you follow want to follow me on Twitter

24:43

There we are. See I've done it in in real life. How times that So this is going to have the ability So these are and Anthony's bullet points except for the last one the ability to stage data upload data via web interface annotate data with new parameters

25:00

map experiments provide a web service interface access systems around groups This one is intended to be generic I mean the stuff that Steve built for in the original TARDIS was specific to protein crystallography This is a generalization of that to sit on top of the large research data store. I'm saying it's potentially useful

25:20

Because it looks like it would be a really good idea, but it's running late. So until I actually see it running I'm reluctant to say more than potentially useful But I'm sure Anthony would be happy to talk to you about it afterwards And the last option is so the second last option is you can run The software that we use for the collections registry locally within your institution. You can run a local author instance

25:47

That was designed to operate in a federated mode But it's primarily focused on meeting the needs of the ANDS registry. So it's not Intended as an institutional metadata solution having said that and you have said that for their data capture project

26:03

They're going to use it and are probably going to be extending it. So that is another possibility that's available now and Extensions will probably be coming out of the ANU use of it That's the architecture diagram don't worry about that and then the last option is you can use your existing institutional repository

26:23

But I would skip to the bottom bullet point which is the recommendation is don't so It's doable but Xiaobin and Nick has spent some time looking at it and it's a valley of pain The valley is deeper and longer for some institutional repository solutions than for others But in none of them is at least than a valley

26:43

And without a degree of pain Most institutional repositories don't like storing large objects. They don't have good collection support They really only are designed to do DC and mods and don't cope well with doing riff CS and there's a range of other problems So think very hard before you go with that solution and I'm fractioning over time, but I'll stop there