We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

An Open Source Web Service For Registering and Managing Environmental Samples

00:00

Formal Metadata

Title
An Open Source Web Service For Registering and Managing Environmental Samples
Title of Series
Number of Parts
183
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Producer
Production Year2015
Production PlaceSeoul, South Korea

Content Metadata

Subject Area
Genre
Abstract
Records of environmental samples, such as minerals, soil, rocks, water, air and plants, are distributed across legacy databases, spreadsheets or other proprietary data systems. Sharing and integration of the sample records across the Web requires globally unique identifiers. These identifiers are essential in order to locate samples unambiguously and to manage their associated metadata and data systematically. The International Geo Sample Number (IGSN) is a persistent, globally unique label for identifying environmental samples. IGSN can be resolved to a digital representation of the sample trough the Handle system. IGSN names are registered by end-users through allocating agents, which are the institutions acting on behalf of the IGSN registration agency. As an IGSN allocating agent, we have implemented a web service based on existing open source tools to streamline the processes of registering IGSNs and for managing and disseminating sample metadata. In this paper we present the design and development of the web service and its database model for capturing various aspects of environmental samples. Previous work by the System for Earth Sample Registration (SESAR) was aimed primarily at individual investigators, whereas our work focuses on curating sample descriptions from larger collaborative projects. The paper describes the linkage between the IGSN metadata elements and the sampling concepts specified in existing common data standards, e.g., the Open Geospatial Consortium (OGC) Observations and Measurements standard. This mapping allows the application of the IGSN model across different science domains. In addition, we show how existing controlled vocabularies are incorporated into the service development to support the metadata registration of different types of samples. The proposed sample registration and curating approach has been trialled in the context of the Capricorn Distal Footprints project on a range of different sample types, varying from water to hard rock samples. The observed results demonstrate the effectiveness of the service while maintaining the flexibility to adapt to various media types, which is critical in the context of a multi-disciplinary project.
Goodness of fitImage registrationSelf-organizationEndliche ModelltheorieSampling (statistics)MetadataIntegrated development environmentDifferent (Kate Ryan album)Data managementWeb serviceStructural load3 (number)Rule of inferenceComputer animation
Motion captureUtility softwareExpert systemDescriptive statisticsComputer fontSystem callElement (mathematics)Information3 (number)Medical imagingCommunications protocolSample (statistics)Goodness of fitEndliche ModelltheorieFamilyMereologyNumberCodeMetadataSet (mathematics)Physical systemNetwork topologyIdentifiabilityGradient descentAutomatic differentiationSystem identificationField (computer science)EmailGroup actionType theoryArchaeological field surveySound effectMultiplication signProjektive GeometrieData structureNumber theoryDimensional analysisSymmetry (physics)Different (Kate Ryan album)Semantics (computer science)CollaborationismEvent horizonOcean currentState of matterWaveRule of inferenceUniverse (mathematics)Sampling (statistics)DivisorSelf-organizationUniform resource locatorLevel (video gaming)Arrow of timeClient (computing)DistanceStandard deviationCovering spaceClosed setArithmetic meanCombinational logicRow (database)Roundness (object)Link (knot theory)HypermediaImage registrationDigital object identifierIntegrated development environmentLine (geometry)Theory of relativityOpen setDiagramPhysicalismWeb serviceWeb 2.0Flow separationRepository (publishing)Core dumpNamespaceTerm (mathematics)Limit (category theory)MathematicsObject (grammar)DigitizingSerial portCharacteristic polynomialPoint (geometry)Dependent and independent variablesOAISAlphabet (computer science)Computer animation
Projektive GeometrieSample (statistics)Image registrationNumber theory5 (number)Analytic continuationCorrelation and dependenceElement (mathematics)Landing pageClient (computing)Sampling (statistics)Data centerDifferent (Kate Ryan album)Descriptive statisticsPhysical systemWater vaporMetadataOffice suiteInformation securityShared memoryPerformance appraisalDigital electronicsDecision theoryCategory of beingInformationMultiplication signGradient descentReflection (mathematics)Electronic mailing listRootMappingType theoryProcess modelingMereologyEndliche ModelltheorieNumbering schemeSign (mathematics)Cartesian coordinate systemOrder (biology)Data compressionDistortion (mathematics)Domain nameRule of inferenceTemplate (C++)Universe (mathematics)Keyboard shortcutSystem identificationNetwork topologyMeasurementQuicksortState observerIdentifiabilityLink (knot theory)Conformal mapArithmetic meanSpacetimeIntegral domainDataflowOperator (mathematics)NamespaceMetropolitan area networkPrincipal ideal domainServer (computing)Motion captureCharacteristic polynomialWeb pagePlotterGame controllerWindows RegistryPower (physics)Block (periodic table)Right angleLevel (video gaming)Theory of relativitySpherical capLinked dataImplementationWeb serviceResultantLatent heatSoftware developerRepresentational state transferUniform resource locatorFile formatSoftwareCore dumpWeb portalData modelComputer animationProgram flowchart
Slide ruleStandard deviationPhysical systemData centerSample (statistics)MetadataNamespaceInformationSampling (statistics)CodeLengthLink (knot theory)Endliche ModelltheorieUniform resource locatorGroup actionLocal ringLanding pageRevision controlCharacteristic polynomialPhysicalismMereologyProcess modelingState observerMeasurementElement (mathematics)Category of beingCore dumpDifferent (Kate Ryan album)DigitizingOcean currentRight angleCASE <Informatik>Number theoryQuicksortPosition operatorResource allocationPoint (geometry)Level (video gaming)SpacetimeTheoryObject (grammar)ArmMultiplication signDigital object identifierGrand Unified TheoryXMLUMLComputer animation
Computer animation
Transcript: English(auto-generated)
So, good afternoon, everyone, I'm Anu Surya, you can just call me Anu, I'm from Mineral Resources flagship of CSIRO, so CSIRO, sorry, sorry. CSIRO is a federal organization for scientific research in Australia.
Today I'm going to talk about a metadata model and a web service which we develop to support the registration and management of different kind of environmental samples in CSIRO.
Example of environmental samples are, for example, physical specimen like water, plant, rock, soil, insects. So, in current practice, these samples are collected and stored by different entities. These include, for example, individual researcher,
laboratories, universities, state agencies, for example, geological survey, and museums. And each corrector, sample corrector, just use their own way of documenting the sample description, this leads to several problems. For example, different names can be used
to describe the same sample, this is possible. And also it is possible that the sample name changes over the time, for example, when they relocate one sample to another location, from one location to another location, then they rename the samples. So, if you want to use the sample
within the same organization, then you won't have any problem, because basically maybe you already have some records about the samples, and you can easily identify the sample. But what happen when you want to expose this sample to somebody outside your organization? Then unique identification of the sample is an important factor.
So, it's similar to the current identification system. For example, we have international serial book number, ISBN, and this is a globally identifier, which can be used to identify books. So, in a similar case, we have a digital object identifier, which can be used to identify publication.
So, in a similar manner, we have ISGN, it's actually, it stands for International Geosample Number. This is actually nine digit alphanumeric code, which can be used to identify samples and specimens, and it is persistent.
What I mean by persistent is that it has a stable link to the samples, compared to URL, because if you use URL, the URL can change over the time. But if it is a persistent identifier, it has a stable link, which contain, which point to the description of the sample.
So, IGSN, this is an example of IGSN, where it consists of first two character representing the agent. I will talk about what is agent later, but it consists of namespace followed by code. So, the namespace represents the allocating agent,
and the code is actually assigned by the user, and it consists of the data center, and followed by some numbers, such as a combination of number and alphabets. So, in this case, this is a fossil, which is from the interdisciplinary Earth Data Alliance.
So, this IEDA, IEDA, they are one of the allocating agent, which formally registered with the IGSN agency, and they have the namespace called IE, and existing center, or project, or individual researcher can register their sample through this namespace, IE.
So, in this case, for example, the core repository from Lamont Earth Observatory registered this fossil. So, CCR stands for core repository, and 001 is some number which is assigned to this fossil. So, this is a persistent unique identifier, which identify this fossil.
So, this is how IGSN code work. Another important aspect about the IGSN is that, so, like I said before, any project, existing project, individual researcher, or any data center, if they want to obtain this IGSN number,
they have to register through allocating agent, and this allocating agent is, for example, IEDA, and CSIRO is one of the allocating agent, which formally registered with the IGSN top-level agency. So, you can only obtain this persistent identifier
with the allocating agent that formally registered with the top-level agency. So, in a CSIRO, we, as the allocating agent, what we would like to do is, we have a lot of samples, millions of samples, different type of samples, but these samples are isolated, there are some kept by the researcher,
some are in the rock stars, so we would like to actually find IGSN. So, for this purpose, we want to develop a system. So, this is what I'm going to talk today. Some history about how we became a IGSN member, so we became a member in 2013,
and it started from the flagship where I'm from, Mineral Resources flagship, and currently there is three projects, or rock star, which will use the system that I will describe in this presentation, and also to point out that the work I'm going to present today is relevant to the ongoing effort
with other two allocating agent in Australia, which we have a collaboration, for example, the Geoscience Australia, and then the Curtin University. There's already some work from the Lamon Observatory, they already have developed the metadata model, and the surveys, but the existing work is mainly focused
on geochemical samples, but in CSIRO, we have different kind of samples beside the geochemical, so therefore, there's several technical limitations in terms of their surveys and the metadata model, that's the reason why we would like to develop one
for CSIRO, actually to cater different type of samples registration. So, two things I will present today, the contribution, first is the metadata model, the descriptive metadata model, and the second one is the web service, which I call as allocating agent web service,
and both of these are currently being used in CSIRO. Again, let's revisit again this diagram, because I think it's a very important diagram to understand which part this work belong to, so the client is the existing three project, which I mentioned before, in CSIRO, and they have a different kind of samples,
and they will actually use the metadata model that we developed to actually send to the web service, so the web service is run by the allocating agent, which is the CSIRO, and then we have the surveys which talk to the top level surveys, the top level surveys will actually register the persistent identifier and return to the allocating agent surveys,
and then this surveys will send the persistent IGSN code to the client. But why we need another metadata model, because from the allocating agent to the top level agency, the metadata model only cover registration information, so there's no information about sample descriptions,
so as allocating agent, it is our responsibility to develop a metadata model that can capture core characteristics of different kind of samples. Okay, so the whole idea is also to use the surveys and then expose the data surveys to the public,
whatever sample description capture year, we would like to expose through OAI PMH, OAI PMH is just like a harvesting protocol, and so the public can automatically get the descriptions from this service.
All right, so more information about the metadata model now, I'm not going to explain in detail each element, but I have grouped the element into several groups, so basically we have some element describing
the sample identification, some elements describing how the samples are collected in the field, and then where it is stored, who stored, the time dimension, and also other related information. Other related information, for example, we have several relation which we can use to say that this sample is sub-sample from other sample,
or there is also data attached to the sample, so these are different kind of relation you can use to describe the samples. Although there are several elements, only few elements are mandatory, so for example sample number, which is the IGSN number, the sample name, which is the local name of the sample,
whether it is the public or private, because I think this is very important, because in some project you want to get the number, but you don't want to release the metadata to the public yet, then therefore you can set whether it's a private or public. And we have also landing page. What is landing page? Landing page provide you further information
about the sample. Like I said before, we only capture the core characteristic of sample, but if you have more detailed information of the sample, then this can be obtained through the landing page. And then of course the sample type, whether it's a rock, water, plant, and so on.
And then the sample creation is very important, where the sample is located. Some information, we also use the concept of link data. Actually, for example, we use some control vocabulary to describe sample type and feature types. I think this is actually to use the power of link data to give the user more meaningful information
about the concepts. And we also reuse some element from the IGSN registry schema. What is the registry schema is here. For example, we use also the log elements and also the related relations, which I described before. This is all derived from the top level schema.
So this actually to show that what we develop is not a new schema, but we use the existing schema and then customize accordingly to get a different type of samples. All right, so just some example of what I mean by identification and sampling activity. For example, identification consists of number,
name, other name, sample type, classification concept, why it is collected, and so on. Sampling activity, for example, the collection information, the location where it is collected from, the time, the sampling feature, the host where the sample, for example, observation well. You collect the water,
so the observation well is a sampling feature. The host where the sample is collected from and who collected the sample, the size, the measurement, and the method, the campaign, for example, and so on. So this is an overview of descriptive metadata. Okay, once we have the metadata,
the next step is we develop a service, and this service will be used by client who are the client, existing project. It can be individual researcher. So they can actually format their data according to this descriptive metadata. They can format their sample description
using this metadata model, and then send to the service. So the service is implement a REST API. These are some operation which have been supported, for example, to see all the namespace, to register namespace, also to register sample, get more information about the metadata through sample number.
So let's look at the more detail, for example, register the samples. And this will actually is a post method, and it will return a list of successful and unsuccessful sample registration. I know this is really small. I don't know, it looks good on the projector, but here, I don't know. But anyway, I would like to say that, so you have the client,
for example, the existing project in CSIRO, the allocating agent is us, the service, and then this is the top-level agency. So first, you send the XML, and then we do the validation, user validation, the XML validation, whether the data is validated properly, and then we also validate namespace.
So because we want to ensure that only, so each data center or client can actually request a unique name, and we want to ensure that only those very unique name can register the sample. And then, so here, in this process,
you can assume the client program sending about 200, 500, or 10,000 samples descriptions. But the problem is that in this part, which is between the allocating agent and IGSN, they only support sequential registration, so you only can register one sample at one time.
There's no support to represent multiple samples. So what we do is we iterate, make sure that all samples are registered. Once only four successful samples which are registered will be inserted into our database, so we keep a copy of the sample registration here, and then send to the client
the list of successful and unsuccessful samples. Because here, that's highly likely that not all sample can obtain the IGSN. This is possible that if you send 100, maybe you will get only 80 due to network failure. So what we do is to ensure that we store
only successfully registered sample description here, and then send to the client what are the samples which are registered successfully and unsuccessfully. So I would like to show you one example how this is, where this is the schema,
the metadata schema and the data model have been applied. So there's a project called Capricorn Distilled Footprint Project. Basically, this project is the members of the project coming from UWA, it's a university in Western Australia, CSIRO, and also, thank you,
and also Geological Survey of Western Australia. And this project basically find minerals, finding goals, find interesting mineral goals or coppers. And within this project, they collect different kind of sample, like a plant, water, soil, and rock. So we have implemented the system, basically, the Capricorn Curation System will send the request
and the service will mint, mint, mint, get the IGSN from the top level registration agency, and then store the description and then send back the IGSN to the client, which is the Sample Curation System of Capricorn Project. So this is the results.
So if you see, this is example of XML, which is created based on the metadata model I described before. And that is an IGSN, CS-CAP-001, CS stands for CSIRO. This is the namespace for allocating agent of CSIRO. CAP is the prefix for the Capricorn Project,
the prefix for the data center, and this is followed by something which is assigned by the data center. And when we register, so this is the main registration agency which show that the sample has been registered. So there is a handle now, this is the persistent identifier. So if you navigate to this, 10273 is the namespace
for IGSN, and this is followed by the IGSN number. So if you link, if you actually, this is the actionable persistent link, so if you click on the link, this will give you more detailed information about the samples.
All right, to conclude, so what I have described so far is the development of a metadata model of samples and a web service implementation for CSIRO. And the contribution here is that the metadata model
is not domain specific, it is meant to capture information for different samples, so it's not meant to develop for a specific type of sample, but rather it can be used to describe the main properties of different type of samples. And this solution, both schema and the service
is important to actually facilitate the sharing of sample description with the inside and outside CSIRO. What we would like to do is we would like to test the solution to the rest, the other two repository. Actually, we already registered some sample from the Australian Resolve Research Center,
so we already actually registered some sample, about 1,500 sub-collections from this ARRC. And the next step is also to register the sample from the Mineral Reflectance Spectra, another star, mineral star. And we would like to also formally document
the mapping between our metadata model and existing metadata model, for example, ISO and OGC. This is to ensure the correct application of the metadata model, which we develop across different domain. And finally, we will also develop a web portal with Curtin University and Geoscience Australia,
and this portal will be used to arouse the sample description from different allocating agency in Australia. Thank you for your attention. Thank you, Arna. Are there any questions from the floor?
I wanted to ask you, you mentioned these ISO standards here at the back on the last slide. Is that where the one on observations and measurements comes in? So how does that relate to what you are doing? When we want to develop the metadata model, we look into existing open standards
because we do want to develop a new one. The problem with the ISO standard is that first it's not suitable for describing different kind of samples, physical objects, and it is complex. Some conflict can be used, but not for all characteristics
which I described on the slide with the different groups. But with the OGC, they have observation and measurement, part two, which is addressing sampling features. And we see there is an overlap, but the modeling scope is different because in the OGC spec, it's more observation centric
how the sampling is done. Whereas in our case, it's more about the sample and then the core property. Of course, the observation is one of the element. So what I meant is that just to show what are the element in our metadata model and how this is aligned to the current standards like OGC.
So if you say ISO, you're talking about 191.15, a metadata one? Yeah, 199.15. Because there's also one on observations and measurements, 191.56. And this is from there. No, it's from ISO also. It went to ISO. Yes, but this is originally from OGC, observation and measurement.
And there's the part two for sampling feature. Thank you. You wanted to say something. Any other questions? I have another question. Sorry for all my questions. If you go back to the front, Anna, it's not something that you do,
but I was just curious, right where you explain the IGSN. This one? Yes. There's only four digits here. So are there never more than 1,000 samples or 9,999? It works this way. According to the IGSN documentation,
the recommended length is nine. The recommended length is nine. That consists of the agent namespace and the code, and the code consists of this data center. But if the allocating agent thinks that there are samples, for example, the marine ships, right?
They have really long numbers. Then the allocating agents have the right to actually accommodate this kind of use cases. So the recommended is nine, because this is easily readable in the barcode. It's easy, but it's recommended.
Have you thought about including some way, including the location code also in the sample number? Like there is something called location coding system.
With the 12 digit, you can define the latitude and latitude of the sample. Actually, that's the idea of IGSN, because, let me go to the last slide. Because this is an identifier, and it's actionable link. When you click this link,
it will give you a more descriptive metadata that includes the location information. Because not all samples have location. For example, synthetic samples. Synthetic samples remember location, and not all are geo-refined locations. For example, something generated a 3D filter
to produce a never location. It's more like a locality than geo-refined location. But that's the nice thing about adding identifier. It's a unique, persistent, actionable. When you click this, this will go to the landing page that give you more information about the sample.
Thank you. Any other questions? We must thank. Thank you very much.