Snippet - FAIR Accessible #2 - A for Accessible: Data Accessibility Practice at NCI

Video thumbnail (Frame 0) Video thumbnail (Frame 913) Video thumbnail (Frame 1726) Video thumbnail (Frame 3290) Video thumbnail (Frame 4441) Video thumbnail (Frame 8103) Video thumbnail (Frame 11514) Video thumbnail (Frame 13033) Video thumbnail (Frame 14049) Video thumbnail (Frame 15339) Video thumbnail (Frame 15883) Video thumbnail (Frame 16239) Video thumbnail (Frame 18198) Video thumbnail (Frame 20072) Video thumbnail (Frame 21055)
Video in TIB AV-Portal: Snippet - FAIR Accessible #2 - A for Accessible: Data Accessibility Practice at NCI

Formal Metadata

Title
Snippet - FAIR Accessible #2 - A for Accessible: Data Accessibility Practice at NCI
Title of Series
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2017
Language
English

Content Metadata

Subject Area
Abstract
Jingbo Wang, Data Collections Manager at NCI presents on how they make data accessible through services over the data so they can be interrogated and used by humans and machines. The FAIR data principles were drafted by the FORCE11 group in 2015. The principles have since received worldwide recognition as a useful framework for thinking about sharing data in a way that will enable maximum use and reuse. This webinar series is a great opportunity to explore each of the 4 FAIR principles in depth - practical case studies from a range of disciplines and organisations from around Australia, and resources to support the uptake of FAIR principles.
Data management Different (Kate Ryan album) Address space Supercomputer Neuroinformatik
Slide rule Game controller Inheritance (object-oriented programming) Service (economics) Computer-generated imagery Set (mathematics) Online help Digital library Metadata Data quality Supercomputer Data management Medical imaging Web service Optics Natural number Visualization (computer graphics) Software Focus (optics) Scaling (geometry) Computer Data analysis Database Virtualization Supercomputer Data management Visualization (computer graphics) Endliche Modelltheorie Asymmetry Data type
Point (geometry) Server (computing) Group action Game controller Identifiability Thread (computing) Service (economics) INTEGRAL Multiplication sign Electronic mailing list Graph coloring Perspective (visual) Zugriffskontrolle Independence (probability theory) Frequency Derivation (linguistics) Different (Kate Ryan album) File system Arrow of time Implementation Embargo Condition number Service (economics) Electric generator GUI widget Projective plane Shared memory Radical (chemistry) Type theory Word Data management Frequency Universe (mathematics) Writing Reading (process) Embargo
Point (geometry) Slide rule Implementation Server (computing) Service (economics) Identifiability Computer file Link (knot theory) Principal ideal domain Code Multiplication sign Set (mathematics) Frustration Average Mereology Perspective (visual) Metadata Scalability Product (business) Data management Data model Frequency Web service Latent heat File system Process (computing) Router (computing) Service (economics) Server (computing) Digitizing Projective plane Gradient Range (statistics) Principal ideal domain Bit Digital signal Library catalog Sphere Open set Category of being Type theory Uniform resource locator Data model Web service Right angle Service-oriented architecture
Standard deviation Service (economics) Standard deviation Web portal Service (economics) Mapping Gradient Data storage device Open set Group action Open set Software bug Oscillation Type theory Medical imaging Web service Process (computing) Different (Kate Ryan album) Self-organization Process (computing) Quicksort Freeware Self-organization
Server (computing) Link (knot theory) Thread (computing) Link (knot theory) Information Interface (computing) Virtual machine Stress (mechanics) Library catalog Product (business) Web service Software Web service Directed set Circle Data type Geometry
Server (computing) Web crawler Implementation Thread (computing) Service (economics) Computer file Software developer Multiplication sign Computer-generated imagery Average Mereology Metadata Subset Product (business) Web 2.0 Medical imaging Internetworking Term (mathematics) File system Integrated development environment Directed set Library (computing) Multiplication Formal grammar Physical system Service (economics) Standard deviation Email Mapping Key (cryptography) Server (computing) Concurrency (computer science) Computer file Polygon Java applet Real-time operating system Database Scalability Web browser Type theory Query language Web service Function (mathematics) File viewer Window Row (database)
Point (geometry) Source code Scaling (geometry) Information Point (geometry) Data storage device Control flow Metadata Revision control Explosion Series (mathematics) System programming Revision control Integrated development environment Aerodynamics Implementation Traffic reporting
right so my name is Jean Bowen I work at national King traditional infrastructure which is a computer a supercomputer Center located in Australian National University campus so today I'm going to address different flavor of data accessibility practice at NCI and before I go for that I just wanted to make a comment that fair principle is quite used for to govern our data management practice and we'll use it a lot in every single aspect in our data management so
this is the quick overview of the data sets we have so as as you can see I've listed here the mind data type that we support n CI are national collections about climate models secular images asymmetry elevation hydrology geophysics and those data are quite geospatial focused but all we also have other social science data and genomic sequencing data and astronomy data so we
aim to provide a user with data as a service as many digital repository will do in our data management we can't log data so that people can query the metadata database to find what we have here we also publish data through various data services that's a focus I'm going to talk about in the next few slides we offer data quality assurance data quality control and benchmarking news pieces we provide the data through virtual laboratories we also provide help on data visualization if I wanted to make something that we are different from other digital repository is because we're co-located with HPC facility high performance computing even the nature of our large scale of the data we host more than 10 credible research data so we really want to make a good use of the high performance computing here to to advance the science research
so this is the six dot points that I wanted to address today about your data access so I put their red color words to show the difference for each points so initially I will talk about the how do we control the data access and then are going to present one example of how do we use persistent identifiers to manage data access then I will talk about two main data services that we offer at NCI for our users one is the threads when the other one is a jiske which is more fancy and scalable distributed data server finally I'm going to cover very quickly about the data versioning and the quality of the data so the first
point is about how do we control the data access most of our data are coming from our stakeholders such as Geoscience Australia the Bureau of metrology CS arrow universities and many data has been funded by Australian government so it naturally fall into CC by for license some owners also impose that the data should be non-commercial non derivative or share alike type of CC by we also have international partners such as in the European and US and they impose an even stricter terminal conditions if we were wanted to access the data so this is a legal perspective about how do we control the license that data access to rely senses on the file system we actually hard-coded the data access control using echoes so this is a way how do we separate a different group of people access the same same data so we have basically for each collection we have two access group the first group is has the read and write permission which means those are data managers who is able to generate data and write data and modify data the second group is a read-only group so for those people who are in the read-only group they can access the data on the file system but they can't really modify this way we actually protect the integrity of the data we only give access to authorized a person who really can manage the data so there is also a social aspect of of data access for a research project we often see the embargo period that maybe after two years of the project the data can be made available also some researchers say I want to share my data after my Journal article about the status that is published so another example is from the Bureau of Meteorology we have a data that we there is a six months time delay between the data is being developed verified until it is being operational available on our threat server so the
second point I wanted to raise is our practice about implementing a person identifier often we experience some frustration about when we give people the URL to access the data it only valid for a certain period of time or only valid during the time that somebody can maintain it afterwards we can't really guarantee and also the URL the original URL if you look at on the left-hand side of the slides those are the metadata catalog URL or service endpoint URL let's um look at the second one which is service endpoint so from this URL name Commission you can tell the later part which include the project code file path file name anything in this path for example project code changed of you rename the file always shuffle the file around and this link will be broken so the original URL that we provided here is not a very stable one we adopt the product that csro developed some time ago about a persistent identify as a broker so we know most of the time we give the external user the right-hand side the name convention as you can see we have four main categories after a PID and cor gwe you now we have data set who have services who have documentation and we have vocabularies the only thing people unique is the file identifier or you your ID it says basically as long as the identifier keeps the same the URL on the right hand side is pretty consistent if anything changed in the original URL on the left hand side what we need to do is update the mapping inside of the PID services broker without interrupting the URL that we give to the external user we have the technical implementation published in the digital science journal so you are welcome to have a look now I'm going to
talk about the main data services that she's really wanted me to address and from ancestress perspective so I divided our the type of data service into two main group one is the OGC services I'm going to talk about more about what is out you see in a second the other type of data services is more project step specific such as we are one of the largest node in house in southern hemisphere as part of the Earth System Federation grade which is the aggregation of climate model from Global Research Institute so the way we provide you the services is we copy the main of the data model to serve for Australian users another fancy data services I'm going to show you a bit more is code jiske it's a scalable data server that directly interact with our file system so what is LG co2 C is open geospatial
consortium it is an international nonprofit organization to make quality open standards for global geospatial community we find OGC standards quite useful for us because we have a lot of geospatial featured data and OGC have all sorts of standards for different type of mapping feature coverage processing for us to use because it's so common it's free for people to use and if we made a data available through OGC standards a lot of people naturally can access our data so that's the motivation
so what is the OGC services it's actually an API in in the middle between the data store and the user so the user can request whatever available on OTC services let's say I want a map about the anomaly across Hope Australian continental and NCI hosts this data but we will host the data we don't host images what the LDC web services is do is he actually extracts the image it returned back to the user and user can takes the URL which contained an image of the data put on their own web portal for example you can get the URL and copy and paste on to the National map to show the grades so NCI has two main
production data type service one via the threads so you can often find the threads available on our data catalog this is the interface of the geo network so the red circle link is the ancestress thread server so you can open and click it the second interface is that they
catalog they more or less contain the same information but serving for different purposes geo network is mainly for data harvester from machine accessible the data catalog is for human readable so stress in a short in a very
simple term is it's a data services which allow you to browse and access the data so I've listed here six main type of services that stress offer the very first a--to opened up and the city of is subset subsetting the data so we have a lot of very large data but in practice when scientists access the data they don't necessarily have to access all the data they might just need a very small piece of data from this big pool so what the threads can offer is you can actually be fine your query and only get the data the part that you want so it really saved a lot of traffic on the internet and this the other two standard OTC web mapping services web carriages is very popular for people to access the mapping and coverage directly out of our data and of course threads offer a very quick data viewer if you don't know what this data is you can have a quick look of what it is on the web without downloading it of course they're also threads offer the direct download if you really want to download the data another fancy syllable
distributed in a server that I was talking about is cochise key this key is the in-house and Sabet uncie I developed a product what it does is we have a lot of data on a file system millions meanings file on a system if we wanted to people to query this data how it's going to be very hard to create a millions of metadata records for every single file so what we've done is will use the crawler to crawl the file system get the header of the file and formulate as a database metadata database and then the database will be a clear window for people to handing the request say give me a poly give me some images in the polygon at work at some time so they so the metadata database actually includes those essential geospatial information and it returned back to you the user of what they requested so we published recently a technical details of she's the implementation you're more than welcome to have a look laughing but
I think you're getting closely and I just wanted to ask you if there was only only about a minute or two left so if you could yes the last two points will be version data again because of the scale of the data we can't really store every single step of the data so what we can do is we stop the raw data and the final version and we keep the URI of the metadata in the middle step so in that way the province information was kept and also saved the storage the last
point of the quality data is I would think somewhat some user said we can't really assume we can access data and data is flawless so by publishing data aside with the quality report we wanted to provide a data access with a certain title of assurance so we also have the publication that is in going to be in place very soon thank you for your attention and that's our experiences so far about this access
Feedback