Snippet - FAIR Accessible #2 - A for Accessible: Data Accessibility Practice at NCI
Video in TIB AV-Portal:
Snippet - FAIR Accessible #2 - A for Accessible: Data Accessibility Practice at NCI
Formal Metadata
Title |
Snippet - FAIR Accessible #2 - A for Accessible: Data Accessibility Practice at NCI
|
Title of Series | |
Author |
|
License |
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. |
Identifiers |
|
Publisher |
|
Release Date |
2017
|
Language |
English
|
Content Metadata
Subject Area | |
Abstract |
Jingbo Wang, Data Collections Manager at NCI presents on how they make data accessible through services over the data so they can be interrogated and used by humans and machines. The FAIR data principles were drafted by the FORCE11 group in 2015. The principles have since received worldwide recognition as a useful framework for thinking about sharing data in a way that will enable maximum use and reuse. This webinar series is a great opportunity to explore each of the 4 FAIR principles in depth - practical case studies from a range of disciplines and organisations from around Australia, and resources to support the uptake of FAIR principles.
|

00:00
Data management
Different (Kate Ryan album)
Address space
Supercomputer
Neuroinformatik
00:37
Slide rule
Game controller
Inheritance (object-oriented programming)
Service (economics)
Computer-generated imagery
Set (mathematics)
Online help
Digital library
Metadata
Data quality
Supercomputer
Data management
Medical imaging
Web service
Optics
Natural number
Visualization (computer graphics)
Software
Focus (optics)
Scaling (geometry)
Computer
Data analysis
Database
Virtualization
Supercomputer
Data management
Visualization (computer graphics)
Endliche Modelltheorie
Asymmetry
Data type
02:12
Point (geometry)
Server (computing)
Group action
Game controller
Identifiability
Thread (computing)
Service (economics)
INTEGRAL
Multiplication sign
Electronic mailing list
Graph coloring
Perspective (visual)
Zugriffskontrolle
Independence (probability theory)
Frequency
Derivation (linguistics)
Different (Kate Ryan album)
File system
Arrow of time
Implementation
Embargo
Condition number
Service (economics)
Electric generator
GUI widget
Projective plane
Shared memory
Radical (chemistry)
Type theory
Word
Data management
Frequency
Universe (mathematics)
Writing
Reading (process)
Embargo
05:24
Point (geometry)
Slide rule
Implementation
Server (computing)
Service (economics)
Identifiability
Computer file
Link (knot theory)
Principal ideal domain
Code
Multiplication sign
Set (mathematics)
Frustration
Average
Mereology
Perspective (visual)
Metadata
Scalability
Product (business)
Data management
Data model
Frequency
Web service
Latent heat
File system
Process (computing)
Router (computing)
Service (economics)
Server (computing)
Digitizing
Projective plane
Gradient
Range (statistics)
Principal ideal domain
Bit
Digital signal
Library catalog
Sphere
Open set
Category of being
Type theory
Uniform resource locator
Data model
Web service
Right angle
Service-oriented architecture
08:41
Standard deviation
Service (economics)
Standard deviation
Web portal
Service (economics)
Mapping
Gradient
Data storage device
Open set
Group action
Open set
Software bug
Oscillation
Type theory
Medical imaging
Web service
Process (computing)
Different (Kate Ryan album)
Self-organization
Process (computing)
Quicksort
Freeware
Self-organization
10:14
Server (computing)
Link (knot theory)
Thread (computing)
Link (knot theory)
Information
Interface (computing)
Virtual machine
Stress (mechanics)
Library catalog
Product (business)
Web service
Software
Web service
Directed set
Circle
Data type
Geometry
10:50
Server (computing)
Web crawler
Implementation
Thread (computing)
Service (economics)
Computer file
Software developer
Multiplication sign
Computer-generated imagery
Average
Mereology
Metadata
Subset
Product (business)
Web 2.0
Medical imaging
Internetworking
Term (mathematics)
File system
Integrated development environment
Directed set
Library (computing)
Multiplication
Formal grammar
Physical system
Service (economics)
Standard deviation
Email
Mapping
Key (cryptography)
Server (computing)
Concurrency (computer science)
Computer file
Polygon
Java applet
Real-time operating system
Database
Scalability
Web browser
Type theory
Query language
Web service
Function (mathematics)
File viewer
Window
Row (database)
13:23
Point (geometry)
Source code
Scaling (geometry)
Information
Point (geometry)
Data storage device
Control flow
Metadata
Revision control
Explosion
Series (mathematics)
System programming
Revision control
Integrated development environment
Aerodynamics
Implementation
Traffic reporting
00:00
right so my name is Jean Bowen I work at national King traditional infrastructure which is a computer a supercomputer Center located in Australian National University campus so today I'm going to address different flavor of data accessibility practice at NCI and before I go for that I just wanted to make a comment that fair principle is quite used for to govern our data management practice and we'll use it a lot in every single aspect in our data management so
00:37
this is the quick overview of the data sets we have so as as you can see I've listed here the mind data type that we support n CI are national collections about climate models secular images asymmetry elevation hydrology geophysics and those data are quite geospatial focused but all we also have other social science data and genomic sequencing data and astronomy data so we
01:10
aim to provide a user with data as a service as many digital repository will do in our data management we can't log data so that people can query the metadata database to find what we have here we also publish data through various data services that's a focus I'm going to talk about in the next few slides we offer data quality assurance data quality control and benchmarking news pieces we provide the data through virtual laboratories we also provide help on data visualization if I wanted to make something that we are different from other digital repository is because we're co-located with HPC facility high performance computing even the nature of our large scale of the data we host more than 10 credible research data so we really want to make a good use of the high performance computing here to to advance the science research
02:12
so this is the six dot points that I wanted to address today about your data access so I put their red color words to show the difference for each points so initially I will talk about the how do we control the data access and then are going to present one example of how do we use persistent identifiers to manage data access then I will talk about two main data services that we offer at NCI for our users one is the threads when the other one is a jiske which is more fancy and scalable distributed data server finally I'm going to cover very quickly about the data versioning and the quality of the data so the first
02:59
point is about how do we control the data access most of our data are coming from our stakeholders such as Geoscience Australia the Bureau of metrology CS arrow universities and many data has been funded by Australian government so it naturally fall into CC by for license some owners also impose that the data should be non-commercial non derivative or share alike type of CC by we also have international partners such as in the European and US and they impose an even stricter terminal conditions if we were wanted to access the data so this is a legal perspective about how do we control the license that data access to rely senses on the file system we actually hard-coded the data access control using echoes so this is a way how do we separate a different group of people access the same same data so we have basically for each collection we have two access group the first group is has the read and write permission which means those are data managers who is able to generate data and write data and modify data the second group is a read-only group so for those people who are in the read-only group they can access the data on the file system but they can't really modify this way we actually protect the integrity of the data we only give access to authorized a person who really can manage the data so there is also a social aspect of of data access for a research project we often see the embargo period that maybe after two years of the project the data can be made available also some researchers say I want to share my data after my Journal article about the status that is published so another example is from the Bureau of Meteorology we have a data that we there is a six months time delay between the data is being developed verified until it is being operational available on our threat server so the
05:25
second point I wanted to raise is our practice about implementing a person identifier often we experience some frustration about when we give people the URL to access the data it only valid for a certain period of time or only valid during the time that somebody can maintain it afterwards we can't really guarantee and also the URL the original URL if you look at on the left-hand side of the slides those are the metadata catalog URL or service endpoint URL let's um look at the second one which is service endpoint so from this URL name Commission you can tell the later part which include the project code file path file name anything in this path for example project code changed of you rename the file always shuffle the file around and this link will be broken so the original URL that we provided here is not a very stable one we adopt the product that csro developed some time ago about a persistent identify as a broker so we know most of the time we give the external user the right-hand side the name convention as you can see we have four main categories after a PID and cor gwe you now we have data set who have services who have documentation and we have vocabularies the only thing people unique is the file identifier or you your ID it says basically as long as the identifier keeps the same the URL on the right hand side is pretty consistent if anything changed in the original URL on the left hand side what we need to do is update the mapping inside of the PID services broker without interrupting the URL that we give to the external user we have the technical implementation published in the digital science journal so you are welcome to have a look now I'm going to
07:42
talk about the main data services that she's really wanted me to address and from ancestress perspective so I divided our the type of data service into two main group one is the OGC services I'm going to talk about more about what is out you see in a second the other type of data services is more project step specific such as we are one of the largest node in house in southern hemisphere as part of the Earth System Federation grade which is the aggregation of climate model from Global Research Institute so the way we provide you the services is we copy the main of the data model to serve for Australian users another fancy data services I'm going to show you a bit more is code jiske it's a scalable data server that directly interact with our file system so what is LG co2 C is open geospatial
08:45
consortium it is an international nonprofit organization to make quality open standards for global geospatial community we find OGC standards quite useful for us because we have a lot of geospatial featured data and OGC have all sorts of standards for different type of mapping feature coverage processing for us to use because it's so common it's free for people to use and if we made a data available through OGC standards a lot of people naturally can access our data so that's the motivation
09:23
so what is the OGC services it's actually an API in in the middle between the data store and the user so the user can request whatever available on OTC services let's say I want a map about the anomaly across Hope Australian continental and NCI hosts this data but we will host the data we don't host images what the LDC web services is do is he actually extracts the image it returned back to the user and user can takes the URL which contained an image of the data put on their own web portal for example you can get the URL and copy and paste on to the National map to show the grades so NCI has two main
10:17
production data type service one via the threads so you can often find the threads available on our data catalog this is the interface of the geo network so the red circle link is the ancestress thread server so you can open and click it the second interface is that they
10:37
catalog they more or less contain the same information but serving for different purposes geo network is mainly for data harvester from machine accessible the data catalog is for human readable so stress in a short in a very
10:53
simple term is it's a data services which allow you to browse and access the data so I've listed here six main type of services that stress offer the very first a--to opened up and the city of is subset subsetting the data so we have a lot of very large data but in practice when scientists access the data they don't necessarily have to access all the data they might just need a very small piece of data from this big pool so what the threads can offer is you can actually be fine your query and only get the data the part that you want so it really saved a lot of traffic on the internet and this the other two standard OTC web mapping services web carriages is very popular for people to access the mapping and coverage directly out of our data and of course threads offer a very quick data viewer if you don't know what this data is you can have a quick look of what it is on the web without downloading it of course they're also threads offer the direct download if you really want to download the data another fancy syllable
12:10
distributed in a server that I was talking about is cochise key this key is the in-house and Sabet uncie I developed a product what it does is we have a lot of data on a file system millions meanings file on a system if we wanted to people to query this data how it's going to be very hard to create a millions of metadata records for every single file so what we've done is will use the crawler to crawl the file system get the header of the file and formulate as a database metadata database and then the database will be a clear window for people to handing the request say give me a poly give me some images in the polygon at work at some time so they so the metadata database actually includes those essential geospatial information and it returned back to you the user of what they requested so we published recently a technical details of she's the implementation you're more than welcome to have a look laughing but
13:26
I think you're getting closely and I just wanted to ask you if there was only only about a minute or two left so if you could yes the last two points will be version data again because of the scale of the data we can't really store every single step of the data so what we can do is we stop the raw data and the final version and we keep the URI of the metadata in the middle step so in that way the province information was kept and also saved the storage the last
14:03
point of the quality data is I would think somewhat some user said we can't really assume we can access data and data is flawless so by publishing data aside with the quality report we wanted to provide a data access with a certain title of assurance so we also have the publication that is in going to be in place very soon thank you for your attention and that's our experiences so far about this access
