Australia National Computational Infrastructure - Implementing a Data Quality Strategy to simplify access to data - March 2019
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 19 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/42932 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | |
Genre |
Tech Talk18 / 19
00:00
Strategisches SpielComputerphysikSummierbarkeitGammafunktionRechenwerkCoxeter-GruppeMAPProzess <Informatik>GruppenoperationEndliche ModelltheorieMomentenproblemRechter WinkelComputeranimation
01:03
DokumentenserverNeuroinformatikRechenschieberRepository <Informatik>FokalpunktMereologieBitTermDomain <Netzwerk>MomentenproblemSpannweite <Stochastik>MultiplikationsoperatorComputeranimation
02:18
VisualisierungComputeranimation
02:32
Spannweite <Stochastik>Konfiguration <Informatik>Gerichtete MengeDienst <Informatik>Portal <Internet>Virtuelle RealitätAnalysisE-MailBenutzerbeteiligungResultanteDienst <Informatik>BitDateiverwaltungPortal <Internet>PunktwolkeDatenverwaltungQuick-SortStandardabweichungStellenringMetadatenGrenzschichtablösungPhysikalisches SystemDifferenteComputeranimation
03:25
DatenbankSystemprogrammierungStandardabweichungTrajektorie <Kinematik>KoordinatenAxonometrieAuflösung <Mathematik>Spannweite <Stochastik>Spannweite <Stochastik>Projektive EbeneMereologieDomain <Netzwerk>Auflösung <Mathematik>StandardabweichungTermTeilbarkeitSchnittmengeComputeranimation
04:00
DatenverwaltungService providerNotepad-ComputerUnternehmensarchitekturStrategisches SpielProgrammFlächeninhaltArchitektur <Informatik>UnordnungSystemplattformProzess <Informatik>Operations ResearchCMM <Software Engineering>BitrateVerschlingungDesintegration <Mathematik>MetadatenStandardabweichungFunktion <Mathematik>KonfigurationsraumDateiformatThreadDigitalsignalZustandsdichteWurzel <Mathematik>Kollaboration <Informatik>Automatische HandlungsplanungDienst <Informatik>KontrollstrukturEndliche ModelltheorieMultiplikationsoperatorProdukt <Mathematik>DatenmodellWasserdampftafelProgrammierumgebungMenütechnikMehrwertige LogikMAPAuswahlaxiomDateiformatCMM <Software Engineering>SupercomputerInformationsqualitätCloud ComputingDomain <Netzwerk>TermEndliche ModelltheorieService providerDienst <Informatik>Komplex <Algebra>TypentheorieFitnessfunktionStrategisches SpielVirtualisierungZahlenbereichRechenschieberPortal <Internet>MereologieSystemplattformDifferenteGamecontrollerAdressraumQuick-SortStandardabweichungStereometrieZusammenhängender GraphOnlinecommunityPunktFigurierte ZahlProgrammierumgebungComputeranimation
06:52
DateiformatKollaboration <Informatik>Virtuelle AdresseThreadDigitalsignalFunktion <Mathematik>SystemplattformDienst <Informatik>StandardabweichungStrategisches SpielWurzel <Mathematik>Digital Object IdentifierTypentheorieMetadatenVerschlingungMAPVererbungshierarchieElektronische PublikationIndexberechnungAttributierte GrammatikKategorie <Mathematik>Inhalt <Mathematik>Konfiguration <Informatik>Web SiteWrapper <Programmierung>VerkehrsinformationStatistikVersionsverwaltungAdressraumKontrollstrukturWiderspruchsfreiheitPrimzahlzwillingeWellenlehreDomain <Netzwerk>BenutzerfreundlichkeitGruppenoperationAutomatische HandlungsplanungPunktspektrumSoftwaretestServerProgrammierungAnalysisVisualisierungSichtenkonzeptGleitendes MittelRückkopplungStellenringOnline-KatalogFramework <Informatik>DatenverwaltungStandardabweichungProzess <Informatik>Funktion <Mathematik>VerkehrsinformationRechenschieberCASE <Informatik>MereologieMultiplikationsoperatorEinsQuick-SortMatrizenrechnungDienst <Informatik>MathematikEndliche ModelltheorieSoftwaretestBitService providerInhalt <Mathematik>Automatische HandlungsplanungOnline-KatalogInformationDomain <Netzwerk>MetadatenPhysikalisches SystemElektronische PublikationDifferenteMAPValiditätSoftwareentwicklerProgrammbibliothekDatensatzKoroutineDatenverwaltungZentrische StreckungRechter WinkelTwitter <Softwareplattform>Framework <Informatik>SkriptspracheDateiformatZahlenbereichOnlinecommunitySpannweite <Stochastik>RückkopplungFigurierte ZahlTeilbarkeitWellenpaketFunktionalSondierungBenutzerfreundlichkeitVisualisierungKategorie <Mathematik>ProgrammierungSpeicherabzugThreadSoftwareDatenfeldDokumentenserverSoftwarepiraterieElementargeometrieKollaboration <Informatik>MomentenproblemComputeranimation
12:56
Online-KatalogSoftwaretestDateiformatFramework <Informatik>DatenverwaltungMetadatenMultiplikationsoperatorComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:00
Great. All right. Well, thank you, Ming and everyone for the opportunity to present here today. So the majority of my presentation here is about a body of work that we actually investigated and wrote a paper about around two years ago, and it's focusing more at the data level.
00:23
So a little deeper down about how to work with data effectively and interoperably and make it and the quality of the access and experience for users. And then at the end of this talk, I'll talk about where we're at now and about how we're trying to extend that model to some additional things. And I should also note right now that this has been a very big group effort
00:45
with many of my colleagues at the NCI. It's the brainchild of Ben Evans and Jingbo Wang, who's on maternity leave at the moment. She did a great job in getting our paper actually written and submitted that Ming Miami shared with this announcement today.
01:03
All right. So just a little bit of background about NCI for those of you who aren't as familiar. So we're a national computational infrastructure here in Australia. So a big part of our focus is traditional kind of HPC usage, but we also are a data repository. So we have, and this is,
01:22
these slides are now about two years old. So at the time, 10 plus petabytes, and that's even bigger at the moment. And we also span a very big range of domains in terms of the data we host here. So we have a very large part that's climate, coast, oceans, but we also span into geophysics, astronomy, biodynamics, et cetera. So a lot of the approaches and best practices
01:46
that we try to share with our users really have to be broad enough to cover this full domain. And this kind of relates, I think, to one of the questions that came up from the earlier talk. So this is really forefront in our minds when we try to put together an approach.
02:03
And the big takeaway is that we really, we want to maximize that usage and experience of our users and ensure that they really have a seamless way to get to the data. And that really all comes back to this quality factor. And our users are looking to do things like
02:23
combine data, visualize, and they really would like this to be as easy as possible. So again, this comes back to some of our, you know, what are these motivating goals? And we want value, you know, it's a lot of work to manage, you know, data and metadata, as I'm sure everyone here has some experience with. And so, you know, you put all that effort
02:44
into that management and you'd like at the end of the day to have that sort of positive feedback and result from all of that hard work. All right. So also just a little bit about where we're at. So our collections are accessed in several different ways. So again, this comes back how we have to put
03:04
together an approach that works for traditional direct access on a file system. So local users at the depository. We have web and data services served through a lot of our cloud facilities and data portals as well as virtual laboratories. So all of these things have to be considered when
03:22
we talk about and think about our approach for what our quality standards are. And then as, you know, if it wasn't complicated enough, then when we come back and look at this really wide range of domains that we host here in terms of the data, each domain has some complicated factors that have to be considered. We have many gridded data sets,
03:44
but we also have a lot of non-gridded, different coordinate reference projections, resolutions. All of these things actually become a really important part of the picture when we start to put together some of these standards that we require. So now going to what this
04:03
data quality strategy looks like. And it's modeled off of this data maturity approach from Shelley Saul at AGU. So this is what kind of motivated the work for us. And we took that and thought, okay, how can we apply that to the data that we host here? And
04:22
so this is just a very simple schematic of kind of the big components for us. We have an underlying high performance data format that we have to consider. So, and this might change, but we want to choose something that's flexible and robust enough to be ready for HPC use, but then also through these other usages that I mentioned earlier, like through cloud services
04:42
and portals and virtual labs. We have to work closely with the data custodians and providers of the data. So planning, deciding about how the data will be accessed, what's needed for,
05:00
you know, for that particular collection. Then we work through data quality control. So we try to make sure we're consistent with any recognized community standards. That's our big one and make sure that we're compliant with that. And then for us, we want to make sure that anything we serve through our facility actually works across our platforms and tools and services.
05:22
So this is the big kind of final check for us. If we deliver data, we want to make sure that it's useful and, you know, not breaking. So I'll just speak a little to each of these points. So we started by looking at our climate, ocean, and weather data. It's a very large part of our collection. And so two years ago, these were some kind of rough numbers to give you a
05:43
sense of, you know, how many petabytes of data we're talking about in these domains. So we started there. They traditionally use a format called NetCDF. So that was our go-to, let's start there. That'll be the model that we begin with. So that's our underlying high-performance data format. This is a pretty famous slide from a few years ago that I'm using
06:05
here, but it's referred to as the National Environmental Research Data Interoperability Platform. But the main thing I'm just going to simplify here, the takeaway is sort of these levels of different things that the complexities that build up from choices you make at, you
06:20
know, an underlying data format layer, what's required in terms of API layers that have to work with it. Likewise, what type of conventions have to then work with, you know, below. And then you get into services and, you know, tools. And then finally, that's when we, you know, if you make some choices on these bottom layers appropriately, the hope is that, okay,
06:41
now the user communities, tools, and services, and portals, et cetera, you know, then, you know, fit with that model. So these, we try to put together checks that address these different levels on this figure. Working with the collaborators, with custodians and managers, of course, that's a big part. And then we get to our compliance. And for the NetCDF collection,
07:09
there's a community of some very useful standards that already exist out there. So this is, we chose, okay, let's jump on this and extend checkers that we can use in our
07:21
workflow. And these consist of something called this climate and forecast convention. So we use that as sort of our big checker of the data contents within the file. And then some additional ones for more traditional metadata. And then of course, at the higher level with catalog collection, catalog records, we have the ISO standards. So you can see there's a few different
07:47
levels of standards to also consider when you get down into that actual data file. And then this was just a little figure of some of the, when we did a bit of a
08:02
survey of all our holdings at that time, before we started this body of work, we realized that there was a big spread, even if we just focused in on climate data using NetCDF, we noticed there was a big spread in the conventions used by the community. So even there, even though there were strong community standards, it wasn't heavily adopted.
08:24
So this was another motivating factor for us to try and put that into our workflow and put it as a best practice to try and ensure that those publishing with us followed some of these standards. And also to try and get that num column down as small as we could, would be a big trend. And so I think I've spoken enough on these ones. I'll move along.
08:46
Like I said, we took these standards, we extended different like checker scripts and routines that we could adopt at sort of a repository scale. So there was some modifications needed, which went into the work we did. And then recording was something that
09:04
was really important for us because we wanted to share, you know, the output of these checkers or any of the quality information we were looking at. We wanted to then summarize this in a way for our data providers so that it was easy to go and make suggested changes.
09:21
But then also, you know, to be a reference so that they saw that value in the improvements that they would gain from putting in that work to make corrections. So we put a little bit of time into thinking about how the reporting would look for us. And then this is a slide on just
09:41
one of our collections that we worked closely with back at that time. And you can just see, you know, this chart here just going a few months at a time, we saw really quick improvements as we went through this process for the community. It was actually a geophysics community and not climate. And so we tried to take what we could from a climate example
10:02
that was appropriate for them. And we saw this really nice improvement in quality in their case. And so lastly, in this model here that I just wanted to talk about was this assurance through demonstrated tests across all our tools and services. And this part was really valuable
10:24
because it was testing that usability across everything that we hosted. So what did we do? We checked commonly used libraries. So a lot of the core system level libraries that were important across a lot of the climate and our system domain fields. Then we were also testing
10:43
any of the services that we were sharing this data through. So making sure that things like and at the time Hyrex and GeoServer all worked with the format, validation with different programming tools, Python, MATLAB, et cetera. Those were really important for users
11:00
and visualization. So this is just a few of the big categories. So we had a matrix of all these tests and we want to make sure that anything we published met any of the appropriate tools for that collection. And so again, we put a little bit of work into how we would share that with our providers. And we learned a lot through this process too. Not only did it give a positive
11:23
experience for our users, that they could expect, okay, yes, our data will work with the range of tools. But it also really helped us. We learned a lot through this process as well. And the bonus that I'd just like to throw in here is that we have a lot of
11:41
feedback to provide to different local and international communities because we did learn a lot. The functionality test led to lots of reference and training material for our user community. So by that, we were able to, as we put together the tests, it naturally led to the development of material that we could use and build in our library.
12:03
And of course, the real benefit is having something that is really interoperable and follows community standards where possible, which I think really plays back to this quality issue. So what's next or what's been going on? So we have a very large amount of data here.
12:22
Obviously, not everything's in NetCDF. So the goal is to also extend this to other formats that we host. So at the moment, we're putting in work into what the process would look like for geotiff collections. You know, staying connected and working with international communities.
12:41
And then a big one right now is extending this framework to now look at that higher level of data management plans and metadata catalog records. So how can we start to look at that information for all of our collections and take a similar approach? And I think that's my last slide. Yeah. So thanks so much. And if there's any questions later when there's time,
13:02
I'm happy to answer them.