II. Improving the management of experimental data: Managing research data for diverse scientific experiments
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 15 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/46312 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
| |
Schlagwörter |
00:00
DrehenEisenbahnwagenSynchrotronMessgerätChirpEisenbahnbetriebNeutronensternPassfederLaserVorlesung/Konferenz
01:12
GarnSteckverbinderWarmumformenAntennendiversityGeneratorETCSFamilie <Elementarteilchenphysik>SynchrotronLos Alamos National LaboratoryProzessleittechnikSpannungsabhängigkeitNeutronensternLaserElementarteilchenphysikInfrarotlaserRutherford-StreuungKristallgitterProtoneutronensternVorlesung/Konferenz
03:27
TargetSchreibwareDiamant <Rakete>SchwingungsphaseEisenbahnbetriebCERNElementarteilchenphysikKalenderjahrRelative DatierungGauß-BündelLichtLaserSchwächungVideotechnikWocheTeilchenZielfernrohrSonnenstrahlungGleichstromDrehenSchlichte <Textiltechnik>Vorlesung/Konferenz
05:11
TomographieEisenbahnbetriebSpeckle-InterferometrieStuhlKalenderjahrZugangsnetzNutzungsgradScheibenbremsePostkutscheVorlesung/Konferenz
06:49
ErsatzteilZugangsnetzKristallgitterGruppenlaufzeitTrockenkühlungSternkatalogVorlesung/Konferenz
07:15
LichtDiscovery <Raumtransporter>AntennendiversityKalenderjahrDurchführung <Elektrotechnik>BehälterTeilchenrapiditätDiamant <Rakete>IntensitätsverteilungImpaktErderFeinschneidenA.J. Diamond, Donald Schmitt and CompanyVorlesung/Konferenz
08:21
MonitorüberwachungSpeichermedienFernordnungDurchführung <Elektrotechnik>XerographieVideokassetteTargetReglerKraft-Wärme-KopplungDrehmeißelInnere ReibungDiamant <Rakete>SchreibwareComputeranimation
10:04
WerkzeugWerkzeugEisenkernKalenderjahrRegistrierkasseEisenbahnbetriebKugelstrahlenSchwächungLichtTagSatz <Drucktechnik>SternkatalogBrennpunkt <Optik>BewegungsmessungRauschunterdrückungSchaltplanProzessleittechnikVorlesung/Konferenz
12:00
ProzessleittechnikWarmumformenBlatt <Papier>Vorlesung/Konferenz
12:27
ProzessleittechnikTomographieGauß-BündelBahnelementProzessleittechnikDiamant <Rakete>ErsatzteilAngeregter ZustandInitiator <Steuerungstechnik>Patrone <Munition>Speckle-InterferometrieDrehmeißelA.J. Diamond, Donald Schmitt and CompanyBlatt <Papier>GreiffingerLeistungssteuerungTagFehlprägungBewegungsmessungVorlesung/Konferenz
14:54
Konverter <Kerntechnik>ModellbauerFernordnungProzessleittechnikPostkutscheRangierlokomotiveSpeckle-InterferometrieSatz <Drucktechnik>SternkatalogSchwächungClusterphysikZugangsnetzMaßstab <Messtechnik>Vorlesung/Konferenz
16:43
KristallgitterFormationsflugDiamant <Rakete>RauschunterdrückungWerkzeugProzessleittechnikSchlichte <Textiltechnik>DrehmeißelSternkatalogTagMessgerätSatz <Drucktechnik>FeilenBohrmaschineVorlesung/Konferenz
18:56
NeutronensternPhotonWerkzeugGreiffingerPassfederProtoneutronensternSternkatalogSynchrotronTonbandgerätFormationsflugStandardzelleGleisketteSource <Elektronik>ErderNeutronensternVorlesung/Konferenz
20:15
SpiegelobjektivLUNA <Teilchenbeschleuniger>EisenkernZugangsnetzSchaltplanProzessleittechnikKristallgitterGreiffingerTagDigitalelektronikETCSKorvetteDurchführung <Elektrotechnik>TrenntechnikWerkzeugPassfederVorlesung/Konferenz
21:20
Debye-Scherrer-MethodeAnstellwinkelProfil <Bauelement>ZugangsnetzPatch-AntenneFormationsflugSchreibstiftFertigpackungAusgleichsgetriebeTiefdruckgebietFeilenGreiffingerTomographieVorlesung/Konferenz
22:11
WerkzeugSpiegelobjektivSchlauchkupplungDienstagWerkzeugOptometrieTauchanzugReglerWarmumformenMessgerätKraft-Wärme-KopplungGreiffingerVorlesung/Konferenz
22:45
ProzessleittechnikZugangsnetzProzessleittechnikKalenderjahrZugangsnetzElektrische StromdichteBlatt <Papier>SteckverbinderOberflächeErdefunkstelleBlasenkammerStandardzelleUmlaufzeitRömischer KalenderVideotechnikBestrahlungsstärkeVorlesung/Konferenz
24:11
Durchführung <Elektrotechnik>KalenderjahrVorlesung/Konferenz
Transkript: Englisch(automatisch erzeugt)
00:14
Yes, well, let me, my talk is straight after John and talk about Research Council, UK.
00:21
So I'm from one of the seven Research Councils, UK. This is the pointer. Yeah, it's called Science and Technology Facilities Council, one of the seven Research Councils of the UK. So we are the one, facility is the key here, so we are the one operating, providing services around large instrumentation, such as large high-power lasers,
00:45
neutrons, synchrotrons, well, synchrotrons, not exactly, let's not get into the territory, but there are significant things that it's beyond me, basically. Well, however, all these big facilities, all these big facilities are physically
01:00
all located in one place, one physical place, very big site called Rediff Appleton Laboratory. This is where I'm coming from. I'm employed by Science and Technology Facility Council, STFC, right? So this is what it looks like. I'll explain a bit more later. Now this talk, the title of this talk is very important, actually quite nicely connected with John's talk,
01:22
is that if you look at this title, it's about data management. Now, what do we mean by data? Now, if you talk about, if you look at the whole pipeline of data generation process or lifecycle, we are sort of the upstream, generating the data, right? And other things, information relating to data. Now what we are, so as the upstream generator of the data
01:42
or help people generate large amount of data, we actually not only interested in the sort of experimental data, but also interested in the data generated down the stream, down the pipeline, data processing, and perhaps all the way stretched to the computational aspect, computational data. So there are a whole range of data actually encapsulate
02:00
by this potential sort of data under this one word. Now, also diverse range of data. Now this perhaps so partly, I will partly explain the reason or the current situations that there are a whole range of data available, a whole diverse range of data available at the Rutherford Appleton Laboratory.
02:22
First of all, it's sort of operating lasers, neutrons, synchrotron, as well as other, even data center, we host a located data center for other facility, other councils in the UK, Natural Environmental Center, Research Council in the UK. So we have a large range of data, astronomy data, particle physicists data,
02:41
as actually co-located in one place. So we have a huge range of data. But this talk, particularly for this audience, I'm going to focus on the data that directly relates to neutron and synchrotron. Myself is from the scientific computing department. The way we organize, and I've got two hats in the department, in the laboratory.
03:01
One is, I'm familiar with the data services division under this department. I'm also the National Lab Services Liaison Officer. National Lab, the way they organize this as TFC is that they've got a very sort of directory, National Laboratory, National Lab Laboratory, so it's kind of an umbrella, which has ICES,
03:20
laser and scientific computing and technology, et cetera, et cetera. So we are under sort of the same directory structure. Now this is actually the site, the physical look of the site. It's a bit dated because our department is actually not there. But it doesn't matter. So this is where we sit now.
03:41
It's a very sort of modern building, et cetera, et cetera. This is the D.C. Ralph Space, Space Science. This is Particle Science here, department, running the tier one and related to do a lot of experiments for CERN and collaborating with scientists elsewhere as well, particle physicists community here. Now this one, Diamond, the light source,
04:01
ICES, TS1, target station one here, about here, target station two here. And this is the central laser facilities. Now one very unique sort of thing about Ralph is that it's perhaps one of the really unique sites in the world that have all these different facilities in one single place. And this is perhaps the reason that,
04:20
there are many reasons. One of the important reasons is that we are in a position, we have the sort of physical location allowing all this happening, not just physically in one place, but also we have the opportunity as well as the, shall we say, the interaction allowing very convenient interactions,
04:41
people interactions that allow us to build, it doesn't happen automatically or very quick. In the last 10 years also, we build a common infrastructure serving all these different communities. Here I highlight Diamond light source, 30-ish beam lines, TS1, TS2, ICES, 40 plus-ish. So some of them are clearly not often in full operation.
05:03
Some are phase three being planned and constructed and similarly here. So let's not get into the detail here. But all of them are actually sharing over time. We have built up this infrastructure, as I said, in the last 10 years-ish. Now, but once upon a time in the long past,
05:20
everybody's familiar with this, right? Even today, we're still seeing portable devices using, well, today's portable devices is a lot more powerful than those ones, right? 4P this and all that. Those are talking about, however, megabytes and all that. These were talking about terabytes of data, capacity, hard drive. People do tomography imaging sort of research.
05:42
They do need to bring this and take their data away, for example. Now, all these things, emails, disks, and simple webpage dissemination, even up to a few years ago, a few, sort of five, six years ago, it still actually works very well, serving the community quite well. So this is all we need.
06:01
And of course, there are lots of issues with those devices over time because even today, even this year, there are still people coming back to us to ask for sort of data they have. They have the data, but they couldn't because they're stuck in those devices. They don't have the means to get access to the data. So there are lots of issues about individual users
06:21
or research maintaining, managing their data over time. That is, over time, these time issues are key point here. They don't have the means as well as expertise over time to look after their data. This is just really very recent sort of emails that are coming back to us. As I said, this works very well. People seem quite happy for a long time.
06:43
ISIS has been on operation for since 70s, well, various stages, of course. Very well. However, going back a bit, the structure of this talk is, there are three parts of this talk. The first part, this is the first part. I'm going to give you an overview, the data in structure, the context, the setting.
07:01
The second part is the core, the data catalog, as well as other metadata. So a lot more interest to this group. And then the last one will be so how people access their data as well as open access issue, data access issues. Now this one, looking into the data infrastructure. Now we can argue what has been, sort of what other things happened in the last 10 years
07:21
that has been sort of make an influence or impact in where into sort of where we are today. Now there are lots of things. For example, there are significant investments in early sort of 2003-ish time. There are a lot of UKE science program in terms of the infrastructure investment storage. And if you remember, some of you may remember SRB and those kinds of things.
07:41
And also the push from the data intensive science and discovery. All these things coupled with the actually onsite developments with the facility Diamond and ISIS, they have been, one thing people perhaps outside of this community is that people may not be really aware of the continuous development upgrades of the facilities.
08:02
What that means is on the ground to the computing facilities like us, department like us is that we're not just seeing the rapid increase in particularly in recent years, the data rate, the data volume is sort of increased dramatically, but also the diversity of data as well as the complexity of data. Now, complexity of the data is one thing
08:22
that perhaps most sort of influencing what we are happening here apart on top of the volumes and the rate. Now here, this is where we are now. This is now you can argue the, you can argue which bits is actually happening in which facilities, whether they have all of them,
08:40
we can get into detailed discussion of that or understanding of that. But roughly speaking, all is started with the data monitoring or data acquisition system and control and all that in the facilities. This is just a target station, one order facilities here. This is just a monitoring of ordered load of the data acquisition system, see how heavy loads they are, et cetera. And there are data replication synchronization system
09:02
across copies within the facility to make the data immediately available to the scientists who conduct experiment at a time so people can access wider networks and all that. And network monitoring system from ISIS all the way to the data archive, to the data storage elsewhere on the lab, on the physical side, to our department elsewhere.
09:22
And there are of course the most important to this audience in the sense that there are data cataloging happening. And people, the important things about data cataloging and data storage is that if you take that time factor into consideration, this is making things a lot more complicated. And this is where the majority of the cost
09:40
and the development cost and maintaining cost and the complexity comes in here is really when you keep the data for a longer time, things can get costly as well as complicated. Well, of course there are on the back, there are tape or robotic tape system continuously operating, serving all these different facilities, Diamond, ISIS, CLF, as well as other data storage.
10:03
This is where we are now. Now we're talking about data management. There are infrastructure behind the scene. Now I'm getting into the tools we are sort of working on today, working on or have been using today. Now, the key thing here is the, like Simon Coates presentation is that
10:21
the understanding of this whole pipeline is from the proposal to scheduling, registration, to sample capture, safety, environmental information, and all that information capturing, so where all these things happen well before people come to do the experiment, physically onsite for three, four days.
10:41
So all those things being fed, being captured by various system, user office system, business system, safety sample system, et cetera, et cetera, or all this information over time being captured and pushed into this central metadata catalog system. Right, it's called ICAT. It's back, on the back is the core schema
11:02
I will talk about later. Now, the key thing about, yes, the key thing about this is the facility, as the facility operator, our focus is really making, our utmost focus is really making sure experiment take place and takes place in an efficient manner and people can conduct their science in a productive way.
11:21
Now, there are, however, as that these are important steps between experiment and publication. And these are the, clearly depends on the beamline, depends on the facility, depends on the type of research or science you do, it takes various time. But sometimes it can take up to, we're talking about year, a year or two, take from going through here to here.
11:42
So this is really the sort of, in recent years, we sort of spend, trying to expand our range of coverage in terms of the support of what we capture, what we capture into this ICAT database to support this downstream pipeline processing, reduction as well as analysis in the recent time, more to do with,
12:00
more even to the computation aspect. Now, however, the reality is the real world is very, it's messy. The data is messy and the complex, the workflow is very complex. This is some early work we do. As John mentioned, basically, it's not just about the data wise, it has raw data, derived data, resulting data that ends up in paper.
12:21
Now, people may interested in this data because they find a paper and then, but they want to trace back. But however, to capture all this provenance, it's really a challenge. From a facility provider perspective, things are happening outside the lab. People traditionally, they take data away, do analysis, they end up with paper.
12:42
If they're nice or they come back to do another, so apply another beam time, they will need to tell you the publication, then you know what has come out from the other end. But in reality, that's not, traditionally has not been the case. So it's quite difficult to capture that. But even, we have some early project trying to, we're actually funded by Simon's Managing Research Program.
13:01
Data program is that various, sort of trying to capture those things. But even that, this is only captured a fraction of this. It can be very complicated. In reality, there are a whole range of issues by capturing this provenance, such as the data, which bits of the research is the trial and error, particular this kind of experimental data analysis. Lots of data you generate is actually not relevant
13:22
or not very useful, or at the time you think it's not relevant or not very useful, but later state of the parameters or whatever you use turn out to be useful. So this is really a trade off between what you capture and what you don't. And also there are software versions and all sorts of things. People are now more and more, a people element I think is very important.
13:41
People are now more and more doing their analysis. Different parts of the analysis are done by different people. They may be in different places. So collaborative data analysis become more and more common in even in small team, as we've seen. Say it. Now, another thing is about, this particular one is about, is more sort of,
14:00
well, it's about a leading academic in the UK talking about their pipeline processing from together the data and doing the analysis at Diamond. This is about tomography imaging processing using I-12 and 13 at Diamond. It's from this part is primarily at Diamond talking about it's so small, but the pipeline is that from this part
14:21
is do an experiment, do the initial reconstruction, 3D image reconstruction at Diamond using Diamond's facility. But however, down the pipeline, there are a whole range of other things happening because this tomography imaging in particular, it's generally so much data and the data pipeline is quite complex. Well, you can argue this is a more advanced
14:40
in processing pipeline by the advanced academic. So, but however, it's very difficult for even for the leading academics to go through this pipeline, not to mention the expertise and knowledge required to go through this pipeline. So it's actually lots of things happening is actually is making already making difference in terms of how we manage our data.
15:02
This one is a very early pilot stage study and trying to understand what's involved in this kind of large scale sort of data analysis pipeline. This one is particular to do with imaging, but there are other things doing, there are other things happening and other conversation we are happening
15:20
with different sort of audience and different communities is that talking about different aspects of processing after post experiment processing. So what is the central thing is that in order to drive this data, sort of we call it computational data management, infrastructure underpinning, that the help to accelerate the research after the experiment,
15:40
you do need a lot infrastructure support. What that means is actually not just starting point is the data availability from the data catalog, what kind of metadata you can obtain and get from the IK database, as well as other whole range of other services from the sort of fast access, integrated access, fast access to your data from your cluster and to different types of cluster
16:02
connected directly connected to our data and as well as the visualization capability that allows you to really disseminate or understand better of your data. So the large complexity of issues involve infrastructure, software and people are involved in this whole process. Now this is sort of raise the issues
16:20
and we generate a lot of data if we still sort of thinking about the traditional model of moving data around and to use the data, will that still be the right way or suitable way for this type of science? Sort of question, there's a lot of question marks here. All sorts of things we are sort of thinking and this is why sort of this little big pilot study
16:42
is going on, trying to understand all these sort of things involved in that. Now we also, as I said, there are data reduction, normalization and analysis happening in the lab. This one particular one is with ISIS. As you, the next talk was about HDF and Nexus and so.
17:00
Now most of the facilities, this particular ISIS diamond are already, well I think diamond is perhaps more advanced because ISIS got legacy sort of instruments. But gradually they are all moving to Nexus really. But there are, these two metadata formats is sort of the most dominant, predominant one.
17:21
So however, from moving from the data they get from the facilities upwards to actual data insights they want to get from the actual data, there are a lot of steps they actually going through. If you, I don't, I take the assumption that most of you here has directly manipulated all this file directly yourself.
17:42
If you ever done that, you will find this is actually quite tedious process. There are a lot of reasons why, it can be done better basically. This is exactly demonstrating that. Why, how and what can be done make it better. So to help scientists basically move straight from this kind of format without drilling into them
18:02
and then go straight into this. Basically providing all this sort of different computing technologies in the middle to allow all these things happening. We're making a lot sort of making, for example, we're making this HDF data as well as Nexus data available straight to view, to plot and interact with all this data, why this different structure format,
18:22
allowing them to interact with them as well as allowing them to directly take away for, because users come in different shape and sizes. Some are novice user, some are very experienced users. So there are different type of users. So some of them really want to take this Excel, for example, straight away to the bits they need and then go away to do the analysis
18:41
and continue to do the simulation or whatever. So we are trying to cater this whole range of user and user capabilities, trying to provide all these different tools to allow them to do this kind of things a lot more effectively than what they are currently doing. Now, data catalog and tools. As a facility, we work a lot
19:01
with the European facilities, the list of facilities. We have a big project called Pan Data, ODI, proton source, sort of you can see here, the long title here. So basically it's a collaboration between us, Diamond, ISIS, as well as all the other synchrotron neutron facilities around Europe.
19:21
The key thing about this project is that we are building a federated data catalog across the facilities, so that it look into the policy aspect, look into the file format aspects, for example, standardization of nexus format across these facilities. We're also looking at many other things, data provenance and tracking of the data issues,
19:41
tracking of data across facilities, and especially for people who use multiple facilities to do their experiment, which is we already got evidence by analyzing the user office record, anonymized, anonymized user office records to understand how many users, about 11% ish people covered by all these communities,
20:03
all these facilities are using more than one facilities. So there are concrete evidence on the table saying that there are definitely lots of users or more and more users using more than one facility, more than one experimental technique in their research. Key things I want to emphasize here, ICAD.
20:21
ICAD is the, the ICAD database, in essence, it's a core of it, it's a database, it's a very large database, got lots of table, this is a core entities in that table. The most important things about this is that two things I want to highlight, one is it's more and more being extended to cover not just experimental data, but also extended to the downstream,
20:42
so a common day, extending not just the database schema or structure to accommodate that, but as well as we integrating this ICAD, as well as services tools around ICAD to interact with the HPC processing and data processing, et cetera, in the facilities. So, and also another important aspect is,
21:02
it is a database, it's about cataloging, but it's also about continuous access to the data. Because one thing is very important is the investigator. I do want to highlight this. By the way, this ICAD is open source project and several facilities are contributing to this development.
21:21
If you, one thing very important in this, in this longer term access to data is that you have to, keeping the user information in here to allow this, this is a Nexus, this is a Nexus profile, application profile if some of you have sort of
21:40
into the area. So this is basically in the Penn Data ODI project, we've done sort of, this is a web patch leading by DAISY. They have three profile, one for tomography, one for small angle scattering, one for powder diffraction. So they following the Nexus application profile to derive three sort of Nexus format for DAISY.
22:01
Important thing differentiator, if you look at the Nexus profile, application profile, and the difference is the user identities is trying to incorporate that into this Nexus files, which are very important. There are a couple of suite of tools are sort of associated with the ICAD tools and ICAD, so cataloging tools.
22:21
There are lots of things here, you can get into the detail here. So we have also done lots of sort of work into standardizing the terminology control vocabularies across the facilities. This, for example, in this particular one is trying to, in the project, is trying to standardize the description, sort of control vocabulary for facilities,
22:41
instruments, and technique, and the relationship between them. There are lots of applications for this. Final thing about open data access. Now, process. This is from the left to right. It's basically the paper. The paper here, you've got the ODI, you can go to data site managed by British Library, and then you can get to the landing page of the experimental data,
23:01
so pointing to that landing page STFC, and then you can get to this Topcat, which is the interface to ICAD, the web portal, and then you need to log on. I will say a bit more about that, and then you will get to the actual data itself, and then you can click the download, et cetera, et cetera. This is the sort of the current ISIS way, I will put the way of managing their open access to their data.
23:22
Now, one thing very important to say is open access, ISIS does do open access, but open access is not equivalent to non-restricted access. That means you need to register, right? You need to register. That's why the login table, the login dialogue there, and there are three years embargo period, there's a 10 years sort of commitment
23:41
from the facilities. There are lots of implications. I'm sure you will have more questions later. I'll look what we are looking at now. As I mentioned, so there are lots of things happening in the facility. This will have significant implication in our view to the way we manage our data, to where we locate the data really, and the connection with the computation,
24:02
the processing, downstream processing. These are sort of some of the sort of things along those lines. I wouldn't get into the details of all these things here, but so that would definitely, and a whole range of people have sort of contributed directly or indirectly to my understanding of the topic,
24:21
as well as the subject itself has been in development at Rao in the last 10 years. So some people have left. More people coming. So this is, so thank you for this audience.