Supporting Data Management Workflows at STFC
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 22 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46335 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
MorningGroup delay and phase delayFACTS (newspaper)RutschungComputer animation
00:45
WorkshopComputer animation
01:10
ToolAccess networkScale (map)Audio feedbackElektronenkonfigurationLaserSource (album)NeutronMuonSynchrotronLightDynamische LichtstreuungRoman calendarRelative articulationSpare partElektronenkonfigurationDiamondGround stationAudio feedbackLaserSynchrotronSizingSynchrotron radiationGreyScale (map)ConcentratorMachineTypesettingToolNeutronMuonAutomobile platformHot workingKontraktionSource (album)Computer animation
03:32
Fiat 500 (2007)Dynamische LichtstreuungDiamondNaval architectureFile (tool)Access networkDigital signal processingStar catalogueIonDiamondMinuteVideotapeMonthAccess networkStar catalogueFiling (metalworking)Scale (map)Bending (metalworking)Ground stationToolDrehmasseGrumman A-6 IntruderRelative datingRauigkeit <Akustik>Juli FernándezSizingMotion captureArc lampLungenautomatRoots-type superchargerComputer animation
05:37
Vertical integrationStar catalogueAccess networkDigital signal processingToolMotion captureNeutronToolMotion captureStar catalogueYearTypesettingNeutronenaktivierungToolCogenerationSingle (music)Separation processNeutronProzessleittechnikX-rayAutomatic watchSpare partCanoeMeasuring instrumentVertical integrationScrew capGround stationGravitational singularityBird vocalizationStationeryFiling (metalworking)Hot workingComputer animation
07:56
Vertical integrationToolAccess networkNaval architectureAmmunitionAccess networkToolKette <Zugmittel>VideoProzessleittechnikMotion captureSeparation processCoatingField-effect transistorSilveringCatadioptric systemBeta particleMountainComputer animation
09:16
Model buildingStock (firearms)Model buildingTypesettingStandard cellMeasuring instrumentHot workingSpare partGround stationDomäne <Kristallographie>Energy levelCatadioptric systemSpecific weightProfil <Bauelement>Group delay and phase delayNightComputer animation
11:02
DiamondToolInterface (chemistry)MarsToolAC power plugs and socketsToolMarsModel buildingDeck (ship)Junk (ship)Series and parallel circuitsBending (metalworking)Beta particleButtonDiving suitGirl (band)SeeschiffPipingSeparation processComputer animation
12:16
Star catalogueSeparation processInterface (chemistry)Interlaced videoTransfer functionTransfer functionSchubvektorsteuerungArc lampElectronic componentPatch antennaToolWeightGlobal warmingModel buildingAccess networkScale (map)Computer animation
13:11
Musical developmentToolCommitteeDynamische LichtstreuungStandard cellCylinder headCommitteeEuropäische SynchrotronstrahlungsanlageFinger protocolHot workingNanotechnologyOceanic climateCatadioptric systemYearComputer animation
13:40
Star catalogueSpare partHot workingKey (engineering)Ground stationComputer animation
14:21
Dumpy levelContinuous trackEnergy levelRelative datingTypesettingComputer animation
15:17
Model buildingAccess networkToolProzessleittechnikNoise reductionAutomobileSound recording and reproductionSpare partToolAccess networkTransmission (mechanics)IPadInterface (chemistry)Ship naming and launchingWater vaporNoise reductionComputer animation
16:18
Loss-of-coolant accidentDynamische LichtstreuungTomographyGamma rayMagellan (band)Superheterodyne receiverScanning probe microscopyLine segmentDrehmasseIntensity (physics)TomographyCentre Party (Germany)RutschungVideoGround (electricity)KontraktionLastRoll forming
17:00
Vertical integrationSuperheterodyne receiverElectronic componentTomographyMatMeasuring instrumentProzessleittechnikDynamische LichtstreuungNeutronSpeckle imagingAutomobile platformPhotonNoise reductionMeasuring instrumentProgressive lensAutumnHalo (optical phenomenon)Finger protocolEnergy levelAccess networkWolkengattungToolFiat PandaRailroad carUnmanned aerial vehicleDiving suitResonanceCogenerationComputer animation
17:51
Hull (watercraft)ToolWolkengattungInterface (chemistry)Access networkMotion captureAutumnPlane (tool)Naval architecture
18:36
Hot workingPaperForschungsinstitut für Leder und KunststoffbahnenPistolComputer animation
19:16
CogenerationYearPermittivityPagerMechanicProzessleittechnikAccess networkDayTypesetting
20:00
EngineForage harvesterToolRRS DiscoveryInterface (chemistry)Calendar dateFinger protocolArc lampMinerJeepRRS DiscoveryTypesettingForage harvesterFinger protocolCatadioptric systemKey (engineering)Vega <Raumsonde>Computer animation
20:52
LithographyStandard cellAutomobile platformDeep geological repositoryKit carModel buildingNanotechnologyVideoClimateNanotechnologyFinger protocolAccess networkScale (map)Spare partElectric generatorType foundryComputer animation
21:31
EngineProzessleittechnikDomäne <Kristallographie>Measuring instrumentAccess networkTin canCatadioptric systemCommon Intermediate FormatDumpy levelRRS DiscoveryDomäne <Kristallographie>Arc lampEnergy levelRRS DiscoveryAccess networkTransmission (mechanics)Ground stationProzessleittechnikRelative datingController (control theory)Specific weightCrystal structureMeasuring instrumentComputer animation
23:19
LaceDeep geological repositoryStandard cellAccess networkVermittlungseinrichtungNanotechnologyRRS DiscoveryAutomobile platformNeutronPhotonMaterialCommon Intermediate FormatGroup delay and phase delayGreyStandard cellRRS DiscoveryNeutronenaktivierungFormation flyingGroup delay and phase delayHot workingSpare partMaterialPhotonAC power plugs and socketsCrystal structureNanotechnologyFinger protocolIonAtmosphere of EarthMatrix (printing)Computer animation
24:22
Scale (map)ScoutingAC power plugs and socketsPlain bearingRelative articulationIonPlain bearingAC power plugs and socketsScoutingScale (map)Series and parallel circuitsComputer animation
25:01
RRS DiscoveryAccess networkDumpy levelContinuous trackDrehmasseAccess networkFACTS (newspaper)Ground stationInterface (chemistry)YearCash registerToolFinger protocolMonthCapacity factorDayGroup delay and phase delayEnergy levelCrystal structureIceSource (album)Digital televisionSynchrotronMeasurementWater vaporHot workingShip classDiamondAtmospheric pressureComputer animation
Transcript: English(auto-generated)
00:00
So, thank you, thank you, and good morning everybody. I'm Brian Matthews, I'm the scientific computing department. I'm not a crystallographer, so I'm, and in fact we're actually kind of two steps removed from the actual crystallography for all the actual experimental teams, so I'm actually not going to talk much about, I don't think I mentioned crystallography once in this talk.
00:26
Also, So, some of you may have seen some of this before, because my colleague Erica Yang gave, presented at previous meetings of this group, and some of the slides are similar, but I thought I would repeat some of it, because I know there's new people here.
00:42
So, what am I going to talk about? Right, so I'm going to talk about a variety of things, so what are we doing now, what do we want to do, and in two different aspects, particularly supporting user workflows, and sharing and publishing data, and finally, come back to metadata, this is a
01:01
workshop about metadata, so I thought I would have some comments about approaches to metadata, and some things you want to do there. So, what do we do now? So, raw data management, so my job is in the scientific computing department at STFC, we have a data center at the laboratory campus, which is at the top there, our building,
01:21
with the machine room, and amongst many other things we do, we support the three STFC funding facilities on the Ralph campus, that's the ISIS neutral and muon source, the diamond synchrotron light source, and the central laser facility, and what we
01:41
do for them is provide data archiving management tools, so data storing and archiving, and a variety of management tools to manage that data around a common core set and common core expertise. So what we do for them is actually all slightly different, each facility
02:01
has different requirements, so for ISIS we provide tools that are supporting the data workflows, they go through much more intimately, right through their scientific life cycle, from proposal to experimental collection and beyond, so there's much more richer metadata there, from our
02:21
point of view at least, but we also provide data archiving in our building, while for the diamond synchrotron we actually have a much, our role is rather more limited, much more to do with data archiving, so actually we store much more, much less metadata, it doesn't mean diamond doesn't collect a lot of metadata, but we
02:41
actually store less. The interesting or challenging part from our point of view of what diamond bring is real challenges in scale and how to scale up the archive, they have data rates and data sizes which are much more significantly challenging, much more so than the ISIS neutron source, so different kind of
03:04
features of our infrastructure are tested by the two different big facilities, we also do some work for central laser facilities, a smaller facility, which is much more to do with real time data management and feedback to users with rich mesh data and laser configuration.
03:22
I won't talk about laser facilities any more than that. I'll concentrate on ISIS and diamond, the ones with the most interest here. So this is a picture of the kind of thing we do, the infrastructure we supply for the diamond archive,
03:42
so data is streamed well, it's passed over from the data collection systems, the Lust file systems that diamond run through a system called Storage-D which is a storage daemon, essentially to store data onto tape
04:01
but also to capture metadata as we go along, the system we call the ICAT and I'll talk about the ICAT, I mean a few people have mentioned the ICAT already and this is I'll talk a lot about the ICAT during this talk, and then we provide tools for data access through a web front end and through some downloader systems through data browsers. Now
04:20
I mentioned that the challenges we're getting through diamond archiving are to do with scale rather than complexity of metadata from our point of view so we're currently, as of July, it keeps creeping up for about 3.3 petabytes of data in total, so we've captured all the data that diamond produced so far, and in a vast
04:41
number of files, diamond is very far rich, so it's not just a total volume it's also just the sheer number of items 846 million files, and this is really increasing enormously six months before this it was two thirds the size, it's half as much again in six months
05:01
so a real challenge for us, and and real fast rates, you know, going through 12,000 files per minute to catalogue them all, and that doesn't keep up with the rates we're getting in the past, so a real challenge in scale on the system which really pushes us at the moment this graph at the bottom is how the data rates are increasing, so it raises
05:24
questions for diamond about how they will sustain us now, there's cost implications for all this, but this is what we're doing at the moment, and I think there's a talk next on diamond so I won't talk much more about that so
05:42
to go back to how we do this in general, we have to build these data pipelines for managing this data, I've said different situations for different facilities, but rather a common set of tools so we've been, as several people already mentioned, we've been building a system over several years now, which we call the ICAT, which is essentially
06:03
a catalogue of experimental data, so it's a metadata catalogue which captures information about experiments and this is developed over the last few years into a set of tools so it's not just one single catalogue, it's become a much wider set of tools which we use, and these tools are quite
06:22
buried quite deeply in our infrastructure, so they're not seen as tools which are, sort of sit on top of your data and present things to a user, they are quite deeply embedded into the information management systems, data management systems
06:40
doing as much automatic metadata capture as possible, so particularly from proposal systems particularly tracking out of files, particularly data acquisition systems from instruments, so we don't have to depend on the user too much to provide metadata which is streamed off as you can, and to do that
07:02
it's integrated into user ops systems, integrated into data acquisition systems, so it has that feature of being very deeply embedded, and then that metadata itself can then be used during subsequent processing to control the way things are subsequently accessed and used, so it has this notion
07:22
of active metadata, which Simon mentioned earlier or mesh data is middleware, where it isn't just a user thing, it's part of the information system itself and we provide, we do provide front-end to the user as well as a separate tool which works with the ICAT
07:41
the TopCat, but we can also integrate it into other tools through APIs, particularly data analysis frameworks like Mantid for neutrons and Dawn for X-rays so this is kind of a schematic which we'll use for a while
08:01
so the ICAT is a, well essentially a database really but it is designed to provide flexible data searching to capture information right through a pipeline of a process from when proposals are submitted and accepted through
08:21
scheduling information, through the experimental setup, data acquisition and collection and storage, and then also capturing information for subsequent analysis, derived data and finally publications, some lines are solid lines, some lines are dotted lines, the solid lines are where things are
08:44
heavily automated, the dotted lines are where things are less automated so I mentioned something to read, measuring data analysis tools is meant to be scalable and extensible and hopefully expand up to manage the data
09:01
rates we're having to deal with, we can access high performance resources through it, so HPC systems, link other outputs so the sort of thing that Suzanne has mentioned, we'll be leading to other research outputs and make policy aware, which is something I'll mention a little bit later as well The core of it is a metadata model which we call the CSMD, the core standard metadata model
09:26
which is a relatively simple model for experimental experiments, and very general as well, it's centred around this notion of investigation which is our notion of an experiment or particularly an experiment on a facility, it has quite a close relationship with the proposal
09:44
but that can also have sub-experiments as well that has, associated with investigators, with people teams, associated with instruments associated with the context experiment was done, so the proposal itself associated with
10:04
authorisation conditions, and then when you come through the experiment it's associated with samples, experimental settings, and then data sets and data files, so you can link all these things together then there's all these very general things on the right hand side called parameters, so this is
10:24
a very structural way of looking at metadata at a quite high level, very general and all the very domain-specific parts are kind of hidden in these data sets, these parameter fields which are, from our point of view, very free so we can capture any kind of domain-specific metadata in them in general
10:44
so one area we might want to do is make more detail in that area so there's some of the work that Andy got talking about yesterday, capturing parameter sets in the network would help us give us much more detail in that, so from our point of view it's a very interesting area. So that's some URLs about where to find more information about the model. As I mentioned, NiCAT is now a
11:06
tool suite, so there's a whole bunch of tools associated with it, so the iCAP core at the bottom is essentially a database it's an Oracle database or a MySQL database and then there's a series of APIs, client server model APIs
11:21
which you can do different authorizations, and then everything's quite pluggable and flexible, so you can plug in different authorization plugins you've got a separate web interface which is a separate tool we've been changing that a lot recently, so there's a new version coming out very soon and then there's other tools associated on the left hand side which we've been building
11:40
iCatch, our portal which we'll talk about later, about accessing HPC and applications, and IDS which is a tool we've added very recently which separates the whole data handling aspect which is a kind of a crucial adjunct for the whole metadata system, and on the right hand side
12:01
there are tools that are contributed by other people, aren't done by the core team but are also contributed, so we've got quite a few places iCAP manager, iCAP administrator, iCAP, and then it links through to analysis systems like Dorm and Manted. I mentioned the iCAP data server which is a fairly recent tool, this has been done in response to this
12:21
data scaling problem where we've separated out the whole data ingest and access component from iCAP, which uses the same authentication and models iCAP uses for the data, so we can provide scalable services for this
12:42
one thing we've particularly added to this is data transfer services as well so it's not just uploading and downloading data, but storage media we can then package up to provide it to transfer elsewhere, which is a real problem for us, so we can provide HTTP transfer but we can also provide by more
13:02
performant protocols like global online, so help with these very large data transfer problems we've been getting so iCAP is now international collaboration, open source project, it's in data production and use on our campus, but also now internationally Andy mentioned the work at ESRF
13:23
I've also been using it at Oakridge and various other places it's been looking at as well, and actively contributing Andy Popp is now head of our steering committee of an open source project okay, so where do we want to go from here
13:45
two things really, firstly is supporting workflows in a more rich manner, so traditionally facilities do the first part of this, it's just a life cycle view not a life cycle view of what we do from the proposal of publication traditionally facilities are pretty good at managing the left hand side, the early stages
14:03
but don't provide a great deal of support on data analysis steps which are usually left to the user the user takes data home and do what they like with it however this is kind of key for providing user insight and this is where the things are starting to break down somewhat, so this data analysis
14:23
challenge is becoming much much greater, we're finding that we have to manage very wide areas of science, but also hugely varying different levels of expertise, from real kind of hackers co-developers right through to almost people who just want to be given a final result, so people need a lot more help
14:43
data is getting so much bigger, so it becomes very hard to move very hard to store at user institutions the software requirements are becoming much more complex than the software itself, but also much much more computing tense as well, again with very
15:01
variable capabilities at user institutions and we have to deal with the whole tracking of problems on the users as I already mentioned so we have serious problems in the data analysis area and this is an area where facilities are beginning to focus on so we've been looking at how we might provide more support data analysis
15:22
these parts of the workflows, so we've been modifying the ICAT to handle the provenance information of derived jobs and derived data and software and access to HPC, we've been putting that into a variety of tools so these three tools, Manted already mentioned, so Manted will
15:42
launch jobs and give results, but that was also all recorded by the ICAT, the ICAT job portal is another system which is again launching jobs mediated by the ICAT, and launching jobs onto your HPC system of choice
16:02
and ISIS have got water reduction systems so people can run their water reduction jobs through a very very simple interface, and it's actually also a mobile interface in this kind of way or using the ICAT to access data and provenance so we need to provide tools to all these things, in general and I know this slide was done last time, we realised that there's a general problem
16:23
we can solve here where we can use our high-performance infrastructure inside of computing to support pipelines for particular fields, particularly data intensive fields so we're concentrating on tomography at the moment
16:43
MX is an area which we also want to do fairly soon on the line so this is where we would provide the whole pipeline, the whole data analysis pipeline within the computing centre accessing our HPC and we're making significant progress on this, so this is all being set up now
17:03
so when the IMAC instrument comes online later in the autumn this should be available to users as a service. We want to generalise this and Andy mentioned yesterday a project called PANDAS which was a proposal led by the SRF to build a very general infrastructure to do this
17:21
federated across different institutions using ICAT or rather using metadata as a controlling feature at the top as Andy mentioned yesterday, and then providing virtualised cloud access to users to take their analysis tools as a general service this wasn't funded
17:42
looked at very well, but not actually funded, but we think this is really important so we are pursuing this anyway at a much lower level locally, so this wouldn't make sense to anybody but this is an architecture of what we're setting up in this area so we can provide in the middle private cloud access to
18:03
analysis tools for user communities which talks to our data backplane you have the data already in the centre, in the archives access to high performance computing resources various high performance computing resources in the top right
18:23
and then provide users with various interfaces to access those tools in a very general way and all backed up by capturing metadata as we go along for the problems, what's happening so that's one area we're working on
18:41
the second area we've done a lot of work on is showing publishing data so this is where, so we've been issuing DOIs or datasets for ISIS for some time for ISIS data, and this is actually beginning to get used so this is very similar to what Suzanne has been talking about with the pistol structures, we do it for experiments
19:01
and the raw data associated with experiments so there's a DOI somewhere in this paper, or you can go and look through the dataset metadata search for ISIS metadata for the same DOI that's in the paper that will come down to a page that we have which may
19:20
provide some more information about the dataset and data collection process and then given the right permissions you can go off onto the top of that system and access the data yourself so this is open data, it fits within the data devices data policy of releasing data after three years
19:43
so hopefully getting more take up and also mechanisms for doing linking to data and tracking the whole research object the various artifacts that happened in the experiment
20:00
one of the other things we've been doing in metadata is publishing to general purpose harvesters as well, there's quite a number of these being produced by the community by the data community in general across all disciplines so this is one in a project called EUDAT which is building data infrastructure across disciplines in Europe, they have a data discovery service called
20:22
that you can find, so we've published ISIS metadata the same data that's in the data site actually to that discovery service and we've been mapping from the ICAT metadata to this metadata which they provide they have a much simpler set of metadata than ICAT, 17 fields
20:42
also in Dublin Core for this discovery, general purpose discovery metadata and one of the ways we want to take that is a new project called NFO Europe which is called nanostructures nanoscience boundaries of climate analysis, the project was just about to start
21:03
it was mostly about transnational access to facilities both large and small scale facilities but there's a significant part of this which is on data management and sharing of the scientific experiments, so we want to publish the data generated in the project and manage it into a data
21:24
infrastructure similar to what we've been doing in a federated way I'll come back to this a little bit later so back to metadata, some final words so what we're kind of getting towards in this talk is sort of three levels of metadata
21:43
so very general discovery metadata which allow people to search for things and which are used for search engines so Dublin Core seek an EU data site between 10 and 20 fields
22:02
systems like Drive, BigShare and Zenodo which were mentioned yesterday this sort of level, this is where you typically also associate your GOIs, you probably need to add some domain specific terms, because all the cameras really want to make it very useful and then there's a sort of data which we've been talking about in this talk
22:22
which I call access metadata, or you could call it structural metadata or control metadata or something, so it's much more talking about how data is organised, who it belongs to, how it's accessed, what was done, what was done to it, sort of provenance, and this is where data can be used in the data management process to control things as well
22:42
so we're talking to CSMD and other metadata frameworks that are around to fit into this space, they're still very cross-discipline very generic in the discovery, and then what I think the data was talking about yesterday which we're talking about usage metadata, where we're talking much more about the samples
23:02
the parameters and instruments and techniques, and I think this is the area where we need to combine these things together, so the SRF approach was very interesting from that point of view, so I think we'll need to provide, support all these areas, combine and integrate them
23:21
and we're an area of collaboration, as I mentioned this NFA Europe project, one thing we've got to do there is metadata management, so we are in support of the information discovery we need to define metadata standards for that area so we can build this common infrastructure and we're just about to start work within the
23:42
Research Area Alliance, to start to produce some common formats there's a number of existing working groups, the materials interest group, the photon neutral interest group, metadata working group, and I've now just discovered a chemistry interest group just about to start, that may well be
24:01
very useful to start to take part in this and there's some more starting points already, there's also a co-data framework for nanostructures so that's something I would like to, I'm leading this so I'd really like to collaborate on that activity
24:24
final plug is there is, while we're talking about co-data, there is a co-data data journal which has recently been launched, relaunched and with which is dedicated to articles on data science and policies, practices
24:42
and management, well you can read this so general issues around data, I've recently been appointed, well I've recently been appointed section editor of a large scale data facility, so I'm plugging that as a possible place for publishing more information about this sort of thing
25:02
so final word, management more data is complex, do well we need good systematic metadata collection, automate as much as possible and track what happens to the data, extend the support across the lifecycle as much as possible, with data analysis and publication, linking data
25:20
from different sources, and support building this whole notion of the whole research object, we don't really want to preserve data and ultimately we want to preserve science and metadata will be of different use at different levels and we should be acting, and in fact we'll be using public infrastructure
25:41
okay, so I'm sorry I've overrun a little bit, but thank you very much applause questions?
26:04
So I noticed that you use Google code as Yeah right, in two days that's going to become That was doing all the work, it's now on GitHub I did the same thing with my project, so and that kind of brings me to the point about this, so
26:23
how long do you envision keeping this raw data at your facility, is there some sort of a sunset on data? Depends on the facility ISIS whose total data collection in the grand scheme of things is relatively small
26:44
half a petabyte in total have kept all the raw data they've ever collected back in 1982, and have no plans to stop doing that Diamond, because they are much
27:04
oh sorry, ISIS also have a much more rigorous data policy which says that, so they actually state it that they will have best efforts to keep data as long as possible Diamond, because of the
27:20
data volumes, partly because of the culture and significant ones, have much less commitment they have kept all the data generated, so they've both archived it they don't guarantee to and whilst they plan to keep on doing it, they have to seriously
27:44
consider this exponential growth, some of our projections are quite frightening in what data they might have in five years time and that is a cost implication, we have capacity so we can do it, no problem, but it's a cost implication, it's whether they want to keep on doing that and that's a matter of the value from the community, so if there's
28:04
demand from the community like you guys to say we really want to keep this particularly or to triage data sets we really want then the pressure, there will be much more incentive for them to do so Brian, it's obviously an exemplary
28:21
effort that's going on here and we made an extensive investigation as a working group and congratulations on that, admittedly some time ago, but I went to the ISIS website to access data that was more than three years old it wasn't clear to me what I could access
28:44
and what I tried in a random sort of way consistently led to the request for a username and password so I think there's the management of the interface to the person like me who randomly comes along, not so randomly
29:03
to get access to data is more than three years I entirely agree, there are two aspects there, firstly there is a username it's free, anyone can register essentially, anyone with a mickeymouse. Google probably wouldn't, but anyone else
29:26
can register it's free from that point of view it's not particularly clear and it's not particularly easy to use, we know this in fact, one of the things we're doing at the moment is we're changing the interfaces we've got new versions of interfaces, lots of new tools
29:43
hopefully run about now or in the next few months to change a lot of that, so we know there are things in that area you probably have lots of guinea pigs, but if you need another one I'd be happy to oblige. Let's thank Brian again, thank you very very much
30:01
applause
Recommendations
Series of 4 media