We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Supporting Data Management Workflows at STFC

00:00

Formal Metadata

Title
Supporting Data Management Workflows at STFC
Title of Series
Number of Parts
22
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
STFC has developed a systematic approach for managing and archiving data generated from its large-scale analytic facilities, which is used with variations by the ISIS Neutron Source, the Diamond Light Source and the Central Laser Facility. This is centred around the ICAT experiment metadata catalogue. The ICAT acts as a core middleware component recording and guiding the storage of raw data and subsequent access and reuse of the data; it has evolved into a suite of tools which can be used to build data management infrastructure. In this talk, I shall describe the current status of the ICAT. The data rates and volumes generated from facilities are ever increasing and experimental science is becoming more complex. This is presenting challenges to the user community in accessing, handling and processing data. I shall describe some approaches to these problems and consider how we are exploring further support for data analysis and publication workflows within a large-scale facility. Finally, I shall consider how we might develop metadata to capture and share this information across communities.
Keywords
MorningGroup delay and phase delayFACTS (newspaper)RutschungComputer animation
WorkshopComputer animation
ToolAccess networkScale (map)Audio feedbackElektronenkonfigurationLaserSource (album)NeutronMuonSynchrotronLightDynamische LichtstreuungRoman calendarRelative articulationSpare partElektronenkonfigurationDiamondGround stationAudio feedbackLaserSynchrotronSizingSynchrotron radiationGreyScale (map)ConcentratorMachineTypesettingToolNeutronMuonAutomobile platformHot workingKontraktionSource (album)Computer animation
Fiat 500 (2007)Dynamische LichtstreuungDiamondNaval architectureFile (tool)Access networkDigital signal processingStar catalogueIonDiamondMinuteVideotapeMonthAccess networkStar catalogueFiling (metalworking)Scale (map)Bending (metalworking)Ground stationToolDrehmasseGrumman A-6 IntruderRelative datingRauigkeit <Akustik>Juli FernándezSizingMotion captureArc lampLungenautomatRoots-type superchargerComputer animation
Vertical integrationStar catalogueAccess networkDigital signal processingToolMotion captureNeutronToolMotion captureStar catalogueYearTypesettingNeutronenaktivierungToolCogenerationSingle (music)Separation processNeutronProzessleittechnikX-rayAutomatic watchSpare partCanoeMeasuring instrumentVertical integrationScrew capGround stationGravitational singularityBird vocalizationStationeryFiling (metalworking)Hot workingComputer animation
Vertical integrationToolAccess networkNaval architectureAmmunitionAccess networkToolKette <Zugmittel>VideoProzessleittechnikMotion captureSeparation processCoatingField-effect transistorSilveringCatadioptric systemBeta particleMountainComputer animation
Model buildingStock (firearms)Model buildingTypesettingStandard cellMeasuring instrumentHot workingSpare partGround stationDomäne <Kristallographie>Energy levelCatadioptric systemSpecific weightProfil <Bauelement>Group delay and phase delayNightComputer animation
DiamondToolInterface (chemistry)MarsToolAC power plugs and socketsToolMarsModel buildingDeck (ship)Junk (ship)Series and parallel circuitsBending (metalworking)Beta particleButtonDiving suitGirl (band)SeeschiffPipingSeparation processComputer animation
Star catalogueSeparation processInterface (chemistry)Interlaced videoTransfer functionTransfer functionSchubvektorsteuerungArc lampElectronic componentPatch antennaToolWeightGlobal warmingModel buildingAccess networkScale (map)Computer animation
Musical developmentToolCommitteeDynamische LichtstreuungStandard cellCylinder headCommitteeEuropäische SynchrotronstrahlungsanlageFinger protocolHot workingNanotechnologyOceanic climateCatadioptric systemYearComputer animation
Star catalogueSpare partHot workingKey (engineering)Ground stationComputer animation
Dumpy levelContinuous trackEnergy levelRelative datingTypesettingComputer animation
Model buildingAccess networkToolProzessleittechnikNoise reductionAutomobileSound recording and reproductionSpare partToolAccess networkTransmission (mechanics)IPadInterface (chemistry)Ship naming and launchingWater vaporNoise reductionComputer animation
Loss-of-coolant accidentDynamische LichtstreuungTomographyGamma rayMagellan (band)Superheterodyne receiverScanning probe microscopyLine segmentDrehmasseIntensity (physics)TomographyCentre Party (Germany)RutschungVideoGround (electricity)KontraktionLastRoll forming
Vertical integrationSuperheterodyne receiverElectronic componentTomographyMatMeasuring instrumentProzessleittechnikDynamische LichtstreuungNeutronSpeckle imagingAutomobile platformPhotonNoise reductionMeasuring instrumentProgressive lensAutumnHalo (optical phenomenon)Finger protocolEnergy levelAccess networkWolkengattungToolFiat PandaRailroad carUnmanned aerial vehicleDiving suitResonanceCogenerationComputer animation
Hull (watercraft)ToolWolkengattungInterface (chemistry)Access networkMotion captureAutumnPlane (tool)Naval architecture
Hot workingPaperForschungsinstitut für Leder und KunststoffbahnenPistolComputer animation
CogenerationYearPermittivityPagerMechanicProzessleittechnikAccess networkDayTypesetting
EngineForage harvesterToolRRS DiscoveryInterface (chemistry)Calendar dateFinger protocolArc lampMinerJeepRRS DiscoveryTypesettingForage harvesterFinger protocolCatadioptric systemKey (engineering)Vega <Raumsonde>Computer animation
LithographyStandard cellAutomobile platformDeep geological repositoryKit carModel buildingNanotechnologyVideoClimateNanotechnologyFinger protocolAccess networkScale (map)Spare partElectric generatorType foundryComputer animation
EngineProzessleittechnikDomäne <Kristallographie>Measuring instrumentAccess networkTin canCatadioptric systemCommon Intermediate FormatDumpy levelRRS DiscoveryDomäne <Kristallographie>Arc lampEnergy levelRRS DiscoveryAccess networkTransmission (mechanics)Ground stationProzessleittechnikRelative datingController (control theory)Specific weightCrystal structureMeasuring instrumentComputer animation
LaceDeep geological repositoryStandard cellAccess networkVermittlungseinrichtungNanotechnologyRRS DiscoveryAutomobile platformNeutronPhotonMaterialCommon Intermediate FormatGroup delay and phase delayGreyStandard cellRRS DiscoveryNeutronenaktivierungFormation flyingGroup delay and phase delayHot workingSpare partMaterialPhotonAC power plugs and socketsCrystal structureNanotechnologyFinger protocolIonAtmosphere of EarthMatrix (printing)Computer animation
Scale (map)ScoutingAC power plugs and socketsPlain bearingRelative articulationIonPlain bearingAC power plugs and socketsScoutingScale (map)Series and parallel circuitsComputer animation
RRS DiscoveryAccess networkDumpy levelContinuous trackDrehmasseAccess networkFACTS (newspaper)Ground stationInterface (chemistry)YearCash registerToolFinger protocolMonthCapacity factorDayGroup delay and phase delayEnergy levelCrystal structureIceSource (album)Digital televisionSynchrotronMeasurementWater vaporHot workingShip classDiamondAtmospheric pressureComputer animation
Transcript: English(auto-generated)
So, thank you, thank you, and good morning everybody. I'm Brian Matthews, I'm the scientific computing department. I'm not a crystallographer, so I'm, and in fact we're actually kind of two steps removed from the actual crystallography for all the actual experimental teams, so I'm actually not going to talk much about, I don't think I mentioned crystallography once in this talk.
Also, So, some of you may have seen some of this before, because my colleague Erica Yang gave, presented at previous meetings of this group, and some of the slides are similar, but I thought I would repeat some of it, because I know there's new people here.
So, what am I going to talk about? Right, so I'm going to talk about a variety of things, so what are we doing now, what do we want to do, and in two different aspects, particularly supporting user workflows, and sharing and publishing data, and finally, come back to metadata, this is a
workshop about metadata, so I thought I would have some comments about approaches to metadata, and some things you want to do there. So, what do we do now? So, raw data management, so my job is in the scientific computing department at STFC, we have a data center at the laboratory campus, which is at the top there, our building,
with the machine room, and amongst many other things we do, we support the three STFC funding facilities on the Ralph campus, that's the ISIS neutral and muon source, the diamond synchrotron light source, and the central laser facility, and what we
do for them is provide data archiving management tools, so data storing and archiving, and a variety of management tools to manage that data around a common core set and common core expertise. So what we do for them is actually all slightly different, each facility
has different requirements, so for ISIS we provide tools that are supporting the data workflows, they go through much more intimately, right through their scientific life cycle, from proposal to experimental collection and beyond, so there's much more richer metadata there, from our
point of view at least, but we also provide data archiving in our building, while for the diamond synchrotron we actually have a much, our role is rather more limited, much more to do with data archiving, so actually we store much more, much less metadata, it doesn't mean diamond doesn't collect a lot of metadata, but we
actually store less. The interesting or challenging part from our point of view of what diamond bring is real challenges in scale and how to scale up the archive, they have data rates and data sizes which are much more significantly challenging, much more so than the ISIS neutron source, so different kind of
features of our infrastructure are tested by the two different big facilities, we also do some work for central laser facilities, a smaller facility, which is much more to do with real time data management and feedback to users with rich mesh data and laser configuration.
I won't talk about laser facilities any more than that. I'll concentrate on ISIS and diamond, the ones with the most interest here. So this is a picture of the kind of thing we do, the infrastructure we supply for the diamond archive,
so data is streamed well, it's passed over from the data collection systems, the Lust file systems that diamond run through a system called Storage-D which is a storage daemon, essentially to store data onto tape
but also to capture metadata as we go along, the system we call the ICAT and I'll talk about the ICAT, I mean a few people have mentioned the ICAT already and this is I'll talk a lot about the ICAT during this talk, and then we provide tools for data access through a web front end and through some downloader systems through data browsers. Now
I mentioned that the challenges we're getting through diamond archiving are to do with scale rather than complexity of metadata from our point of view so we're currently, as of July, it keeps creeping up for about 3.3 petabytes of data in total, so we've captured all the data that diamond produced so far, and in a vast
number of files, diamond is very far rich, so it's not just a total volume it's also just the sheer number of items 846 million files, and this is really increasing enormously six months before this it was two thirds the size, it's half as much again in six months
so a real challenge for us, and and real fast rates, you know, going through 12,000 files per minute to catalogue them all, and that doesn't keep up with the rates we're getting in the past, so a real challenge in scale on the system which really pushes us at the moment this graph at the bottom is how the data rates are increasing, so it raises
questions for diamond about how they will sustain us now, there's cost implications for all this, but this is what we're doing at the moment, and I think there's a talk next on diamond so I won't talk much more about that so
to go back to how we do this in general, we have to build these data pipelines for managing this data, I've said different situations for different facilities, but rather a common set of tools so we've been, as several people already mentioned, we've been building a system over several years now, which we call the ICAT, which is essentially
a catalogue of experimental data, so it's a metadata catalogue which captures information about experiments and this is developed over the last few years into a set of tools so it's not just one single catalogue, it's become a much wider set of tools which we use, and these tools are quite
buried quite deeply in our infrastructure, so they're not seen as tools which are, sort of sit on top of your data and present things to a user, they are quite deeply embedded into the information management systems, data management systems
doing as much automatic metadata capture as possible, so particularly from proposal systems particularly tracking out of files, particularly data acquisition systems from instruments, so we don't have to depend on the user too much to provide metadata which is streamed off as you can, and to do that
it's integrated into user ops systems, integrated into data acquisition systems, so it has that feature of being very deeply embedded, and then that metadata itself can then be used during subsequent processing to control the way things are subsequently accessed and used, so it has this notion
of active metadata, which Simon mentioned earlier or mesh data is middleware, where it isn't just a user thing, it's part of the information system itself and we provide, we do provide front-end to the user as well as a separate tool which works with the ICAT
the TopCat, but we can also integrate it into other tools through APIs, particularly data analysis frameworks like Mantid for neutrons and Dawn for X-rays so this is kind of a schematic which we'll use for a while
so the ICAT is a, well essentially a database really but it is designed to provide flexible data searching to capture information right through a pipeline of a process from when proposals are submitted and accepted through
scheduling information, through the experimental setup, data acquisition and collection and storage, and then also capturing information for subsequent analysis, derived data and finally publications, some lines are solid lines, some lines are dotted lines, the solid lines are where things are
heavily automated, the dotted lines are where things are less automated so I mentioned something to read, measuring data analysis tools is meant to be scalable and extensible and hopefully expand up to manage the data
rates we're having to deal with, we can access high performance resources through it, so HPC systems, link other outputs so the sort of thing that Suzanne has mentioned, we'll be leading to other research outputs and make policy aware, which is something I'll mention a little bit later as well The core of it is a metadata model which we call the CSMD, the core standard metadata model
which is a relatively simple model for experimental experiments, and very general as well, it's centred around this notion of investigation which is our notion of an experiment or particularly an experiment on a facility, it has quite a close relationship with the proposal
but that can also have sub-experiments as well that has, associated with investigators, with people teams, associated with instruments associated with the context experiment was done, so the proposal itself associated with
authorisation conditions, and then when you come through the experiment it's associated with samples, experimental settings, and then data sets and data files, so you can link all these things together then there's all these very general things on the right hand side called parameters, so this is
a very structural way of looking at metadata at a quite high level, very general and all the very domain-specific parts are kind of hidden in these data sets, these parameter fields which are, from our point of view, very free so we can capture any kind of domain-specific metadata in them in general
so one area we might want to do is make more detail in that area so there's some of the work that Andy got talking about yesterday, capturing parameter sets in the network would help us give us much more detail in that, so from our point of view it's a very interesting area. So that's some URLs about where to find more information about the model. As I mentioned, NiCAT is now a
tool suite, so there's a whole bunch of tools associated with it, so the iCAP core at the bottom is essentially a database it's an Oracle database or a MySQL database and then there's a series of APIs, client server model APIs
which you can do different authorizations, and then everything's quite pluggable and flexible, so you can plug in different authorization plugins you've got a separate web interface which is a separate tool we've been changing that a lot recently, so there's a new version coming out very soon and then there's other tools associated on the left hand side which we've been building
iCatch, our portal which we'll talk about later, about accessing HPC and applications, and IDS which is a tool we've added very recently which separates the whole data handling aspect which is a kind of a crucial adjunct for the whole metadata system, and on the right hand side
there are tools that are contributed by other people, aren't done by the core team but are also contributed, so we've got quite a few places iCAP manager, iCAP administrator, iCAP, and then it links through to analysis systems like Dorm and Manted. I mentioned the iCAP data server which is a fairly recent tool, this has been done in response to this
data scaling problem where we've separated out the whole data ingest and access component from iCAP, which uses the same authentication and models iCAP uses for the data, so we can provide scalable services for this
one thing we've particularly added to this is data transfer services as well so it's not just uploading and downloading data, but storage media we can then package up to provide it to transfer elsewhere, which is a real problem for us, so we can provide HTTP transfer but we can also provide by more
performant protocols like global online, so help with these very large data transfer problems we've been getting so iCAP is now international collaboration, open source project, it's in data production and use on our campus, but also now internationally Andy mentioned the work at ESRF
I've also been using it at Oakridge and various other places it's been looking at as well, and actively contributing Andy Popp is now head of our steering committee of an open source project okay, so where do we want to go from here
two things really, firstly is supporting workflows in a more rich manner, so traditionally facilities do the first part of this, it's just a life cycle view not a life cycle view of what we do from the proposal of publication traditionally facilities are pretty good at managing the left hand side, the early stages
but don't provide a great deal of support on data analysis steps which are usually left to the user the user takes data home and do what they like with it however this is kind of key for providing user insight and this is where the things are starting to break down somewhat, so this data analysis
challenge is becoming much much greater, we're finding that we have to manage very wide areas of science, but also hugely varying different levels of expertise, from real kind of hackers co-developers right through to almost people who just want to be given a final result, so people need a lot more help
data is getting so much bigger, so it becomes very hard to move very hard to store at user institutions the software requirements are becoming much more complex than the software itself, but also much much more computing tense as well, again with very
variable capabilities at user institutions and we have to deal with the whole tracking of problems on the users as I already mentioned so we have serious problems in the data analysis area and this is an area where facilities are beginning to focus on so we've been looking at how we might provide more support data analysis
these parts of the workflows, so we've been modifying the ICAT to handle the provenance information of derived jobs and derived data and software and access to HPC, we've been putting that into a variety of tools so these three tools, Manted already mentioned, so Manted will
launch jobs and give results, but that was also all recorded by the ICAT, the ICAT job portal is another system which is again launching jobs mediated by the ICAT, and launching jobs onto your HPC system of choice
and ISIS have got water reduction systems so people can run their water reduction jobs through a very very simple interface, and it's actually also a mobile interface in this kind of way or using the ICAT to access data and provenance so we need to provide tools to all these things, in general and I know this slide was done last time, we realised that there's a general problem
we can solve here where we can use our high-performance infrastructure inside of computing to support pipelines for particular fields, particularly data intensive fields so we're concentrating on tomography at the moment
MX is an area which we also want to do fairly soon on the line so this is where we would provide the whole pipeline, the whole data analysis pipeline within the computing centre accessing our HPC and we're making significant progress on this, so this is all being set up now
so when the IMAC instrument comes online later in the autumn this should be available to users as a service. We want to generalise this and Andy mentioned yesterday a project called PANDAS which was a proposal led by the SRF to build a very general infrastructure to do this
federated across different institutions using ICAT or rather using metadata as a controlling feature at the top as Andy mentioned yesterday, and then providing virtualised cloud access to users to take their analysis tools as a general service this wasn't funded
looked at very well, but not actually funded, but we think this is really important so we are pursuing this anyway at a much lower level locally, so this wouldn't make sense to anybody but this is an architecture of what we're setting up in this area so we can provide in the middle private cloud access to
analysis tools for user communities which talks to our data backplane you have the data already in the centre, in the archives access to high performance computing resources various high performance computing resources in the top right
and then provide users with various interfaces to access those tools in a very general way and all backed up by capturing metadata as we go along for the problems, what's happening so that's one area we're working on
the second area we've done a lot of work on is showing publishing data so this is where, so we've been issuing DOIs or datasets for ISIS for some time for ISIS data, and this is actually beginning to get used so this is very similar to what Suzanne has been talking about with the pistol structures, we do it for experiments
and the raw data associated with experiments so there's a DOI somewhere in this paper, or you can go and look through the dataset metadata search for ISIS metadata for the same DOI that's in the paper that will come down to a page that we have which may
provide some more information about the dataset and data collection process and then given the right permissions you can go off onto the top of that system and access the data yourself so this is open data, it fits within the data devices data policy of releasing data after three years
so hopefully getting more take up and also mechanisms for doing linking to data and tracking the whole research object the various artifacts that happened in the experiment
one of the other things we've been doing in metadata is publishing to general purpose harvesters as well, there's quite a number of these being produced by the community by the data community in general across all disciplines so this is one in a project called EUDAT which is building data infrastructure across disciplines in Europe, they have a data discovery service called
that you can find, so we've published ISIS metadata the same data that's in the data site actually to that discovery service and we've been mapping from the ICAT metadata to this metadata which they provide they have a much simpler set of metadata than ICAT, 17 fields
also in Dublin Core for this discovery, general purpose discovery metadata and one of the ways we want to take that is a new project called NFO Europe which is called nanostructures nanoscience boundaries of climate analysis, the project was just about to start
it was mostly about transnational access to facilities both large and small scale facilities but there's a significant part of this which is on data management and sharing of the scientific experiments, so we want to publish the data generated in the project and manage it into a data
infrastructure similar to what we've been doing in a federated way I'll come back to this a little bit later so back to metadata, some final words so what we're kind of getting towards in this talk is sort of three levels of metadata
so very general discovery metadata which allow people to search for things and which are used for search engines so Dublin Core seek an EU data site between 10 and 20 fields
systems like Drive, BigShare and Zenodo which were mentioned yesterday this sort of level, this is where you typically also associate your GOIs, you probably need to add some domain specific terms, because all the cameras really want to make it very useful and then there's a sort of data which we've been talking about in this talk
which I call access metadata, or you could call it structural metadata or control metadata or something, so it's much more talking about how data is organised, who it belongs to, how it's accessed, what was done, what was done to it, sort of provenance, and this is where data can be used in the data management process to control things as well
so we're talking to CSMD and other metadata frameworks that are around to fit into this space, they're still very cross-discipline very generic in the discovery, and then what I think the data was talking about yesterday which we're talking about usage metadata, where we're talking much more about the samples
the parameters and instruments and techniques, and I think this is the area where we need to combine these things together, so the SRF approach was very interesting from that point of view, so I think we'll need to provide, support all these areas, combine and integrate them
and we're an area of collaboration, as I mentioned this NFA Europe project, one thing we've got to do there is metadata management, so we are in support of the information discovery we need to define metadata standards for that area so we can build this common infrastructure and we're just about to start work within the
Research Area Alliance, to start to produce some common formats there's a number of existing working groups, the materials interest group, the photon neutral interest group, metadata working group, and I've now just discovered a chemistry interest group just about to start, that may well be
very useful to start to take part in this and there's some more starting points already, there's also a co-data framework for nanostructures so that's something I would like to, I'm leading this so I'd really like to collaborate on that activity
final plug is there is, while we're talking about co-data, there is a co-data data journal which has recently been launched, relaunched and with which is dedicated to articles on data science and policies, practices
and management, well you can read this so general issues around data, I've recently been appointed, well I've recently been appointed section editor of a large scale data facility, so I'm plugging that as a possible place for publishing more information about this sort of thing
so final word, management more data is complex, do well we need good systematic metadata collection, automate as much as possible and track what happens to the data, extend the support across the lifecycle as much as possible, with data analysis and publication, linking data
from different sources, and support building this whole notion of the whole research object, we don't really want to preserve data and ultimately we want to preserve science and metadata will be of different use at different levels and we should be acting, and in fact we'll be using public infrastructure
okay, so I'm sorry I've overrun a little bit, but thank you very much applause questions?
So I noticed that you use Google code as Yeah right, in two days that's going to become That was doing all the work, it's now on GitHub I did the same thing with my project, so and that kind of brings me to the point about this, so
how long do you envision keeping this raw data at your facility, is there some sort of a sunset on data? Depends on the facility ISIS whose total data collection in the grand scheme of things is relatively small
half a petabyte in total have kept all the raw data they've ever collected back in 1982, and have no plans to stop doing that Diamond, because they are much
oh sorry, ISIS also have a much more rigorous data policy which says that, so they actually state it that they will have best efforts to keep data as long as possible Diamond, because of the
data volumes, partly because of the culture and significant ones, have much less commitment they have kept all the data generated, so they've both archived it they don't guarantee to and whilst they plan to keep on doing it, they have to seriously
consider this exponential growth, some of our projections are quite frightening in what data they might have in five years time and that is a cost implication, we have capacity so we can do it, no problem, but it's a cost implication, it's whether they want to keep on doing that and that's a matter of the value from the community, so if there's
demand from the community like you guys to say we really want to keep this particularly or to triage data sets we really want then the pressure, there will be much more incentive for them to do so Brian, it's obviously an exemplary
effort that's going on here and we made an extensive investigation as a working group and congratulations on that, admittedly some time ago, but I went to the ISIS website to access data that was more than three years old it wasn't clear to me what I could access
and what I tried in a random sort of way consistently led to the request for a username and password so I think there's the management of the interface to the person like me who randomly comes along, not so randomly
to get access to data is more than three years I entirely agree, there are two aspects there, firstly there is a username it's free, anyone can register essentially, anyone with a mickeymouse. Google probably wouldn't, but anyone else
can register it's free from that point of view it's not particularly clear and it's not particularly easy to use, we know this in fact, one of the things we're doing at the moment is we're changing the interfaces we've got new versions of interfaces, lots of new tools
hopefully run about now or in the next few months to change a lot of that, so we know there are things in that area you probably have lots of guinea pigs, but if you need another one I'd be happy to oblige. Let's thank Brian again, thank you very very much
applause