We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

II. Improving the management of experimental data: The data explosion and the need to manage diverse data sources in scientific research

00:00

Formal Metadata

Title
II. Improving the management of experimental data: The data explosion and the need to manage diverse data sources in scientific research
Title of Series
Number of Parts
15
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Crystal structure determination has become a high-throughput activity, and even at the level of the departmental laboratory, automation is increasingly important in managing the experiment and, crucially, the data collected from the experiment and its subsequent processing, analysis and dissemination. The UK National Crystallography Service is a medium-scale facility which needs to address issues of data management, accountability and dissemination, on top of its efforts to achieve best experimental practice. In providing a service to chemists as well as crystallographers, it has amassed considerable experience in cross-discipline ontology building, data publication via repository platforms, and integration with laboratory management systems.
Keywords
Chemical substanceVideoSystem in packageMagnetizationPaperHot workingRoll formingProzessleittechnikFord TransitYearSpare partDisc brakeMatrix (printing)Structure factorOrbital periodCompound enginePIN diodeDayCrystal structureSeries and parallel circuitsDuty cycleTransmission (mechanics)GameSource (album)Antenna diversityRulerHiltPower (physics)Screen printingTurningUniverseSensorScrewdriverLumberAccess networkTape recorderCardinal directionVisible spectrumSkelettuhrPattern (sewing)BahnelementElectronic mediaSynchrotronFullingBackpackFiling (metalworking)Centre Party (Germany)Bending (metalworking)Lecture/Conference
Urban heat islandHot isostatic pressingSpare partProzessleittechnikDeep geological repositoryRoll formingCrystal structureElectromagnetic compatibilityMachineCooling towerMeasuring instrumentStar catalogueElectronHot workingCogenerationFACTS (newspaper)Data storage deviceMobile phoneLecture/Conference
BecherwerkMixing (process engineering)MachineBird vocalizationHot workingCamcorderForceHandle (grip)NanotechnologyData storage deviceBeta particleDynamische LichtstreuungEuropean Train Control SystemLappingDiffractionProzessleittechnikMeasuring instrumentSynchrotronSpare partInterface (chemistry)DiffractometerAutomobile platformGaussian beamTypesettingOrbital periodVideoCrystal structureSynchrotronYearMapScreen printingFiling (metalworking)DiamondAtmosphere of EarthChemical substanceDeep geological repositoryCircuit diagramReaction (physics)Access networkCogenerationFinger protocolMaterialMagnetic coreLecture/Conference
Tape recorderFACTS (newspaper)SynchrotronMachine
Bulk modulusTape recorderDiffractionMobile phoneProzessleittechnikCylinder blockScreen printingSpare partControl rodPotentiometerPlatingDiffractometerSpeckle imagingCrystal structureSource (album)MicroscopeThursdayLecture/Conference
ProzessleittechnikSpare partRoll formingTelephoneHot workingLecture/Conference
BauxitbergbauTape recorderHot workingMechanical fanChemical substanceRefractive indexFaraday cageSynthesizer
YearEngineComputer animation
Group delay and phase delayMeasuring instrumentFinger protocolGroup delay and phase delayLecture/Conference
BauxitbergbauFinger protocolFinger protocolSource (album)Single crystalCrystal structureHot workingMachineDiffractionFiling (metalworking)Deep geological repositoryEuropean Train Control SystemLecture/Conference
Telecommunications linkTrunk (automobile)MachineCrystal structureBauxitbergbauCompound engineDeep geological repositoryFormation flyingLecture/Conference
ElectronDiffractionTape recorderCash registerCrystal structureMagnetic coreLecture/Conference
Scale (map)YearDisc brakeComputer animation
Transcript: English(auto-generated)
Thank you very much.
Yes, in a session surrounding data management, I'm going to start with telling you why I got into this. And I've been involved with this national facility for about 20-odd years. So we collect data on behalf of others, whether that be structure factors and handing them out or completed SIFs. And therefore, I feel we have a duty of care in terms of managing that data responsibly and appropriately.
And that sort of underpins absolutely everything we do, but this first cropped up and whacked me in the face, if you like, when I had to move this centre from Wales to the south of England about 100-odd miles down the road.
And we did that, we didn't get in Pickford's moving people, I did it in a transit van myself. And we've been running this national facility for quite some time in that place. And so in physically shifting the laboratory, I discovered a lot of skeletons, an awful lot of skeletons. And they came in magnetic tape form, in magnetic disc form, they came in paper form,
lined printouts, filing cabinets full of these things, I even uncovered punch cards. And that move happened in 1998, and I've only just finished now trying to claw back and save some of the data that was actually held on those sort of obsolete media.
That's been a very sort of thought-provoking exercise. And of course, since that period in 1998, as we've seen from previous talks, detection rates, computing, power, speed have all increased,
and that's leading to what we currently call the data explosion. But actually we've had a lot of data around that we've needed to manage for many, many decades. And so that's why I got into this, but in a data management session, I feel I need to actually talk a bit more broadly, and I think that's what I'm here about, to talk about.
And so when we talk about data management, to the practising crystallographer, most of them think about the problems of day-to-day coping with a deluge of information. How do I audit this? How do I find the thing that I did two, three, four years ago? Someone's come to me to write up a paper. Where's all the stuff? That kind of thing.
And it's sort of the day-to-day auditing of what we do. But actually, more importantly, are aspects of scientific responsibility. Is the work that we do reproducible? Have we recorded accurately? Have we archived accurately?
And can we actually enable the results of what we do to be incorporated into futures and drive future science? This is really, really rather important. But also these days, diversity is kind of important. And by that I mean interdisciplinarity. I'm not entirely sure what my small molecule crystal structures are actually going to be incorporated into,
or what studies they're going to be contributing to when they go out there in the big wide world. But also in terms of actually collecting the data. So I run a mid-range facility. It operates out of a university laboratory. It's very high-powered equipment. We interact with people who have looked at some of these samples,
or related samples on their home laboratory sources, micro-sources, what have you. And we have access to synchrotrons. So we sit in the middle of a whole spectrum of different kind of ways of collecting data, and different, what I call here, institutional boundaries. And there's also issues about accountability, telling our funders how much work we've done,
being able to adhere to mandates for openness and rigorous data management are all part of that. Very quickly, we have to actually think about what structures are going to be used for in the future.
Not necessarily think about it, but protect against it, guard against it, and make sure our data is available so that they can be incorporated into broader scientific studies. So we're not just doing crystal structure for the sake of analyzing a crystal structure. It's going to go out there and be used in the broader spectrum of things.
And that's what I really want to talk about, is the context of crystal structure in the broader scientific landscape, if you like. And that's really the essence of my talk. And therefore, to enable that good crystallographic practice requires recording of information that goes beyond the current remit of the Sith.
And the bottom line is, you do not know how your data is going to be used by other people when it's published and it goes out there. There are many, many examples of this. But in terms of the data deluge, and just simply thinking about chemical crystallography,
we're just simply looking at pure crystallographic data and studies that incorporate just crystallographic data. We are getting more and more complicated. So in our research in Southampton, we generate large systematic series of compounds.
We try thorough crystallization screens and attempt to collect crystal structures of everything that comes out. So we have these sort of matrices of crystal structures. Every dot on this diagram, if you like, is a crystal structure. There can be many polymorphic forms of each of these elements in the matrix.
So we have studies which incorporate 250, 300 crystal structures in one go. And we try and make rules, derive rules out of that, and look for patterns in packing and the likes. And we're now beginning to go beyond that and say, well, okay, can we adopt this approach with all the data that's out there?
Can we mine the CSD? Can we pull in chemistry from elsewhere? And therefore we need to be able to do that. Very briefly, this kind of charts the process of doing scientific research. It's not a diagram I want to go into any detail in, but the point is that various different parts of this process
are covered by different standards, different ways of going about things, and different ways of managing data. And that can lead to a big issue when it comes to interoperability, compatibility, and trying to cover the whole cycle, if you like. So there are a number of ways in which we can manage electronic data.
Starting from the top here, there are sort of repositories like the CSD, if you like, where final results data can reside. There's systems that we use in our own laboratories to be able to collect, coordinate, and manage data.
And what I'll go on to call later on my talk, structured data, data that we understand the form of, we understand the problems, where it's come from, generally stuff streaming off instruments, and we can very easily structure and catalog that data and put it into these kind of systems. I'll also go on to talk about all the stuff that supports all this experiment,
what we do in our lab notebooks, all the things that we can't easily capture streaming off a machine, the stuff that people do. And that comes from electronic lab notebooks. And there's all these big data stores that we're quite familiar with. But the biggest problem is the fact that everything I do probably goes through this thing.
And if I dropped it off the side of the desk here, I could be in a problem situation. And so we use these devices to control the management or even to be the management of much of the work we do, and increasingly these mobile things kind of confuse the situation as well.
And how do we actually get all these talking to each other? So firstly I'll address what I call this structured data, which is mainly stuff streaming off machines. And I'll draw on what we have to do in the national facility. So as part of the national facility we process about 1,600 to 2,000 data sets per annum,
and that mounts up and that comes from 100-odd different people around the country. So we have a lot coming at us that we have to process and deal with, and we need to underpin the whole life cycle. So we start with proposals to use the facility,
and that is where we get the very beginning of the context for doing, the driving force for doing this work. So when I track back in five years in the future, I think, oh, this crystal structure that I want to publish has been orphaned, what is the scientific context? I've actually got the driving force, the reason,
the scientific program under which these people are actually trying to do the work. So it is actually pretty important to capture all that. We're lucky that we have this kind of system and this sort of process that people have to apply to users. If you don't have that, you end up with orphan data and you have absolutely no idea of its context.
So grabbing from the originator of your samples the reason behind the rationale, the scientific rationale behind doing this work is really, really important. And then we do a whole lot of peer review and approval and all that kind of stuff, so we developed a system to manage all that, and I have to chart all that, that's not so interesting. But then we get to somebody actually sending us stuff
and recording what somebody's done, what they think they're submitting to us. This stuff that you're not meant to be able to read is all about safety information, how one's meant to handle this sample. And of course, as I said earlier, we follow on to the synchrotron,
so we have to actually be very wary of what we're taking on site to synchrotrons and how we're handling it all. So we've got to grab all this ancillary information about the material that we're handling and to enable us to go on to do the experiment. Of course, so this is stepping through our interface right now.
This is a part of the experiment which I'm going to come to a little bit further down the line in the talk. It's how you start to manipulate and play around and screen samples. And in chemical crystallography this is not really very well captured and documented. So we have a whole bunch of samples that users provide us with
and we select one and head off down the route of doing the experiment. And this is where we get really important information on individual samples, the reaction scheme and the expected product, all the solvents that were used in synthesizing this and preparing it, etc.
And we do a whole bunch of stuff in the lab which is very structured information. It's all captured very accurately by the computers that drive our diffractometers and we have systems that we use on our PCs that work up the results, if you like. So at the top here, this is how we store all our results information.
This can be a publishing platform, I call it publication, but we can publish to ourselves or publish just to our users or publish to the entire world through the system. So we store everything in a very structured repository, all the files that we generate along the way of the analysis
and routes into the diffraction data and underlying data stores. But this provides us and our users with a very sort of accurate summary report and it sits there until we decide that we want to make it available to the big wide world. So we have policies with our users that say, you know, after three years we're going to make this public if you don't come back to us
and say why you don't want it made public or whether you're going to publish or not. And so we implement policies that will enable us to effectively make all the stuff we do publicly available. One way or other ultimately. But we have to come back to this and it's important that we have the whole accurate record
so that we can basically publish it. So talking briefly now here about management across facilities. We sit here in the middle of all this, I've shown that this cycle has been what we've just gone around in our facility, but we of course have to mesh in with what people do in their home institutions.
Some of our users are crystallographers who need access to high powered instrumentation and they can currently, they have in their laboratory to address the problem they're looking at and some don't have any diffraction facilities available to them at all. But we want to grab as much information as they have at the point around about here
where they submit into our system. But more importantly we interface with central facilities, so we have very periodic regular beam time at Diamond and the number of samples that we're processing we don't want to have to be able to manually put them through the processes
of getting them to the synchrotron and processing at the synchrotron. So we look at interfacing to the systems for management that the synchrotron have. I'm not going to go into any great detail in this because Erica will talk about it in much more detail and I'm not experienced enough. But the bottom line is we need to do mapping between what we do in our local facility
and what the synchrotron does. And so we have projects where we've interfaced between the two and this core scientific metadata schema is basically the way we go about this or what we have to map onto. And so we're mindful of what we're doing, joining up very well with what's happening at the synchrotron.
And so I alluded to the fact that when we start doing an experiment in the laboratory it's not very well documented. There's a lot of unstructured information, things that we need to record
for the scientific record that aren't captured by machines. And by that I mean when you pour something out of the sample pot and start poking around underneath the microscope. And so we started developing into our system ways in which we can try and start to capture that information.
And it's actually mobile technology that's kind of enabled us to do this. So we basically go around the lab capturing images with our mobile phones and uploading them to the system as part of the record. And we have to think quite a lot about sort of the information or the evidence we present back to the people who submit to us
because they're not just down the corridor, they're from miles away. And they send stuff to us by post. So we use software that natively sits on devices like our mobile phones to capture things like the sample as it's sent to us.
Did it arrive smashed up? We can back up the way we've been working. But also was there mother liquor in the pot? How much sample was in there? That kind of thing affects the analysis further downstream. And so we take images and we can add notes on our mobile phones
and then upload it to our system. And we do a whole bunch of that for how we manipulate the bulk sample and the crystal that we select and how many of us report the fact that we've derived a crystal structure from a block when actually much earlier down the process you chisel that off the end of a big rod
and it's not actually representative of the shape of the bulk sample. And that now becomes very important. We try and mine as many sources as we can to look at crystal form and I can't rely on any of them to be honest and I know that because I do it. We've all chiseled a bit of a plate and called it a block or what have you.
So very accurate recording of how you manipulate your sample, what actually ends up on the diffractometer and trial screens of diffraction are all important. And that can all be then uploaded into our management system
and forms a report of this part of the process which is rather more nebulous to try and capture. And we actually just do that through native browsers so we don't worry about writing apps. It all kind of works just through the browser on the phone. What I want to talk a bit more about is unstructured data and in particular lab notebooks.
So we all make records in the lab of what we're up to and actually as a chemical crystallographer I need to integrate in with what people are doing in the synthesis labs. Not a new problem. Faraday was a bit of a fan of mine. You can't read what it says here but he basically did 30,000 experiments. He catalogued them all in lab notebooks
and invented his own way of indexing them and being able to retrieve information about them. And this was also a record of the things that didn't work as much as the things that did work. And so we spent a lot of time building a lab notebook. It's open to all and we've spent the last year or two putting it on a cloud platform
so we can make it available to the UK higher education community. It's a very generic lab notebook unlike the specific ones that are sold to chemists in industry. And so it can be used across disciplines. This is engineering, environmental science. This is somebody noting their MATLAB code.
This is instruments automatically logging into lab notebooks. And this is sort of discussions of stuff that we did on the instrument. So this group actually uses a lab notebook as a way of coordinating their weekly meetings and what we've done in the lab.
Biochemical assays, open source science projects. So this is basically putting your lab notebook up there live as you're doing the experiments. It's funded by the Gates Foundation and everyone around the world is basically trying experiments and putting the results up online as they're doing them.
But we use it to actually support basically the publication of chemistry as a whole. So in these lab notebooks alongside stuff about single crystals work is the spectroscopy, et cetera. Now, Lewis is looming, so I'm going to whiz through this. We try and impose a bit of structure on all this data
with some ontology work that we've done. I won't go into it, but we try and capture what we plan to do, what we actually do, and then compare the two things. And this is basically one of those plans for a single crystal diffraction experiment. And we can impose that, we can expose that in a machine-readable way through our data repositories.
So machines can now come along and look at all the files they've got and work out which ones were generated from each other and walk through the experiment. Since now, just about what we're doing currently and the way I see the future, we want to integrate the crystal structures and the chemistry they're doing with structure,
as in 2D structure, of compounds, their properties, et cetera, and be able to sort of do real informatics, mining, and what's out there. We're working with the Royal Society of Chemistry, ChemSpider, and have basically underwired their database of hundreds of thousands of chemical compounds
with this same machine-readable format. So we're now working on how we can match up what we have in our crystal structure repositories with what's in compound databases that are online. And this is supporting publication. So this is basically an article that's about to come out
where all the electronic supplementary information for all the chemistry in the article is basically just hooking into our electronic lab notebooks. A bit of evidence of that there. We're registering all this with DOI. It's really, really very important to go through registration authorities so all our crystal structure stuff has got DOIs for data.
We have a sort of institutional way of doing this and all the lab notebook stuff that we publish, we register with DOIs as well, so it's all findable. And in collaboration with RIGAKI, we're looking at sample information management going beyond what we do in the diffraction lab and pulling all this together
where SIF is sort of at the core of it all, but we've sort of extended around the SIF and we're doing similar stuff with electronic lab notebook records as well. Sorry for running over time. This mess is my attempt at thanking all the people that have been involved over the years.
It's sort of a way of... I'm trying to do wordles, but with logos. It hasn't quite worked. Thank you. No, scale is not a factor here. The GISC one would be huge. Thank you.