CCDC metadata initiatives
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 22 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46322 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
WorkshopDiffractionTrimmenSundayAugustus, Count Palatine of SulzbachIonLastInitiator <Steuerungstechnik>NightSource (album)Computer animation
00:27
Crystal structureEuropean Train Control SystemProzessleittechnikYearSource (album)DiffractionSynthesizerMeasuring instrumentHose couplingCartridge (firearms)DaySingle crystalThin filmComputer animation
01:20
Model buildingCrystal structureSuperheterodyne receiverThin filmElectronDensityIonCommon Intermediate FormatThin filmFiling (metalworking)ElectronCompound engineYearChemical substanceZipperSteckverbinderCommon Intermediate FormatComputer animation
01:48
Crystal structureSizingComputer animation
02:17
Chemical substanceBauxitbergbauRRS DiscoveryCrystal structureRRS DiscoveryBauxitbergbauTypesettingMorningGround stationChemical substanceComputer animation
02:55
Ship naming and launchingPair productionCrystal structureInterface (chemistry)Motion captureHalo (optical phenomenon)Stem (ship)Common Intermediate FormatTypesettingPhotonic integrated circuitHorn antennaTramAmmunitionMeasurementHulk (ship)Initiator <Steuerungstechnik>Crystal structureProzessleittechnikComputer animation
03:31
IndustrieelektronikThin filmViscosityCrystal structureCrystal structureCrystal structureCasting defectSunriseComputer animation
03:54
IndustrieelektronikThin filmCrystal structureSunriseStructure factorThin filmStandard cellInstitut für Energieverfahrenstechnik und ChemieingenieurwesenProzessleittechnikLeadComputer animationProgram flowchart
04:36
LaceGenerationFood storageProzessleittechnikFord OrionVertical integrationProzessleittechnikComputer animation
04:55
IonThin filmCompound engineColorfulnessProzessleittechnikCrystal habitMelting pointComputer animation
05:18
Punt (boat)Motion captureTemperatureMeltingSchwimmbadreaktorMondayFunkCrystal structureDiffractionDiffractometerMelting pointColorfulnessTemperatureMeasurementCartridge (firearms)Common Intermediate FormatEcho <Ballonsatellit>Standard cell
07:24
Mint-made errorsCrystal structureProzessleittechnikSpare part
07:57
ContactorPlain bearingTypesettingTape recorderTypesettingPlain bearingComputer animation
08:28
ProzessleittechnikLevel staffChemical substanceProzessleittechnikSteckverbinderCrystal structure
09:23
Compound engineChemical substanceEngine control unitGenerationMine flailIonRail profileCell (biology)Compound engineChemical substanceSteckverbinderComputer animation
09:43
Bird vocalizationPair productionDeep geological repositoryCommunications satelliteTypesettingUniverseCrystal structureDeep geological repositoryComputer animation
10:27
VermittlungseinrichtungPhotodissoziationViseHot workingPhase (matter)Computer animation
11:05
March (territory)Compound engineMetalAccess networkGround stationAmmunitionCrystal structureRainElle (magazine)ProzessleittechnikChemical substanceComputer animation
11:28
Kette <Zugmittel>Telecommunications linkCorrugated fiberboardKette <Zugmittel>Institut für RaumfahrtsystemeCrystal structureComputer animation
11:47
Crystal structureAprilSynthesizerAnalytical mechanicsNanotechnologyAmmunitionIonRainHourFiling (metalworking)PaperComputer animation
12:41
Crystal structureAccess networkFlugbahnAccess networkDrehmasseCrystal structureFiling (metalworking)Book designVideoComputer animation
13:24
Refractive indexRoman calendarSemi-trailer truckRefractive indexCrystal structure
13:54
RRS DiscoveryAußerirdische IntelligenzTypesettingFinger protocolKette <Zugmittel>Chemical substanceDisc brake
14:28
Bulk modulusAlfa Romeo SpiderTelecommunications linkCrystal structureAnimal trappingSteckverbinderCrystal structureKette <Zugmittel>Chemical substanceDeep geological repositoryCrystal structureDayComputer animation
15:15
Deep geological repositoryCrystal structureGroup delay and phase delayElectronic componentLeistungsanpassungChemical substanceCrystal structureKette <Zugmittel>Computer animation
15:48
NeutronMeasuring instrumentStar catalogueTelecommunications linkCrystal structureMeasurementAmmunitionModel buildingCrystal structureYearLeistungsanpassungFood storageAtmosphere of EarthData storage deviceDayComputer animation
16:24
Thin filmWind farmAxionPhotodissoziationEveningComputer animation
17:02
Telecommunications linkAccess networkVertical integrationThin filmChemical substanceIonAccess networkThin filmVertical integrationYearPair productionCrystal structureProzessleittechnikSpeckle imagingComputer animation
18:20
Thin filmAC power plugs and socketsSpare partSatelliteCrystal structureProzessleittechnikSpecific weightDeep geological repositoryCamera lensUniverseChemical substanceFinger protocolFood storageReflexionskoeffizientYearSpeckle imagingTypesettingCartridge (firearms)Structure factorModel buildingDiffractionLoudspeakerInitiator <Steuerungstechnik>AvalancheDrehmasseBulk modulusDayRelative articulationSource (album)Orbital periodColorfulnessEffects unitRedshiftIcePhotographyEisengießereiLeadHot working
Transcript: English(auto-generated)
00:00
So thank you to Brian and John for inviting me to talk today, and thank you all for coming for so early after a late dinner last night. So today I'm going to reflect back on the metadata associated with entries in the CSD, and some of this will touch on some of the points that Simon made in the previous talk. I'm going to look at some of the recent
00:21
metadata initiatives, some of the challenges we have faced and some of the challenges we still face. So I think as Herb said yesterday, not all metadata comes from a single source. So if we think about a single crystal structure, then we have metadata associated with the crystal itself. So we have data about the synthesis of the crystal, the crystallization
00:44
conditions of the crystal, maybe some physical properties of the crystal, like melting points or the ability. Then there's data about the diffraction experiment itself, so the instrumentation use, the conditions, et cetera. And then we have the raw data, the processed
01:01
data and the derived data, and in some cases, an associated publication. And we use some of this data to create a CSD entry. And I think what was said a couple of times yesterday is what we're really aiming to do is go from data to information to knowledge to wisdom. So historically at the CCDC, our emphasis has been on the deposition
01:26
of derived data. So typically, what we collect or what we receive from depositors is the electronic CIF file, and this contains atomic coordinates. And we use those coordinates
01:41
to create an entry in the CSD that has chemical connectivity, TD diagram and more chemical information such as compound name. And over the 50 years when we've been creating the CSD, the number of crystal structures or small molecule crystal structures published has continued to rise. So we've now got over 790,000 crystal structures in the CSD.
02:04
And just as the PDB, as the number of structures in the database rises, the complexity and size of the structures also continues to rise. So we really had to evolve the ways we deal with data and metadata at CCDC. So I'm going to take a step back and reflect
02:22
the types of metadata associated with the CSD entry. And so my simplistic view of metadata is that it's a set of data that describes and gives information about other data. And the data that I'm going to focus on this morning is data that describes the substance studied, and this is important for discovery, analysis and mining of the data. Data that
02:43
describes the data set as a whole, and this is important for provenance and attribution of the data. And then data that describes the experiment, so things like where it was done, how it was done, who did it, and who funded it. And when I go through those categories, I'm going to look at some of the recent initiatives that we've had
03:02
at CCDC, some of the changes we've made, and some of the challenges we still face. So as you all know, back in the 1990s, the crystallographic information framework was launched under the guidance of the IUCR, and this was widely and extensively adopted by both the crystallographic community and the publishing community. And this allows people
03:26
to store derived results, raw and processed data, experimental conditions and publication data. And we can see that it's really revolutionized how small molecule crystal structures were published. So pre-SIF, when a crystal structure was published, the publication
03:42
will include hand-typed tables of coordinates, and we used to have to use those hand-typed coordinates to create an entry in the CSD. Thankfully, now things are different. So you can see here the rise of crystal structures, and then the rise of the deposits that are electronic SIFs, and so virtually all of the new depositions are now electronic SIFs.
04:05
Because we have a standard way of people depositing data, that's allowed us to create an interactive web deposition process. And following the IUCR's lead, we're strongly encouraging the deposition of structure factors alongside SIF, and we're working with the
04:23
IUCR, with the community and with publishers to make this mandatory in the future. We can also check the data that's being deposited. So we've got syntax checking during the deposition process, and just last month, we've been collaborating with the IUCR to integrate the IUCR's CheckSIF service into the process as well. So depositors
04:46
can check the integrity of their data, and I think this is really important, both for the depositor, for the crystallographic community, and for the publishers. During the deposition process, we also highlight some key metadata that is used to create
05:01
a CSV entry to the depositor. So metadata is pulled out and extracted from the SIF, and it seems like compound name, colour, habit, and mounting points. And the depositor is asked to review their data and enhance their data. But I think we all know that sometimes allowing free text and allowing people to enter data manually is not a good
05:25
thing, and is not always the right thing to do. And I think it's important to try and capture metadata in as semantic or defined way as possible. So if we look at things like colours reported in the CSV, we see colours such as wine-coloured
05:41
crystals, but there's no information about what coloured wine people are actually talking about. And if we look at morphologies, we often get different descriptions of the same shape. So we've got penguin-shaped crystals, we've got pear-shaped crystals, we've got lozenge-shaped crystals. So perhaps there should be some more defined
06:02
dictionary of what should be allowed in certain metadata fields. Sometimes it's hard to know what the metadata is actually referring to as well. So if we look at morphology again, sometimes we don't know whether the crystallographer is reporting the morphology of the growing crystal, or the crystal as it was cut and put
06:21
on the diffractometer. I think manually entering information also means that sometimes the depositor doesn't always get things right. So when we look at CIFS deposited at CCDC, in some cases we see that the melting point is actually lower than the study temperature of the diffraction experiment.
06:42
And when we looked at this further, we saw that it was predominantly due to people putting in the melting point of their crystal in Celsius and not Kelvin, so the wrong units. So I think it's really important, and I think this echoes some of what Simon said, to try and capture the metadata directly from the equipment in as standard a way as possible.
07:01
Maybe have some defined dictionaries to use for certain metadata, and to increase the validation of the metadata that can be done. So maybe validating things like showing what the probable colours are or improbable colours are for a crystal based on the chemistry of the structure
07:20
and most improbable melting points highlighted. So to deal with all of the deposited data and the metadata and the increasing volume of data that we receive at CCDC, we launched a new infrastructure in 2013 called CSC Expedite.
07:41
And this tries to automate as much of the process as we can while keeping some of the manual parts of the process where our scientific input really adds value. So the boxes in green here are the automated parts of the process, and you can see it's quite a linear workflow. This new system is based on Microsoft Dynamics DRM
08:01
and it's used to manage all of the transactions and interactions that go on behind the scenes. So it's not just the CSD entries that we have to deal with. So in the new system there are a number of different entities or record types, and each entity has its own metadata associated with it. So we've got an entity for things like the deposit,
08:22
the CSD entry, publication, journal, publishers, and there really is quite a lot of information to manage. One of the key things we do when creating the CSD is to add chemistry to the coordinates. And we do this using an automated process in the first instance and software that was developed at CCDC called Decipher.
08:45
And this uses a probabilistic approach. So it looks at all of the 750,000 structures in the CSD and uses Bayesian Theorem to work out what the most probable chemistry and chemical connectivity is from the coordinates. The chemistry and chemical connectivity then gets assigned a reliability score
09:04
so we know how good the chemical assignment might be. And then each entry is checked manually by our editors. And I think the assignment of chemistry is really important to allow users to discover data, do data mining, analysis, and allow us to do interoperability of the structures.
09:24
So as well as data about the crystallographic information, like the cell parameters, assigning the chemical connectivity allows us to add things like compound names, chemical diagrams, and this is really useful when people are doing searches and analysis of the data.
09:45
So let's go on to look at the data that describes the data set as a whole. And I think this is important for provenance and attribution of the data. So this includes publication data, so when the data was published, who published it, and where it was published. And this could be scientific literature,
10:03
or it could be private communication in the CSD, like Simon mentioned. And then we could also think a bit further and think about authorship data. So instead of just who published the corresponding publication,
10:22
who created the sample, who did the data collection, and who performed the data analysis. And we have quite a lot of complicated workflows in place to allow us to associate CSD data with publication data. And we have workflows in place with all the major publishers that let us know
10:41
when data is included in the manuscript, when it goes to just accepted or ASAP stage, and when it's then fully published. We also have interactions with third party services such as Crosswork, so we can find automated ways of, from some of the publication information,
11:00
finding out what the DOI of the publication is, or vice versa. And all of these complicated workflows and interactions allow us to make the data available to the right people at the right time. So the data is available, both the deposited data and the most probable chemistry and chemical assignments at the point of refereeing.
11:22
So referees and publishers during the peer review process can obtain data from the CCDC. And then the data is available immediately when a structure is published in a publication and links are in place from the publications to the data that's stored at CCDC for publishers such as the ACS, RSC,
11:45
Elsevier, Wiley and the IUCR. When we think about publications, we've also been thinking about data and data publications and we've started assigning data set DOIs to CSD entries. We started last year and we've assigned over 500,000 DOIs
12:04
and DOIs are now assigned within an hour of publication. And I think this really forms a foundation for formulising data citation and interoperability. So if we think about a typical publication, we have a list of authors. And the list of authors might not necessarily include the crystallographer,
12:24
so they don't always get credit for their contribution. The DOI associated with an article doesn't identify the data, so it identifies the paper. And sometimes if the SID file is uploaded with the publication, it can be buried away in supporting information.
12:41
So I think assigning DOIs helps pave the way. And what we've also done is on our Get Structures service, so where people come and get data from us, we've got a CCDC citation box. And at the moment this mirrors publication information, but in the future we could see this becoming more of the details
13:03
about who created the data itself. When people download data from CCDC, we also add the key metadata into the downloaded files. So we add things like the CSD DOI so people can get access to more information,
13:20
the publication information, and when it was deposited and downloaded. Assigning DOIs means that we can link to other services as well and facilitate interoperability. So in minting a DOI with data sites, we have to provide data sites with some key metadata. And people like Thomson Reuters and the Data Citation Index
13:41
can then pick up this metadata and create CSD entries in other services such as the Data Citation Index. So now the Data Citation Index also covers Crystal Structures. As Simon touched on earlier, we're also working with a number of other projects,
14:05
and Ian Bruno in particular has been working heavily with GISC and the RDA. And you can see here one of the recent outputs from the RDA, which is the Data Literature Interlinking Service, which ingests data article links. And you can go on to their beta servers and here are some of the graphics and information that you can find out.
14:24
So about data sets and publications and how they're linked together. Because we assign chemical connectivity to structures in the CSD, we can also link to other resources and other metadata. And we can do this based on the chemistry.
14:42
So we've been working hard to assign reliable inches to a subset of our structures in the CSD. We've added links to CSD entries to more general repositories such as ChemSpider. So you can see there's a link and the user can also visualise the Crystal Structure.
15:01
So they can see Crystal Structure data alongside things like NMR data, IR data and wider context data. And in the next day or two we'll also be adding links to PubChem and that should go live very soon. As well as more general chemistry, we can also link between crystallographic databases.
15:24
So we've been working with the PDB to match PDB ligands to best representative CSD molecules. And of the 20,000 chemical components in the PDB chemical component dictionary, there are about 1,500 exact messages for structures in the CSD
15:44
and PDB users can now link between the two. So our second call-out for Brian's talk next, and I think he's going to mention ITAT in that talk, is an investigation that we did last year
16:00
about trying to link between the raw data stored at STFC and the model structure stored at CTDC. So we looked at publication DOIs in the two systems and see if we could match the data and we also looked at some of the mesh data contained in the SIF. I think we could find some matches but I think we need to have a more systematic approach
16:20
to match raw data and model structures in the future. So when we talk about publications and DOIs, assigning DOIs also means that researchers can add data to their researcher IDs such as ORCID alongside their publications. So they're getting more attribution and credit
16:43
for their data as well as publication. We might also want to think about how we could add data about funders, about grant numbers and associate that with data and publications and then about institutions, so working with people like ISNI and Ringgold to associate data in institutions as well.
17:04
So some of the immediate things that we're working on in the area of metadata is the creation of a new CSD deposition portal to allow depositors to log on, see the data, edit the data and enhance the data ready to create a CSD entry and extend some of the integrity checks
17:23
involved in that deposition process. We're also looking at extending the linking between different datasets and extending the programmatic access to data. So earlier this year we launched our CSD Python API and this gives users of the CSD programmatic access
17:42
to data in the CSD entry and we're looking to extend that to more data and more of the deposited data making more of the deposited data available in a programmatic way. We're also looking to extend the data available to the wider community. So we've got a free service on our website
18:01
where people can access Crystal Structure data and we're extending the search functionality including the addition of inches into this and we're trying to make the data more accessible to a broader audience so present it in a way that non-crystallographers could understand.
18:21
So that was a brief roundup of some of the things that we've been involved with. I'd like to thank you for your time and if there's any questions please go ahead.
18:42
Thank you very much Susannah. So did I understand that one of the requirements from CCDC is that CONSIFS should add the specification of penguin-shaped crystal to its jurisdiction? Definitely. We can't take that out.
19:00
Hi, it's Mike Probert of Newcastle. Given the ambiguity over some crystal shapes and colours and things is there not an argument for actually storing images of crystals alongside the deposited data? So that leads on to a second thing which possibly falls
19:22
for the next two speakers as well. Can you envisage a situation where the CCDC is acting as a repository for raw diffraction data having already stated that you're trying to move to a situation where it's a requirement for structure factors to be deposited
19:40
or would that cause a data avalanche that the company couldn't deal with and the reason it could be more pertinent to the next two speakers is that possibly the amount of data that's collected in home sources compared to the amount of data that's collected per day at central facilities would mean that this could be something that's handled
20:00
by the central facilities only. I think we'd have to think very carefully if we started handling raw data and how it would reflect or how it would change our sustainability model. So at the moment we've got processes in place where we can deal with the amount of data and we've kind of future-proofed it as much as we can but as soon as we start talking about
20:20
a lot of data then I think that would change and it would change how we would need to charge for some of our services. So I think we're very keen to support initiatives like this and try and link to as much data as we can and have linking between data. Just to follow up on that I think the PDB solution
20:41
of allowing for an entry for DOIs for the raw data is the way to go rather than imagining that you would host raw data sets. I think particularly for the chemical crystallography where my experience in Manchester was that probably 97%
21:01
of the data was measured locally and maybe 3% of the projects went off to the central facilities. The university repository is then the outlet for raw data sets and the DOIs. The whole process is simpler for you anyway
21:21
to make allowance for the DOI. Which is something that we will definitely do. Thank you. Just to follow up on that point the DDDWG has been interested in the possibility of individual repositories that collect all the diffraction data
21:41
and there's an economic case to be considered. People very often say these days storage is cheap but of course it's not so much the bulk storage that is costly but the organization that allows you to make good use of that and characterizing the metadata framework that would make
22:02
the costs come down to those of the bulk storage. It's part of the exercise that we're engaged in. Two or three years ago when I looked in the differences between Cambridge database
22:20
and ACTA E, I believe collected experimental details and I remember that Colin told me at some point that you are planning to have some experimental details. I think we're planning to extend the metadata that's available
22:41
through this testing but we need to make sure that's captured in the right way. Next question. How many structures which you have were never published? We've got an initiative on that at the moment and we've looked in our repository and we've got about 120,000
23:03
crystal structures in the repository that have never been published so we're going back and contacting the depositors and putting automated workflows in place to allow us to publish those. We've also changed the embargo period on our deposition process to try and get that feedback loop
23:22
shorter so people haven't moved on by the time we contact them. You see, sometimes I wonder how many structures are not going to Cambridge database. They are not going anywhere. Because
23:40
I know people who are doing 600 structures a year. So I worked out for a recent talk that it could be about 15-20% of structures that actually get deposited at CCTC whether that's true or not. I'm just being optimistic, I don't know, but I think
24:00
we need to make sure that we're part of the crystallographers workflow. So there's no barrier for deposition and it's just part of the process. Coming to you back to the lens Florence. Considering that you can
24:21
using the same raw data you can have different structural interpretations like you can miss misinterpreted symmetry or miss twinning or satellite reflections. Can you consider the clear descriptor
24:41
of the same raw data of different entries in the CCTC based on the same data collection. At the moment we allow reinterpretations of the same data and we allow
25:00
redeterminations of the same structure and we link between those and we have what's called Refcode families between the structures but I do think we need to go further. Thank you very much indeed Susannah.