We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

CCDC metadata initiatives

00:00

Formal Metadata

Title
CCDC metadata initiatives
Title of Series
Number of Parts
22
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
For half a century the Cambridge Crystallographic Data Centre (CCDC) has produced the Cambridge Structural Database (CSD) to allow scientists worldwide to share, search and reuse small molecule crystal structure data. An entry in the CSD is often seen as 'just' a set of coordinates, but the associated metadata (data that describes and gives information about other data), is essential to contextualise an entry. Data that describe the substance studied, the experiment performed and the dataset as a whole are all vital. This presentation, timed to coincide with the 50th anniversary of the CSD, will look at how metadata is used from deposition to dissemination of the CSD. We will look at how recent developments surrounding metadata have been targeted to improve the discoverability, validation and reuse of crystal structure data before looking to see what the future may hold.
Keywords
WorkshopDiffractionTrimmenSundayAugustus, Count Palatine of SulzbachIonLastInitiator <Steuerungstechnik>NightSource (album)Computer animation
Crystal structureEuropean Train Control SystemProzessleittechnikYearSource (album)DiffractionSynthesizerMeasuring instrumentHose couplingCartridge (firearms)DaySingle crystalThin filmComputer animation
Model buildingCrystal structureSuperheterodyne receiverThin filmElectronDensityIonCommon Intermediate FormatThin filmFiling (metalworking)ElectronCompound engineYearChemical substanceZipperSteckverbinderCommon Intermediate FormatComputer animation
Crystal structureSizingComputer animation
Chemical substanceBauxitbergbauRRS DiscoveryCrystal structureRRS DiscoveryBauxitbergbauTypesettingMorningGround stationChemical substanceComputer animation
Ship naming and launchingPair productionCrystal structureInterface (chemistry)Motion captureHalo (optical phenomenon)Stem (ship)Common Intermediate FormatTypesettingPhotonic integrated circuitHorn antennaTramAmmunitionMeasurementHulk (ship)Initiator <Steuerungstechnik>Crystal structureProzessleittechnikComputer animation
IndustrieelektronikThin filmViscosityCrystal structureCrystal structureCrystal structureCasting defectSunriseComputer animation
IndustrieelektronikThin filmCrystal structureSunriseStructure factorThin filmStandard cellInstitut für Energieverfahrenstechnik und ChemieingenieurwesenProzessleittechnikLeadComputer animationProgram flowchart
LaceGenerationFood storageProzessleittechnikFord OrionVertical integrationProzessleittechnikComputer animation
IonThin filmCompound engineColorfulnessProzessleittechnikCrystal habitMelting pointComputer animation
Punt (boat)Motion captureTemperatureMeltingSchwimmbadreaktorMondayFunkCrystal structureDiffractionDiffractometerMelting pointColorfulnessTemperatureMeasurementCartridge (firearms)Common Intermediate FormatEcho <Ballonsatellit>Standard cell
Mint-made errorsCrystal structureProzessleittechnikSpare part
ContactorPlain bearingTypesettingTape recorderTypesettingPlain bearingComputer animation
ProzessleittechnikLevel staffChemical substanceProzessleittechnikSteckverbinderCrystal structure
Compound engineChemical substanceEngine control unitGenerationMine flailIonRail profileCell (biology)Compound engineChemical substanceSteckverbinderComputer animation
Bird vocalizationPair productionDeep geological repositoryCommunications satelliteTypesettingUniverseCrystal structureDeep geological repositoryComputer animation
VermittlungseinrichtungPhotodissoziationViseHot workingPhase (matter)Computer animation
March (territory)Compound engineMetalAccess networkGround stationAmmunitionCrystal structureRainElle (magazine)ProzessleittechnikChemical substanceComputer animation
Kette <Zugmittel>Telecommunications linkCorrugated fiberboardKette <Zugmittel>Institut für RaumfahrtsystemeCrystal structureComputer animation
Crystal structureAprilSynthesizerAnalytical mechanicsNanotechnologyAmmunitionIonRainHourFiling (metalworking)PaperComputer animation
Crystal structureAccess networkFlugbahnAccess networkDrehmasseCrystal structureFiling (metalworking)Book designVideoComputer animation
Refractive indexRoman calendarSemi-trailer truckRefractive indexCrystal structure
RRS DiscoveryAußerirdische IntelligenzTypesettingFinger protocolKette <Zugmittel>Chemical substanceDisc brake
Bulk modulusAlfa Romeo SpiderTelecommunications linkCrystal structureAnimal trappingSteckverbinderCrystal structureKette <Zugmittel>Chemical substanceDeep geological repositoryCrystal structureDayComputer animation
Deep geological repositoryCrystal structureGroup delay and phase delayElectronic componentLeistungsanpassungChemical substanceCrystal structureKette <Zugmittel>Computer animation
NeutronMeasuring instrumentStar catalogueTelecommunications linkCrystal structureMeasurementAmmunitionModel buildingCrystal structureYearLeistungsanpassungFood storageAtmosphere of EarthData storage deviceDayComputer animation
Thin filmWind farmAxionPhotodissoziationEveningComputer animation
Telecommunications linkAccess networkVertical integrationThin filmChemical substanceIonAccess networkThin filmVertical integrationYearPair productionCrystal structureProzessleittechnikSpeckle imagingComputer animation
Thin filmAC power plugs and socketsSpare partSatelliteCrystal structureProzessleittechnikSpecific weightDeep geological repositoryCamera lensUniverseChemical substanceFinger protocolFood storageReflexionskoeffizientYearSpeckle imagingTypesettingCartridge (firearms)Structure factorModel buildingDiffractionLoudspeakerInitiator <Steuerungstechnik>AvalancheDrehmasseBulk modulusDayRelative articulationSource (album)Orbital periodColorfulnessEffects unitRedshiftIcePhotographyEisengießereiLeadHot working
Transcript: English(auto-generated)
So thank you to Brian and John for inviting me to talk today, and thank you all for coming for so early after a late dinner last night. So today I'm going to reflect back on the metadata associated with entries in the CSD, and some of this will touch on some of the points that Simon made in the previous talk. I'm going to look at some of the recent
metadata initiatives, some of the challenges we have faced and some of the challenges we still face. So I think as Herb said yesterday, not all metadata comes from a single source. So if we think about a single crystal structure, then we have metadata associated with the crystal itself. So we have data about the synthesis of the crystal, the crystallization
conditions of the crystal, maybe some physical properties of the crystal, like melting points or the ability. Then there's data about the diffraction experiment itself, so the instrumentation use, the conditions, et cetera. And then we have the raw data, the processed
data and the derived data, and in some cases, an associated publication. And we use some of this data to create a CSD entry. And I think what was said a couple of times yesterday is what we're really aiming to do is go from data to information to knowledge to wisdom. So historically at the CCDC, our emphasis has been on the deposition
of derived data. So typically, what we collect or what we receive from depositors is the electronic CIF file, and this contains atomic coordinates. And we use those coordinates
to create an entry in the CSD that has chemical connectivity, TD diagram and more chemical information such as compound name. And over the 50 years when we've been creating the CSD, the number of crystal structures or small molecule crystal structures published has continued to rise. So we've now got over 790,000 crystal structures in the CSD.
And just as the PDB, as the number of structures in the database rises, the complexity and size of the structures also continues to rise. So we really had to evolve the ways we deal with data and metadata at CCDC. So I'm going to take a step back and reflect
the types of metadata associated with the CSD entry. And so my simplistic view of metadata is that it's a set of data that describes and gives information about other data. And the data that I'm going to focus on this morning is data that describes the substance studied, and this is important for discovery, analysis and mining of the data. Data that
describes the data set as a whole, and this is important for provenance and attribution of the data. And then data that describes the experiment, so things like where it was done, how it was done, who did it, and who funded it. And when I go through those categories, I'm going to look at some of the recent initiatives that we've had
at CCDC, some of the changes we've made, and some of the challenges we still face. So as you all know, back in the 1990s, the crystallographic information framework was launched under the guidance of the IUCR, and this was widely and extensively adopted by both the crystallographic community and the publishing community. And this allows people
to store derived results, raw and processed data, experimental conditions and publication data. And we can see that it's really revolutionized how small molecule crystal structures were published. So pre-SIF, when a crystal structure was published, the publication
will include hand-typed tables of coordinates, and we used to have to use those hand-typed coordinates to create an entry in the CSD. Thankfully, now things are different. So you can see here the rise of crystal structures, and then the rise of the deposits that are electronic SIFs, and so virtually all of the new depositions are now electronic SIFs.
Because we have a standard way of people depositing data, that's allowed us to create an interactive web deposition process. And following the IUCR's lead, we're strongly encouraging the deposition of structure factors alongside SIF, and we're working with the
IUCR, with the community and with publishers to make this mandatory in the future. We can also check the data that's being deposited. So we've got syntax checking during the deposition process, and just last month, we've been collaborating with the IUCR to integrate the IUCR's CheckSIF service into the process as well. So depositors
can check the integrity of their data, and I think this is really important, both for the depositor, for the crystallographic community, and for the publishers. During the deposition process, we also highlight some key metadata that is used to create
a CSV entry to the depositor. So metadata is pulled out and extracted from the SIF, and it seems like compound name, colour, habit, and mounting points. And the depositor is asked to review their data and enhance their data. But I think we all know that sometimes allowing free text and allowing people to enter data manually is not a good
thing, and is not always the right thing to do. And I think it's important to try and capture metadata in as semantic or defined way as possible. So if we look at things like colours reported in the CSV, we see colours such as wine-coloured
crystals, but there's no information about what coloured wine people are actually talking about. And if we look at morphologies, we often get different descriptions of the same shape. So we've got penguin-shaped crystals, we've got pear-shaped crystals, we've got lozenge-shaped crystals. So perhaps there should be some more defined
dictionary of what should be allowed in certain metadata fields. Sometimes it's hard to know what the metadata is actually referring to as well. So if we look at morphology again, sometimes we don't know whether the crystallographer is reporting the morphology of the growing crystal, or the crystal as it was cut and put
on the diffractometer. I think manually entering information also means that sometimes the depositor doesn't always get things right. So when we look at CIFS deposited at CCDC, in some cases we see that the melting point is actually lower than the study temperature of the diffraction experiment.
And when we looked at this further, we saw that it was predominantly due to people putting in the melting point of their crystal in Celsius and not Kelvin, so the wrong units. So I think it's really important, and I think this echoes some of what Simon said, to try and capture the metadata directly from the equipment in as standard a way as possible.
Maybe have some defined dictionaries to use for certain metadata, and to increase the validation of the metadata that can be done. So maybe validating things like showing what the probable colours are or improbable colours are for a crystal based on the chemistry of the structure
and most improbable melting points highlighted. So to deal with all of the deposited data and the metadata and the increasing volume of data that we receive at CCDC, we launched a new infrastructure in 2013 called CSC Expedite.
And this tries to automate as much of the process as we can while keeping some of the manual parts of the process where our scientific input really adds value. So the boxes in green here are the automated parts of the process, and you can see it's quite a linear workflow. This new system is based on Microsoft Dynamics DRM
and it's used to manage all of the transactions and interactions that go on behind the scenes. So it's not just the CSD entries that we have to deal with. So in the new system there are a number of different entities or record types, and each entity has its own metadata associated with it. So we've got an entity for things like the deposit,
the CSD entry, publication, journal, publishers, and there really is quite a lot of information to manage. One of the key things we do when creating the CSD is to add chemistry to the coordinates. And we do this using an automated process in the first instance and software that was developed at CCDC called Decipher.
And this uses a probabilistic approach. So it looks at all of the 750,000 structures in the CSD and uses Bayesian Theorem to work out what the most probable chemistry and chemical connectivity is from the coordinates. The chemistry and chemical connectivity then gets assigned a reliability score
so we know how good the chemical assignment might be. And then each entry is checked manually by our editors. And I think the assignment of chemistry is really important to allow users to discover data, do data mining, analysis, and allow us to do interoperability of the structures.
So as well as data about the crystallographic information, like the cell parameters, assigning the chemical connectivity allows us to add things like compound names, chemical diagrams, and this is really useful when people are doing searches and analysis of the data.
So let's go on to look at the data that describes the data set as a whole. And I think this is important for provenance and attribution of the data. So this includes publication data, so when the data was published, who published it, and where it was published. And this could be scientific literature,
or it could be private communication in the CSD, like Simon mentioned. And then we could also think a bit further and think about authorship data. So instead of just who published the corresponding publication,
who created the sample, who did the data collection, and who performed the data analysis. And we have quite a lot of complicated workflows in place to allow us to associate CSD data with publication data. And we have workflows in place with all the major publishers that let us know
when data is included in the manuscript, when it goes to just accepted or ASAP stage, and when it's then fully published. We also have interactions with third party services such as Crosswork, so we can find automated ways of, from some of the publication information,
finding out what the DOI of the publication is, or vice versa. And all of these complicated workflows and interactions allow us to make the data available to the right people at the right time. So the data is available, both the deposited data and the most probable chemistry and chemical assignments at the point of refereeing.
So referees and publishers during the peer review process can obtain data from the CCDC. And then the data is available immediately when a structure is published in a publication and links are in place from the publications to the data that's stored at CCDC for publishers such as the ACS, RSC,
Elsevier, Wiley and the IUCR. When we think about publications, we've also been thinking about data and data publications and we've started assigning data set DOIs to CSD entries. We started last year and we've assigned over 500,000 DOIs
and DOIs are now assigned within an hour of publication. And I think this really forms a foundation for formulising data citation and interoperability. So if we think about a typical publication, we have a list of authors. And the list of authors might not necessarily include the crystallographer,
so they don't always get credit for their contribution. The DOI associated with an article doesn't identify the data, so it identifies the paper. And sometimes if the SID file is uploaded with the publication, it can be buried away in supporting information.
So I think assigning DOIs helps pave the way. And what we've also done is on our Get Structures service, so where people come and get data from us, we've got a CCDC citation box. And at the moment this mirrors publication information, but in the future we could see this becoming more of the details
about who created the data itself. When people download data from CCDC, we also add the key metadata into the downloaded files. So we add things like the CSD DOI so people can get access to more information,
the publication information, and when it was deposited and downloaded. Assigning DOIs means that we can link to other services as well and facilitate interoperability. So in minting a DOI with data sites, we have to provide data sites with some key metadata. And people like Thomson Reuters and the Data Citation Index
can then pick up this metadata and create CSD entries in other services such as the Data Citation Index. So now the Data Citation Index also covers Crystal Structures. As Simon touched on earlier, we're also working with a number of other projects,
and Ian Bruno in particular has been working heavily with GISC and the RDA. And you can see here one of the recent outputs from the RDA, which is the Data Literature Interlinking Service, which ingests data article links. And you can go on to their beta servers and here are some of the graphics and information that you can find out.
So about data sets and publications and how they're linked together. Because we assign chemical connectivity to structures in the CSD, we can also link to other resources and other metadata. And we can do this based on the chemistry.
So we've been working hard to assign reliable inches to a subset of our structures in the CSD. We've added links to CSD entries to more general repositories such as ChemSpider. So you can see there's a link and the user can also visualise the Crystal Structure.
So they can see Crystal Structure data alongside things like NMR data, IR data and wider context data. And in the next day or two we'll also be adding links to PubChem and that should go live very soon. As well as more general chemistry, we can also link between crystallographic databases.
So we've been working with the PDB to match PDB ligands to best representative CSD molecules. And of the 20,000 chemical components in the PDB chemical component dictionary, there are about 1,500 exact messages for structures in the CSD
and PDB users can now link between the two. So our second call-out for Brian's talk next, and I think he's going to mention ITAT in that talk, is an investigation that we did last year
about trying to link between the raw data stored at STFC and the model structure stored at CTDC. So we looked at publication DOIs in the two systems and see if we could match the data and we also looked at some of the mesh data contained in the SIF. I think we could find some matches but I think we need to have a more systematic approach
to match raw data and model structures in the future. So when we talk about publications and DOIs, assigning DOIs also means that researchers can add data to their researcher IDs such as ORCID alongside their publications. So they're getting more attribution and credit
for their data as well as publication. We might also want to think about how we could add data about funders, about grant numbers and associate that with data and publications and then about institutions, so working with people like ISNI and Ringgold to associate data in institutions as well.
So some of the immediate things that we're working on in the area of metadata is the creation of a new CSD deposition portal to allow depositors to log on, see the data, edit the data and enhance the data ready to create a CSD entry and extend some of the integrity checks
involved in that deposition process. We're also looking at extending the linking between different datasets and extending the programmatic access to data. So earlier this year we launched our CSD Python API and this gives users of the CSD programmatic access
to data in the CSD entry and we're looking to extend that to more data and more of the deposited data making more of the deposited data available in a programmatic way. We're also looking to extend the data available to the wider community. So we've got a free service on our website
where people can access Crystal Structure data and we're extending the search functionality including the addition of inches into this and we're trying to make the data more accessible to a broader audience so present it in a way that non-crystallographers could understand.
So that was a brief roundup of some of the things that we've been involved with. I'd like to thank you for your time and if there's any questions please go ahead.
Thank you very much Susannah. So did I understand that one of the requirements from CCDC is that CONSIFS should add the specification of penguin-shaped crystal to its jurisdiction? Definitely. We can't take that out.
Hi, it's Mike Probert of Newcastle. Given the ambiguity over some crystal shapes and colours and things is there not an argument for actually storing images of crystals alongside the deposited data? So that leads on to a second thing which possibly falls
for the next two speakers as well. Can you envisage a situation where the CCDC is acting as a repository for raw diffraction data having already stated that you're trying to move to a situation where it's a requirement for structure factors to be deposited
or would that cause a data avalanche that the company couldn't deal with and the reason it could be more pertinent to the next two speakers is that possibly the amount of data that's collected in home sources compared to the amount of data that's collected per day at central facilities would mean that this could be something that's handled
by the central facilities only. I think we'd have to think very carefully if we started handling raw data and how it would reflect or how it would change our sustainability model. So at the moment we've got processes in place where we can deal with the amount of data and we've kind of future-proofed it as much as we can but as soon as we start talking about
a lot of data then I think that would change and it would change how we would need to charge for some of our services. So I think we're very keen to support initiatives like this and try and link to as much data as we can and have linking between data. Just to follow up on that I think the PDB solution
of allowing for an entry for DOIs for the raw data is the way to go rather than imagining that you would host raw data sets. I think particularly for the chemical crystallography where my experience in Manchester was that probably 97%
of the data was measured locally and maybe 3% of the projects went off to the central facilities. The university repository is then the outlet for raw data sets and the DOIs. The whole process is simpler for you anyway
to make allowance for the DOI. Which is something that we will definitely do. Thank you. Just to follow up on that point the DDDWG has been interested in the possibility of individual repositories that collect all the diffraction data
and there's an economic case to be considered. People very often say these days storage is cheap but of course it's not so much the bulk storage that is costly but the organization that allows you to make good use of that and characterizing the metadata framework that would make
the costs come down to those of the bulk storage. It's part of the exercise that we're engaged in. Two or three years ago when I looked in the differences between Cambridge database
and ACTA E, I believe collected experimental details and I remember that Colin told me at some point that you are planning to have some experimental details. I think we're planning to extend the metadata that's available
through this testing but we need to make sure that's captured in the right way. Next question. How many structures which you have were never published? We've got an initiative on that at the moment and we've looked in our repository and we've got about 120,000
crystal structures in the repository that have never been published so we're going back and contacting the depositors and putting automated workflows in place to allow us to publish those. We've also changed the embargo period on our deposition process to try and get that feedback loop
shorter so people haven't moved on by the time we contact them. You see, sometimes I wonder how many structures are not going to Cambridge database. They are not going anywhere. Because
I know people who are doing 600 structures a year. So I worked out for a recent talk that it could be about 15-20% of structures that actually get deposited at CCTC whether that's true or not. I'm just being optimistic, I don't know, but I think
we need to make sure that we're part of the crystallographers workflow. So there's no barrier for deposition and it's just part of the process. Coming to you back to the lens Florence. Considering that you can
using the same raw data you can have different structural interpretations like you can miss misinterpreted symmetry or miss twinning or satellite reflections. Can you consider the clear descriptor
of the same raw data of different entries in the CCTC based on the same data collection. At the moment we allow reinterpretations of the same data and we allow
redeterminations of the same structure and we link between those and we have what's called Refcode families between the structures but I do think we need to go further. Thank you very much indeed Susannah.