III. The integrity of published information: Validating a small-unit-cell structure; understanding checkCIF reports
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 15 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46319 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
RutschungAfternoonLecture/Conference
00:30
Wire bondingEngine displacementCrystal structureReflexionskoeffizientPaperFoot (unit)ToolOn-board diagnosticsAutomobileTARGET2Series and parallel circuitsRutschungYearTypesettingMonthHourHose couplingMental disorderSource (album)Plain bearingInitiator <Steuerungstechnik>Phase (matter)Weather frontFormation flyingPickup truckSpare partTimerSensorLadungsgekoppeltes BauelementFiling (metalworking)DiffractometerStandard cellPhysicistHot workingCommon Intermediate FormatMint-made errorsDrehmasseCartridge (firearms)StagecoachSheet metalDayTurningHall effectLevel staffHyperbelnavigationTrainMinuteStructure factorScreen printingAngle of attackLecture/Conference
07:59
DiffractometerMeasuring instrumentCrystallizationGenerationCrystal structureBahnelementToolMixing (process engineering)Lecture/Conference
08:37
Bending (metalworking)Audio feedbackSystem in packageAstronomisches FensterLecture/Conference
09:14
Bending (metalworking)PlatingButtonDiving suit
09:54
MuonAtomhülleRefractive indexContactorSeparation processShort circuitCartridge (firearms)Cosmic distance ladderScoutingPagerRoll formingPaperCamcorderVideoHyperbelnavigationTypesettingQuality (business)Mint-made errorsProzessleittechnikPlain bearingModel buildingCrystal structureEnergy levelCrystallizationAusreißer <Messtechnik>SensorDrehmasseCatadioptric systemLecture/Conference
14:18
Common Intermediate FormatStandard cellElectronAmplitudeModel buildingHydrogen atomYearFiling (metalworking)Cylinder blockNetztransformatorStructure factorWriting implementDrehmasseDensityColor chargePlain bearingDampfbügeleisenCrystal structureScoutingLeistungsanpassungPaperCut (gems)Wire bondingDayAtomismDiffractometerAbsorption (electromagnetic radiation)KopfstützeSauerstoff-16ToolSpare partDirect currentGroup delay and phase delaySynthesizerApparent magnitudeAngle of attackHot workingLadungsgekoppeltes BauelementCocktail party effectSystem in packageLimiterColorfulnessFACTS (newspaper)StrangenessHourWater vaporStandard cellRing strainBubble chamberLecture/Conference
23:48
Lecture/Conference
Transcript: English(auto-generated)
00:14
Good afternoon everybody, and thank you Brian for the invitation to speak this afternoon. Mike observed that I had rather a lot of slides in this talk,
00:23
and I've just reduced them by a third because he has covered some of these already, so I apologize for any duplication. Let's think about history here for a moment. Who remembers what it was like when we didn't have motor cars in the world? I guess nobody, right? Who remembers what it was like when we didn't have SIF?
00:42
A few hands, but if you're younger than about 45 years old, you only ever grew up with SIF. Think what it was like when we had to publish data before we had the SIF facility available. For start, a lot of papers were submitted on type written sheets,
01:01
so you had to transcribe data from one place to the other. Maybe you had a funny little floppy disk at some stage. How did we know that everything had been done correctly? How do we validate things? Back then, one crystal structure in the small molecule world took some days, maybe even months before you published it. We had a lot of time,
01:22
and we had to do a lot of things very manually, and so as you're doing things manually, you tended to check things as you were going. So, validation was sort of part of your life at that time. Now things have gone a lot faster. We have data sets every couple of hours. We have data sets every 10 seconds in some facilities, and it's very hard to find enough time to think about all of these things carefully and observe everything.
01:45
So we need tools to help us validate what we're doing to avoid making goofs and egg-on-face situations when you're publishing data. So, you know, we've heard before in 1991 came the paper by Sid Hall and others which defined the initial phase of SIF, and
02:06
that was great. I picked that paper up. I read it from front to back, and I really enjoyed it, and very quickly I was submitting papers in SIF format to Acta Chris C, whereas I probably only published one or two structures prior to that. So there was a big spike in my output simply because I didn't
02:22
have to worry about the crystallographic data anymore. I only had to type a few sentences to describe my structure, to describe the science. But the utility of all that wouldn't have happened had not, and I guess I've heard someone say already, Sid was very persuasive. He got people on board with the software in the small molecule world to get SIF
02:42
incorporated, and therefore you could use it. The users could use it, and that enabled the journal to start saying, hey, you can submit things in SIF, and a bit later, hey, you must submit things in SIF. And so it went on, and then Sid Hall had the idea, well, you know, there's some things not really right with these submissions.
03:02
Perhaps they should be checked more thoroughly, and the guys in Chester, the editorial staff, were doing a lot of that manual checking as Mike said. Why not automate that? Put it back and let the authors use some tool to validate their work. It saves them submitting stuff and having to resubmit it and resubmit it. They do all that before they get to submission. And so validation was born. Mike Hoyland developed some
03:25
things in Chester, Tom Speck was brought on board, and between them over the years they have developed a very comprehensive validation tool. And as Brian said earlier, you know, this is not about validation, but this is about SIF, and see what it's become over the years.
03:43
So why do we need CheckSIF? Well, when my timer comes through, there we go, we got those CCD detectors in particular, area detectors, around 1994 when Bruker released their smart system, and so the numbers of structures we were generating went up dramatically, the speed went up dramatically.
04:04
Software developers started to write nice graphical user interfaces, especially the diffractometer companies, so, you know, we were able to click buttons and do pretty things and have pretty pictures and instant refinements, but what that tends to do is it stops people looking at the output files,
04:21
at the LST or the log files, and the information about, hey, hang on a minute, there's something not quite right here, is in those files and not on the screen, and so it's lost, and people tend to miss out on it. So, you know, you need other ways to prompt people. Crystallography became so easy, they were more what I call non-experts, people who are not in this room, who are expert
04:43
crystallographers or doing crystallography as a career, they might be chemists or physicists or whatever, using crystallography as a tool, they're doing it quite competently, but they perhaps lack the formal training, so they can easily have oversights, make mistakes. You can also use the validation to set the standards you want, in other words, encourage best practice,
05:04
whether you agree with the standards, Cheksev has promulgated, it doesn't really matter, but this is where to aim. If you can aim up there, you have a consistent target for everybody. You know, different structures behave differently, you can't always meet that target, but that's the aim, it's a consistent thing.
05:23
Authors don't need to revise as much if they have more flawless submissions, and ultimately this goes through to more rapid publication, which is what everybody likes. Well, I can skip this slide because Mike showed that we're testing against a whole series of different criteria.
05:42
We look for the CIF itself being a consistent file, and there's not mistakes that kill the syntax of the CIF, space groups, symmetry, anisotropic displacement parameters, that all of the derived parameters are consistent, geometric parameters, bond lengths, angles, errors, and even the reflection data, as Mike talked about,
06:04
the structure factors, and a whole lot more things. And this Cheksev program is really out there to help everybody. It's to help you to check your work, make sure everything's okay, to follow best practice ideals. It's not designed as a, I'm sorry, you can't publish in this journal.
06:22
It's not a brick wall, but a lot of people unfortunately got that impression when it came out. That's not really true, okay? If you can't meet the criteria, that's okay, provided there's a good scientific understanding of the reason why you don't get there. And at the same time, reviewers can use this.
06:42
So if you're, you know, the few is on the other foot sometimes, you are reviewing papers and it's a tool to help you look at what the authors have been doing. There's many sources of outlier parameters, incorrect structures, some feature like a disorder that you haven't treated fully could lead to validation alerts,
07:03
or you're using some non-optimal procedures that should be done better. You can go through a whole list of these things, or it could be some real observation that's unusual, and hey, if that's the case, then it's a good thing to discuss. You might say, well, you know, we've made everybody aware of the need for validation and so on,
07:20
and do we still need it? Unfortunately, yes, because even in publications in various journals today there are problems, okay? This comes from inexperience or just plain laziness, ignoring some of the less severe alerts. Even though they're not severe, they might still mean something, or just not being able to understand the alerts properly.
07:42
And then there's two problems. Blind reliance on checks. If there's no alert, there's no problem. Are you sure? Reviewers tend to do the other thing. Oh, there's an alert. I'm sorry, there must be something wrong. And so people need to learn a little bit more to understand what the alerts are telling you.
08:02
Another issue is the generation of new instruments that are fully automated and the diffractometer companies like to sell this. Hey, you can just drop a crystal in, bang, you get out a pretty picture in the end, you don't need much experience to do it. Are you really sure? On my diffractometer, I have one of these automated structure solution refinement things
08:21
and it works really well. It's a great tool, but a certain percentage of structures, it still gets element assignments mixed up. It's not always possible to distinguish carbon and nitrogen and so on. So you need a little bit of expertise and knowledge to make sure it's doing the job correctly. So Mike showed you how to upload a SIF into the system
08:44
at the CheckSIF site. You get feedback which gives you a summary at the beginning and then some of the alerts. And one feature of that is you get some alerts and you don't understand the text. It's very short. If we made a long description for each of those,
09:00
the web page would go on and on and on. So it's kept short, but they are clickable. So if you click on these alerts there, then up should pop a window which gives you a little bit of help to interpret what's going on. And at the bottom you get just an ellipsoid plot which is created out of the SIF and not something that somebody has created by another route.
09:25
Alternatively, you can use PLATE on itself. PLATE on will actually have a few more tests switched on in it that in Chester we disable. So if you want to really be thorough, you can use PLATE on itself. It works the same way. You click the validation button up there
09:47
and it just runs the suite. The output you get back looks much the same as what you get from the online version of CheckSIF. So then we get this feedback with alert in it. There's three types of alert indicators.
10:00
There's this ABC, and actually G, which tells you something about the severity of the alert. There's another one here, an alert number, which was introduced a bit later on, which tries to tell you about the type of alert. And this might give you a hint as to what's really necessary here. Type 1 tells you there's a syntax error in the SIF.
10:21
It's usually trivial, should be easy to correct. Type 2 says, hey, there's a hint here that your model might not be quite correct. Type 3 says, it's not really the problem with the model, but the data overall are not of the greatest quality. And the other two are really just information or maybe suggestions that you might be able to do things
10:40
in a slightly different way that might lead to a more optimal result. So if you get an alert, this is usually quite serious and needs to be attended to or thought about. And it could be trivial. I mean, you miss out the crystal dimensions. There's nothing really major, but it's a requirement for the journal publication,
11:01
so it's a serious alert. Easy to fix. Other things may be not so easy to fix, but you have to decide is this really, in my case, something I can't do anything about. It's a property of the structure. If so, okay, and I'll tell you how to indicate that in a moment. B, well, it's a little bit less serious, but I think you still should have a look at it.
11:21
I'm not going to go through that in detail. C tends to be, well, it's not too bad. It's a little bit of an outlier. Well, if it's there, it's something to be thought about. Moiety formula not given. Well, that's okay. You can give it easily to correct. There's a short contact here. Well, maybe that's okay.
11:41
Small ADP for an atom. You know, it's only slightly different. So each of those individually might be, well, not too bad, but what happens if we consider those three together? What does that, what could that mean? Well, if you have the pyridinium cation, it probably means that you've switched the position of the nitrogen
12:03
because it's very difficult to work out where this guy should, when he's positively charged, should lie, but this distance here is probably from here, and so that's a hydrogen bond. So if the nitrogen was in the other position, then all these alerts would go away.
12:21
There is this G alert now that is actually increasing in the number of categories that give that alert it's not necessarily an error, it's a prompt to say, are you sure? Okay, but you should think about those. We've seen that procedure from Mike, so I won't go through that one again.
12:43
The thing is you get a list of validation alerts. It's important to try and address each one, look at each one and think about it for a few seconds to decide is it important, is it something I can easily deal with. If it is, deal with it. The criteria based on what we'd say are normally expected. There are many cases where
13:01
the normal situation won't apply, but it's up to you to understand that. And the benefits are that you can get things through the publication process much more quickly. So if you've still got trouble with A and B alerts and you know you can't resolve that, then all you have to do is say, okay, I know what's wrong there,
13:21
and you put in a short explanation and it doesn't have to be a page. Just explain why this is a true feature. Now maybe that should be in the experimental section of your paper. We're supposed to be documenting our scientific observations and results and procedures. It doesn't have to be in an excuse filled into the SIF, but it should be put in the appropriate place
13:40
and some of the appropriate places, if it's not in your paper, would be in such sections of the SIF. So if you get this level A, then what you get back from the system is a validation reply form which you just plug into your SIF or your response line here and you can give an answer.
14:02
When you've done that, if you rerun CheckSIF, the answer appears in the CheckSIF output. Now the handiness of that is you can see you've done something and the second thing is a reviewer can see that you've done something. So there it is for everybody to see.
14:21
Our validation is not the be-all and end-all. There are limits. Maybe there's some test we haven't yet thought about implementing. If you think of something, let Tom know. It may not be practical to apply that test in a computer program. So you still have to be a little bit careful and be vigilant yourself
14:40
when you're checking your own structures. Other things you can put in the SIF which make no sense, like you can say an iron II complex that has a certain colour which has never had anything to do with iron II, but you can't validate that so easily. So you have to say, does the structure make sense to you? Does the structure look right? When you look at the ellipsoid plot from all angles,
15:00
are there strange ellipsoids? These often give you a visual impression very quickly of odd things. Does the chemistry match the structure or if the structure is something different can you understand how you got there with the chemistry? If you're using restraints in a difficult structure don't overly restrain the structure to be that which you want it to be
15:23
or you can even use the databases. CSD has a lot of information about certain geometries which you can use to compare with your structure and see that everything is okay. And look at the listing files, the output files, the log files as well if you've got problems. I just want to give two examples here.
15:41
A situation where somebody had four lactams. Four lactams. But one was claimed to be a carbonyl here. We have OH and an N here. Well, is it really? The R factor on that structure
16:00
is 0.059 for an organic molecule. No, it's okay. Nothing unusual there. But there is a B alert. Hirschfeld test says there's something wrong with the ellipsoids, the ADPs of two atoms that are bonded together. So there's a problem for that O1C2 bond which is supposed to be this carbonyl or it's an OH in this model
16:23
and a peak which is slightly larger, residual peak, than the rest. And so if you run a different density map, you see here there's a lot of negative electron density around this oxygen and there's a slight peak here. That's this one comes up here. So what is it?
16:41
Well, instead of OH, it's NH2. That peak was another hydrogen missing and now if we change the oxygen to a nitrogen, that negative electron density goes away. That's 046. Okay, so that's the answer. Of course, the chemist now has to understand what's gone wrong
17:00
in their synthesis. The second example, we have here a strange thing which this should be a CH2 group apparently. It's missing a hydrogen and actually only generates a G alert saying, well, the angles are a bit odd here for if only one hydrogen,
17:20
then it should be a planar and it's not planar obviously. So what's gone wrong there? The largest peak is here in the difference map, 0.84 electrons per cubic angstrom. So there's definitely a missing hydrogen in the model. Okay, can happen, right? But in the sift, there's no mismatched formula. So what's happened here
17:40
is the author decided, well, there was a mismatched formula so I've changed the formula so it matches my model. Okay, that went away. And in fact, the author said, well, you be the judge. Okay. Then comes structure factor validation
18:00
which has only been around a few years now and was one of the reasons we could pick up on the fraud. But it can find all sorts of things like mismatch between the data block names and the sift and the FCF file. Well, it just means someone's uploading the wrong pair of files. It happens. It's not uncommon.
18:22
It's found the same refinement that produced the sift. Someone's done some updating of a sift and it's got out of kilter somehow. Mistwining can be found. Someone does a transformation and pastes in new atomic coordinates but doesn't transform the UIJ. I mean, this cut and paste business of stuff into the sift instead of just generating a completely fresh new one
18:42
is a problem. And other sorts of things that can be done including cheating. Which is submitted from the refinement used to generate the FCF file but it was in the model to generate the sift.
19:01
And you can see that you get mismatches all over the place of R factors and lots of alerts saying there's problems with the R factors and goodness of fit and the reported and calculated density and all sorts of things. So that's a fairly clear sign that there was some mismatch there. As I said a bit earlier you start editing into the guts of a sift.
19:27
At the moment generate a nice fresh sift and then all of these sorts of things go away. Now I just want to finish off on this is a session on sift and hey, sift is a great tool.
19:41
As I said at the beginning a lot of people, young people don't know what life was like without sift. The problem is the better you make something the easier you make something concerning computer programs the more a user gets upset as soon as it doesn't work properly. The better the user is
20:01
only as good as its acceptance by the users. It needs to be useful, it needs to be easy to use, it needs almost to be transparent especially for people who are not specialists in the field. Now over the last year or so at the IUCR journals we've been looking into trying to improve the journals,
20:20
make them more attractive to authors and one of the things I've been doing at Actichrist C is trying to find out if people are reluctant at the moment to publish in Section C and so you say well why? There are a number of reasons but one is sift is too hard to understand or work with or they cannot prepare a paper
20:40
in sift, that may seem to some people incredulous yet that is the attitude of the chemical community or parts of the chemical community so we have to be careful about that. We have PublSift, we have PublSift is a great tool and it takes time to learn PublSift
21:01
as easy it is, word or nothing. Why? Are people not as capable with computers as they were 25 years ago? I don't know. Anyway, what this has led to unfortunately with Actichrist C is we have recently decided that we will accept the text sections of a C paper as a word document.
21:24
It seems a bit sad to go that way but it seems some authors will not submit unless that happens. The reason is there is a tendency at the moment for a proliferation of non-defined, non-standard data names in sifts.
21:40
You're allowed to do that. The sift context allows you to do that. However, it's getting away from us a little bit. It's getting a little bit out of hand and I urge comms sifts perhaps they need to take this back to themselves and try and cooperate with or negotiate with software developers
22:03
that is organised more better. Otherwise we'll have a plethora of things that nobody knows the definitions for. Each diffractometer manufacturer has a different sequence of things for the same data. So I would encourage comms sifts to actively go out there
22:21
and talk to the software developers. There are things that we use these days in the small molecule science that weren't very often used way back 20 years ago and there are no sift definitions for them and we don't have those and we need to be able to get them quickly. I'm happy if a data name caters
22:40
for 98% of the situations and the other 2% we can work on later. If we delay while we try and work out how to cater for 100% it's a bit difficult. Some of the sift definitions, existing ones, they need to be curated a little bit and gone over to make sure they still are good for things that we use today. For example, the items that deal
23:03
we don't just do numerical absorption corrections anymore we have these programs that treat CCD data and do all sorts of corrections and we have one item for it and we need more items for those sorts of things. I don't mean to be negative here but I think one needs to pay a little bit more attention to these things.
23:22
So that's all I have to say. I won't even read my summary I think you can read it except that structure factors that in the coming years we can encourage non-crystallographic journals to also require structure factors to always be submitted.
23:42
George's new version of his program puts the raw data automatically into the sift which is a good step in the right direction. So thank you very much.