I. Standard information exchange formalisms: A coherent information flow in crystallography
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 15 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/46306 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
| |
Schlagwörter |
00:00
BehälterEinkristallGasturbineGleitlagerWärmeaustauscherSatz <Drucktechnik>FörderleistungStandardzelleProfilwalzenDünne SchichtKristallgitterBildqualitätEnergieniveauErsatzteilFeilenKommunikationssatellitMaterialModellbauerPostkutscheRauschunterdrückungRöntgenbeugungSchlichte <Textiltechnik>Speckle-InterferometrieStrukturfaktorInitiator <Steuerungstechnik>SchlauchkupplungTagElektrische StromdichteFernordnungFormationsflugSatzspiegelParallelschaltungPagerKompendium <Photographie>RöntgendetektorKalenderjahrOptische KohärenzBlatt <Papier>FACTS-AnlageEisenkernStutzuhrGreiffingerCrystallographic Information FileAkustische RückkopplungKristallgitterDiffraktometerFertigpackungPeddigrohrSchneckengetriebeSpeckle-InterferometrieLunkerCrystallographic Information FileVorlesung/Konferenz
08:57
GleitlagerStandardzelleDünne SchichtKristallgitterEnergieniveauErsatzteilFeilenMaterialFormationsflugKalenderjahrSternsystemVorlesung/Konferenz
10:26
UnterwasserfahrzeugKalenderjahrVorlesung/Konferenz
10:49
BehälterKristallgitterFormationsflugEndlagerProbedruckVorlesung/Konferenz
11:30
SchaltplanFlugzeugträgerKristallgitterMaterialSteckkarteFormationsflugFACTS-AnlageVorlesung/Konferenz
12:18
DrehenWerkstattDiffraktometerMultiplizitätPhototechnikSpeckle-InterferometrieFormationsflugGriff <Textiltechnik>Vorlesung/Konferenz
12:58
StandardzelleDiffraktometerGruppenlaufzeitSchlauchkupplungTagKalenderjahrAusschuss <Technik>Vorlesung/Konferenz
13:50
MaschineMessungSatz <Drucktechnik>WarmumformenLeistenWerkstattFeilenGruppenlaufzeitPatrone <Munition>ReflexionskoeffizientSpannungsabhängigkeitTagFernordnungSchwingungsphaseZwangsbedingungParallelschaltungKalenderjahrZugangsnetzGruppenlaufzeitDonnerstagVorlesung/Konferenz
16:54
ErsatzteilNeutronenaktivierungStörstelleTagFamilie <Elementarteilchenphysik>Ausschuss <Technik>KristallgitterDiffraktometerFertigpackungSpeckle-InterferometrieFormationsflugCrystallographic Information FileVorlesung/Konferenz
18:17
Cocktailparty-EffektTotaler WirkungsquerschnittKalenderjahrGreiffingerVorlesung/Konferenz
Transkript: Englisch(automatisch erzeugt)
00:05
A nomenclature commission, and back then, 65 years ago, a commission on crystallographic data. And this was at a time, not quite before computers had been invented, but certainly before what we now recognize as the information age,
00:22
when nowadays data is almost synonymous with computer processing. I won't go through other details on the timeline, because that's sort of reproduced in the program booklet as well, other than just to comment that it's consistently, throughout this long history, had a number of initiatives and projects to address this very important aspect of science, of performing science.
00:52
So I approach this introduction by presenting to you a paradigm for scientific communication. It's not necessarily applicable everywhere,
01:01
but there are certain practices within science for which this is appropriate. You come along and you frame a hypothesis. To test the hypothesis, you perform some experiment. That experiment typically generates raw data that you have to reduce or process in order to make sure that it's in a form proper for your onward analysis.
01:26
From that reduced data, you apply some thought. You derive a model, and if you're a good scientist, you validate that model. You look at how the model informs or flows from your hypothesis and the experiment and so on,
01:44
and that might cause you to repeat the experiment. There's a sort of feedback loop here. But eventually you converge on a model which you consider to be the one that you want to go forward with and convey to your colleagues. Traditionally, you do that by submitting a paper in a scholarly journal.
02:01
That paper is subject to peer review. And the purpose of the peer review is that one of your peers independently gives some thought as to the validity of the model that you propose. And in the more or less traditional cycle, that's the extent of the loop.
02:21
Typically, a reviewer is not able to penetrate further back and assess the quality of the data from which you derive the model. So in some sense, he's relying upon your interpretation and looking at the reasonableness and self-consistency of that model. If the peer is satisfied, you go on to publish the article,
02:43
and the article communicates your findings, your views, your new theory, and ideally, you would like the supporting data for that to be disseminated to the wider community. And I put that in brackets because the recognition of the importance of that
03:01
is only relatively slowly seeping out into the general scholarly communication environment. So if we apply this paradigm to some aspects of our science, to an X-ray structure determination, for example, the hypothesis might have something to do with relating structure to function within the pharmaceutical environment.
03:23
A typical experiment will be an X-ray deflection experiment from a single crystal or a powder sample. And from that, the raw data is typically in the form of images. A data set typically is of order a gigabyte in size. It's quite a significant amount of, a significant quantity of computer storage
03:43
as technology improves, the volumes involved will only increase. But in the process of reducing your data to generate the structure factors on which your refinement will be best, you in fact reduce the volume of data as well.
04:01
And in recent years, this has become a very manageable quantity, a few megabytes perhaps of data that can be carried along quite easily in a computer environment. Typically, the derivation of the model leads to a molecular or crystal structure.
04:22
And within our community, we actually have automated processes which allow you to validate that model against a set of objective criteria that give you a handle on how reasonable, how self-consistent your model has been. Again, within our community, we have adopted the practice that's now well established
04:41
that we will submit the paper to certain of the IUCR journals in CIF format. That includes all the data at the level of describing the model and typically it will also include the structure factors in the way in which our journals work.
05:02
The peer review process includes that aspect of validation. It includes the output from CheckCif. It also includes the ability to search for databases for prior refinements or similar structures. And so you have a rather rich part of the pathway that you traverse here.
05:24
And in doing that, you eventually get to publish the article and the article can be disseminated in forms that people are comfortable with as a PDF or on the web as HTML. But we also, within our journals, we publish the CIF so that the data also finds its way into the wider community
05:49
as part of that publication process. And we therefore transmit onwards the data in the form of the deposited material.
06:02
The journals do it as supplementary core CIF files. MM CIF files or PDB format are available for protein depositions in the PDB. And so we're conforming in a very full way to this general framework that I've outlined.
06:21
And you'll notice that most of this pathway is in bold type. So because of the effort we've put into providing standard tools, we're actually able to traverse this pathway backwards and forwards very effectively. This allows us to construct a schematic.
06:42
I'm not going to go through this in detail because it's a picture that might come up a couple of times later in the course of the day. But through our standardization efforts, there are possibilities for taking data all the way through from the raw data through the various reduction and processing stages within the laboratory,
07:05
through to publication, to deposition in databases, and to publication from the journals, or redistribution of publication from the databases. It's a very rich, it's a very full pathway. It's what I've characterized in the title of the talk as a coherent information flow.
07:24
And it's one that as a community we're very fortunate to be able to support. Now in mentioning that this is, sorry, I think the feedback's coming from here. If you look at my browser and close it.
07:45
So a lot of this has been directly attributable to SIF, to the crystallographic framework that ComSIF has overseen. But there have also been a number of standardization efforts that have worked alongside SIF to provide this coherent overview.
08:06
SIF itself dates from 1991 when the initial standard was published as a paper, even then quite a substantial paper of some 30 pages I think in extent. But in the time since then, the documentation for SIF has grown somewhat
08:23
so that the current user manual is quite a hefty tome. There's a lot of reading, but we've tried to make it easily readable, even to non-specialists. But the significance of this book is not just its heft, but the fact that it's part of the international tables.
08:42
So this is the standard reference series that crystallographers look to to define their subjects. And I think it's highly significant that the Union has chosen to dedicate an entire volume of that series to the whole business of data representation, definition and exchange.
09:01
SIF started life as a format description, a very simple description, a human-readable ASCII-based file, with a very simple and lightweight syntax. And that would be familiar to many of you who use it to submit structures directly to some of our journals. But from the outset, we made the design decision that within the format,
09:23
tags such as this that define particular quantities were not defined as part of the format specification, but the definitions were externalized. So the semantics of those tags were defined very carefully in another place.
09:40
In fact, in machine-readable files that had the same format as SIF, which was very interesting because it allowed the same software to process the data, but also to refer back to the definitions and pick up machine-readable attributes of the definitions. And this was a very foresighted design approach.
10:03
The community, I think, rapidly saw the benefits of this level of standardization. And within five years of its introduction, we were able to mandate the SIF as the only allowable submission format for one of our journals.
10:20
And it is, of course, a format that's accepted for the deposition of material across all the journals. So the editorial in 1996 brought this news to the world. Seems a long time ago now. But reflect that the web itself, the HTML, was released to the wider community only around about 1994.
10:44
And we had SIF going a few years before that. So we were really leading the curve even then. And not long after that, the PDB, the Protein Data Bank, which is this enormously valuable repository of biological structures,
11:03
it was re-engineered under the curatorship of the Research Collaboratory for Structural Biology and to accommodate both the growing volume of data and complexity of the structures that they were required to store and to handle,
11:22
they adopted an extended SIF format that was designed to be future proof and to accommodate the enormous growth that, in fact, we have seen come to pass in the decade or so since then. Notice, by the way, that organizationally the PDB has, in the last ten years,
11:43
been formed of a worldwide consortium of organizations and that to maintain their information in synchronicity they actually transfer material between themselves using an XML carrier format. But the structure of that XML schema maps completely onto the MMSIF definitions,
12:04
onto the semantic ontology. And this is really just illustrating the fact that those SIFs started with a particular concrete file format that's not an essential or indeed the most important aspect of what we've done.
12:21
Then again, not long after that a need was perceived for standardizing the formats in which diffraction images were collected from a multiplicity of vendors and the workshop took place in the late 90s to work towards a standard, an extension of SIF, an image SIF,
12:41
and a binary equivalent of that that would be able to handle diffraction images effectively and efficiently. And here's a photograph of the workshop at Brookhaven that presaged that and a number of the faces here are also present in the room today.
13:00
So you had this momentum building towards standardization, towards moving towards more complex and larger data that all fitted into the SIF framework. And in more recent times people have begun to ask the question, now that we're dealing with these large volumes of experimental data,
13:20
should the union be systematizing the way in which those are collected, stored, disseminated, reused within the community? And a couple of years ago the executive committee of the union called or convened a working group to address this, to look at the problems of standardization that underpin a desire to recommend
13:45
or perhaps one day mandate routine diffraction of experimental data. That working group isn't under the remit of COMSIFs but I included in this series because it does overlap very much with what we're trying to do in this coherent framework.
14:04
And some of you I think will remember that we ran a workshop in Bergen at last year's ECM to chew over many of the management issues involved at that time. And round about, slightly before that, a number of us were also involved in a series of open access publications
14:26
that we've made available through xt to inform our own community certainly but the wider world of many of the ramifications of this type of exercise.
14:41
And in recent times, indeed in the last two days just before this symposium, a group of us from COMSIFs have run a workshop on the premises here in order to look at how you can refine still further the value of the dictionary framework. And one of the things that we've been looking at is extensions to the formalism
15:06
so that where previously we defined a particular quantity with a small piece of text and a few constraints on units or other attributes. Now we think it would be possible to build into the dictionary definition
15:24
machine readable and ultimately machine executable algorithms which will relate a particular data, my particular concept expressed as data to other data items within the same file allowing you again to validate,
15:43
to check the self consistency of the file, of the data in that file. They will allow you to query a file and to retrieve the reflection calculated phase if it is not present in the file but quantities from which it may be derived
16:03
are located elsewhere in the file. And it also has the potential in the long term of actually defining algorithmically a lot of the quantities that are embedded in computer programs and data processing and so forth and that therefore this extends the meanings that as humans we embed
16:26
in these little textual descriptions into the execution environment within computer code. It's a very exciting prospect and you'll hear a little bit more later in the day about what potential that might open up into the future. And it was a fun workshop, I think we all worked very hard
16:44
but I think we came away with a feeling that we're moving forward and achieving a lot and that though ComSips has been around for 20 years it still has a lot of work to do and a lot of impetus behind it. And part of the reason I'm here addressing this youthful audience
17:03
of people from the more general fields of crystallography who will wander in during the course of the day and hear more of the details of this approach is that we want you to appreciate that ComSips isn't simply a dry and dusty little committee of grey-bearded old fogies
17:23
but it's central to... The comment, yes it is, wasn't picked up by the mic happily so the wider world didn't get to hear that. But it is a very vigorous activity at the centre of much that's important
17:41
and still rapidly evolving within crystallography. So there are lots of ways in which the younger generation can become involved with these activities should you choose to do so. So if I just go back fleetingly to the framework I outlined at the beginning the idea with the DDLM extensions
18:00
is that you carry the ontology all the way through this paradigm so that right from your actual hypothesis to publication of all the associated supporting evidence and buck you can traverse this fore and aft as effectively as you want. We have been involved in the project for over 20 years
18:23
a large number of people have been involved either directly with ComSips or in projects that we've received their input. I'm not going to read this slide, I've probably forgotten a lot of other names for whom I apologise, it wasn't a deliberate slide, but you'll see it's something that an enormous cross-section
18:43
of the community has already contributed to and we look forward to extending that community further. And finally I just wish to thank the organisations who have sponsored today's activities, they've made it possible, they've made the streaming and recording possible. There are a lot of familiar names in the crystallographic world,
19:02
CCDC, Protein Data Bank, some of the equipment vendors. There are also some possibly slightly surprising contributors, the Digital Curation Centre, British Library, CODATA. These are all organisations that we've worked with very closely in recent years and have expressed a great interest in what we're doing.
19:22
So thank you for your attention. I have a really good slide.