Spotlight on Free Software Building Blocks for a Secure Health Data Infrastructure
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46923 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2020132 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
Block (periodic table)Open sourcePublic domainMultiplication signBitAssociative propertyPlanningTrailComputer animation
01:12
InformationHypothesisSoftwareField (computer science)Medizinische InformatikSoftwareMereologyRegulator geneCondition numberInformation systemsCASE <Informatik>Core dumpPublic domainComputerOpen sourceComputer configurationSoftware developerPhysical systemDerivation (linguistics)Medical imagingRow (database)StatisticsRoutingCoroutineLevel (video gaming)Cellular automatonComputer animation
03:49
Lattice (order)Condition numberParameter (computer programming)Medical imagingCondition numberSampling (statistics)Multiplication signOpen sourceHospital information systemEndliche ModelltheorieComputer animation
04:23
Open sourceParameter (computer programming)VideoconferencingComputer-generated imageryMedical imagingCASE <Informatik>Open sourceSystem programmingFile archiverParameter (computer programming)Endliche ModelltheorieUser interfaceMathematical analysisExtension (kinesiology)Web applicationComputer animation
05:23
CoroutineDisintegrationSystem programmingRepository (publishing)Physical systemFile formatSoftware repositoryIdentity managementInformation engineeringAuditory maskingProcess (computing)MiniDiscData storage deviceState of matterData integritySystem programmingComputer animation
06:35
DisintegrationTransformation (genetics)Identity managementDifferent (Kate Ryan album)Repository (publishing)Graphical user interfaceOpen sourceHospital information systemDigital rights managementProcess (computing)MereologyInterface (computing)Medical imagingAuditory maskingService (economics)Computer fileProduct (business)Server (computing)Open setSingle-precision floating-point formatComputer animation
07:33
Attribute grammarIdentity managementData miningWordRandomizationSlide ruleService (economics)NumberMathematicsIntrusion detection systemComputer animation
08:21
Attribute grammarDatenverknüpfungPhysical systemService (economics)Intrusion detection systemArithmetic progressionSet (mathematics)Computer animation
08:46
DisintegrationOpen sourceArithmetic progressionData storage deviceComputer animation
09:12
DisintegrationDatabaseAuditory maskingData storage deviceComputer configurationElectronic mailing listGame controllerRepresentational state transferMiddlewareMetadataData centerCausalityHypermediaComputer animation
09:34
DisintegrationProcess (computing)Data centerUniverse (mathematics)Library (computing)State of matter
09:59
DisintegrationQueue (abstract data type)Form (programming)Computer animation
10:39
DisintegrationFunction (mathematics)Mir Docking ModuleProcess modelingParameter (computer programming)DatabaseFormal languageComputer programLevel (video gaming)IntegerComputer architectureMereologyDifferent (Kate Ryan album)NumberSpacetimePressureTemplate (C++)Constructor (object-oriented programming)Execution unitObservational studyFile formatOrder (biology)Computer fileInformationHybrid computerReferenzmodellOpen setLatent heatQuery languageTable (information)Data typeMultiplication signEndliche ModelltheorieEmailSemantics (computer science)Arithmetic meanCombinational logicRow (database)Form (programming)Programming languageNP-hardWorkstation <Musikinstrument>TelecommunicationProcess (computing)CuboidComputer animation
13:09
Query languageCondition numberIdentity managementWechselseitige InformationHypothesisInstance (computer science)Repository (publishing)Cross-correlationAnalytic setFunction (mathematics)Right angleDatabasePoint (geometry)Computing platformComputer fileOpen setCondition numberMetadataHypothesisStandard deviationMetreData storage deviceObservational studyPresentation of a groupLink (knot theory)Data warehouseMultiplication signMoment (mathematics)Instance (computer science)Scripting languageMathematical analysisDistribution (mathematics)Software repositoryStatisticsGenderCASE <Informatik>Process (computing)Operator (mathematics)Computer animation
16:38
SoftwareCASE <Informatik>Projective planeLevel (video gaming)ForestObject (grammar)InternetworkingComputer animation
17:07
Object (grammar)InternetworkingPublic domainScale (map)World Wide Web ConsortiumVector potentialMedizinische InformatikOrder (biology)Information securityBoundary value problemDifferent (Kate Ryan album)DigitizingScaling (geometry)Field (computer science)Auditory maskingCombinational logicObject (grammar)InternetworkingPublic domainLevel (video gaming)Open sourcePerspective (visual)Communications protocolLeakLinked dataSoftware developerFile formatWeb 2.0Data analysisGroup actionArmStaff (military)Data structureComputer animation
20:14
Public domainSoftwareCodeDirection (geometry)CodePublic domainRight angleData structureInformation securityComputer animation
20:53
System programmingInformationSoftwareTouch typingMedizinische InformatikOpen sourceInformation systemsOpen setPublic domainArmComputer animation
21:48
Open sourceProcess (computing)Web pageComputer animation
22:28
Open sourceRevision controlField (computer science)Self-organizationStandard deviationRoutingNeuroinformatikBitDifferent (Kate Ryan album)TelecommunicationOpen setLevel (video gaming)Table (information)Projective planeRight angleFile formatHypothesisView (database)InformationDerivation (linguistics)Cartesian coordinate systemCode division multiple accessLecture/ConferenceMeeting/Interview
26:03
Level (video gaming)TelecommunicationStandard deviationMedizinische InformatikOpen sourceMeeting/InterviewLecture/Conference
27:04
Point cloudFacebookOpen source
Transcript: English(auto-generated)
00:15
Okay, so we are right on time and thank you for organizing this track.
00:22
Thank you for the opportunity to speak to you. This is Marcel. My name is Markus. We are both research associates at the Department of Medical Informatics in the city of Göttingen in Germany. And now we switch from the neurosciences domain to the medical research domain.
00:43
And our plan is to give you a quick, very quick overview of some of the tools that we use in our domain of research. And just to get an overview, who has ever been in contact with the domain of medical informatics?
01:00
Quite a few. Well, it's overwhelming actually. So you might recognize some of the tools, but for all the others, we try to introduce a little bit the discipline in general. So we thought of how to characterize our domain. And we started us like four major fields of research in medical informatics.
01:24
So at the core of it, there's systems for primary care. So I think everybody can relate to that when you are sick, when you go to the hospital, your case creates data, your condition is documented, and all of this is done in electronic
01:43
health care record systems. Maybe there are images taken, MRI, CT, echocardiography, those are put into some information systems. And all of this is primary care domain. And then we have a couple of research domains or research heavy domains.
02:00
First of all, biomedical research, where you study the systems biology within the body and derive therapy ideas. The other route would be to go down here into take routine data, run statistics on them and do use.
02:22
So it's called secondary use because you, again, data that was collected in primary care is used for research to develop new therapy options again. And finally, because we run something called evidence-based medicine, we want statistical scientific evidence that the therapies we run in primary care are actually not harmful
02:43
and even beneficial for your condition. So all of this goes up to the top into what's called clinical trials. So basically you do experiments in human to make sure that everything we do in therapy is actually sound.
03:00
We will go through a couple of software tools that are part of some of those fields. And one thing I would like to mention is that the primary care is basically probably the domain with the least open source software, which is due to the regulations that are
03:21
in place in primary care. I mean, you're treating patients, which means that even software will soon have to comply to European medical device regulations and so on. So many companies create proprietary software and sell it because they say our software
03:42
complies with all the regulation. So to take you through the software tools that we want to introduce, a short story. This is Bob. Bob suffers from chronic heart insufficiency. So this means that Bob regularly has to visit the hospital, have his condition checked up.
04:05
Every time he has to go in for a routine checkup, they document the medication he's on. They take blood samples and also they make echocardiographic imaging. And all of this is stored in the clinical information systems.
04:22
So this is a model hospital because they actually use an open source tool for primary care documentation. In this case, they use X-NUT, which is a picture archiving and documentation system that is open source and that can store the images created and the structured data that
04:44
is derived from those images. So the vital parameters that are actually interesting in further research on the data. X-Net is a very extensible open source tool. You can not only store the images and share them with your colleagues, you can also
05:03
plug in analysis pipelines like ImageJ to run analysis on the images and data that is stored in there. In the end, it gives you web-based user interfaces for image uploads, also image
05:21
viewers, et cetera. So since Bob is also not only at the model hospital, but he's a model patient, he has been used for research purposes just like that. And this leads us to, of course, Alice.
05:43
And Alice is a health data engineer at a place called the Medical Data Integration Center. So that's basically where we work. And her job is to get all the data that is created and documented in primary care systems
06:01
out of those systems. So we extract this data to mask patient identity because in clinical systems you always have the medical data stored for each patient and you know who this patient is in research. We do not want to know who the patient is. So we want to anonymize the data. So the data has to be masked.
06:21
Then the data has to be transformed according to the formats that you can use in research. And finally, put into some kind of research data repository to make it accessible again for researchers. The tool that Alice uses is open source and its talent open studio for data integration.
06:42
So this is a graphical user interface data workflow manager. I'll give you a little better image of that. So you can create data transformation workflows with a graphical user interface. It's based on Eclipse. It's also provided as a product by a company, but all in all it's open source tool and
07:06
you can create these kind of workflows, dragging and dropping like subprocesses that are encapsulated and can be reused in different workflows. You can export all of that as a jar file and then run it on servers and orchestrate
07:22
this. And this is basically what we do to extract data from all the different clinical information systems and put it into a single research repository in the end. So subprocesses, part of these ETL jobs run through talent is for one, the masking
07:41
of the patient identity. The example tool we use for that is the Meinsellister. It's a German name for a tool mostly created in Germany. So please excuse me that these slides are in German. It just says that we pseudonymized the identifying data of a patient.
08:00
So you see the name on the left actually and then you pseudonymize it, you just get some random ID number back and you can also use this service, this masking service to create different secondary IDs that belong to this first ID. So yeah, there's lots of math behind that and you can actually do lots of stuff.
08:26
So you can do de-identified record linkage. So we have research data from two different systems that have separate IDs. You do not know that those are the same data from the same patient, but you can
08:41
use the service to link those data back again in the end. Yeah, so each of those de-identified data packages that we now created in the first step of our workflow has to be stored and we are working in science. We want to have all the data pieces we use in the scientific progress archived and
09:04
stored persistently. For this, we use an open source tool called CD star. This is developed in Göttingen. It's data storage middleware basically. So we mask all the underlying block storage options that are running in the data center.
09:26
We put a REST API in front of it and we can just use REST calls to store data items together with access control lists and metadata about this item. We also generate persistent identifiers so that each data package that was used in one of
09:44
those processes can be identified afterwards permanently. Hopefully forever because we use a data center that's in the forever business because they are linked to our university library.
10:01
So this is the queue to switch.
10:29
So awkwardly switching has been done. I hope the cartoon was long enough to distract all of you. So we've seen that Alice has stored her data in CD star and now she has this data in some
10:41
format she may be thought of. But we would like to have more. We would like to have semantic annotation, meaning that the data file itself should contain something that tells some other person what the data is about. So you may know it from CSV files where it's just data tables where you don't know what it means and most of the time a table header that says some weird combination
11:03
of letters that you don't understand. So in order to circumvent that we use Open EHR. Open EHR actually is a foundation so it creates a lot of stuff but the two things we want to focus on would be the specification as well as the clinical modeling part of Open EHR.
11:20
So as for the specification the guys at Open EHR created a two level modeling architecture where the specification states different reference models. These are the most basic parts that you can store data in. So think about that as data types in every program and language like character or integer or something like that.
11:41
Using this reference model you can go a level higher and use it in the let's say user space where different users with clinical knowledge can take this reference model piece of information and put them together into an archetype. An archetype would be a let's say logical compounded value that you can store in a database
12:04
afterwards. So meaning that instead of just storing a number you want to have the blood pressure you need two numbers and maybe some other units or something like that. So an archetype would for example be Bob's blood pressure. At the topmost level the templates are even a higher level collection of archetypes
12:21
meaning that you can model even higher level constructs like a visit. So meaning that a patient comes to a hospital you are able to create how should a visit look like, what archetypes, what parameters has a study nurse to record in order to map everything to a common data format.
12:42
If you do that a template may look like that so pretty complex pretty big and it's more or less really hard to just look at the data itself. Imagine that will be a really big JSON XML file it's hard to find the data in that. So obviously the guys of OpenEHR thought of that and created the archetype query language which is a kind of hybrid between SQL and XPath that allows you to traverse the
13:03
hierarchical data and get your data that you want to have out of that. So very nice we have now the data in a common format that can be understood by other people so let's introduce another researcher called Carmen and she wants to get the data. First of all she does research on heart insufficiency so just the
13:25
condition that Bob has and she now wants to have some data from the OpenEHR repository. So if she's capable of doing that she will specify some kind of aql query, get the data out of the database that is set up at the hospital and is then able to use another
13:43
platform or another tool to analyze the data that is called I2B2Transmart. Again I2B2Transmart actually are two tools but they are being merged right now together and this tool is a data a clinical data warehouse that allows to do some simple analytics and
14:00
analytics on data. We will focus on Transmart for this presentation because it's the tool that we run at the moment but in due time we will switch to the I2B2Transmart merged tool. For that you can see in this picture you can see how this can look like so you can look at very basic statistics like the distribution of the age or the gender distribution so to kind
14:24
to get a feel for your data that you have and you can even run more sophisticated analysis using R scripts you can just create an R script write it upload it there and the analytics engine will give you the output that you desired. We use this tool primarily to kind of do a data review
14:45
so in this example you can see that there is a correlation between age and height you can see that there is a lot of data points on the left but only one at the right this is because the person that is depicted there has a height of 165 meters which is let's say unlikely but it
15:02
shows that data that you integrate from the heterogeneous IT infrastructures usually are erroneous and you have to think about that and you have to keep that in mind. Let's say she had a research hypothesis that specified that did her research and was able to
15:24
either approve or reject the hypothesis she writes up a paper and everything is nice everything is clean and she submitted this submits this paper to an open access journal what we'd like to see is that she opens up her data as well so in the spirit of open research open data she
15:41
should yeah publish the data somewhere she could do that in a fathom seek instance located at the hospital run at the hospital fathom seek is basically a data repository which follows the ISA standard which stands for investigation study and assess say thank you
16:02
and most importantly it stores rich metadata with the data files that you can upload there so you can sign a license you can say this data this piece of data was conducted in this study using this investigation and so on and with that you can much better publish the data and he has stored persistently so even after several years you can provide the same link
16:26
you can call that link and you can see okay this data was used in this publication and I can maybe open up this publication to find more about the data or vice versa so yeah thanks with all the tools and the people that are involved that they were able to use these tools
16:43
maybe Bob can get healthy and all his fellow patients can lead longer and happier lives so for one hospital this may may even be somewhere the case I don't know not in Germany most certainly but what about sharing data with other hospitals this is something that is very hard to do
17:03
for us especially we're working on that project right now and we can see that there is a lot of data infrastructure on the global level that tries to do that to link data data together so create this internet of data objects there is a lot of work put into that in different domains
17:21
and for example in the medical informatics domain on a national scale we have the medical informatics initiative which yeah has offered us some money basically to build this up for Germany on the international scale Odyssey or Eden do basically the same with other data formats other technologies but have the same goal so it's a very very good movement to see that that
17:42
want to link the data together and across domain there are also developments that most of you people may know like the research data alliance or the working group from the w3c data on the web which has specified the doap protocol to access digital objects from different domains
18:01
or from different it infrastructures so in order to create in order to to use this data over the boundaries of our hospital we have to think especially medicine about the security medical data is very sensitive as we said we we are doing a masking of the patient data
18:22
but still this may be not enough if you think of rare diseases it's very likely that a doctor that knows a lot of rare diseases or works in the field is able to find a specific combination of some diseases and know okay i know this patient i've seen that before because it's rare
18:40
so this medical data bears both high value for the research but also potential for misuse so we think that the benefit from linked medical data has to be exploited basically so we we want to have the data we don't want to have it accessible and do research on that to improve healthcare but what we need to do would be to create secure it infrastructures for that so not
19:05
everybody should be able to just like that download everything and have it a really big nice repository and do some i don't know big data analytics or something like that as he wishes but we do need some regulation and do need some safeguards in place and for the to do so
19:22
we need accountable and transparent workflows so not only we have to make it secure but the patient should know where is my data going what is happening with my data where is research may be published for with my data in order to empower the patient um yeah scandals like the the cambridge analytica data leak or the leak what they did with the data doesn't really
19:44
make it particularly easy to say go up to a patient and say hey can we have your data we want to do research so they're pretty hesitant on that um this also not all this this not all this not only counts for the data but obviously for the for the um for the for the tools that we
20:01
use themselves so uh we try to use tools that are off the shelf are available for download but not every tool fits our purpose so we have someone some sometimes to extend it or even build new tools and from our perspective this has to be uh open source tools because um this is the only way to really uh show uh get transparency and empower the patient so
20:25
things like public domain public money public code show that we're going in the right direction and have a political let's say a layer to talk about these things so we need that for medical research especially and we should emphasize that everyone who's working in this medical feature
20:43
think about that that making not only the data open just as is but thinking about the tools that you are using and making these open and the data flows you're creating so global data infrastructure we saw them they are being built um and we should establish decentralized
21:01
and free technologies to ensure secure it it infrastructures typically the medical information systems are really not free or open source whatsoever these are proprietary big blocks that we saw that are very black and nobody touches them but the tools in medical informatics research are frequently used and could be very well rolled out to more of the primary care
21:25
but we have to advance that we have to politically like make voice for that and say hey there are tools you could use and yeah please do so if you are capable so we saw that some people from the medical domain here raise your voice it would be really a step forward to
21:42
have more openness in the medical informatics research um so for the references uh obviously thanks for the team in guttingen we had a lot of guys that are very open to the whole open source community and we try to uh yeah advise it on further use and like like everyone basically
22:02
said if you're interested contact us we have jobs yeah so thank you very much questions doesn't
22:21
work if i turn off the microphone so any questions yes
23:16
so so the first comment was that we do not necessarily need open resource tool or research
23:26
tools but especially open formats and open standards right so like opening hr represented there are a lot of other standards fire omop cdm i we agree totally so we we do need that to exchange data and enrich it semantically so other can reuse the the same data the second comment
23:44
was what the question was um where is our field of application right where where we where we going with that so basically as medical informaticists we're kind of in between everything so basically we're trying to help the researchers get the data analyze it and and derive research
24:03
hypothesis from it we're trying to get to the patient to to to empower his view on his data and we're also trying to go a little bit into primary care and say hey look at that we have this tools we use this path to get the data from a to b and wouldn't it be a possibility
24:21
for you guys in the it yes
24:41
okay the the question was um there are similar movements in the netherlands right now so basically what we shouted out we want to have the doing that in netherlands so great um please come to us after the talk we would like to talk i guess um and i i hope i think i don't know which project you are associated with there are some projects that do that internationally but i agree there's not much communication so everybody gets
25:05
us grant and tries to build up something open-source and then we have 60 different versions of open source standards which comply very very badly yeah so i guess yeah the the answer is let's talk we have to talk i'll come and find you out yeah sure
25:38
so your question was regarding the uh medical information informatics initiative in germany
25:49
and how the four consortia that are being um funded their act together so especially relating to this yes especially relating to the to the open source tools so uh we have uh like yeah organization that goes above all of it and tries to increase
26:05
the communication between uh but of course these are these are professors talking about problems that have to be solved um on a working level we try to establish those i mean in germany it's not a huge community so basically everyone knows everyone somehow and we have meetups and some things like that so um we have to increase the talking bit in between
26:25
we try to do so and we we try to establish well let's say the the the standard for for open source is set i i i've never seen a researcher in medical informatics who says nah open source we don't need that everyone is the same um but what we have to do is to deliver that
26:44
more into primary care and to the people that are not affiliated with medical informatics itself but maybe more in the medical it which are somewhat more hesitant regarding that