Interoperability Through Python Modules, Unit-Testing and Continuous Integration

18 views

Formal Metadata

Title
Interoperability Through Python Modules, Unit-Testing and Continuous Integration
Title of Series
Part Number
4
Number of Parts
9
Author
Angelina, Kraft
0000-0002-6454-335X (ORCID)
57188695916 (SCOPUS)
E-5011-2016 (RESEARCHERID)
Leinweber, Katrin
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
DOI
Publisher
Technische Informationsbibliothek (TIB)
Release Date
2018
Language
English

Content Metadata

Subject Area
Loading...
and today it's on the the off into a the and their gender but today in short is served
1st we start with some mad definitions and maybe some use also from inside the TIB and its
new directors of an hour and about the based of no emission of knowledge based information flows and we will continue with some on practical local on daytime suffer management projects and then as it before continue with 2 Python and modules and probably but also some up will see to be interoperable
while it so often disquiet how among scientists are working in very closely in the disciplines to be maybe the toughest 1 after fair principles why if and well many scientific disciplines they focus so much on on their actual work in the lab laboratories in the fields doing their observations that day I actually still electing the information technology part so to say so we are making the appropriate ontologies their liking our probe with appropriate several can't release in the field and they're often complain that they do not have to time and to welcome all within their community to develop Such vocabularies but as this there often affair principles and for our machine-to-machine into up will be you need a data and to data this use of form all accessible or shared and broadly applicable a language for knowledge representation as such as the vocabularies themselves I needed to follow the that principles as well so in themselves languages disquieting year and the scientific fields they need to be there as well so it's not done with that data packages and a scientific suffer being fair but let's say delay of disquieting them also needs to be fair tunes and the last 1 off and June of this this list is that made a data and data include qualified where for instance also through order data needed data and this vocabularies underway had to do that at the square them in a in a semantic way so on the side of the institutions and repositories and there are several points at which can be done so elections to be debt taken and such as providing machine readable again and to data and data in a well established formalism that means and they should use structured and disciplined as that list of cross vocabularies ontologies to solely and which are based on for example an in RDF format and by using Jason healthy and there are many more on in there before it all started to souls semantic role let's say started also coming from the governmental data side because it ever FDA and governmental databases like I don't know the size of cities and the location geo-location of cities and in countries across the world and the timetables off you're learning at trains and public transport and stuff like that and also all of those things their wearable in a governmental databases repositories and then they were the 1st 1 among the 1st to be started to be described through the common scheme either developed by who will will that they'd like this and it's the applied widely and it's can be extended also to their scientific community so another point is on the institutions side to support referencing and me to fields and between data sets and by our certain scheme are so for example we have decide meter data schema which allows us to include properties like the related identifier field and the relation types so you can interlink and different datasets you can interlink at data side datasets refer academic publication so and this is 1 way to go it was too well to very early so and very beginning all the data side then today a scheme which started in 2009 it's what it's going to be improved and in the meantime data side also started to include an RDF and basics l as though all their data sets that have that are described the GUI and the the data side you I'm now you can get the the data in an RDF formant do and other than that had the needed data and data ingest should happen form good as sources and yeah and there is also the that initiatives going on for example to extract the information also from PDF will come to that and in a moment as a scientist you should actually be really precise and complete about their me to data and you provide to refer a dataset all of any publication and this point has been mentioned before walls and the other principles and F and about again here you you cannot stress enough it's some importance I would say and as a scientist you should look out there you should look alarmed in your community if there are any and were kept louise ontologies to sully already in use talk to be artists and I get don't be shy the to get into them a little bit and maybe you even interested and enough to to assist in their development as I as I said before in many is scentific disciplines the are still lacking or they are not developed as well as they should be then and then again and I'll also it depends on the needed data and you should clearly identify relationships between datasets of relationships between the data and the publications and that are based on it as so det sale really important and again here and many needed data scheme and I am assisting you are here to provide such information of course also under on the publisher side a journal editors as should take care more about 2 interlinking between the datasets and the publications and Yeah don't be shy again and just as a EverQuest support and if you needed stay our data managers out there there journal editors out there so where you shouldn't be shy to to ask them here for help and to to assist you have for software and here we follow a often an established code style guides and and also cut fuel come to debts and in a moment so as he had a T but so are we doing in this context actually quite a lot so we established new our scientific research groups and their day focused on this semantic descriptions of with
such data and said structures and it's an initiative brought forward by our director still an hour and the really tribes this which what again form of context based their pdfs them to we such data and actually to these open we such knowledge graph so divisionist that
research will move Actually in the long term it's not a short-term development it's a long term 1 from the document based style to an information and knowledge graph so you and I will if you look up for information for example it will be no longer that we need to index that PDFs that we need to describe them that we use keywords and that I included in the in the PDFs for the search by it will actually all be accessible and who such interlinked and knowledge information systems so that's the Wishon of cause it's not going to happen should very fast because again we're missing the tools in many scientific disciplines who to allow that but yeah it and that is our 1 of the thinks and to consider and you wanted to have as comment on that so currently a lot of this
extraction from text-based documents back into a data and then they're putting this information into knowledge graphs happens after publication obviously and if you think about this this knowledge graph idea is actually a much better representation of how all brain processes information so that she could we are at the and use years of scientific work go into thinking OK what we
already know what kind of new experiments can we developed which kind of data can we then collect and then we put here the bit we had actually so I think it's quite good direction to go in although it's very long term probably fit that fell on that says I want to say what this and yeah and so
on yes I will groups set heavy concerned with working on the so where point total 1 is called beta signed in
digital and libraries and a focus on the actual development and testing out some trials of those we search knowledge graphs and the next 1 as a scientific data management working along the it's a ontology development as side we have 2 group of the Visual Analytics who and they expose information within non-scientific views for example as so context analysis like and mathematical formulas within within a scientific reunion lecture for example and there's there and the site of the more scientific knowledge engineering and also contributes in 2 ontologies but they try to to interlink them as so they look what a word and ontologies are there which as I said they're out there and how they could be adapted and used for certain scientific can use cases so before this whole let's say semantic development came came up there have been already and had discussions going on within the PID system waltz as we introduced on the 1st day and a concerning this interoperability yeah especially in the today just you must and and the F. 1 example I wanted to give here which is already in place is an arm of thematic an update function when a DUI is minted means it's and it's registered officially and its searchable within 24 hours at the latest as so it is a close cooperation going on between f we big key PID system providers there which is of course that optical publications they decide for our dataset and publications and of course orchids have of the orchid and we ID as though well between DOS weekday agreed to to establish a mechanism as so that's your occured private profile is are automated and is updated automatically as soon as you publish and an article on a dataset as so you have to offer lies them could that as the 1st instance but after that it's it's done fully between is interconnection and the the publishers here which are associated to cause that they actually also create and to do this so well in the end it's looking quite nicely and we have some use cases Moline place where and current research information systems so chris and at he local institutions use those all to update functions already also to keep their institutional and information systems up-to-date so it's looking quite an nicely we have found at AT T at the RIA as delivered in the testing phase doing that if there the more and information system yet but it's it's working quite quite well so just as a side note here and now and captain will continue and with their replication crisis and some more details this will be a little detour you've
probably heard of this psychological field for example has been discovering that many of the foundational experiments that assume Illinois usually performed with students because that's at the university the population that you
can use the access and invited experiment I'm not sure that bill not may be generalized to the whole population or and some of the theories have been shaken quite a lot than people would have to some other fields as well as that of a single out them that psychologists here because generally in the whole scientific community this talk about we cannot replicate experiments this is the crisis has been increasing in the last few years especially also thanks to some initiatives to in an organized manner try and replicate some of the older research instead of they always chasing the Next exciting discovery it's also it to be honest Lisa quite a bad impression on on science of course were funded by public money is so we have to insure that's the the results we put out valid some of these data here is from private companies who have taken I think this was cancer research we have taken the results of of universal studies of institutional as studies and try to replicate them without their all on methods and as you can see users of the percentage of Europe reducibility maybe I'm not sure what the commercial interests here maybe they want to prove that their own research is better but still even even the independent studies there if you have about half of the results are not reproducible in such an important view it as a medical science then the of a basic research she never reaches patients so that's of course a really big problem and also financial problem this is just for preclinical research in the United States you can see it in the she's so in
the the finer grained data what you think where in these blocks in the laboratory protocols the data analysis study design on the reagents and reference materials what software is software play a role and
therefore might software problems also lead to replication problems I would say it's so almost at least the 3 of them definitely contain some kind of software and home has a nice set of of a fixed share presentations where usually in the front of the presentation you explain some examples of there are signs of a failing or a scientific paper having to be retracted its because of some software so the rock of economics spend mentioned on on Monday on that Tuesday is there as well and explain in more detail so this is a responsibility for all of us if we do software and of course the damage was well it should be done a correctly and today we will teach you some methods for example for automated testing of Python scripts and are then tomorrow to help prevent
arrows but before actually want to also interpret the integral operability principle here in terms of software into some very basic practices and the first one is simply
organizing your files these are some
ideas of just folder and file structure on the computers that you use from downtown dry at and the other 1 is from a paper to do that you know if you're working with 3 huge data sets you maybe even encountering doctor storage systems like object-based storage these slides for example here we did them in Google Drive and as you can see a lesser kind of hash or an identify on and this ideal is this is not a placed in some kind of folder or file but rather it is accessible that even if I send you the link it doesn't matter whether 5 actually sits on a hard drive on any hard drive but
it's also a way to store data which some of you may have encountered but if you have not if you are documenting your the projects on your on the computers please still have a look at what the repository that you may be found in our re 3 . example on the exercise on Monday maybe that suggests a particular file organization scheme check around if your colleagues use a particular 1 when we do a package development are for example there will also be a standard which we will follow and he's just choose 1 it's it's not always that easy but it's definitely in the long run better than having your own file and folder structure because this will only make sense to you not to your students not to your collaborators and any kind of structure that is predetermined will make more sense to everybody version control is of course each super combination with this kind of thing because you can just start was any structure that was an unstructured 1 but if you a half of project for under version control you can of course always change their folder structure without using the previous structure that you might have found from a different source so this is why version control is so nicely integratable into so many workflows and 1 very important thing the date formats you know the American version was months 1st then day and so on then here the Europeans have day month year budget really for computers the most useful on and also for people all around the world this the year 1st then the month and the date because its hierarchical the computer can sort it in more easily so it's a really helpful suggestion and various you see no you don't see the filenames we started doing this for presentations you know there 1 way
interpretation interoperability for software from the Unix you may be you know the units from of the philosophy of software being written in a way that it works well together with other software Software's so that the for example the output of 1 software can also be used as input for another software so that's interoperability on a very technical level but also on extremely practical level because that's what you probably have learned if you use UNIX shell piping and filtering it just works if the program adhere to this philosophy and today where therefore also going to learn about modules and as I said if we have time for far packages to follow interoperability principles because packaging your software up into reusable that packages of modules of the terminology is a bit different in some cases is a way to share it with people which also then reflects on the reusability principle for tomorrow some basic
rules as I said on the in the function a function presentation it is not a quality level from scripts to functions to packages you don't necessarily have to go this path but if you all as you all set in the survey scripts this sum rules and to follow some things to avoid for example there if you depending on external packages is really good idea to just load them on or import them all at the very beginning of a script and not further down when then you because since the and the script can you needed to know how the maybe need to install these in these things and therefore the run of the script data is more easy also calling absolute paths as so this reflects a bit on the file and folder organization that topic I mentioned just now if you use your own scheme of file and folder organizations the past that you have on your computer omnipresent anywhere else so therefore this avoid hard-coding pass like these I use a US your own user name my projects and so on but rather I always use it relatively to the script file to the project folder and so on also setting a working directory and is not very useful because nobody else will have the same working directly on the computer in we can use a project file essential piousness something similar as well or at least have to switch the project of folder 1st the the advice here I gotta from Jennie Ryan and who also belongs in this category of very wise person that along with the heavily we can give
it of of many many our packages we will learn how to build a module package later so in this special case of organizing the fires this is definitely better than just having hard part in your scripts next
software testing is 1 topic for today and this is some data from a recent survey because it's a bit too small I summarized at roughly says you can see about half of the scientific software a surveyed from several dozen the scientific software engineers is tested by the developers themselves 1 Florida's tested by the users this is interesting actually and and then there's a few cases where either this in testing or their special test engineers and the test engineers while I would like to see the research group who wears the test engineers dedicated to this task this could be a very good thing because someone was qualified and the software testing they catch more box may write that attests yes but on the other hand it could also be a worst thing because then the developer no longer has to care about the tests and if the communication between these 2 people as they're not optimal the test engineer might test something that's not relevant and the developer might never know that they're actually series box in their self so as soon as you have specialization that of course depends how well the specialized roles in the project team for example that talk to each other I that's a summary from the survey and the source so as I mentioned before there are many many cases where a box and the software led to a paper retractions inconclusive results it extremely surprising results which will then even published and then when it became clear what the problem actually was the self well everybody that our market that's why the result was so surprising so the sources the in the
vocabulary and and to use a common language part of interoperability principle will be in our case be interpreted as conduct started the less yesterday as well as the Colts styles and this is something where you give a little
bit of your own while taste and control over 2 or maybe a computer program or as you can see it in our studio the reformatting code option would just do the indentation and closing records underlining records and stuff like this on automatically it simply faster than the manual or I would even seem menial formatting shifting stuff around is really not a useful use of time and in particular in a team of people where you have to read each other's called if it looks the same regardless of who would rolled it is much more understandable this can of course be automated so and it ties into the version control our part as well because if you're commits that have some kind of meaning of course may be changed some thing in your code is included with the formatting changes as well and will be more difficult for yourself and for others to understand your coat and it just some mentions here the Python universe has for example pipe called style more vessels and automatic call for called formant a called just why and in I would very highly recommend DIR open science packaging guidance so your community of some research so for engineers and scientists building packages and they have a very helpful learning material about this some of which we will also use and the tidy versus of course has also style guide either of them is perfectly fine and this also generally from Google the style guide for many different languages again it doesn't matter which 1 you pick it is important that you pick 1 and allows as soon as we start packaging or software of course we intended for others to be used and then we have to do was versioning and that also applies to was something that you may not publish because as soon as somebody else can use your cold like a student like a collaborator maybe you should try to convey some meaning about what updates to your own software I mean we of course presume that your software will evolve and would get better all the time and 1 common vocabulary to use use the so-called send them . semantic versioning and it's defined some specific meanings for each part of the version number and the meanings of course reflect the chord changes that underlie the update so very easily if you just fix a little type or that is probably not worth any version change however ends a wig use the example here from version 0 comma decimal 4 . 2 will the 3 parts are the last 1 with from last 1 it's the patched version so if I would change my itself a package from version the true and the antidote 3 if it means I have implemented some back compared compatible bug fixes and to the user the signals that I I should probably install software I care about updates but it can be pretty confident that my cold though will not be updated itself assuming that they are not relying on somebody all by the behavior in their own program it signals that are you should not rely on boxes should reported back and then use the correct behavior the correct output for example the middle party is a so-called minor version number and it indicates that there are changes and additions which are however backwards compatible and then people can say well I don't need any new features from the software so I might as well skip this version or I just install it and I can again be pretty confident that I would not need to change anything then the version number here in the very front the major version that is of course the more and the most important 1 and that indicates that
something has actually been changed in a non backwards compatible manner for example even removal of functions or renaming and of course of some other package is calling my functions then they will definitely need to update the cold and maybe they would even like to wait a little bit with the update before they do that maybe to to moral or on friday will have lived demo more about a visualizing some data using GG plot to I prepared this
example and you plot version you have to but just last week in update came out to vote and 3 so we will try life to run the cold 1st you can also practice with this and then as the last thing we just updates you plot on this computer and try to run the court again maybe we will have exactly the problem here that the the demo called they're prepared has to be updated then and the whole idea here is that you value the outcome for the user is more than the output by the developers because there the user using the software and therefore as I mention also called is more often than written so it's and these meeting nation it's a very shortsighted calculation if you say I'm a developer I'm going to save myself a little bit of time but it will mean more time investment is necessary for our people you have a very practical suggestions if you have to deal with this maybe it's a good idea to set a certain upgrades time not on a Friday evening definitely but sometime in the week where you do all updates in 1 goal and then also set time aside to maybe data on court a way out of this is a little bit the continuous integration and continuous development idea that is coming from the software development world also into science more and more of this can help because things just change and it's therefore just an acceptance of the fact that if you have to change as well it would better be a continuous process and small steps then maybe waiting before you publish your paper and then after resolve all the update conflicts that may have occurred in the months or however long it takes between you doing the analysis 1st and then having to make the court reproducible publishing the court along with the paper and so on and here again version control systems play a role at the top for example and GitLab as well they're integrating dependency checks and security alerts this of course driven by commercial interests but again an automatic notification that's some dependent package updated will also be useful for
scientists
Loading...
Feedback

Timings

  595 ms - page object

Version

AV-Portal 3.9.2 (c7d7a940c57b22d0bc6d7f70d6f13fde2ef2d4b8)