Developing Research Infrastructures the DevOps way

Video in TIB AV-Portal: Developing Research Infrastructures the DevOps way

Formal Metadata

Developing Research Infrastructures the DevOps way
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Distributed Research Infrastructures are built to support scholars from various disciplines in their work. In the case of CENDARI, a toolset aimed at historians has been developed with support by the European Comission. We will explain how popular open source solutions like Jenkins and Puppet have been employed in building the infrastructure, which is composed of open source applications, both existing and specifically developed.
Group action View (database) Set (mathematics) Formal language Web 2.0 Medical imaging Sign (mathematics) Web service Videoconferencing Series (mathematics) Area Pattern recognition Theory of relativity Mapping Electronic mailing list Virtualization Sparse matrix Textsystem Computer science Quicksort Resultant Web page Point (geometry) Service (economics) Observational study Open source Variety (linguistics) Connectivity (graph theory) Graph coloring Field (computer science) Content (media) Lecture/Conference Term (mathematics) Internetworking Integer Address space Authentication Focus (optics) Information Projective plane Commutator Cartesian coordinate system Word Computer animation Integrated development environment Personal digital assistant Network topology Universe (mathematics) File archiver Library (computing)
Axiom of choice Building Presentation of a group Hoax State of matter Code Multiplication sign Source code Coroutine Numbering scheme Set (mathematics) Insertion loss Open set Mereology Semantics (computer science) IP address Medical imaging Fluid statics Mathematics Semiconductor memory Core dump Square number Cuboid Endliche Modelltheorie Error message Descriptive statistics Physical system Collaborationism Pattern recognition Constraint (mathematics) Knowledge base Software developer Data storage device Bit Proof theory Data management Process (computing) Repository (publishing) Telecommunication Phase transition Order (biology) Configuration space Text editor Quicksort Resultant Web page Laptop Server (computing) Service (economics) Open source Computer file Connectivity (graph theory) Virtual machine Student's t-test Web browser Protein Element (mathematics) Revision control Content (media) Lecture/Conference Term (mathematics) Operator (mathematics) Energy level Address space Computer architecture Standard deviation Dependent and independent variables Graph (mathematics) Scaling (geometry) Open Archives Initiative Interface (computing) Projective plane Database Cartesian coordinate system Word Nationale Forschungseinrichtung für Informatik und Automatik Software Integrated development environment Personal digital assistant File archiver Data center Object (grammar) Routing
Server (computing) Code Software developer Moment (mathematics) Lattice (order) Cartesian coordinate system Public key certificate Product (business) Revision control Medical imaging Mathematics Computer animation Integrated development environment Different (Kate Ryan album) Password Physical system
Server (computing) Hoax Computer file State of matter Connectivity (graph theory) Multiplication sign Firewall (computing) Virtual machine Set (mathematics) Branch (computer science) Mass Menu (computing) Open set Mereology Product (business) Revision control Medical imaging Mathematics Lecture/Conference Different (Kate Ryan album) Term (mathematics) Bridging (networking) Entropie <Informationstheorie> Energy level Damping Software testing Office suite Extension (kinesiology) Traffic reporting Information security Physical system Software developer Projective plane Data storage device Database Instance (computer science) Cartesian coordinate system Category of being Software Integrated development environment Personal digital assistant Repository (publishing) Order (biology) Right angle Resultant Library (computing) Row (database)
Server (computing) Service (economics) Overhead (computing) Computer file Code System administrator Multiplication sign Set (mathematics) Branch (computer science) Code Dimensional analysis Computer programming Content (media) Mathematics Entropie <Informationstheorie> Physical system Default (computer science) Programming language Content management system Software developer Data management Friction Personal digital assistant Configuration space Figurate number
Computer file Code System administrator Firewall (computing) Connectivity (graph theory) Multiplication sign Virtual machine Set (mathematics) Public key certificate 2 (number) Front and back ends Revision control Medical imaging Latent heat Mathematics Lecture/Conference Term (mathematics) Different (Kate Ryan album) Gastropod shell Modul <Datentyp> Elasticity (physics) Endliche Modelltheorie Module (mathematics) Scripting language Standard deviation Software developer Debugger Projective plane Database Instance (computer science) Line (geometry) Configuration management Cartesian coordinate system Flow separation Data management Software Vector space Repository (publishing) Password Configuration space Website Abstraction Row (database)
Standard deviation Code Software developer Multiplication sign Projective plane Set (mathematics) Cartesian coordinate system Proper map Computer programming Revision control Online chat Arithmetic mean Software Lecture/Conference Profil (magazine) Core dump Extension (kinesiology) Information security Physical system
Data management Computer animation Lecture/Conference Multiplication sign Network topology Projective plane Data storage device Virtual machine Control flow Configuration space Information security
Group action Computer animation Friction Buckling Multiplication sign Cuboid
Area Pattern recognition Computer file Code Interior (topology) Interface (computing) Multiplication sign Parameter (computer programming) Mereology Wave packet Content (media) Category of being Mathematics Computer animation Lecture/Conference Term (mathematics) Configuration space Iteration Resultant
Slide rule Group action Server (computing) Computer file Link (knot theory) Code Multiplication sign Virtual machine Set (mathematics) Product (business) Revision control Mathematics Causality Lecture/Conference Different (Kate Ryan album) Term (mathematics) Operating system Elasticity (physics) Scripting language Bit Database Instance (computer science) Cartesian coordinate system Variable (mathematics) Flow separation Subject indexing Uniform resource locator Integrated development environment Friction Personal digital assistant Universe (mathematics) Configuration space
Group action Robot Software developer Projective plane Virtual machine Instance (computer science) Product (business) Subset Revision control Mathematics Lecture/Conference Finite difference Universe (mathematics) Computer science Software testing Physical system
Standard deviation Arm Validity (statistics) Software developer Projective plane Coordinate system Database Cartesian coordinate system Process (computing) Lecture/Conference Personal digital assistant File archiver Physical system
Computer animation
OK hello everyone a good morning and they would commute to the 1st talk today and this is to science take this is a new feature on the first one we 1 to have more signs related talks and this is the first one in this set series and that may to do so Mr. Konstantina he said madam it's not uh and he has studied to in our must look you studied in uh getting yes can auditing in and emits PhD in that book and use snow in the does exon it's a sh starts and then you university start so that its peak it take uh in and of that and that in OK and he is there in the year of supporting science work especially the on technical science and so they are very interesting to talk to give a few people the OK so I want to talk about our and others into research infrastructures so they already said I'm working in IT for structured integer humanities at the City University library and getting in and of what I'm presenting here's what we did in this and our project I where we did technology work together with the French Research Institute King's College in London in the Serbian Academy of Sciences sparsity as the getting king and Trinity College in Dublin so overall international group of IT people the project ran from 2012 to 2016 underfunding by the European Commission on research infrastructures as defined by the European Commission refers to facilities resources and related services used by the scientific community to conduct top-level research in their respective fields so in our case that means Web Services Web interfaces that users can use to advance their research distributed in this sense means that we have components that are running at different institutions that are connected through either API is or at least through centralized user authentication so that users can switch from 1 application to the other and continue working on that data and in the humanities we have a high variety of research questions from historians which is the main focus group of this sort of this project uh tool people working on literature studies languages and so on and there are lots of special-purpose solutions for every single 1 of those questions research questions Research problems and there is a strong focus on word processing soul the quite a lot of what they do they do it in microsoft word in and 1 problem that this project was trying to address is access to resources that are headed various cultural heritage institutions archives libraries but said diary is a project that build a region Virtual Research Environment targeted at historians the term was originally an angry environment for research into history studies 1 aim was to integrate existing resources existing sources around euro enabling access to also called hidden archives many archives are still want uh accessible well the continents are not accessible through the Internet very often even the archive themselves don't have web pages where you can find much information about them so there very hard to find stuff that you want for your research and 1 traditional problem of a historian is he's traveling to some place in the next month to visit a conference and so what you would like to know is is there an archive in the area of that has something that relates to my research problems and while I'm there I could go to that archive and look at that you and that is really a hard problems still to find of what archives there are and what they actually have interviews of and fostering transnational axis so again at the European project so we're working together collaboratively then you around Europe and 1 idea also is to have people from 1 country go to other countries and find resources there that the the project had 14 partners from 8 European countries among them humanities scholars computer scientists cultural heritage institutions themselves archives libraries and and we had 2 main focus areas 1 was world war 1 and the other 1 was medieval research which are quite different from all of the content and questions that have and also from the material available so in medieval research you have 1 page that's but a lot whereas for the 1st world war you even have videos and things like that available sometimes in the archive very often not digitally and this is 1 view and what the application can do in the end on the left you have your project tree and in the middle of the text you're working on which can be a transcript of a document from an archive scan an image that you have and and what you see in the colors are the results of the named entity recognition a flight automatically and then edited from the researcher by the researcher of colors indicate people places and places of new people are red and you see the the list of people here you can highlight them you might hover over these bars than the pop up here I have thought that visible but this is a map where you see the places the currently selected them on the map some and people can share these projects with each other work collaboratively on them and at some point maybe even decide to make it available publicly and it's all happening in that process From single-sided looks a bit different and that's what we've been focused on um I'll explain this in a few minutes an so I said the project ran for 4 years and 1st I'm going to tell you how we started out and then I'm going to tell you what how we try to fix some of the problems we have but we started out with
1 virtual machine where everyone was playing around everyone had access everyone has the rides everyone wondered what they thought was the best idea on which didn't work that well because everyone was interfering with everyone else and someone broke uh 1 can't fake and nothing else worked so we started to have more machines every team had their own machine playing around and it cost less interference but things started to grow apart and also there was a long trial and error face many many different things were tried out installed removed all partially removed or just abandoned and left for running for months or years around we also did manual EAD encoding EAT is a standard for archival description that's an XML standard so people were manually writing and and when I say people I mean historians scholars they were manually encoding xml files tracking them in as the end so they 1st had to learn how to write an XML find the natural on how to use the end they did use oxygen during model so as that I so things that happened was that they wanted to have an object with 2 ideas oxygen said no you can only have 1 ID on an element in the 6 enough but well this choice no better they want to items which led to lots of problems later when we tried to actually parcels finds and make them available in the interface the there was also a phase where we tried Semantic MediaWiki as an editing software because well it's MediaWiki which is basically Wikipedia everybody knows and uses Wikipedia's so that's easy and also its semantic so we got all these sort great semantic features yeah well that doesn't work automatically and so magic you have to actually do something and that was more complicated than want was originally hoped the so I said things grew apart we had this last machines human-to-machine deviance machines less machine with the with the Devi and change route because you know that 1 package that isn't available for last but deadly and it's easier to install Odevene change short and then installed package from there because operates in all but no 1 was doing upgrades in the deadly change should because all those people who were doing upgrades to the system package of it didn't even know there was a deviant change would on their way into the into so it was basically the same as installing from source no the we had applications store from packages we had applications compiled manually directly on the server again things that people once did completely forgot about it never wrote it down never told anyone else left the project 1 and in terms of standards well installing from source usually that doesn't give you automatic back up proteins and things like that the picture's don't you to get your ultimate because routines and so in its case were missing when uh so as we to it for various reasons power outage in the data center or whatever um where we had to write any mention that 1 developer who knew how to stop that so what database on the development of in and there were lots of experiments so all 1 example is uh this is the actual the you're elevated you started to use for the reference to the IEEE to this scheme of fire with an IP address hard coded in some yeah that's not a very sustainable and that became a problem in before the end of the project because we switched to a peaceful some reason I collaboration well we had some kind of shared responsibility 1 was responsible for the 1 server and the other 1 was responsible for the other server but in terms of the application it was not so clear when we started the putting them together so at 1st it was ever silos several applications all working on their own but we needed that to combine them and that started to cause problems because every phone was so very different so again documentation incomplete sometimes lacking big is of silos and knowledge loss in particular knowledge lost because is a research project where most of many of the people at least of that of page to work on the project usually the advisor was a professor at some Institute he is not exactly paid by the project because he has his salary but many of the grant money to pay his PhD students who sometimes finished PhD in the or even leave without finishing the PhD so sometimes in the middle of the project 31 develop new something is suddenly gone to SAP and then it takes 2 weeks to fly actually find the source code here when so we looked at something else devil wants the big our buzzword clipped compound of them all the development and operations a cultural movement for a culture or woman or practice that emphasis on collaboration and communication but when automating the process of suffer delivery and infrastructure changes that's what the English Wikipedia sets so important are collaboration collaboration communication and at the same time automation and that's what we tried to use to fix at least some of the problems we have so all when you have a research project across Europe with people from said France Britain Germany Serbia in and then they're working people off from the at different countries and you have all those cultural clashes in terms of how you work I what you think you should do or not and all of those teams were working independently because we're not a company that has 1 goal or something there's researchers who have their research projects and 1 of those projects is this secondary project and and the
goal is to get something that works so that we can show to the European Commission In some sets of course everyone wants there to be something that people can use that has an added value but at the same time the 1 thing you get paid for is to deliver something that's presentable but to the fund and no matter how great this if the the founder the the the most important thing is that the Finder proofs of the result because then you get another graph for the next project so you have to have impact for your project to be valuable but in fact is not measured in the same sense as it is with the companies I going back to the DevOps words and so what we did was includes the building of the architecture into all agile development processes of which the teams had individually started to use and we try to combine them together into 1 of process and we also defined this infrastructure in some sense so what this picture shows is the front-end applications than we have in the middle of our the API layer that connects all the front-end applications to all the back and applications which basically uh came down to 2 back and applications from originally 6 because we realize we don't need that many of them and then at the bottom layer you have of things like databases and storage and the these are the core components and these are some externally hosted services I said it's a distributed infrastructure so we have external services that are not part of this infrastructure that existed before we started creating this and exist independently so 1 is this red box here the nerd which shows the named entity recognition and disambiguation of it's hosted by the French INRIA developers and and we only accessing that through an API but these are all hosted internally and you can switch between the applications the so what are those application so we have a term that's not missiles PHP application some it stands for axes to memory it's what the i it's a standard application from the Open Archive Initiative that's used to encode archival descriptions where people can enter well there's an archive this is the address and this is what they have in terms of content you can describe this and what it puts out of those standard EAD finds that we 1st tried to manually entered with XML editor some c can is a Python-based repository software we all are data gets a and we have half a million datasets are a bit more than that in there and that were mostly harvested directly from archives through open standards of if it's collection content so we have uh descriptions on item level which means there's this image which shows that person held at this archive but we also have more global things that just basic textual description of an archive and their holdings the this main application and taking environment which was the image I showed you it's forked from an application that's called editors notes Python based on the on general then we have our very own applications deleted looked 0 that's the main on back and component that does all the transformations and and we have pineapple which is of browser for the triple store we have an open that shows all the triple store in the back and and this is the browser for that knowledge base um we using my square prescribes where ElasticSearch over the chills soul what we needed was a lot more communication a lot of thought imitation and as I said we're not talking about scale so we didn't need hundreds of those servers many times when you introduce automation in many service what we were interesting was um defined state and reviews I the and what we tried to do in order to get more communication was shortened to print constraints more releases we went from well maybe 1 thing every 6 months the 2 or more or less weekly we had weekly sessions with the developers and the historians who were using the current version of the software talking about current issues what were the next steps to fix something to get closer to what the application was supposed to do some it was possible to directly create tickets directly from the application of and so that developers and say well this is just failed for that reason I which just config management we chose 1 you want a version so it was 2 2 years ago so we chose 114 and we chose puppet for config management we set up a staging a production environment both managed to puppet what we use Jenkins to build the software and and when I say built with PHP well there's still some processing of the C is as far as us fires and static files uh and then we package everything as Debbie and finance with FPM because they were very easy to version in terms of pop that you really not easily know which version you have installed you can easily go back to an older version if you need to only and we did this even for static files like documentation because it's just the easiest way to deploy the software some so what's it look like we had the developer with his laptop then pushes this changes to get time which then trigger a built in the Jenkins server alone which creates the Devin package puts it in aptly them in a repository and then we install it on the server the service management puppets for which we have our internal actually project uh code hosting and from that you can create a vagrant machine on your local laptop again which looks well
identical to the production system with the obvious differences of data of things like passwords hostnames certificates but up to that it's identical i and it can also test yourself against latest version from other people locally if you want to all you can just employ them push your changes to get out and will go to the staging environment I explain that in a
moment we had lots of mostly virtual meetings because all across Europe it's very expensive to get people together so mostly Skype sessions everything went into version control and the code for the infrastructure everyone had access to everything so still every single developer was able to change the production puppet called uh was related to look into the server manually do changes which then of course different problems but they know that was not a good idea I we automated builds and tests upon push we installed or applications from 1 after poetry the infrastructure and the applications were developed together because this image I showed you of how the infrastructure look like this is like the 10th iteration but before that we had much more back
and applications things were moved around between in this image there's actually 2 choose to front at the the the front office over in the back office over which hold to do part to different parts of a stack things were moved around all these changes happen simultaneously to the changes to the application of and also on the 2nd level whenever we had a new version of a tool deployed we could also change for instance it's config file or things like that simultaneously with the change of the so we had a vague machine we the staging and the production systems and they all were basically identical except for data which is very important because if you have a triple store with well a few hundred thousand triples and the triples store with a few billion triples they behave very differently so it's elastic Search is much more flexible in that respect for instance of databases I yeah I think there are times where 2 environments the staging in the production environment they were created by 2 different branches in our property git repository which means we had 1 of our staging branch we 1st tried 0 changes there but when we were satisfied with the changes we merge that into the production branch and then those changes went into the production systems some are packed repository actually had tool as it's called components that Jenkins uploaded the Darien packages always in the staging component and that's what the stages so used to install the packages it always installed the latest package available from that component of branch of the repository and we wanted to to deploy something to production we just copied the Devin package from 1 part of the report to the other and that production so get that version 2 the yeah that allow for coordinated changes the new version of a component and changes to config file they all happened together some things we learned but very nice is the reviews reproducibility we are now able to recreate the entire so over from scratch if we need to or if we want to so and if we need a test instance that something pop and we can do it's not the case any longer that we need that 1 person who knows how to deploy that 1 piece of software was unfortunately on currently extended sick leaves and will only be back in a month so everyone has to wait for him to return to continue work on the project in scientific terms reproducibility all of your software stack is very important we have a defined state of the software officer row of the configuration and in some sense this is a provenance data on the infrastructure it's of course not entirely complete because depending on what level of prominence you need it's very important bridge version of which library you have installed which we don't manage to that extent if you install and 1 14 today there are differences to the Ubuntu 14 from 2 years ago that well the biggest thing is of messes l which and quite a few changes over the years with drastic results which are not probably not that important to some of the applications
that using but still it's big changes in the infrastructure that we are not entirely menu we had shared ownership in the sense that most developers don't really care about all those all masses of problems and which cypher uh order you're implementing on the server and the entropy con fake and what your firewall settings exactly on the Best Buy was settings is everything's open because then the developer is everything and do what he wants but they still want to change something in their applications file header setting for a new feature and things like that I and everyone was able to do that and not the best practices can help a lot there there is a talk about let's tomorrow I believe only and so we were able to share the same so developers didn't have to care about is outside 1st and they were able to to set up the things themselves 1 thing is that the security settings like firewall and so on they are important right from the start even before you have all the user data in there because otherwise you end up with situation where you never thought about it and then you have all the user data and suddenly realize that you have the Colts of someone but which
is also something you have to take care of it config management it causes a lot of overhead it's harder to set up a server with content management than it is to just do not get smaller and be happy with that what you have to rethink how you work with the assistance because you no longer as as agent into the server to make a change you make the change in your topic codes that gets into the river you check that in with gets then against deployed and then hopefully the thing you want it to happen will happen in some sense it's a new programming language you have to learn and by you I also mean of system administrators who usually don't consider themselves developers and all they have to do programming I'm also in the sense that they have to use gets and workflows like branching because I said we had 2 branches mapping to the different environments on and so changing that 1 setting in that 1 file becomes much more complicated of course now he changed all the service so if it's only to service maybe it's faster to do it by hand but apart from that research infrastructure we have more of them and so entropy settings and things they they're identical in all and if we want to change in 140 service that starts to get more complicated than just doing it but where you just doing it by hand and also puppet and most content management systems also they will undo what you did manually so if the developer SSH into the system and just changes the setting for the admin doesn't just as into the system changes all setting to be and it so make something work public will revert that on its next 1 and and then it's broken again so this led to some friction of people were afraid of the automation because the never knew what will happen next and well of course he knew what he was going to do but thing is that you kind of see the whole puppet code unless you actually working on it all the time and um I when using country or since the configuration and when like with public then and you don't end up with the defaults for example that you want to defaults in some settings because the puppet D 4 is a bit different sometimes and this can cause problems when you think well I just install the package and then everything works because with the default setting in the Ubuntu package it works but not with the default setting that gets put for puppet it won't work figure it happened and it took us some time to trace and of course always it's the problem with the automation because the dimension of things it's not supposed to do but the other thing is also some sort of
automation of the viewpoint of default which is different from the CST fault and may not be the best setting either also the
complained that it's found too early to put everything into conflict management you can use contig very well to configure NTP firewalls Apache settings but it's far too early to decide where the config file lives and what's in the config file things like that it's far better to do that in the end like we have 2 days before the end of the project then we can think about this of and I mentioned this before for system administrators they usually don't this you don't see themselves as developers so learning get learning get but it's overhead there 1 thing I've heard several times is I don't know how to code well I'm too embarrassed to show my code on github I can't possibly publish my little shell script that does something because that's just text together in a few seconds the other company the fear is that if we publish our configuration everyone know how we set of everything in a every site detail and they will find all the attack record and attack vectors that we overlooked which is not entirely wrong but on the other hand if you know what you're doing and if you look at the many of the world so for instance so yes nobody doing that's still a problem that and many people do it anyways because many of the things you can find out if you know what you're looking for as well and and there are so I said also that our public code isn't actually on the top for everything for most of the configuration management using an internal git repository which shows a grant everyone access to others do this so the United Kingdom Government has all of its puppet code completely available on the top well obviously except for passwords and certificates and stuff like that in how we did Abbott called this is more of a puppet specific and 1st we tried some custom abstractions which basically model this image of our infrastructure that I showed you we had these definitions of the front end machine and the backend machines and on that were the components our software that we installed and they Ontario relied on the resources like databases Elastic Search and if we decided to switch 1 component from 1 role from the back into the front end or something we just had to change 1 line and we had 1 modular term Puppet module that we on shared with the other research infrastructures we have which sets of things like uh passwords and the certificates firewalls and so on but that's how to migrate was once a project ends because now you have 2 completely different git repositories and and you want to put all of this together you want to make a change to things like the SSL 1st because latest problems and then you still have to go in and change it in several Côte repositories which still causes over and sold we change that into what the standard is in terms of public now we have the basically 1 on 1 public module on get have which you can use to install this and RE applications Of course it relies on our internal at repository which is available publicly known and
and it's open source and reusable and we have these chat roles and profiles which is the standard
and that we can use among all of projects and and through those we pull in the and this is unsatisfiable settings sense on yeah bid on sustainability as I said 1 problem is that the project only
runs for 4 years once the project is over everyone moves on to the next research project and people leave but then now you have this application hopefully running in the case of secondary we have about a thousand users a month which is a lot for research applications in the humanities and now there's people who are using its and this data in there so you have to keep installing software updates package update security updates and to be sure that everything will continue working and that everything is still secure you have to maybe look at the code to fix something mean and this is why we think at least to some extent this will have it won't help us if there's a problem with the actual applications that our developers program but it will at least give us some way of fixing these I mean those are proper Python PHP applications if in 2 years time we we have to move to a new all operating system version it's probably possible to recompile them on the new version and reinstall it on the new system and so the hope is that this will enable us to to get to that stage as well but before that at least we now have this as a centralized and application which is integrated into other projects 1 of them is our core diarrhea the Digital Research Infrastructure for the Arts and Humanities but it's a large
European project that tries to
us sustain some of these projects some of these infrastructures to keep them alive keep installing security updates monitor them if something breaks like storage has disappeared we will the machine I and it's really broken decide on whether or not there are enough users to decide and invest some of people time to go in and update things and this became much easier with this aligned config management where we have basically the very same layout and and we have to know where to look if you wanted to change something there's this we did this for and and also for text with it which is a German Our research infrastructure for own College tree search let's come to the end it does
cost a lot of time initially a lot of friction we also experienced I but later paid off because 1
thing we had was this feature of trusted automation so we have 1 box fixed in staging and there's half an hour before this big workshop with 30 international historians coming in and they all want to see how great the of various but this does 1 critical buckling educated down well it's working is staging so let's just applauded action because the last 30 times it worked as well and attempted as well so that was 1 of the good and spent without I will the questions the
yep thank you and this was the part of the think about the use all they don't know so the question was on the if we don't have the mikes on repeating uh we have uh the question of named entity recognition and what training parameters we have and how to implement them uh and with the public controversy so the thing is that the named entity iterations happening outside so it's not managed by our puppet code area is still developing that's they originally did some training with the Wikipedia texts from of which are a bit different from text that historians right so all of my what they then started working on was so the the interfaces 2 steps to click on a button and the automatic named entity recognition runs through it and then you can go in and fix the entities and sigh well this is actually not a place that the person or this is a person forgot and that can get them fed back into the and and duration to improve the results and part in terms of the parameters i mean if its contents of so interior that's Condon's of of of a configuration file that doesn't have to change endemically then you can manage it with properties that something that's changes dynamically over time then you can't change so
pop it always gives you access to the configuration but not to the data it's really hard to manage the data also what our public could not dance is to uh if you set up a new server the not initialize any databases you have to do that meant there are scripts and it deploys the strips to do that but you have to be set manually designed to do yeah the that it is the I and that the in the and all and here and in yeah so 1st question is a bit more on the change in mind-set and the 2nd was on I'm putting this different locations different set of uh maybe in the 2nd we have not yet of both of those I said we have to infrastructures like does not take students and and we always have 1 of those but interiors possible and to set it up again and with TextGrid we are in fact in discussion with a group from Switzerland on exactly that because of the they really want to keep this stuff in Switzerland and not on servers in Germany and 2 in many cases it's still easier to put it on servers that are at the university somewhere in Europe than to put it on Amazon AWS for instance but it's still and sometimes politically important to have it in your own country I changing the mind-set but the problem is that it is very you always know what you changing your config files sometimes you have installed your applications several times so you databases several times on on that specific version of an operating system and you just look in you know exactly what the file is to just write something in there and then you're done and the problem that arises years as that we have to staging environments and their differences like what passed with the user which database they use with the server lives and so you have to have ways to say well depending on if this code is applied and that's a wrong that server is different and and and these obstructions is what makes it complicated to change something and you want to know change 1 setting that's the same everywhere that's probably very easy because you just open the file and write it in there but it it it depends on which server you will actually be applying the thing on then you have put in variables that are then relies on the server depending on the host for instance and which creates a lot of over and that cause a lot of friction because it was much more complicated to change something I and also want someone decided on something well that was it and so they were like well the codes on good to have you can install it there's a basic installation instructions which included the Makefile rich to compile static fights required me to have an elastic search instance running with the correct uh elastic search index set up because on a production so is that's anyway the case so that's not a problem and also an involvement machine you need that but you don't need it on Jenkins and you just want to compile the static for hours and and and this causes a lot of friction because you have to change how you working you can just have 1 single make target that does everything and you need 1 specific make target to create the static files that don't have dependencies on things you don't really need to compose that fights and and then in terms of deploying that package that you create on the Jenkins over to your actual production server you have changes and at
1st had Jenkins over the random deviance and there is also some node applications in there and node lives well I'm not entirely sure but user live and live uh user been and slide just been somewhere there's a difference between Ubuntu and every and and the cost problem links and things like that so these things are hard an and that took a lot of time initially but as I said it paid off in the end
because it was easier to come to get a new version into production no 1 was afraid to actually deploy that thing just half an hour before the workshop that literally happen end the yeah yeah and and the in the what is I yes and no so Jenkins did run the tests if they were in uh sorry um repeating the question and if using automated tests and reusing automated tests where editing if the the scripture unhinging consolidated maven target includes testing and will be run but that's well the standard problem was it was weighed more important to finish the product and to finish the tests which will cause a lot of problem and someone has to go in actually changed you and bots during the project it was more like the developer changed something in work on his machine he deployed a staging it group and staging and he was sufficiently confident push it to production yeah yes you must and initially when the end we
did that originally and so the thing is we're using on uh why we using different operating systems in the 1st place and so on basically every team got the machine and they did what they wanted that was how this happened I and this less came in because of the university status and amusing that was the default and so they gave us says machines some people reinstalled it with a different operating systems others didn't that's how we ended up with the 7 change would among them that the so and or and acceptance letter yeah yeah yeah yeah OK so acceptance by humanities researchers so most of the development was done by computer scientists together with the researchers who joined us in those weekly test sessions who did the actual testing and they did not so much work on the top and what they did was all we originally had a you're instance where they were tracking I said we had at
these data from these archives in our repository and that they had they had to go to those archives talk to them to let them allow us to have them into our database and they
were agreement signed and so on and this detract from the beginning with Durao they had specifically jurors as this workflow engine where you can do things like that so 1st it was person goes out to the situ which once this is all done the ticket gets moved on to the project coordinator who has to sign the paperwork with this archive then it goes on to the developer who does the actual
harvesting all the data from them and and when I said we have the ticket system directly in our application that was also originally be originally the juror of ticketing system In the end we put our tickets and because we no longer use your arms and people in that case people needed to have get up a concept to create the issues but those 3 who has to creating issues when the project is basically a over and they kept doing that as well but it was a hard process also we had in the beginning when they were forced to use as the end of well they didn't and they were manually encoding some Alfonso when I joined the project an island that that we had historians who had no idea what they were actually trying to to do with the using Synote finds wide at they which just being told well it has to be this case in fight has to be the standard and yet this validator well the things happen like ignoring the leading to a lot like you can only have 1 idea you the yes thank you FIL that In what few