Molecular Evolution, Genomic Analysis and FreeBSD

Video in TIB AV-Portal: Molecular Evolution, Genomic Analysis and FreeBSD

Formal Metadata

Molecular Evolution, Genomic Analysis and FreeBSD
Title of Series
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
The Bielawski group at Dalhousie University is focused on molecular evolution, phylogenetics and genomics. At the moment, the research is entirely computational, involving model development, simulation, and analysis of real genetic data. Since 2009 we have used FreeBSD almost exclusively for our work. We use our FreeBSD-based cluster for 1) running computationally demanding models of molecular evolution and genomic analysis and 2) storage of genetic sequence data. In this talk I will introduce you to the type of work we do and describe how FreeBSD meets the challenges.
Mathematics Statistics Time evolution Multiplication sign Bit Mathematical analysis Student's t-test Statistics
Trail State observer Group action Decision theory 1 (number) Mathematical analysis Mathematical model Field (computer science) Local Group Sequence Inference Software Energy level Collaborationism Decision theory Moment (mathematics) Mathematical analysis Planning Database Bit Group action Evolute Process (computing) Software Time evolution Computer hardware Inference Self-organization Natural language Energy level
Intel Group action Building Multiplication sign Modal logic Water vapor Special unitary group Duality (mathematics) Bit rate Befehlsprozessor Vertex (graph theory) Endliche Modelltheorie Physical system NP-hard Intel Special unitary group Bit Cloud computing Data management Process (computing) Coefficient of determination Series (mathematics) Website MiniDisc Quantum Right angle Figurate number Quicksort Cycle (graph theory) Geometry Point (geometry) Game controller Server (computing) Inheritance (object-oriented programming) Link (knot theory) Kreisprozess Data storage device Student's t-test Mass Web browser Computer Power (physics) Computer hardware Software testing MiniDisc Vorwärtsfehlerkorrektur Game controller Server (computing) Schmelze <Betrieb> Core dump Datei-Server Word Personal digital assistant Computer hardware Calculation Video game Game theory
Observational study Multiplication sign Execution unit Protein Mathematical morphology Inference Mathematics Different (Kate Ryan album) Row (database) Selectivity (electronic) Process (computing) Pressure Exception handling Observational study Electric generator Trail Mathematical model Information Bit ACID Extreme programming Evolute Protein Sign (mathematics) Category of being Process (computing) Event horizon Personal digital assistant Time evolution Normed vector space Inference Energy level Species Pressure Row (database)
Computer virus Point (geometry) Dynamical system Arm Electric generator Trail Twin prime Multiplication sign Evolute Mereology Variable (mathematics) Special unitary group Number Duality (mathematics) Mathematics Goodness of fit Bit rate Analogy Physical law Selectivity (electronic) Quicksort Position operator
Computer virus Group action Building Trail Divisor Block (periodic table) Code Building Cellular automaton Execution unit Bit Mathematical analysis Protein Code Mathematical model 10 (number) Protein Type theory Envelope (mathematics) Personal digital assistant Different (Kate Ryan album) Selectivity (electronic) Divisor Block (periodic table)
Point (geometry) Code Direction (geometry) Bit rate Code Graph coloring Product (business) Measurement Substitute good Mathematics Term (mathematics) Selectivity (electronic) Position operator Pressure Information management Trail Direction (geometry) Fitness function Code ACID Measurement Substitute good Mathematics CAN bus Sign (mathematics) Type theory Time evolution Hill differential equation Table (information) Pressure
Point (geometry) Game controller Functional (mathematics) State of matter Code Multiplication sign Time travel Parameter (computer programming) Distance Mathematical model Likelihood function Computational physics Mathematics Bit rate Different (Kate Ryan album) Matrix (mathematics) Process (computing) Conditional probability Condition number Potenz <Mathematik> Trail Mathematical model Cellular automaton Parameter (computer programming) Bit Evolute Sequence Measurement Substitute good Markov chain Category of being Process (computing) Maximum likelihood Commitment scheme Estimation Personal digital assistant Network topology Phase transition Website Row (database)
Group action Dynamical system State of matter System administrator Multiplication sign Plotter Mereology Web 2.0 Data model Mathematics Different (Kate Ryan album) Single-precision floating-point format Core dump God Potenz <Mathematik> Trail Complex (psychology) Bit Perturbation theory Sequence Category of being Divergence Inference System programming Species Quicksort Data structure Point (geometry) Trail Functional (mathematics) Real number Mathematical model Field (computer science) Number Sequence Local Group Moore's law Population density Bridging (networking) String (computer science) Representation (politics) Integrated development environment Software testing Shift operator Scaling (geometry) Cellular automaton Database Cartesian coordinate system Evolute Integrated development environment Grand Unified Theory Search engine (computing) Complex system Bayesian network
Group action Presentation of a group Parsing Code Multiplication sign System administrator 1 (number) Stack (abstract data type) Optical disc drive Mathematics Virtual reality Different (Kate Ryan album) Bus (computing) Cuboid Physical system Exception handling Rhombus Scripting language Arm Data storage device Stress (mechanics) Chaos (cosmogony) Bit Special unitary group Sequence Type theory Radical (chemistry) Data management Process (computing) Configuration space output Quicksort Freeware Rhombus Point (geometry) Statistics Implementation Server (computing) Overhead (computing) Computer file Real number Patch (Unix) Data storage device Student's t-test Coprocessor Mathematical model Computer Field (computer science) 2 (number) Element (mathematics) Software Computer hardware Gastropod shell Authorization Authentication Installation art Server (computing) Cellular automaton Directory service Line (geometry) Cartesian coordinate system Datei-Server Software Search engine (computing) Password Network topology Cuboid Window
Metropolitan area network Sine State of matter Plotter Structural load Core dump Special unitary group CAN bus Causality Uniform resource name Software Physical law Conditional-access module
Stapeldatei System of linear equations Computer file Code Length System administrator Multiplication sign 1 (number) Branch (computer science) Water vapor Open set Special unitary group Theory Causality Semiconductor memory Software Metropolitan area network Social class Module (mathematics) Information Wrapper (data mining) Software developer Projective plane Electronic mailing list Bit Measurement Number Data management Process (computing) Resource allocation Software Quicksort Directed graph
Point (geometry) Group action Observational study Computer file Patch (Unix) Multiplication sign Control flow Mathematics Computer cluster Different (Kate Ryan album) Computer hardware Core dump Utility software Analytic continuation Software developer Bit Line (geometry) Measurement Data management Process (computing) Software Configuration space Object (grammar) Freeware Resultant Directed graph
Group action Constraint (mathematics) Multiplication sign Data storage device Bit Number Goodness of fit Process (computing) Software Term (mathematics) Computer configuration Universe (mathematics) Computer hardware Freeware Active contour model Spacetime
it's a little bit after I'm so thanks for coming my name's Joseph ground undergraduate student I'm studying statistics at Dalhousie University that's in Halifax Nova Scotia so go about as far east as you can on the mainland can and you'll find us in Halifax so unhappy if anybody wants to interrupt me and ask questions I'm happy long as we don't get to extract that I think will have lots of time come so here today to tell you about our research and how we used previously their research done so a small
group we usually fluctuate between 5 and 10 members but we study in the fields of molecular evolution and genomic analysis so in a nutshell we take the genetic sequence status of DNA data and we analyze it to infer process he's and to say something about the organisms from which the DNA came to the research is purely computational and by that I mean we don't have a wet lab were not the ones that are sequencing that data we get from collaborators we get from online databases but the field is a multidisciplinary and that we have statisticians and my supervisor has a background in genetics biologists we collaborate with health researchers pharmacologists for actually working with the human palio biologists at the moment so it's quite multidisciplinary so my
plan is to tell you a little the harbor we have on and then I'll switch gears and tell you a little bit about 2 research tracks that were about within the group of the first one is the more of modeling evolution at the molecular level and the 2nd trackers microbiome in metagenomics which use again tell you about software design decisions in other observations a little bit about the workflow and things like that
so our primary of computing resources is a cluster that we purchased from Sun in 2006 we call it a Warnock so my supervisors into Arthurian legend apparently a war not means like a giant or has has some sort of significance in something related to a giant in Arthurian legend other systems in and the operand on quantum here none of so if you're into Arthurian legend fairly that that that means something but it has a Sun Fire V 40 z master node within a and B based on that chip it's got 16 games ECC RAM it came with 20 compute knows x 41 100 also in debates for the of ECC RAM our 73 gig studied this in for gigabit Ethernet ports on behalf of those 20 nodes 18 still in use 1 of them was a little flaky right from the beginning I think it was a disk controller in around 2013 another 1 devoted filled the post from the we also had another rack that so I'm a graduate student I don't know a lot and I don't have a lot of experience with that that's a hardware so I haven't been around much so I can describe this probably very well but we had this other undirected had these . 1 you vertically stacked nodes and they were grouped in groups of 4 and they all share the same power supplies in and and they all had their own power button on the front and so we have to power cycle and I to there was no Ramon a management and these no say to going hit the power but in something random would happen every time that node may or may not come on for the 3 beside it may or may not come on our might go off on fortunately unfortunately we had a little bit of a natural disaster but all and natural disasters we're attached to the life sciences building and they were working on the roof there and apparently have some equipment out and in the hot sun the fire started and all the sprinklers went on in life sciences building and were when the adjacent building the mass that's building and when the sprinklers when on all the water went down into the basement server and apparently there was some shorting some smoke and we had a funeral for those notes were down to 1 rack Ferreira computing resources on this on this which we had to replace we had some problems with from besides and that we have these issues where I'm really compute node and try to paint and would respond you go to the consul in Kenya everything would be fine but we have these delays so you'd SSH into a computer no wait wait wait and I just assumed that it had something to with NFS and tried to look into it couldn't figure it out right away from later on I'll use link aggregation we got a storage server and it would work for like 5 minutes and go down finally I connected by a serial cable to the poor and realize the firmer was constantly crashing so unfortunate by this time SMC didn't even I didn't even notice model on the site so there's no support we saw opted out with the 40 port dust Cisco which in all those problems when when I is gonna Cyclades system with that I long for remote management that generally works well as long as a job in the browser the job of putting words on the browser which you will the well it's worked well recently so in
academia and of anybody here is involved in in academia but funding tends to common boom-and-bust cycles and so on and after a year 2 we had grown the storage and at but we never funding to buy anything new so there was all 3 you system in the server room nobody else was using it had a decent disk controller so I threw a couple drives and their use geometry the 2 gene there's works relatively well but we opened up pretty quickly so finally in around 2012 we had some funding and we got a new storage server a 1 you server it's the act it's actually the only Intel-based system we have in server room I I got 40 60 gig necessities I planted near them for Zealand how to work but after doing some testing from I realize that manifest melting asynchronous asynchronously gave quite a bit better performance in its worked well for us are and use the other to assist users in L 2 work but I've seen that were only getting about a 20 % hit rate so I'm probably going to get rid of the L 2 oxen 2 on it's got a simple Alici sass controller and most importantly we got out of 40 days chassis so we have room to grow right now we only have 10 disks in their true we went with the Western Digital reds are simply because they're cheaper and our so far they've been fine but I was probably a little too nervous I use the rate z 3 but if I could go back and again I probably receptive you have a little bit differently but is generally working well for us at the same time we also got a new computer node it's also an AND based system it's got for 12 causal 48 quarters 256 kids around and so on so the main reason we will put the energy-based systems even though the technology seems to be falling behind was the price so this Ranice about 7 thousand Canadian on when I looked at Intel-based systems with hyperthreading I think we get 20 core but the same amount of RAM it was quite a bit more expensive if I'm not mistaken we're really pushing 20 thousand dollars so I looked at spectacular work so that's how I did most of the research to to find out what we could get in what we could afford and if you just the costs of the still give you a pretty good bang for your buck 1 important point in the calculations is that I'm we don't have to pay for power we have to deal with cooling and these things thrown a lot of heat so keep that in mind I guess if you're if you're looking for something similar a late so
all switch gears now and talk a little bit about her research so evolution except in extreme cases it's a slow gradual process so the study of evolution is really a study of the past and how we study it for example by using clues from the present day to make inferences about the past so we might look at our present-day species and compare the morphology to the fossil record if we have a species where it's really short generation time like fruit flies we can grow them in the lab and observe them over generations but in general then the genetic material provides a much clearer picture it has much more information for inferring the past so that's exactly what we used we use the genetic material that you wish so more
specifically but we look at the evolution acting on proteins or the individual units of proteins are amino acids and we for example classified that selection pressure into 3 different categories so purifies purifying selection pressure on that selection against changed neutral evolution where a change is selectively inconsequential or positive selection were changes selected for and so the positive selection is 1 of the new and the Haystack if you will and it's what we're typically most interested in so a
useful analogy for understanding situations or places we see evolution or positive selection in the natural world is an arms race so we have 2 entities with an adversarial relationship 1 makes a change in has some sort of advantage the other side also has to make a change and we go back and forth back and forth and change really quickly so
in the natural world we see this sort of situation with pathogen host relationships so this age is a really good example because number 1 the virus can mutate really quickly and and it has really short generation times and so it can adapt really quickly to avoid the host or avoid the drugs that we we give patients so for example we might have a virus and because they have a higher mutation rate there's some variability in the viruses so the majority of the virus strains or of the virus and the patient are susceptible to a drug we supplied a drug is quite effective but after a few weeks on those viruses that have changes that make them resistant to the drug becoming more common and then the drug is no longer effective so the important point here is that understanding the evolution of the virus can help as part of the approach for controlling so as I said we might of give might give drug-AE for a while it's effective it's no longer effectively might try a cocktail of drugs take them away are the dynamics of the virus changes and then we can give the drug really high doses so again the the important point is understanding evolution can be helpful so we looked at the
HAVE a couple genes in the HIV virus so the envelope gene codes of protein that helps the virus detect a certain types of cells and helped and invade the cell DNA plimer race encodes a protein that helps the virus inserted genetic material and into the host DNA violence activity factor and it encodes a protein that's responsible for inhibiting the host and the viral activity and so we found 391 11 947 intended of 190 amino acids to be under positive selection in each of those cases so crucial and a giant tell
you a little bit about how the models work and to do that after you just a little bit of background for for anybody that's not familiar from so DNA is a macro molecule it's made up of individual units called nucleotides and depending on the nucleotide bases there's 4 different types of nucleotides adenine cytosine guanine and thymine so we have a DNA strands of nucleotides in those nucleotides in groups groups of 3 R code answer and its codons that code for amino acids the building blocks for proteins and so when we talk about the
genetic code we mean what code orange code for what amino acids so in this table that represents the universal genetic code we would read that the role labels are the 1st nucleotide in the code on and column labels of 2nd nucleotide so for example to find out what the code on TCG codes for we go to the road and find the the common finds the insulin TCG codes for the amino acids serious so what's interesting here is that we have 4 different nucleotides 3 the make a code on so this 43 64 possible codons in fact the universal genetic code only 61 of code for amino acids but there's only 20 amino acids so the code is redundant so if we took a nucleotide are we took a code on and we sort that 1 of the nucleotides 1 of 2 things can happen the amino acid codes for may change remain not change in we make that
substitution and there's no change recall synonymous substitution so sloppily nucleotide the code on so that the same amino acids synonymous change and conversely if we Swoboda nucleotides and there is a change its color non-synonymous substitutions so considering selection on the amino acid on good so if changes are inconsequential in terms of selection but then we expect the rates of non-synonymous some substitutions to be about the same when there's purifying selection so this change with respect to the amino acid that's our reduces the fitness and we expect the rate of non-synonymous to synonymous substitutions to be less than 1 so remember as synonymous substitutions don't change amino acid products so in terms of selection there inconsequential and vice versa when there's positive selection pressure we expect the ratio to be greater than 1 and the important point here is the ratio of these 2 types of substitutions which we often refer to as a major is a measure of the strength and or direction of selection pressure
against so how do we how do we estimate from these parameters we use something called a Markov process and so what a Markov process does is it models changes from 1 state to another and in this case our states the 61 cents code on the on an important property of a Markov process is the probability of going from 1 state to another is only dependent on the current state so if we want to see the probability of going from 88 to CTA we go to the present states are in the the rows and the possible future states and column so row and column are what we see the it's the it's the cell in the matrix that's highlighted there and the important point is that those probabilities are include those parameters that so
what we do is we take our sequences and we are the related sequences and we use a phylogenic tree so that the inferred evolutionary history between the sequences and we go through all the possible are substitution so at the front most with here so these 2 sequences had this common ancestor and there was some code on state here that we don't know and it's switched to TCG at some point and so what we do is we traversed through all those possible state changes using our our our our Markov process end of a little bit of a problem here is we don't know what the and what these unknown states were so we have to do something that's so computationally intensive refuse these conditional probabilities where we condition on all possible states in the past so given that the the ancestral state was aiming went from 8 TCG or if that answer so state was a commitment to the C T G C and so on and so forth you can imagine how competition intense that is in and we have to go through that for all the different possible state changes yet at a lot of their phylogenic trees showing the distant yes actually distance as the measure because you can't quite say time because you don't know what the mutation rate was so your confounded between time and our mutation rates so yes it's a phylogenic tree it shows the inferred relationships and the evolution relationship between those 2 different sequences that you know we we it's it's a it's and we assume that it's true there is some unknown to it it's not you know we don't have a crystal ball or a time machine we can go and be certain about it but we take it is true otherwise the computational it would just get a control the computational demand and the already but the main point here is that there already really competition demanding because we have to do this for all possible state changes and that's just the beginning then we have to move on to the next site and so you can have really long sequences are really large trees so you have a lot of different state changes and there's a whole bunch of steps that I'm I'm I'm leaving out here there's some matrix exponentiation and so the whole bunch of other parameters what we get out of this fellow as we get a great big likelihood function and now this next step is to optimize that functions you get the maximum likelihood parameter estimates so I know that's a lot to take in expecially at a as the conference and so the important point those very computation-intensive we couldn't do this work without a cluster
OK so that's all say for the 1st research track so the 2nd research track and I'm not directly involved with this except for administrative support role so all you will be more general and a little bit more breeds of OK so it microbes there they're ubiquitous this something like 10 times more bacterial cells in you than your own cells and we all understand how a single strain of bacteria or some microbes can affect of more complex systems like your health for our in in certain parts of the environment but with this research is concerned with is how the dynamics of those microbial communities affect more complex systems so traditionally how we sequence uh of microbes bacteria is we isolate them in the lab and sequence those individual strains but we really miss a lot of the diversity when we only look at the single string so come the technology exists nowadays to go out in the environment and by the environment i might mean that you're got on are certain adapt in the ocean at a certain time of year and just sequence everything so that's not a problem the technology is there what is a problem is making sense of all that vast amount of data that you get and that's where this research comes in and so this a postdoc in our lab but money shift the wrote these 2 programs by on that in our bio Michael and they basically group my core micro real hot populations are communities according to the properties of this and so what I mean by microbial is a micro community so if you wear glasses but then you have very different communities of bacteria cheer up on the that you're the the bridge of your nose and behind your years so microbiome these small little communities of bacteria itions actually say small sometimes the very complex they might of hundreds of thousands of different strains of species and what we mean by metagenomics is that just going out and not isolating strain but is sequencing everything in dealing with later and so 1 way they apply this research was with patients with Crohn's disease so it's believed that our patients with Crohn's are 1 of the reasons they have the disease is because the communities and their guts are out of whack and so to test this they took patients are they sequenced the bacteria communities and their gods they gave him a treatment and then weeks later they sequenced the that's again to find out if the treatment did anything and and correlate that with changes in communities in the state of the disease and
so these models are also quite complex the high-dimensional but and I put this plot together just to drive home the point that this field and you technology are tightly coupled so good the middle plot everybody is probably familiar with it represents it's basically representation of Moore's law so the density or the number of transistors on a chip for time on and take note that the Y axis is on the log scale those so these are the exponential growth functions the top plot is basically a representation of how the amount of DNA that's been sequence so so it's the number of nucleotides in these online databases of DNA and the bottom plot is basically shows how from intense or how much research is going on in this field so I went to web of science it's sort of like a search engine for academic papers or citations and I put in a search of molecular and evolution and you can see that the the fuel is also growing exponentially
OK so let's switch gears again and I'll talk a little bit about the software that we use on our clusters cluster so up like I said we got the cluster in 2006 it originally ran 64 bit Solaris on in and we had a dedicated system in a little of 2009 when funding ran out for him and so I was I was asked to take over the system administration role but it was about this time that Brooks Davis was talking about his work at the Aerospace Corporation where he ran previously on a much larger cluster and he ported over a lot of the tools that we were used to probably most importantly was our Sun Grid Engine as a resource manager and so there were no deal breakers with installing Free BSD in the cluster that I know of no software that we depended on that was incompatible that was aware of no hardware issues so I took the plunge and I tried to install Free BSD on there and it initially was actually a a minor challenging it wasn't anything it wasn't the fault of free BST I'd say it's probably the fault of sun on so I don't know if you remember the picture of a cluster showed nobody noticed this problem but it all so the back up the 2nd I 1st tried to pick C below it didn't work right away I was like you know there's 20 nodes i'm just gonna there only compute nodes I don't have to do a lot configuration not just burn a couple DVDs victim in there and so it had these DVD drives not the ones with the tray comes out we have to like Popper DVD in its oxidation so for some reason about 1 and 3 times it would just suck the DVD and wouldn't spit it out just didn't acknowledge there was a DVD in there so since about 2007 I think about half a computer knows how the 7 point something DVDs still in them so I switch that OK that get a couple USB sticks they were the US after USB sticks societies I can't previously honored it sucks in DVDs include of USB so eventually I got a hold of of a a USB DVD drive took about 20 minutes we you know the compute nodes basically say accept all the defaults from and it was after that was really no problems arms so I've stopped with stable are up with police and q simply because I just want to you've got previously upgrade and it takes like you know 10 15 minutes from I have actually switched to stable on the storage server and because there was a patch that I want to pull in in the new computer no probably most importantly I switched to stable because I want to grab the Beehive stuff so and I said in the past that there was no real software incompatibilities but this microbiome research has a lot of and software written by biologists so but I'll talk a little bit more and more about that in a minute but it wasn't so friendly the 3 so we actually have a couple desktops in the lab that run free BSD 1 graduate student actually has probably the most powerful that processor and the in the group that's a household based system with 4 cores and and it's basically your was basically sitting idle at nite so I thought well of rope Couldry around there and build all the packages that works really well really liked was yeah it's a it's a nice tool on the same graduate students pretty comfortable with Matlab likes the bus new ideas and Matlab on so we put virtual box and there he mostly just exploring so it's not an issue I with speed is actually doing a bit more with announced we might have to look into getting Windows running and the house some this and stuff and the course we're running ZFS on the storage server were NFS mounting home directories from store server to the computer nodes and the master node and I stuck with 1st 3 just because most comfortable with it it hasn't really been a problem recently there always had a little bit of a minor issue where so I'll submit like hundreds of thousands or even like a million jobs in they run following but then when you're done about half the nodes it's almost like the swapping of he tried SSH into the new wait 30 seconds for a prompt even when you're not an NFS-mounted directory is the parsers on it I eventually figured out if you are now and now the home directories on the fucking excessive everything's back to normal I haven't figured this out yet but generally offense hasn't causes any problems except for 1 exception and so for authentication authorization amusing about the most simple set up you can imagine I basically just take the password file minor crime job and I send it to the compute nodes it works really well no overhead you don't have to mess with that L dapper anything like that user just know that if they change the vascular compute node it's just gonna be lost and if they change the password on the master node might take up to 5 minutes for the crimes of the elements in the but generally works it works well so most of the jobs we run our our I guess what people call embarrassing parallel embarrassingly parallel or uncoupled so obviously just run them all they run independently and then when they're done we use the shell scripts a postscript to make sense of everything we make use of statistical package are quite a bit of so I must ask major have to for a lot of documents with mathematical text I haven't found anything better than than LaTeCH actually enjoy using later they attack this presentation is done in attack on so the implementation of these models is in package called him off and I was actually quite shocked this is pretty obscure piece of software but is actually in the ports tree but it was abandoned after a while but we eventually took over all it's a really exceptional piece of software in this field because then it basically runs without any patches it's like 33 thousand lines of C code and it doesn't even need tools anything it just works and that's really exceptional most of the software left to run is just fine and talk about that in a minute but from this so courteous is something that my supervisor started it's basically a reimplementation of those models and as a said there's nothing wrong with Pamela stress we we extend a lot these models and add new models and it's a little bit more modular and makes it easy for us to to extend those models We use a couple other ports are blast is basically a search engine for sequences so you put in a sequence of you searchable tie are you match equal eyestrain whatever diamonds a similar application to a blast and talked about biomass and by Michael the other applications and mentioned there they're used in the microbiome research on another tool
that I find useful for administering the cluster bomb was originally a bash script written on top of the locks in that so what it does is it gives you a terminal session to all your nodes end it nears the input to all the other nodes so if anybody is used as cluster SSH it's basically the same idea I think the main difference is that it doesn't require excellent good so what's the all with that false that OK go to so cell stack sort of like danceable that type of thing so so so far I resisted doing that I'll talk a little bit but then a 2nd time so that I really don't need interactive sessions all that often I usually use it's like previous the update and it works really well it's very simple it's like it's like a shell script and it was originally a bash script and I worked with the guy upstream to take out all the patches and so he was pretty receptive and there is a helpful tool in the forestry called check bashes and I think it's called a quickly tells you all like the batches and in your script so that was helpful anyway he accepted all these changes and throw it so we threw it in the ports tree I find it helpful to I don't use it all that often if you had a bigger cluster more than 23 nodes that might might not be a useful another
tool that I think was ported over by Brooke status and the guys at the Aerospace Corporation is a ganglion it's a nice tool for
monitoring the current state of the cluster on can I particularly like this plot that shows you are an aggregated load of everything that's happening in the cluster so I can just look at this right away and see everything that's going on and usually cheaper monitoring allowed that shows this plot is I know but I see that goes down and rush to get my jobs on in the cluster cause I know it's a free before somebody else it's voltage so broke stressed
this in his last start simple unix tools I found that like the just brushing up on Unix tools is probably the most efficient way to get a lot of things done on the clusters something like this and when I don't need something interactive just uh no semisoft all the nodes it's maybe you'll sell stock something is more helpful when you don't have such a homogeneous system these are basically these compute nodes are all identical so that really helps with administration OK
so I'm grid engine is a piece of software that we relied on quite heavily on it does basically 3 things but it's a resource manager so you say I want this many nodes there have at least this much memory this many cause I want for this amount of time and we don't use that feature all that often I mean we're only like 5 to 10 members in the lab so I resource management is most often walking across the hall knocking on the door and saying hey device for use you know this many nodes for 3 days water supervisor saying you're gonna run your stuff now because it's more important and you have to wait and then the 2nd feature that great engine provided was our job submission so basically getting your jobs out on the on the the nodes to nodes and that's that's the future that we rely on quite a lot so n as I said most of our our jobs uncoupled or they run independently so something like religion might be a little bit overkill on class maybe I thought maybe it would be more appropriate with some sort of MPI wrapper that that is a very efficient it can it can help you just get the jobs on their very quickly and get them off and get new ones on when they're done so I think the metric here to look at would be the proportion of the idealized time earlier in the most efficient time so you year 1 job on 1 node and you know it takes 1 minute and you have to submit a hundred thousand of these so compare and you have like a hundred we have 120 cost if you could run those 120 in parallel perfectly efficiently so when one's done the next ones right on there and then you actually look at the wall timeliness with bringin grid engine and compare the proportion of the of the idealized time and it worked out pretty good with grid and some might come up and reinvent the wheel and it's going to stick with this and had you know i wanna graduate eventually someday so it's like when anything works and just move on from this so the 3rd feature is that religion provides is that use your job so when all the resources the use of what's in it you and submitted jobs when everything when those resources become free and so around 2013 I wasn't following the ports mailing list close enough and apparently Great engine was building on some architectures and so it was removed from the forestry and I sent an e-mails trying to get it back in time and nobody was really receptive so I kept at a local court going for a while but there were a few hiccups here and there are some people but eventually from the switch from 9 to 10 something broke and I spent like a dare to it was just too much of a time sink so I started to searching for for other alternatives and and want include some sort of metric some sort of an objective measure that would show how difficult it was dealing with grid engine it's got about 20 years of legacy code there's no more development in the main branch because Oracle so that over other is development but it's not open anymore there's a few of projects they're trying to resurrect the or code I think there's some of the Great Engine open Grigoryan I tried both of those Newton worked very well on this so yes this metric just this this objective measure of how difficult religion has to deal with so I looked at the length of the make files of all the 27 thousand ports and the Fox little small but there's great engine I think it's a little bit of an outlier and so decided to move on from incidentally can anybody guess what port has the largest file in the no no no change in next developed by a long shot at something like take a thousand lines and that it the 2nd 1 is engine genetics engine next develop called no I just do not interesting information and actually I think if you look at it I'm I'm not making any judgment on this theory in fact I think a lot of it's like something to do with the modules quickly there's like a big long list and actually the grid engine had a separate Makefile was Make file that man was that that was maybe like 100 to line so if I included that would be the merger so anyway we moved
on I looked at a bunch different scandalous and I eventually discovered slurm and it was already in the porch treaty that was a bonus and it was much despite the name simple Linux utility for resource management is quite portable on this guy by the name of jason bacon ported over there was a few patches at the time but nothing too serious I he's a busy guy so he handed over maintain a ship and are hasn't been too much work so far were down to 1 patch and I think it's actually just a single line includes of study on Enron's generally well without too much intervention and there's a few hiccups like few gotchas you look out for certain configurations well safe all the learned the Damon right away but I found if you avoid those then everything works really well and so they document how you can submit like I think it's like a thousand jobs 2nd we're not even close to that we might be like some around 100 but I don't know if it has to do with free hardware but the the performance we get is good enough for us to keep the cost running in and we're happy from doing for time of OK so I
said I was going to talk about this so we didn't have many issues or any problems with software until we started this microbial research and as surprising as it may be biologist and don't break the most portable software and so I'm a member of our group will come to me and say I'm trying to get this running I don't like the page and if there was any documentation might say I prepared this on the host of when to use sneaky salamander I think that's actually named by the way the uh and I would try and mess around it might be just change if you had a files that might be a shebang with bash change that but I eventually do you wanna graduate and and this was just a huge time sink and fortunately it was about this time that the beehive support free and he came out so under our our new 48 core node i through beehive on there in and this it's running fantastically I don't have any like our objective measures but users tell me that and they have access to a shared cluster where there are a lot of these things in linux they tell me they don't notice any difference in performance on the verses that the Linux cluster home the great thing is you know utility 1 are only works on a bond to whatever the other 1 only works and sent to us so we're just fire up a different EM so I think if the hiding didn't exist we probably wouldn't have previously in clustering so I'm very grateful to the behavior and continue interesting point to note that the motivation in academia is basically did you results published and forget about the software and so research is all about reproducibility and this is I think a real issue and I think I I I think something hopefully will change where people start understanding that these tools are important for the research and a little bit more attention is paid more developers in groups but yeah think it's it's really an issue
so across is getting a little bit older stays inside it's pushing 10 years old and there's been some discussions about uh replacing it or just getting rid of it and it's there if you are a researcher in Canada there are other options and there's Compute Canada which is a pooling of resources of Canadian universities and there's a snack which is a partner only in Atlantic Canada but in and told there is some member of the group of a group that use a snack and told that it works generally well but because these have such high high demands on our cluster even though it's getting older I take some time I think it's really worth it for us in using something like a snake would be a problem because number 1 some the microbiome datasets gigabytes and we have to analyse like a couple hundred of them so even just getting the storage space getting them onto the clusters is challenge so I guess the last thing I'll say is that Brooks and it is talk about 7 or 8 years ago saying that previous these actually if you don't have any constraints in terms of hardware or software Free BSD can be really good question cluster and that's been our experience to and as as they say would be high suffer constraints of becoming less of a problem so if you're a similar situation small grouping in a small cluster I think the best you can actually do a pretty good job so that's all I have this
few take it for higher level of of what you know it's not I in I think I would have done in my 4th year right now I think I'd be close to graduating it's just from the methods used credentials anybody else oral thank you very much


  513 ms - page object


AV-Portal 3.20.2 (36f6df173ce4850b467c9cb7af359cf1cdaed247)