We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Tips for the scientific programmer

00:00

Formal Metadata

Title
Tips for the scientific programmer
Subtitle
Case studies for parallelism, data storage, memory and performance
Title of Series
Number of Parts
118
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
This is a talk for people who need to perform large numeric calculations. They could be scientists, developers working in close contact with scientists, or even people working on finance and other quantitative fields. Such people are routinely confronted with issues like 1 parallelism: how to parallelize calculations efficiently 2 data: how to store and manage large amounts of data efficiently 3 memory: how to avoid running out of memory 4 performance: how to be fast The goal of the talk is to teach some lessons learned after several years of doing numeric simulations in a context were micro-optimizations are the least important factor, while overall architecture, design choices and good algorithms are of paramount importance.
Keywords
20
58
Programmer (hardware)Point cloudGoogolLattice (order)Multiplication signShooting methodRegular graphEndliche ModelltheorieLecture/ConferenceComputer animation
Modal logicExecution unitInterior (topology)ProgrammschleifeOperations researchCodeArchitectureLibrary (computing)Data structureoutputFile formatFunction (mathematics)Keyboard shortcutUser interfaceLevel (video gaming)Electronic visual displayBuildingCalculationInterface (computing)Insertion lossMoment (mathematics)Plug-in (computing)Functional (mathematics)NeuroinformatikMappingFrequencyData structureLine (geometry)WebsiteMultiplication signContext awarenessCodeSoftwareRevision controlOrientation (vector space)Function (mathematics)Computer architectureProfil (magazine)Order of magnitudeComputer programmingBlock (periodic table)Data managementFile formatSemiconductor memoryProgrammschleifeDistanceoutputBitNumberPhysicistModule (mathematics)Point (geometry)Different (Kate Ryan album)Run time (program lifecycle phase)Stack (abstract data type)Error messageEnterprise architectureResultantData conversionVirtual machineLibrary (computing)Digital rights managementCompilation albumType theoryDifferenz <Mathematik>SmartphoneNormal (geometry)Level of measurementClosed setMetropolitan area networkSlide ruleReal numberRight angleSoftware testingForcing (mathematics)Bridging (networking)Call centreData storage deviceClient (computing)Execution unitWeb 2.0Total S.A.Greatest elementPiAsynchronous Transfer ModeTheoryFormal languageSpecial unitary groupMixed realityArtificial neural networkComputer fileInformation systemsGreen's functionInstance (computer science)Medical imagingTraffic reportingSet (mathematics)Beat (acoustics)InformationAreaNewsletterField (computer science)Data miningImpulse responseCartesian coordinate systemLaptopSupercomputerCore dumpComputer animation
outputFile formatFunction (mathematics)Table (information)Personal digital assistantEmailModal logicWordHazard (2005 film)Forcing (mathematics)Condition numberNewsletterMoment (mathematics)Computer fileGame theoryRevision controlStandard deviation1 (number)Shape (magazine)HoaxMarkup languageCumulative distribution functionSensitivity analysisComputing platformUser interfaceWindowMiniDiscMultiplication signSoftwareFrequencyReal numberVideoconferencingSource codeWorkstation <Musikinstrument>Green's functionMachine visionGoodness of fitNumberEmailOrder (biology)Beat (acoustics)Instance (computer science)Pauli exclusion principleWebsiteWeb 2.0Endliche ModelltheorieData dictionaryFunction (mathematics)Plug-in (computing)CalculationHeat transferSpeciesServer (computing)PredictabilityClient (computing)Right angleOctahedronSuite (music)Data conversionForm (programming)Observational studyFile formatCASE <Informatik>Internet service providerParameter (computer programming)Computer programmingDisk read-and-write headParsingSet (mathematics)Metropolitan area networkoutputSpecial unitary groupSampling (statistics)Insertion lossDistribution (mathematics)Pole (complex analysis)Group actionMathematicsSoftware configuration managementState of matterLogic gateFault-tolerant systemMessage passingEvent horizonStability theoryDiallyl disulfideRemote procedure callHill differential equationNatural numberValidity (statistics)Heegaard splittingDirectory serviceAcoustic shadowMaxwell's equationsAdditionData storage deviceComplex (psychology)BitTotal S.A.ParsingReduced instruction set computingMereologyNormal (geometry)Modal logicLatent heatArray data structureLibrary (computing)Level (video gaming)Field (computer science)Extension (kinesiology)Data modelInterface (computing)Table (information)Data structureComputer animation
CalculationCalculationThermal expansionFlynn's taxonomyPoint (geometry)Traffic reportingDifferent (Kate Ryan album)outputComputer fileWeb 2.0MehrplatzsystemSoftware bugContext awarenessFunction (mathematics)Complete metric spaceEvent horizonProof theoryVirtual machineBus (computing)Open sourceMixed realityInformationData storage deviceYouTubeInternet forumMobile appSource codeSystem callForm (programming)File formatComputer animationLecture/ConferenceXML
Computer fileFile formatoutputMultiplication signData storage deviceRadio-frequency identificationParameter (computer programming)Latin squareLecture/Conference
File formatAxiom of choiceArray data structureDatabaseMultiplication signFile formatInformationResultantCASE <Informatik>CalculationData storage deviceAxiom of choiceReading (process)BuildingComputer fileArc (geometry)SequelInheritance (object-oriented programming)Staff (military)Instance (computer science)MathematicsGoodness of fitStudent's t-testMetadataPotenz <Mathematik>EmailReliefRow (database)Computer animation
Multiplication signDistribution (mathematics)File formatRow (database)BitFrame problemProcess (computing)Selectivity (electronic)Right angleData conversionMachine visionMoment (mathematics)Computer fileoutputTask (computing)Object (grammar)BuildingLecture/Conference
MehrprozessorsystemIdeal (ethics)Function (mathematics)Virtual machineDistribution (mathematics)Task (computing)Distribution (mathematics)Mechanism designDatabaseNeuroinformatikCalculationMessage passingData storage deviceForm (programming)PlanningGame controllerSpecial unitary groupHarmonic analysisTask (computing)Queue (abstract data type)outputFault-tolerant systemComputer animation
Maß <Mathematik>PlanningData transmissionFunctional (mathematics)CalculationFunction (mathematics)Queue (abstract data type)Arithmetic meanErlang distributionData conversionMiniDiscInstance (computer science)SoftwareoutputState of matterHeat transferBeat (acoustics)Multiplication signParameter (computer programming)Key (cryptography)SpacetimeLevel (video gaming)Computer-assisted translationGradient descentMultiplicationTable (information)Semiconductor memoryWaveLecture/Conference
Task (computing)Function (mathematics)TupleForm (programming)Single-precision floating-point formatBarrelled spaceVirtual machineMarginal distributionRobotoutputFunction (mathematics)Task (computing)Field (computer science)Distribution (mathematics)Mechanism designTupleObject (grammar)CASE <Informatik>Parameter (computer programming)Source codeArray data structureIterationLibrary (computing)BitCalculationInformationSimulationMetropolitan area networkSearch algorithmMultiplication signFacebookAverageCubeService (economics)Web 2.0Row (database)Total S.A.Key (cryptography)Electronic visual displayElectronic program guideVotingHeegaard splittingContext awarenessKeyboard shortcutReal numberVector potentialLecture/ConferenceComputer animation
Task (computing)Source codeWeightCalculationDiffuser (automotive)EstimationSubsetParameter (computer programming)PlastikkarteDefault (computer science)Level (video gaming)Heegaard splittingoutputCodeFunction (mathematics)Heat transferRead-only memoryResource allocationDivisorSemiconductor memorySemiconductor memoryParameter (computer programming)PlastikkarteMixed realityFunction (mathematics)Complex (psychology)File systemGastropod shellInstance (computer science)Musical ensembleFunctional (mathematics)Block (periodic table)Shared memoryPerfect groupEndliche ModelltheorieFigurate numberRight angleCalculationTask (computing)Process (computing)MetreWeightHeegaard splittingSystem callSingle-precision floating-point formatTrailGradient descentAssociative propertySampling (statistics)Multiplication signData transmissionOpen sourceTexture mappingNeuroinformatikTranslation (relic)Diffuser (automotive)Chief information officerNominal numberProcedural programmingTracing (software)Set (mathematics)Web 2.0RankingSoftwareoutputCuboidUniform resource locatorDirectory serviceUniformer RaumFitness functionSymbol tableFormal languageComputer fileGodPhotographic mosaicDialectGoodness of fitSoftware testingCache (computing)CoprocessorGame controllerPhysical systemSource codeMaxima and minimaCASE <Informatik>WebsiteConfiguration spaceRevision controlElement (mathematics)BitImplementationNumberHeat transferComputer animation
Computer scienceAuthorizationTask (computing)Semiconductor memorySquare numberMultiplicationBlock (periodic table)Point (geometry)SpeicheradresseMathematicsRow (database)Matching (graph theory)StatisticsMultiplication signOntologyCoefficient of determinationProduct (business)CausalityShape (magazine)ProteinMemory managementAlgorithmView (database)Array data structureNeuroinformatikLecture/Conference
Array data structureAlgorithmTask (computing)Read-only memoryObject (grammar)Process (computing)Parallel computingSoftware frameworkLattice (order)ResultantSemiconductor memoryBlock (periodic table)ParsingSoftware frameworkParallel portTask (computing)AlgorithmComputer animation
Software frameworkParallel computingSource codeRead-only memoryElectronic mailing listResultantCoprocessorClosed setPartial derivativeProcess (computing)Pulse (signal processing)1 (number)Instance (computer science)Hydraulic jumpSource codeParallel portOrder (biology)Core dumpModule (mathematics)Semiconductor memorySoftware developerPhysical lawWeb portalMultiplicationField (computer science)Computer animationLecture/ConferenceXML
Execution unitCalculationMehrprozessorsystemDistribution (mathematics)Data structureWebsiteFaculty (division)Beat (acoustics)Shared memoryWave packetMultiplication signSingle-precision floating-point formatMonster groupLevel (video gaming)System administratorModule (mathematics)SmartphoneMechanism designObservational studyStandard deviationMusical ensembleOffice suiteMathematicsYouTubeSoftware frameworkSoftware developerContext awarenessLibrary (computing)Point (geometry)ResultantEmailSoftware bugPhysical systemSoftware testingView (database)Lecture/Conference
WebsiteHazard (2005 film)BuildingInformationProfil (magazine)Discounts and allowancesComputer animation
Lecture/Conference
Transcript: English(auto-generated)
You know, it's always nice to be here at EuroPython. Actually, I'm a regular. I think I've been here for 16 or 15 or 16 years, long time, but it's always fun. So right now I'm working at the GEM Foundation in Paguia.
GEM stands for Global Earthquake Model. And what we do, oh, no. Sorry. Technical.
Okay, the keyboard was not working for a moment. Now it's working, yeah. What we do is things like that. This is the global risk map. So essentially, in all countries of the world, we can make an assessment of how big the losses will be.
They could be economic losses, in the sense that there are buildings that were destroyed. Can be human losses, on the bottom right corner, there is the fatalities map, so where people will die.
There is the, on the left, we have the other map, so where are the places, the world is more likely to have earthquakes. And in the middle, there is the exposure map, so where are the buildings, bridges, hospitals, things that can be destroyed. So this map was released in December last year.
It was a big, big, big enterprise because GEM was founded something like 10 years ago. And after 10 years, we were able to produce this map for the entire world. And since we are in the Python conference, probably you want, you are interested in knowing how this was done.
And this was done all in Python, SciPy, NumPy, all the usual stack. And they took a few months, two or three months of computation on our cluster, that is more or less 500 cores plus other machines. And lots of trial and error and problems,
we are running out of memory and things like that. So this is where I come from. I also want to make a disclaimer here because this morning we learned that the high performance community is a bit closed, is a bit hard to enter, get a bit evil, I understood.
So don't worry here, we are all nice because this is about middle performance computer, okay? So it's something more than doing calculation on your laptop, but it's not the supercomputer, it's not the Fortran C++, it's just Python, okay? So don't be scared, we are ready, we are coming here.
And probably lots of you, I expect in the ocean there are lots of people that are all scientists or work in computational physics, could be finance. I have been working for seven years in finance too. Actually, I'm originally a physicist. So I think most of you feel comfortable
to program in Python and doing something, but they want to do something that is more than a laptop, so this talk is for you. So this talk will not be much about performance, will not very much performance oriented. So it's true that sometimes I do really,
I need to profile code and find issues. For instance, last month I had some legacy code was written before my time. It was load, I had to start the Python profiler, see exactly which line was low, why, and there was a function which was called
too many times and in the loops and moved outside the loop. So this is a skill that you need, it's important. But honestly, I do this one or two times per year. It's not so common that I have to go this low level of detail. And this is a bit surprising for me
because I would have expected to be spending all my time in the profiler, but this is not indeed the kind of middle performance computer that we are doing. Actually, it's very rarely that we go to this level of detail when it's the Python profile. Essentially, what we are interested in is the total runtime of your computation
and knowing more or less where are the points where it is low. More or less means that you don't go to the details which Python line. What you have to do is to instrument your code. So typically you have some context manager, so you say we've, and you have this context manager
and inside the block of code and we've this context, compute things like start time and time and save some places, how much long it took to run that block of code, how much memory was allocated, and that's the main thing. So as your computation goes,
you store this information and at the end you have a report, you can report, you understand where the problem are and you don't essentially need to go to low level except one or two times per period. So this talk is not about micro-prostigmatizations.
It's not like the talk that preceded mine about the site and it was very good talk. But this is not what I'm going to talk about. Instead, I want to point out that what is more important for this kind of middle level computing is the architecture that they are using, the data structure that they're using,
the libraries that they're using. For instance, an example, we have to compute distances, okay? The after you want to know, this is the hypo center of the air to wake, how far is from the cities to complete distances. In some calculations it was low. So I tried to use a site
and I tried to use a different number and various things. We, to no avail essentially, I was getting some improvement but if I get the 20% improvement, it's not enough essentially to justify, to have a big dependency from some external library.
Instead, at the end, we found out that in SciPy, there is a module which is called SciPy distance, which is able to compute Euclidean distances and it was an order of magnitude faster than NumPy. So what they did was to convert, we have long lat, convert in 3D coordinates, X, Y, Z, and then use the Euclidean distance.
It was an order of magnitude faster and that's that. So essentially, right now, the first century, is probably more important to know which is the right library to use because it's already done. More than going low level, depends on all the things that you are doing. But for the thing that we are doing,
you should read the documentation and know well your NumPy supply. Okay, having said that, I want to start. So, from the input-output formats, because I suppose you are writing some scientific code and maybe you want other people use your code.
So, really, one thing that is really important, maybe people don't think enough about is, how you will use and will interact with your application, which are the inputs, which are the outputs. Okay, and one thing that I discovered here,
is that once you have decided this is my input, you are stuck with it, essentially forever. Why? Because there will be always somebody coming and said, ah, I ran this calculation 10 years ago and now I want to repeat,
I want to reproduce that result. Or it was typically a scientist or engineer or users. And they want to reproduce something was published in an article even a long time ago. So, you really need to support the, you need to be able to run all the calculation.
But also for myself, if I want to compare an old version, the new version of the regression, I must be sure the engine, when I say the engine, means our software is called OpenQuake Engine. And this is software that you give an input, things like the seismic sources, like the exposure
and in output you get maps or region where there are more earthquakes, more damages, things like that. So, the engine is, this tool is a command line tool. There is also a web interface and there is also a QGIS interface via a plugin
so you can run calculation, display the results in that way too. So, I was saying, you really need to keep your input formats working forever because originally I thought, okay, but maybe I can give a converter to the user. So, now the old version, this was the format, you convert and then after a few releases,
nobody will use the old format and I can remove the old parser. So, no, really it does not work because if you need to provide the converter, you can as well, that converter, put inside your source code and then have internally done a conversion every time
and the user will not need to do anything and will be done internally. That's the right way if you want to do that. So, and you must do that because I'm pretty convinced that it's impossible to get the right format at the beginning because what you, for instance,
we decided to use XML for the sources. XML maybe is not the best format but there were reasons at the time it was done this way because other softwares in the seismic community were using that. The other softwares were written in Java, so XML for them was meaningful.
So, we used that, there were some standards. So, there are reasons, things that they say, okay, this format probably is good but after 10 years it's not good anymore. So, you must be prepared to do that. What you can do instead is to have an internal format. So, what you should do is to have some internal conversion
from the input that must stay the same forever to the internal format that you can change in such a way that is performant, okay. The output formats are a bit better. So, we were able to change the output formats because, okay, some user may protest because, but usually they're happy because before we were giving XML as output.
So, they had to write their own parsers for the XML. Now, we give CSV, they are more happy with the CSV. So, it's easier especially if the format is simple. Okay, so let me talk a bit more about in practice what we are using.
There is a 90 file where you set the parameters that they are using. I don't know, like the investigation time. You want to know how many earthquakes there will be in the next 100 years, that's the investigation time. Or parameters like the discretizations, things like that.
And the INE format is good. Right now, I would use the TOML format. I don't know if you, do you know the TOML format? Never heard about this format, somebody. It is a new, not so new, relatively new format. So, at the time, there was not that format, so you could not, so it's not.
It's used, especially in the IT, there is a Python PEP to use TOML format for distribution tools. It's a nice format. It's essentially an extension of the notiny, which is also hierarchical. So, it's in the middle between a JSON and a notiny.
So, that's format at the time. We're happy to use that, but we are still supporting the time, and we will forever use the time. What we're using internally. The XML, there was a period that there really was against the XML,
not so much against. The big mistake, I think, was using the XSD, which is XML schema, that was really not needed, because the XML schema is a way to constrain the XML, but was not really working in our case, was making everything more complex.
By the end, you need to do additional constraints, which are specific to the earthquake, so you have to write your own validator in Python anyway. You cannot trust it. It's not enough, the validator at the XML level. So, this was being removed. XSD schema are no more fortunate.
The XML could have been a bit simpler, but okay. CSV, we can use the CSV. Now, we're using input CSV more and more, especially the engineers are very happy with this CSV, because typically they have Excel sheets, so they have input in CSV, so it's good for them. We are using HDF5 in some cases,
because the other scientists, there is a big split in this earthquake community. There is AZARD, which means geologist, essential AZARD scientist, because geologist, geophysicist, et cetera. Then there is the RISC guys,
which are typically engineers, so they know about structural engineering, buildings, and things like that. So, they come from different background, they have a different mindset, and they have in the middle, because they have to support both of them. So, let's say more AZARD guys, geologists, et cetera,
they like to code, to program, so they use formats like HDF5, for instance. Well, that means it's okay. So, they ask us, but for this, the model for California, which is this USERF tree, is the seismic model for California. The model is written inside this big HDF5 file.
So, we need to support that format, because this is the format they give us. Or, in Canada, they use this ground motion prediction equations, which are HDF5 files. So, there are tables and numbers inside this area. So, we have to support it, because they use that. And then also, we support the zip,
because if you have lots of files, to identify the HDF5, put everything in the zip, and you have done, so it's very convenient to ship in a calculation front to another people, et cetera. No problem with that. In output, we are using XML.
When I write NRML, means Natural Risk Markup Language, is a version of XML, specific for this earthquake stuff. And we are removing this output, because people have to pass it, and it's very inconvenient.
Apparently, nobody wants to write an XML on normal parts in our community. So, we are trying to export instead, CSVs. CSV, everybody's happy. We have CSV, where we have an header with the name on the fields. And before the header, you have a sharp character,
like comment in Python, and inside there, you can put some metadata, like this file was generated, and the data by this version, the engine, and maybe some parameters that are relevant. HDF5 sometimes we generate, and there is also NPZ.
Do you know what NPZ is? NPZ is a zipped NumPy array, essentially. Let's say, if you have more than one NumPy array, you can zip them together, it looks like a dictionary with the name and the array, name array,
and you zip everything. This format, we use it by necessity. Necessity, essentially, we need to communicate from the engine server to the QGIS plugin. I mentioned before that we have also QGIS interface.
And so, we need to transfer these large arrays from the engine server to the QGIS client. Originally, the idea was that we would transfer those as HDF5 files, but it didn't work. Actually, it worked for us, but we were using Ubuntu, and the version of the libraries were consistent
between the engine and the QGIS plugin. But then we discovered that in any other platform, there were inconsistencies, like QGIS was using its own version of HDF5, the zip part, and if you want to try to read the file generated by us,
we would have another version, essentially segmentation falls, total disasters, was very, very difficult. So we said, okay, let's transfer these arrays as in NPZ, as NPZ files, because the NPZ is a format that NumPy introduced 10 years ago, I don't know, 15 years ago, it's very stable and works everywhere,
so we did that by necessity. This is an example of the web interface that we have, and this kind of output, so you can have for, this is a classical calculation produced as our course. As our course has essentially the probability
of having earthquake bigger than that at some value in each point, so you can download lots of different formats, SP, XML, NPZ. This full report is in a structural text,
it's also another form I did not mention, but this is useful because inside the report, you get information like, which was the seismic source slower, or how many raptures there were in this source model, things which are useful to understand if there are problems or things are slow.
This is very interesting, the input. One output is called input, is actually the input files that were used to do that calculation. This is really valuable, because if you have another guy,
another, because this is a multi-user situation, okay, we have a cluster, several users, one of my users ran a calculation, and as a problem, calls me and says, ah, it doesn't work, I have an error, so I can go here, download the input, that input output, and I download that, unzip,
and I have all the input files that are used, so I can run again the calculation on my machine, and the bug, and see what's happening. So it's useful in this way that everything, and then at the end, there is this download data store, and we call data store, this hdf5 file, which is the internal format that contains everything
about the calculation, contains another copy of the input, and all the outputs, and when somebody has a problem, I can tell him, please send me via Skype or something, the hdf5 file, the data store, and from that, I can repeat and understand everything. So that's a good tip.
Have a way to get the input that your user are using, because typically they say, and this happened, they say, ah, but I ran these files, and last year they were working, now they don't work, and then you discover that the files they ran, they are not the same before, they changed some parameters, so it's very important.
What a way to get really what they run. Tenant formats that we are using, as I said, hdf5 is the main, because everything, the real data, the arrays are here. Toml, we are starting to use here internally in Convina, because it's a readable hierarchical format. SQLite, because we have a database
that store information like who submitted that calculation, when it was submitted, and then you can do statistics, like who was the user that used the cluster more, more run time, longer calculation, things like that, so you should try the database with this kind of stuff.
But this is, let's say, metadata access to this stuff, the real data are inside the hdf5. And they are good, and we are happy with this format. Of course, the choice of the format has a big performance impact. This is clear, but it's clear to me,
but probably not to the one that originally thought that exporting everything in XML was a good idea, because we have cases where the export time, the time to export the result of the calculation was longer than the calculation, because we were trying to export in XML. Totally foolish now.
And then we, now we remove the XML export, the export ESC, which is a lot faster. And if, again, if what you want to export is too big, there are other options, like export in hdf5, or give the data store where there is already information that needs to be exported.
Also, the importers, very recently, for instance, I changed the exporters, because we had the exporters reading the, or the importers, sorry, importers. We have a format for the exposure, so the buildings, in XML, and we also introduced the format later in CSV.
But for legacy reasons, this CSV format was actually reading the records, converting these records into node objects. These node objects are similar to the nodes that you will get from the XML, and then reuse the XML importer logic,
which made sense at the time, because, okay, you have always set the line, so you try to reuse whatever you have already. But the end, the importer was low, because it was doing all these conversions, so what were recently changed, now the importer just read the CSV, and the XML importer instead reused the CSV importer.
So I did the opposite thing than before, and now they are five times faster, so you can read two million assets, which is, for instance, the exposure of Canada or California, you have millions and millions of buildings. You can write these millions of buildings, two, three millions, okay?
So they are not in issue now. And for instance, one more format, say hdfag, we are happy with that. Really, I would recommend to use that. Okay, let me, if you have a question on this, because half of my talk, the other half will be, but if you have a question, that's the right moment.
On the input format, that's the right moment. Otherwise, I'm going to talk about task distribution. Okay, which is very also, I think that's probably more and more interesting than the first part, but I wanted to start with the input outputs,
because they are more important than people realize. Task distribution, what we are doing, by originally, we started from salary for historical reasons, because the people that were there before my time, they like the salary, they know the new salary,
so we start the region that was only salary. And salary has, we have a rabbit and a queue. So it is very nice queue mechanism, but is not meant for computing, for high performance, it's not meant for that. So rabbit and a queue works very well if you have a lot of small messages.
But on the contrary, we have few messages which are large arrays, gigabytes of data in one message. And then rabbit and a queue are things like fault tolerance. I can keep everything in a database, which is called Mnesia, so if the calculation dies, you can restart it. But these features, they are very nice,
but we do not use those features. For us, if a calculation dies, runs out of memory, there is some problem. Dies, you don't restart, okay? Is dead means that you made some mistake, you have to change the parameter, repeat everything. So it has features that we don't use and these features have a cost in the sense
that for instance, the data transfer is really slow and there is a lot of conversions between Python and that there is Erlang inside rabbit and a queue is done in Erlang, so it is complicated.
But still, we kept using salary, essentially because of this revoke functionality, which works very nicely. So the revoke functionality is the ability to kill a calculation without killing the calculation of the other users, because we have a cluster, multiple user, I run the calculation,
after 10 hour, I discover that it was low, I'm still at 1%, so I said, let me kill, I will retry, but if another user in the meantime started something, I cannot kill all the cluster, I need to kill my calculation or kill the other and this is the revoke functionality, that works well in salary.
And we tried other things, they don't have this revoke functionality, like Dask does not have this functionality. I asked the main developer, Dask, but what about that? And they said, no, don't use Dask, we don't plan to have this functionality ever. So we don't use it, we start with selling the sense. And, but we solved the big problem of the data transfer
by using zero MQ, essentially. The trick was to use zero MQ to receive the outputs. So if your calculation produced a lot of outputs that you want to collect, by transferring those by zero MQ, you can transfer one terabyte of data
without big issue, essentially. On the contrary, if you try with salary, you immediately run out of memory, out of disk space, everything becomes low, it's a disaster. So zero MQ is the reason, it was introduced just to solve the problem of the data transfer from the outputs back, instead for the inputs,
they are less inputs than outputs in our case, because what we are doing is generating for instance, ground motion fields of the earthquakes, they're big arrays, we're generating more than one. So we're using zero MQ to receive the outputs and we are using both in single machine,
both in the task situation. The biggest issue that we have with the task distribution is not much the technology or the libraries that we're using, but it's really related to the fact that when you do a seismic simulation,
you have some objects which are called seismic sources, essentially they are containers for earthquake. So they contain, a seismic source contains all the potential earthquakes can be generated. So they are in Python speak, they are iterators, okay. There are iterators. And you can have a seismic source that contains
one million ruptures inside, so it's very big and another contains three. So there is very inhomogeneous, when you try to, when you have to send the tasks to the workers, you may end up with a task which contains big source and other tasks contains small sources. And the task contains the big source
will take forever to run. So the problem is the slow task essentially. So this is a case where we have a mean time or around 600 seconds, 10 minutes. So never is a task running 10 minutes, but there is this one running more than 50 minutes.
And this is a good situation because now has improved a lot. But in the past, we had the calculation that was taken, I don't know. 10 hours, and actually 10 hours was the lowest task without the lowest idea, all the rest in one hour. So we're spending, waiting 10 hours for something could have been done in one hour. Okay, this is the biggest problem
that we have with the task distribution. And recently we were able to, I will not say totally solve, but improve a lot on that by changing essentially the task distribution mechanism. So the idea was that we can essentially split
the big task into smaller tasks. Out is done internally, and we'll try to explain a bit. So when you receive the output of a task from zero MQ essentially, you can, what you receive typically is a dictionary or arrays or arrays or things like that.
But we decide, but if I receive a tuple which first argument is a callable and then some arguments. This tuple, this callable arguments, actually I can think this as a task, a sub-task. So I send back the information on which I can start a new task, okay.
And this way I can essentially introduce a task bleeder which is able to split the bigger calculation to smaller calculation. Here is an example I was implemented. So this task bleeder, the first element are the seismic sources, then other parameters.
And this seismic sources you can give a weight to the sources, so how many earthquakes they contain essentially. You can have a max weight parameter somewhere, so you can decide this was a big task, big weight, so I will split in blocks
and yield, when you yield this tuple, first argument is a function, then the other arguments. When you yield this, the receiver will resubmit that task function with this block of parameters. So in this way, the sub-task works
and we are able to split the big task. A trick is that it's difficult to associate the weight, so how many ruptures there are inside that source without how much time it will take to perform the calculation.
Because it depends on various things. I will give you one example. I could have a complex fault source that is long, 10,000, 5,000 kilometers, so affects 10,000 sites, affects a lot of sites. Even if there is only one rupture of this kind, but affects a lot of sites, lots of sites means you will have a large numpy array.
Okay, numpy array will be 10,000 slots. And this numpy array will be so large that it will be essentially out of the cache from the processor. And if the array is small, you are within the cache of the processor
and everything is fast. If the array is big, you are outside of the cache and everything will alter as well. And this depends on the processor you have. How big is your cache? There are details, implementation details and makes it very difficult to estimate knowing the number of rupture, how long it will take.
So what we did, this was very recent breakthrough on two months ago, one month ago. So what we did was inside each task and you have, I don't know, one million ruptures to compute, you take a sample, so you make a sample, so the bigger calculation,
you measure the time it takes to compute this sample and then you estimate how long it will take to run the entire calculation. Then when you have that, you can essentially have an association between the weight of the calculation, how long it takes. So you can decide how to split and the user can say,
I want to do tasks that are 10 minutes long or half an hour long or two hours long and the engine will understand more or less how long they will take and will split accordingly. This is really what solved the problem of the slow task. Then we made another improvement.
So now the engine is so smart that does not require the user to set the task duration parameter but is able to figure out what is the right task duration. Of course, the engine is not perfect so maybe sometimes it does not figure out the right task duration so that parameter is still user configurable
and it's very important that you give the user control in some cases because now essentially what I did, I have a system that works well for all the model that we have in the global mosaic. So global mosaic, I mean, we split essentially the world in 30 regions.
I don't know, Northern Africa, South America, Canada, Australia, these 30 regions and we have source models for all of these 30 models and we have also automatic tests that run during the night and run samples of the model. Of course, the model is very large it would take weeks to run one
so run all that in weeks, months to run all of them so we run a simplified version but we have tests to make sure that the software is able to run what we need to run and we keep working even in the future.
So now I am nearly at the end. As I said, the data transfer issue was solved in the output using zero-mic UGO which is very good, so I recommend it. In the input instead, the trick, we are using a shared file system, NFS for instance, you can use the shared file system you want in a cluster.
So the idea is that if you store your input in, for instance, in the HDF5, in a location, in a directory which is visible to the workers, the worker can read directly from that directory instead of transferring via a bit of the data and this solved the problem with the input,
not the transfer. And finally, there is a note with the, if you have a trust completing procedure that produce a lot of tasks, you might be careful because if you produce too many, so we run out of memory anyway, so you must be careful.
Problem we had with the memory occupation in the past and even now, we always have memory issues. Memory issues are not necessarily always bad, sometimes it's good if you run out of memory early
because what happens is that the scientist try to run a calculation for the first time, he probably is not the person that made the model in the first time, doesn't know which are the right parameters, try to run it and doesn't work. It's better if you run out of memory after 10 minutes
rather than keep working without running memory but for 10 days and then the end, you discover that you can complete because it's just too big. So it could be good to run out of memory. Yeah, memory allocation is very important actually also for the performance point of view because I don't know, I had this idea
that if you took a complicated mathematical computation like sinus, cosinus, square root, this complex. Instead, you know, typically that is ultra-fast, what is low is allocating the arrays. So be careful about that.
How to reduce the required memory? There are a few tricks which are well-known and I would say try to use NumPy arrays as much as possible because if you have NumPy arrays, you know how much memory will take just to do the multiplication shape of the array
and if it is a 32-bit or 64-bit protein point, you know. So use as much as possible NumPy arrays. Also, for instance, recently we were running out of memory when computing statistics with the other cores and so I solved by using site-by-site algorithm.
So for each other, I don't know, 100,000 sites, for each site, I do the computation. This is not the most efficient way. Actually, before it was done in blocks, blocks of sites, which is faster, but at the end of doing it in blocks of sites, or I had two options, or I ran out of memory,
or I used too many tasks, of course, if you have more tasks, the blocks are smaller, but there were so many tasks that at the end, everywhere was ultra-slow. So at the end, I went back to use a site-by-site algorithm. So you have to see case-by-case. At the end, another way to solve the memory issue
is if you have a parallelization framework, which is able to yield partial results. This is an example, you want to accumulate results from, okay, you have the processor for each source, if you have the result of this source, then you accumulate, for instance, on a list.
And if you are able to yield the partial results first, then you avoid to accumulate everything in a huge list. In order to do that, you need a parallelization approach that is able to do that. Unfortunately, we don't have it so easily,
so I had to implement it with a zero MQ. But if somebody who is a core developer in Python and can touch this, I think the multiprocessing module should work this way. So multiprocessing, if there is a return, return one result.
If there is yield, returns partial result. That's an idea they had 10 years ago, and it feels very natural to me. I don't know why nobody is using it, but anyway, it worked out a lot for the memory. And with this, I'm done, I'm ready for questions.
We have four minutes for questions, so we have some microphones there and there, so please line up behind them and fire your question away.
Thank you for your talk. You mentioned that Dask was missing a feature compared to another framework. You mentioned that you talked to the main developer and they have no problem. Can you repeat, I missed the subject. We're missing the feature. When you were talking about zero MQ, I think,
you said that Dask was missing a feature. No. No, no, no. No, no, zero MQ is the solution as the feature. That is the multiprocessing module, essentially, in the standard library that does not have the feature or yielding partial results. That I think would be useful.
You did not speak about Dask, the framework. Dask, yes, I spoke. The feature of Dask that was missing was the revoke functionality, the ability of the skill. Revoke, the ability of killing my calculation without killing calculation by another guy. And that was available in zero MQ? Ah, it's now available. It was one year ago that it was here.
Okay, it was in zero. Okay, thank you. Okay, sorry if you misunderstood, it was my task. Any other question? Did you ever look into using Apache Spark for a similar framework?
And if so, why did you choose to use your own system? It's too big Apache Spark for us. We are, for me, would be more than the high performance. We have the middle level. It would be too big also from the point of view. Infrastructure and our system is very busy
and it would be a big change for us. So we try to do whatever we can do in the small. Okay, any other question? Well, one question I have is have you tried to use MPI and MPI for Pi?
Yeah, I'm aware about MPI. We do not have a cluster with MPI, so we cannot test it, essentially. Okay. And again, it would be too difficult for us. We are trying to do middle level. And a question. I have another question. Did you have a look at Apache Parkit?
Parkit, I know about Apache Parkit, but we did not have a look, did not. Actually, you should also consider that I'm spending 99% of my time fixing bugs and the mailing list, the training issue,
and the time we have for changing the distribution mechanism is very little. So I made my talk on that because it's an interesting talk for the developers, but it's not really, most of my time is not spent there. Okay. Any other question?
Well, what should you do if you realize that you live in a city which is on one of your maps, like the fatalities map, that kind of stuff? If you are curious, we have, sorry. Or that building over there. There is this application, there is a gem website
where you can go there and see your city. You look in there and see what is the hazard on, you live there. I live there, okay. And also we have actually risk profiles for each country in the world, more than 200 countries where we give you information
which was the biggest earthquake recently. Okay, maybe I don't want to know. How many people are at risk? There is everything, the gem site. Nepal. And Nepal is one of the more at-risk countries.
Okay, let's leave it like this because we get depressed. So let's thank Michel again, and the next speaker can come. Okay.