We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Tracking Performance of a Big Application from Dev to Ops

00:00

Formal Metadata

Title
Tracking Performance of a Big Application from Dev to Ops
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
This talk describes how performance aspects of a big Air Traffic Flow Management mission critical application are tracked from development to operations. Tracking performance is needed when new functionality is added, to balance the additional services versus the resource increase needed. Measuring and tracking performance is also critical to ensure a new release can cope with the current or expected load. We will discuss various aspects such as which tools and techniques are used for performance tracking and measurements, what are the traps and pitfalls encountered for these activities. The application in question is using Ada, but most of the items discussed are not particularly Ada related.
Video trackingBefehlsprozessorRead-only memorySoftware developerSelf-organizationState of matterOperations researchImplementationNetzwerkverwaltungComputer networkPhase transitionMilitary operationDataflowChannel capacityProcess (computing)System programmingCore dumpInstallable File SystemLine (geometry)CodeQuery languageMessage passingComputer hardwareServer (computing)Workstation <Musikinstrument>Graphical user interfaceComplex (psychology)AlgorithmCountingPredictionMathematical optimizationCalculationVertical directionTrajectoryColor managementPoint (geometry)Line (geometry)Graphical user interfaceTrajectoryMultiplication signView (database)PredictabilityVertex (graph theory)CalculationMathematicsGraph (mathematics)Object (grammar)Complex (psychology)DataflowWorkstation <Musikinstrument>AlgorithmQuery languageChannel capacityPhysical systemNetzwerkverwaltungTerm (mathematics)PlanningBuildingRoutingInformationBitSemiconductor memoryMeasurementSoftware developerCartesian coordinate systemSelf-organizationDifferent (Kate Ryan album)StapeldateiCodePower (physics)ÜberlastkontrolleSoftware testingMechanism designResultantStructural loadGame controllerSoftwareCore dumpOrder (biology)Complete metric spaceState of matterProjective planeImplementationProcess (computing)Position operatorMathematical optimizationSlide rule2 (number)CodeRule of inferenceComputer hardwareServer (computing)Message passingMultilaterationElectronic mailing listPresentation of a groupScripting languageGastropod shellWordShared memoryOperator (mathematics)Form (programming)WebsiteArchaeological field surveyKey (cryptography)WeightCoefficient of determinationE-learningDegree (graph theory)Control flowStaff (military)Formal languageEvent horizonSource codeCASE <Informatik>Group actionFunctional (mathematics)40 (number)DemosceneGoodness of fitDoubling the cubeService (economics)Product (business)WaveMereologyShape (magazine)Traffic reportingLogic gateSpeech synthesisMoment (mathematics)Coma BerenicesDrop (liquid)Web pageComputer animation
TrajectoryMenu (computing)Linear multistep methodKnotScalabilityConfiguration spaceoutputServer (computing)Query languageThread (computing)Workstation <Musikinstrument>StapeldateiProcess (computing)Operator (mathematics)SimulationGroup actionCloningRead-only memorySubject indexingTask (computing)Data structureNeuroinformatikPosition operatorScalabilityConstraint (mathematics)Workstation <Musikinstrument>View (database)Operator (mathematics)Different (Kate Ryan album)Multiplication signBitStructural loadTask (computing)Query languageProcess (computing)Server (computing)RoutingInstance (computer science)Streaming mediaSoftware developerAlgorithmoutputFunctional (mathematics)SimulationStapeldateiSubject indexingPhysical systemBinary codeHash functionMathematicsCentralizer and normalizerGame controllerCore dumpSet (mathematics)Service (economics)Complete metric spaceMathematical optimizationPlanningLevel (video gaming)PredictabilityCalculationDataflowGraph (mathematics)Group actionMehrplatzsystemSystem callCondition numberOperating systemConfiguration spaceUnit testingData structureTrajectoryExecution unitMeasurementBlock (periodic table)Direction (geometry)Event horizonGraph coloringSpeciesClient (computing)Square numberPole (complex analysis)WhiteboardPower (physics)Focus (optics)Speech synthesisOpen setInternet service providerSoftwareSoftware testingOffice suiteUniqueness quantificationLocal ringChemical equationConservation lawMoment (mathematics)Order (biology)Single-precision floating-point format2 (number)CausalityPoint (geometry)WeightEmailFilm editingGoodness of fitComputer animation
Execution unitBinary fileObject (grammar)Hash functionCorrelation and dependenceLibrary (computing)Software testingSimulationBounded variationMathematical analysisTask (computing)ProgrammschleifeSampling (statistics)Table (information)Complex (psychology)AlgorithmCalculationTrajectoryPhysical systemoutputMessage passingQuery languageFormal grammarSelf-balancing binary search treeUnit testingExecution unitDoubling the cubeLoginWordObject (grammar)Representation (politics)Symbol tableFreewareDifferent (Kate Ryan album)System callSubject indexingBefehlsprozessorMessage passingThread (computing)MathematicsSoftware testingBitOperator (mathematics)Physical systemType theoryoutputContent (media)Software developerPlanningCodePattern languageComputer hardwareBounded variationCalculationMultiplication signRadio-frequency identificationFormal languageTrajectoryLibrary (computing)Stack (abstract data type)Electronic mailing listState of matterMeasurementRoutingDistribution (mathematics)Binary codeGoodness of fitTask (computing)Hash functionFlow separationSimulationProgrammschleifeConstraint (mathematics)File formatGame controllerQuery languageComputer-assisted translation2 (number)Structural loadService (economics)Workstation <Musikinstrument>QuicksortMaizeEndliche ModelltheorieNumberStatisticsForm (programming)Online helpMaxima and minimaPort scannerCall centreResultantTheory of relativityBootingDisk read-and-write headVideo gameBlock (periodic table)Speech synthesisRevision controlCategory of beingLevel (video gaming)Presentation of a groupKey (cryptography)Uniform resource locatorComputer wormElectronic program guideUniqueness quantificationComputer animation
outputPhysical systemMessage passingQuery languageFormal grammarComputer hardwareVolumeAsynchronous Transfer ModeEvent horizonGroup actionGUI widgetFrequencyWorkstation <Musikinstrument>Source codeDeterminismComputer networkSystem administratorLocal ringDatabaseInformation securityFunction (mathematics)Computer fileProcess (computing)Parallel portIdentical particlesOperations researchLevel (video gaming)Video trackingPlanningPhysical systemMultiplication signComputer fileWorkstation <Musikinstrument>2 (number)Point (geometry)Level (video gaming)outputBefehlsprozessorParallel portState of matterObject (grammar)Mathematical optimizationTotal S.A.Quantum computerDeterminismSemiconductor memorySoftware developerResultantSource codeBitProcess (computing)Pattern languageMemory managementEvent horizonInformation securityFrequencyDatabasePort scannerLimit (category theory)Hydraulic jumpGraph (mathematics)MereologyAsynchronous Transfer ModeGroup actionComputer hardwareIdentity managementPosition operatorServer (computing)Continuous integrationReal numberDifferent (Kate Ryan album)File systemQuery languageStatisticsOperating systemFlow separationSystem administratorVolume (thermodynamics)SubsetRootOffice suiteDampingBootingReal-time operating systemOperator (mathematics)Structural loadObservational studyBounded variationMathematicsArithmetic meanKey (cryptography)Canonical ensembleData storage deviceForm (programming)Video gameRow (database)ConsistencyCore dumpSet (mathematics)Block (periodic table)Coefficient of determinationDisk read-and-write headDrop (liquid)Staff (military)Web syndicationComputer animation
Process (computing)Video trackingMathematical optimizationMathematical optimizationStructural loadLevel (video gaming)BitFunctional (mathematics)Client (computing)Condition numberProcess (computing)Line (geometry)Task (computing)Electronic mailing listSemiconductor memoryCodeMeasurementAlgorithmSlide ruleNumberSynchronizationSelf-organizationDifferent (Kate Ryan album)System callAssembly languageVideo gameOctaveTunisObject (grammar)AuthorizationSpeciesUsabilityPhysical systemQueue (abstract data type)FluxCartesian coordinate systemRoutingCausalityKey (cryptography)Order (biology)PlanningComputer animation
Mathematical optimizationVideo trackingExecution unitCountingNatural numberSoftware testingDeterminismPressure volume diagramPhysical systemBefehlsprozessorSampling (statistics)PlanningWorkloadChannel capacityTime evolutionQuery languageAlgorithmLimit (category theory)Variable (mathematics)RoutingNumberCondition numberMathematical optimizationContent (media)Price indexPhysical systemFunctional (mathematics)Multiplication signBefehlsprozessorEvoluteMoving averageRepresentation (politics)System callStatisticsTask (computing)Response time (technology)Query languagePlanningMeasurementBlind spot (vehicle)Pattern languageMatching (graph theory)BitProcess (computing)Computer hardwareChannel capacityCartesian coordinate systemOperator (mathematics)Graph (mathematics)CompilerMixed realityGraph (mathematics)DeterminismExecution unitGoodness of fitMereologyDifferenz <Mathematik>SynchronizationRevision controlBuildingType theoryAdditionService (economics)WorkloadResultantMilitary baseHacker (term)View (database)Dot productWebsiteArithmetic meanStudent's t-testDifferent (Kate Ryan album)Term (mathematics)Electronic program guideWhiteboardLabour Party (Malta)SpacetimeForm (programming)QuicksortMathematicsMessage passingVolume (thermodynamics)Computer animation
Dependent and independent variablesRemote procedure callMeasurementPoint (geometry)CodeDatabaseVertical directionAlgorithmTrajectoryData bufferThread (computing)Process (computing)BefehlsprozessorStatisticsHistogramOverhead (computing)Personal digital assistantVery long instruction wordInclusion mapExecution unit10 (number)Wechselseitige InformationCorrelation and dependenceEmailComputer-generated imageryConvex hullLink (knot theory)SimulationEmulationMaxima and minimaMenu (computing)Message passingPhysical systemPairwise comparisonUser profileException handlingNormal (geometry)outputCore dumpMathematical analysisPlanningChannel capacityGroup actionRead-only memoryVideo trackingDeterminismBasis <Mathematik>Linear regressionSoftware testingFunctional (mathematics)Blind spot (vehicle)Core dumpPlanningResponse time (technology)View (database)Multiplication signFilm editingOperating systemComputer configurationResultantPulse (signal processing)CASE <Informatik>Computer fileDistribution (mathematics)CalculationStatisticsMeasurementChannel capacityBeer steinInversion (music)SpacetimeSoftware maintenanceCausalityStandard deviationMathematical analysisKey (cryptography)outputGroup actionOverhead (computing)Operator (mathematics)Speech synthesisMixed realityOffice suiteSound effectLogic gateCartesian coordinate systemMathematical optimizationPresentation of a groupAbtrieb <Aerodynamik>Point (geometry)Hacker (term)Exception handlingSoftware developerLine (geometry)Process (computing)WebsiteForm (programming)WordProcedural programmingElectronic program guideVideoconferencingSystem callPhysical systemDependent and independent variablesMessage passingRemote procedure callTrajectoryProfil (magazine)DatabaseBuffer solutionThread (computing)HistogramAlgorithmBefehlsprozessorDerivation (linguistics)Flow separationNetwork topologyTouchscreenKernel (computing)VirtualizationLinear regressionDeterminismFormal verificationReal numberSurgerySoftware bugPhase transitionCodeOrder (biology)Execution unitData structureBasis <Mathematik>Computer animation
Video trackingMultiplication signWeightService (economics)Source codeOpen setPoint (geometry)Order (biology)Key (cryptography)Software developerMeasurementProcess (computing)Operator (mathematics)Sheaf (mathematics)Computer-assisted translationSingle-precision floating-point formatHypermediaMetropolitan area networkGame theoryUniqueness quantificationPhysical systemUnit testingSystem callComputer hardwareKernel (computing)2 (number)Group actionGene clusterServer (computing)BefehlsprozessorCodePattern languageTask (computing)Cartesian coordinate systemComputer animationMeeting/Interview
Video trackingPoint cloudFacebookOpen sourceDiscrete element methodComputer animation
Transcript: English(auto-generated)
OK, our next speaker is Philippe Poirot-Kes from Belgium, and he's working for Eurocontrol, where he manages about 2.5 million lines of IDAC code. 2.3.
2.3, correct. OK, so today I will speak about tracking performance of a big application from development to operations. In fact, what I will discuss is not too much related to EDA. In fact, yesterday there was even zero slides in my presentation containing a line of IDAC code,
so I was a little bit afraid that the Dev Room organizer would kill me after the presentation, so I added one slide. In fact, I was joking when I said that they would kill me after the presentation because Jean Pierre, one of the Dev Room organizers,
in fact, he's working for me because I'm paying him to support the IDAControl tool that we use to check the coding rules of our code. And Dirk Hainest, absent today, the second Dev Room organizer, is working in my team, so they have reasons not to kill me, except maybe, of course,
Dirk might have a reason to kill me because he's working for me. Voilà. So, let's switch to the next slide. What are the objectives of performance tracking? So, the objective of performance tracking is to evaluate, measure resources needed by new functionalities.
That's one objective. Another objective is to verify the estimated resource budget, CPU memory of what you develop. Sorry.
So, we want also to ensure that the new release will cope with the current or expected new load, and we want to avoid performance degradation during development. For example, imagine that we have a team of 20 developers working six months on a new release. That's about the size of the team working on this application.
And let's imagine that each developer integrates x changes per month, and if one change on x degrades the performance by one person, then optimistically, after six months, we have a new release which is 2.2 times slower.
We start from performance 100%, six months, 20%, one person. That's an optimistic view. The pessimistic view is that the new release is 3.3 times slower. We start from 100% times 1.01 to the power of 120. So, clearly, we have to do something,
and we cannot wait the end of the release to check the performance and see where the degradation is coming from. And so, the objective is to do a daily track of the performance during the development. So, we have a development performance tracking objective which is relatively precise because we want to reliably detect performance difference of one person or less.
So, a little bit of explanation about Eurocontrol. So, a little bit of explanation about where I'm working and what we are doing. So, Eurocontrol is the European organization for the safety of air navigation.
So, it's an international organization with 41 member states who would like to draw your attention on the fact that it is 41 and not 40 since last midnight. We have several sites, directorates, etc.
The activities, we have operations, handling management of the traffic in Europe. We do concept development. We do some European-wide project implementation and support, etc. More info on our website. Where I'm working is the directorate of network management.
It has nothing to do with network routers, cables, and so on. It manages the network which is used by the flights to fly. So, we have routes, we have points, we have air routes. This is a network and that's what we manage. We develop and operate the air traffic management network.
So, we have different operational phase. We have strategical. We prepare the network, the restructure of the network and so on, months or years in advance. Pre-tactical, a few days in advance. Tactical, today, tomorrow. Post-operation, we have to analyze what has happened and maybe improve for the future.
So, we are doing airspace route data, flight plan processing for complete Europe, flow and capacity management. So, if you have an airspace control center and you have too many flights that are planned to fly through this airspace control center, the controller might be overloaded, which is not good for your safety.
And then, the idea is that all systems will detect this load and then provide mechanism to balance the capacity and the load. So, network management. We have two mission safety critical systems. So, these are IFPS flight plan processing and ETF mess flow and capacity management.
And my team is developing these two applications, which share quite a lot of code, in AIDA. AIDA is by far the main language used. We have something like the core of the system is 99% AIDA. So, some more words about IFPS and ETF mess.
So, it's a big application. The core is 2.3 million lines of code. And then, around, we have shell scripts for the building infrastructure, for the monitoring. We have a lot of test code around, etc. So, in terms of capacity, let's speak about a peak day of ETF mess, the flow management system.
So, in Europe, we have about, in peak, more than 37,000 flights to handle. And we have around 8.6 million radar positions. And these are planned to increase to 18 million in one year, because we will add new sources of position, namely the ADS-B data provided by new systems.
So, we have from external users, so the aircraft operator, France, British Airways, and so on, the airports, the aircraft control center, they all query our systems in order to have data. And this results in about 3.3 million queries per day in peak.
And we are publishing changes on the flight data through publish-subscribe mechanisms such as AMQP. And we have more than 3.5 million messages per day to publish. So, what hardware do we run on? So, the online processing is done on a Linux server 28 core.
So, it is not a small server, but it is by far not a huge, huge setup. We have some workstations that are running a graphical user interface for our internal flow controllers. And on these workstations, because they are quite powerful, we are also doing some batch processing and some background jobs.
So, we have many heavy queries, we have complex algorithms that are called a lot. For example, we have queries such as, cons or give me a flight list for all the flights that are traversing France between 10 o'clock and 20 o'clock. We have some algorithms like lateral route prediction or route proposal optimization.
We have vertical trajectory calculation. I'll show after some drawing to show a little bit the complexity. And a lot of other things you can imagine in 2.3 million lines of code, we do a lot. So, this is a graph of a flight departing somewhere in Turkey, arriving somewhere outside of the European Union.
And you see that when we have to calculate a trajectory.
So, for example, these orange stuff that you see here are radar position. And each time the system receives a radar position, it has to detect if it is far or not from the planned trajectory. And if it deviates a lot, it has to recompute the trajectory. So, this is just computing one or several routes if the aircraft operator changes the flight plan for a flight.
But what we also provide as a service is a route optimization service. So, they can call our system and say, please find me a good route. And then we have to search maybe this way, maybe this way, maybe this way. Of course, basically it's a shortest path algorithm dextra, but we can't use purely a dextra.
It's a lot more complex than that because there are constraints which cannot be modeled in a simple graph. And so even searching, let's say, the n shortest route is a lot more complex than a typical similar problem.
This is the same flight that we have seen, but on a vertical view. So, we see departs here and we land there. And we see that we have plenty of constraints like forbidden airspace. We have some levels that have to be respected. We have blocks of airspace that can be traversed or not, depending on the direction and other condition.
So, you see that our algorithms that we have to make are quite complex. For this algorithm, for example, the algorithm that search where the flight can fly, we have per day several hundred of millions of calculations to do per day, which are done either on the central server or in batch or background jobs on workstations.
So, performance needs an ETFS scalability. So, as we have indicated, we have a lot of users, we have a lot of queries, and so we need horizontal scalability for this or operational configuration of our main server
because we have a lot of instances of this software, because we have, for example, an instance of the same system, which is doing the prediction for the next week. For example, we have standalone systems that are used to study what happens in several months. Here, I am speaking about the operational system.
So, on our 28 core system, we have ten high priority server processes that are handling the critical input. So, the flight plan, the radar position, the external users queries. We have nine lower priority server processes having each four threads, which are handling lower priority queries, such as find me a better route for flight Air France 123.
We have about up to 20 processes running on workstations, which are executing this batch job or background queries. For example, every year, search a better route for all flights of aircraft operator British Airways, departing in the next three years. And that, of course, is quite a high load.
We also need vertical scalability. So, this is a little bit different of systems, where you can have a lot of users asking a lot of queries, and you can distribute it, and that's what we do. But we also have some functionalities like simulation, where one single user needs a lot of power, because we have our internal flow controllers.
They have sometimes to take heavy actions, like if an airport calls or an airspace control center calls and say, we have a problem, we have a technical problem, and we have to take maybe 1,000 flights, and we have to find how to best handle these 1,000 flights. All these actions will be done on the complete set of flights in the system,
and so we must provide to our flow controller a functionality which goes fast for a lot of changes. So, we need vertical scalability, so something very fast. For example, to evaluate heavy actions such as,
we close an airspace, we close a country, and we have to spread, reroute, or delay the whole traffic impacted. So, to give an example, starting a simulation, it implies to clone the whole traffic from the server to the workstation. It's a very fat client, and we need to recreate the in-memory indexes,
which are needed to execute all these algorithms and so on. So, it's about 20 million in-memory indexes that we have to recreate. And in the release, we are busy developing. We have spent quite some time to optimize the start simulation. We are now starting a simulation in less than four seconds,
including bringing the data from the server. And this is using multi-threading. So, we have one task that decodes the flight data received from a stream from the server. One task creates a flight data structure once it is decoded. And we have six tasks which are recreating the indexes.
Okay, so, what we have seen now is that we have high performance requirements, and we can't degrade, and so we have to track performance during development. And one thing we can use for that is performance unit tests. So, performance unit tests, these are useful to measure things such as basic data structure,
hash table, binary trees, and so on. For example, here you see that we have a performance unit test, which is checking the speed of an insect in a balanced binary tree. And then we can, with this, double check that, for example, the n log n behavior that you expect is effectively respected by your implementation.
So, we can use a performance unit test to check the performance of low-level primitives, such as p-thread mutex, eta protected objects, et cetera. So, this is a performance unit test which is verifying various things.
So, it checks the low-level p-thread calls that we have available on Linux. We can compare with a protected object and higher-level eta concept for these kind of things. We have also some timing here in this performance unit test system called clock get time,
clock monotonic, 40 nanoseconds for this one. Clock get time, but for the thread CPU time, about 400 nanoseconds. Interesting to remember these figures, because I will speak about it a little bit later, and the difference between these two.
So, we can also use performance unit tests to evaluate, measure, and be sure that we have the required performance. For low-level libraries, for example, malloc. In eta, we don't use malloc directly, we use the eta language allocation, but at least with Gnat, it is based on calls to the underlying malloc library.
So, performance unit tests have a lot of advantages. They are usually small, they are usually fast, and they are usually reproducible and precise. Remember, a one-person objective. If there is a degradation, we want to detect it, and we want to detect with a one-person objective.
Now, we have some pitfalls with performance unit tests, and I will describe a real-life example with malloc. So, malloc performance unit tests, we have developed a performance unit test to compare the J-Lipsey malloc with the TC malloc and J-E malloc. So, seven years ago, we switched from the J-Lipsey malloc to TC malloc,
because we had less fragmentation, and it was faster with TC malloc. All fine, all good. But when we parallelize the start simulation, where we had, for example, to recreate these 20 million in-memory indexes, we saw some not understandable 25-person variation.
So, sometimes that symbol was taking four seconds, and sometimes it was taking five seconds. And this, we saw that the performance difference was varying depending on linking a little bit more or less code in the executable, but this code was not called. And so, the size, minimal change to the size of the executable was causing difference.
And so, we said, let's analyze where this is coming from, and we started to analyze with Valgrin call Grin to really see in detail the instruction executed, and we saw no difference. We used the Linux PERF tool to analyze then what was the behavior
not under the Valgrin call Grin simulator, but the real stuff. And PERF shows effectively that the TC malloc slow pass was called a lot more when we had maybe 10 bytes more or 100 bytes less of executable code not called. We couldn't understand this mystery. We saw it was more often in the slow pass, but we couldn't determine why.
So, we said, that's easy. Now we will re-measure the malloc library. So, we do a malloc performance unit test and tasks, simulating or indexing tasks, doing m million malloc and then m million free.
What have we seen? We saw that jlibc was slower, but had consistent performance with this unit test. Je malloc with this performance unit test was significantly faster than TC malloc. We were really happy. But when we use Je malloc with a real code, the real stack symbol with a complete system
was slower with Je malloc. So, what was the conclusion? More work needed on the unit test so that it better simulates what we have to do. So, we continued to work on the unit test. After improving the unit test to better reflect the stack simulation work, what have we seen? TC malloc was slower with many threads,
but became faster when the unit test was doing L loops of simulating stacktop simulator. So, you stack the unit test, TC malloc, once. You stack the unit test stack symbol once with Je malloc. Ah, TC malloc slower. But when you say to TC malloc, stacktop stacktop stacktop, then it was faster.
So, our unit test was still not okay. With Je malloc, we have observed that then doing the m million free in the main task was slower. And that is, in fact, when we stop Simul, it's the main task that is doing the free. The unit test, it does not evaluate fragmentation. And I have not put all the mysteries that we have seen with this unit test.
But still, based on what we have measured, we obtained a very clear conclusion with this unit test and what to do about malloc. The very clear conclusion is that we cannot conclude from the malloc performance unit test. So, currently, we have decided to keep TC malloc,
and we will re-evaluate with the newer JLipsey in Red Hat 8. We are currently on Red Hat 7, because on Red Hat 7, the JLipsey is quite old. Okay, so pitfall of performance unit test. As we have seen, it's difficult to have a performance unit test which is representative of the reload. Malloc, we obtained no conclusion.
The pit had mutex timing that we have seen. It was a very simple measure. In fact, we have measured without contention. But what should we do? We should maybe also measure with contention, and what type of contention, and so on. And if we have to have a unit test representative of the reload,
what kind of contention do you have in the reload? And if you measure and you see where you have contention, and you simulate this in your performance unit test that might be for your current release, but if you change your code, the pattern of contention might change. So performance unit tests are nice, small, fast, but it's difficult to have them representative.
Even for hash tables, binary trees, and so on, the real behavior depends on the key types of the hash function, of the compare function, of the distribution of the key values, etc. So if it is already difficult to have performance unit tests for such low-level algorithms,
what about performance unit tests for more complex algorithms? For example, how to have a representative trajectory calculation performance unit test? You remember the picture at the beginning? How do you do a performance unit test for that? With which data? How many airports? Route air spaces?
With what flights? A lot of short haul, a lot of long haul, flying where? You might have a lot of variation in the data. So, conclusion on performance unit tests. These are useful, somewhat useful, but they are largely insufficient. And so for this, the solution is to complement, in fact to do most of the performance tracking,
not with performance unit tests, but to measure and track performance with a full system and the real data. So we want to replay one day of operational data. So replay operational data. The operational system, ETFMS, it records all its external input. So it records the messages that are modifying the state of the system,
the flight plan, the radar position, etc. It records the query messages, you know, flying in a front-end system, so the ETFMS is recording it. For example, queries such as flight list entering France between 10 and 12, which might be asked by France, for example, by the France control center, this is recorded.
And so we have a replay tool. ETFMS has a replay tool which can replay the input data. So of course, it means that the new release that we are preparing, we must be able to replay the somewhat recent old input format. So this brings a little bit of constraint on the development. And with this, we have some difficulties.
We need several days of input to replay one day because you can have flights that are filed several days in advance. So if you have a flight today, the flight plan might have been filed two days in advance, and so we have to replay more days. Now, what is the elapsed time that we need to replay several days of operational data?
This is a problem, of course. What is the hardware needed to replay the full operational data? Well, we have seen that we have, let's say, a medium-sized server and workstation. If a developer wants to evaluate the impact of a performance or we want to track daily, well, we have to ask them quite a lot of hardware. So that's a problem to replay the full operational data.
Now also, we want, as you remember our objective of one person or better, how to have a sufficiently deterministic replay in a multi-process system, multi-threaded system. This is quite a challenge, and we will describe it a little bit after. Remember, one person.
So volume of data to replay. So replaying the full operational input is too heavy, and so the compromise is to replay the full data that changes state of the system – flight plan, radar data, et cetera – but to replay only a subset of the query load. So we replay only one hour of the query load of the real system.
And even in this, we replay a subset of the background and the bad jobs. We also have a problem that replaying in real time is too slow. As I have said, if we have to replay several days to have the result of one day, if we replay in real time, it means several days.
If you want to do daily tracking of the performance, you have a lot of replays that will be in parallel a little bit everywhere, and so we have to try to reduce the time needed to replay. But we can't just take all the input and replay it instantaneously, as fast as possible, without doing something, because an input must be replayed at the time it was received on ops.
If you have a flight plan at 9 o'clock and radar data at 9.30 and another one at 9.35, you can't replay the radar data. Together with the arrival of the flight plan, you have to wait until the radar data is the correct time to process it.
Many actions also happen on time events, and so what we need is an accelerated fast time replay mode. So what is it? The replay tool controls the clock value, and the clock value jumps over the time periods with no input and no event. And so we are processing the data at the correct simulated clock,
but when there is nothing to do, the clock jumps instead of doing nothing. So the fast time mode, with all these limitations on the data and with this time control, we can now replay in one day and the data needed for one day in about 30 hours on a fast Linux workstation.
We don't need anymore a big server, and we can use a workstation which is not so huge. Still, we have to take attention to the source of non-deterministic results. So one source of non-deterministic results is a network, the NFS and so on.
If the database files or whatever are on the network, then the replay is not the only use of the network, and that can introduce quite a lot of variation, much bigger than the one person we want to detect. So the solution for this is very easy. We replay on isolated workstations. They have their local file system, their local database, and so on.
Now, another source of non-deterministic results is a system administrator. You might say, but how can a system administrator give non-deterministic results on a replay? Well, they are running jobs to audit, to see what's happening, whatever, and these jobs, if they suddenly intervene during a replay, it changes the performance.
And so the solution is to discuss with the system administrator so that we can disable their job on the replay workstation, and the solution was not too difficult. Another source of non-deterministic is the security officer. They absolutely want to be sure that there is no virus, no strange thing,
no root executable not allowed. And so they also want to run audit jobs and scan jobs and so on, and here also the solution is to discuss with them, but it was a little bit more difficult.
We also see that non-deterministic results are obtained because of the input-output path history. You start the replay from scratch, but because you use a file system and a database that was already used previously, even if you removed all the files, if you cleared the database, you still do not obtain exactly the same,
and this was annoying us with a one-person objective. And so removing files, clearing the database was not good enough, and so before each replay, we completely recreated the file system and the database for each replay. Even with that, it was not okay, because the operating system uses the history itself.
If you do twice, two things, one after each other on Linux, the second time it might have been, I don't know, the memory in the kernel, whatever might have been changed, and so what we do is we recreate the file system, the database, and we reboot the workstation before each replay.
With this, we still have some remaining source of non-deterministic results. For example, the time-control tool, we serialize most of the input processing, most but not all. For example, because if we serialize everything, it will slow down a lot the replay.
For example, the radar position that has arrived at the same second, we are not replaying them one by one. We have several processes, and the several processes will process the radar data that arrived at the same second in parallel, and that can introduce some non-determinism. The replays are done on identical workstations,
same hardware, as I've said, file system recreated, database recreated, rebooted. Still, despite same hardware, same operating system, restart from scratch and so on, we are observing difference between workstations, so all workstations are not born equal, that's our conclusion. Small difference in the clock of the CPU or whatever,
but we see some impact. With all this caveat and the limitation and all what we did, we have finally achieved a reasonably deterministic replay performance with three levels of results. We have a global tracking where we track the elapsed user and system CPU for a replay in the complete system.
We do a per-process tracking, user and system CPU, and some perf stats recording, and we have a detailed tracking. We run one hour of replay under Val Drin called Drin. This we run on the side, because this takes quite a lot of time. It's very slow. It takes 26 hours, but it is very precise.
So, this is a drawing. These are all the baselines that we are building, so we do continuous integration. Every day, developers are integrating, we are building and replaying. And this is a drawing showing the performance of the tracking
of the global performance of the release we are developing, where we have done this optimization for the stats table. In green, you see the total user CPU of all the processes that are replaying. In blue, it is a total system CPU for all the processes. And in red, it is the elapsed time. So, what can you see is that we have seen a very gradual improvement
of the performance during the development. And this gradual improvement that we have managed to track, to see that the optimizations that we are doing were effectively optimizing, or we were able to see performance degradation on a global level using this.
By the way, this, this and this and that is not that suddenly we got a quantum computer that did the replay in zero point something seconds. It's because we had a problem during the replay, of course. I'll discuss more in detail later on this part of the graph. Remember a little bit the pattern.
So, that's a global tracking. We also have a per-process tracking, where we record the user system CPU, the heap status, how much was used, three TC malloc details, because we are using TC malloc and so on. So, that's the kind of thing that we will record for each process, so that we can see what's happening here. So, we see four processes which are processing flight.
The counter there is the one which is processing flight list and cons and so on. And for each of these processes, we are recording data that allows us to understand which process, if we see a difference on the big global graph, which process has increased. And then the third level is when we have to analyze what has happened inside the process.
Then we have a one hour of replay under Valgrin-Kolgrin, and then we use the excellent KcashGrin tool and the excellent Kolgrin tool to record the cold stack, who has spent what. We can see the functions which have consumed the topmost,
and we can see the code, which jumps we have, which condition was often through or forth. And we can go up to the assembly language level. So, this is the main tool we are using when we want to optimize some specific algorithms
or when we see degradation. By the way, I am also the organizer of the debugging tool, Devroom, and tomorrow there is a talk, at least one talk, about Valgrin. So, if you are interested, that was the advertisement. So, another interesting thing to discuss is that
what we measure is we want to avoid performance degradation, but we also want to see that if we believe we are doing an optimization, is that really an optimization? And that's a real-life example of what we believe was a missed optimization, something that we could optimize, that we tried to optimize,
and then it was becoming a failed optimization. So, this is the slide with a little bit of Eda code. I promised to Jean-Pierre that we would have some lines of Eda code. Here they are. So, here what we see, we see a little bit of code of a task, a data task, and it has two rendezvous. So, this task is maintaining the automatic loading of data,
like when the airports are changing and whatever. It's synchronizing the access and the load of this data. So, while someone is accessing this data, it must be locked, and when it is not locked, the task will load new data in the memory. So, this task, among others, has to maintain the number of locks,
and so it accepts. So, it has a rendezvous, which is called unlock, where when a client asks, I don't need a lock anymore, it's calling unlock, and the task decrements the number of locks, and there is also a rendezvous called getLockCount, which returns the current lock to the client.
What is it used for? This is used for because when a process is activated to handle something, like a flight plan, it will take some locks, but when it is finished to process the flight plan, it's not supposed to have a lock anymore. And so, at the top level of the processing, we are checking that there is no lock by calling this,
this getLockCount. Okay, a rendezvous, a rendezvous is a data task, is something which is relatively costly, because it is a task calling, between quotes, another task. So, there is a task switch, there is some system call to synchronize, and so it's relatively expensive.
And so, the optimization idea was to decrease the number of rendezvous by using lower level synchronization based on volatile. That was the optimization idea. The idea was to take locks, not as a task maintain variable, but as a global volatile variable.
We have a function that returns a value, and inside the task, we now accept unlock, and then we are doing the decrement in the body. Why do we have to do it in the body, while above the unlock, the decrement was done outside of the body?
Because imagine that you have the task which is processing my flight plan. It is finished, it releases the last lock, and just after says, no, I will check that there is zero lock. Here, with this solution, we must absolutely be sure that when the task that says unlock gets the lock on,
that it must have been done. And so, we must do it in the body, because otherwise, possibly we could have a risk condition in our check. So, this should be faster, because we will have the same number of unlock rendezvous, but we will have much faster getLockCons.
Accessing a volatile variable is much faster than a rendezvous, and so this is supposed to be much, much faster. That was the idea. In reality, we detected with the performance tracking that this was a pessimization, this thing. The compiler, the Ada compiler, is quite efficient
and helps us to build high-level synchronization algorithms in an efficient way. And so, for example, if you have a nobody rendezvous, then the compiler will optimize this, and it will be a lot less costly to have a nobody rendezvous than to have a body in the unlock calls. And so, because of this becoming more heavy,
in fact, it was a pessimization. And this was detected. This is the extract of the big drawing at the beginning. So, this was without the optimization. Here we had a little bit of a problem. And here was with the optimization. And you see that the system time slightly increased.
So, here we start to say, what have we done during several release? And after, we have said, we better roll back. And so, here we have rolled back the optimization so that we are back at the original system time performance. So, what can we conclude about that?
You need to track performance. You need to track performance because otherwise you can have problems. You have to track performance of your optimization because otherwise it might become a pessimization. And a third thing to remember is that you can believe you are maybe smarter than the compiler, but it's difficult. Difficult, but not impossible
because here you see we did the optimization correctly, not using volatile, but using atomic and doing a part of the decrement. The decrement operation was done outside of the task. And so, it all went okay. So, we detect here that we can optimize something. We optimize, but it is a pessimization.
We ask ourselves, why are we so stupid? We understand it. And then after, we become more intelligent. So, that's quite positive as an evolution. So, performance tracking a submarine. We have good depth performance tracking using a mix of performance unit tests,
replay operational data as deterministic as possible. The replay day, we have to change it because the pattern of usage might change. We might have new companies appearing, air space being restructured, new route, old route, and so on. So, we change the reference day that we compare relatively frequently
so as to match new usage pattern. We use various tools. Valgrin, coldgrin, ekashgrin, perf, top, and so on. But we have to take care about blind spot of your tools. For example, Valgrin, coldgrin, ekashgrin is very easy to use. It's really a main tool we are using to optimize. But it is very slow and it serializes multi-threaded application.
And so, it measures something for contention. For example, you never have any contention when you run something under Valgrin, coldgrin. And it had a limited system call measurement. It was not measuring the number of system call and the elapsed time in system call, but not the system CPU.
And we saw here, in our case, the system CPU had to measure. As I also happen to be a Valgrin developer, I change Valgrin so that it also measures the system CPU spent in system calls. So, we need to have global indicators. We have to zoom on the details where needed. And as I have said, improvement in the pipeline,
the next version of Valgrin, coldgrin, will measure system CPU. And we are also working on developing a coldgrin diff to help visualizing the difference, because currently comparing the k-cashgrin graphs is a little bit difficult. So, it looks wonderful.
We have a nice tracking system, nice graph. We can measure from the top, from the global system to the details of the process. And can we be happy with that? Is that good enough to go operational? What about you are on call, or me, I'm on call, because I'm on call from time to time,
and you are waking up Saturday at four o'clock because the users are complaining that the system is slow. Well, I can't reply. I will reply tomorrow, Sunday, the day, and on Monday, I'll explain to you why we had big problems on Saturday. This is not acceptable. We have other questions like,
is the reference day that we replay, is it representative of what happens on ops? We need some indication for that. What about evolution of the ops workload and capacity planning? For example, if we say, we believe that our users will do a lot more queries to optimize their routes because we have improved,
because we have new users that will say we want to use our service. Will the system cope, or do we have to change the hardware setup? Do we have to upgrade the hardware, put more hardware? For this, we need something else than the replay. For example, what additional hardware capacity is needed to support each time more queries of that specific type.
The solution for this is to permanently activate response time monitoring and statistics. So, this I'm speaking about the tactical response time, tactical because it's mostly useful during tactical operation, but we of course also use this during the replay of the development.
The idea of this is that the application contains measurement code at critical points, such as every remote procedure call invocation begin and end. So, one process is invoking something in another process. It will measure how long it took. At the process side that executes the remote procedure call,
we will also measure how much it took to process and send back the reply. We are measuring the database access time, begin end of the database, and see if we can algorithm begin end, such as calculate a vertical trajectory and so on and so on. So, the measurements are typically nested.
For example, inside an RPC execution begin end, we will order begin end for sub things that are being used by the thing. So, this tactical response time package maintains a circular buffer with the last end measurement, so the begin end measurement, and for each begin end measurement,
it records the elapsed time, the thread CPU time, optionally the full process CPU time. You remember at the beginning I said clock get time monotonic and clock get time thread CPU. This is measured with clock get time monotonic, and this is measured with clock get time thread CPU. This is relatively heavy, costly.
This is in fact a virtual system call implemented for those that know in the VDSO of Linux, and this is really switching to kernel to get the data. So, if there are kernel developers in the room, if you could improve the clock get time thread CPU, that would be really nice. So, this package also maintains statistics,
how many measurements of what kind were done, an histogram of elapsed thread CPU, and details about the n worst case. This is giving a reasonable overhead. About 1.7% of the CPU is spent in measuring what the application is doing rather than doing real work,
and for this we largely prefer to have it always activated on ops and on our test and replay and so on, because this is critical for us to understand what's happening in our system. So, this is an example of we have an online access to this data structure. So, this is a screen which allows to look interactively
if you are called, for example, during the night, what's happening. And you see here a kind of tree of actions where we have received a flight plan, we had a flight deviation, then we did several derivation phase for the calculation of the trajectory. We had the flight in the database.
We calculate another concept. Finally, we have distributed some data and we have committed at the end the data in the database. So, we can track the details of what's happening in the last m measurements.
This is the statistics that are maintained. So, we maintain how many measurements we had. Sorry, the total time spent in upstream threat CPU, the average, and a distribution, an histogram of the distribution, and the n worst case. In case something is really abnormal,
it will appear in this data structure. So, this tacked response time, as I've indicated, is used from development to operation. On development, it helps to understand how the system works, see message exchange between process, the algorithms executed.
The statistics are used to analyze the performance replay. We can compare, using this tacked response time, the profile of the replay day with the ops profile. We can measure the resource consumption for new functionalities, because we are recording things that need to be recorded, and so on. On ops, online investigation of performance problems,
bug investigation in our system, in our code, the policies that exceptions are used for bugs, not for normal behavior. And so, if we have an exception, we take a core dump, we can take a core dump without stopping the process. We take a core dump, we drop the input that has created the problem, and we process the next message. And the core dump contains the full status
of what the thing was doing, including in the tacked response time, the last m measurement of what the process did. Because possibly, the bug might be created by something that was done a little bit before. And so, with this, we can see what the process was doing and recently did.
We also use this for post-ops analysis and thread analysis and input for our capacity planning. So, performance tracking of a big application, a summary. So, we have a reasonably deterministic performance tracking during development. It allows us to detect performance regression on a daily basis.
We can verify that the optimization, that we believe our optimization, has the desired effect. It allows us to plan capacity upgrades for demand growth and new functionality, etc. We are using a mix of various techniques and tools, such as performance unit tests, replay real data, application self-measurement. We have to take care to avoid blind spots
by using various tools, PERF, Val Green, Cold Green, TOP and STRACE. All will learn you something about your application. The tooling we are also using for other purposes. For example, the replay tool is also an automatic testing tool because if we say we can replay 50,000 flight plans,
we can also say inject to flight plan and verify that this one is this characteristic and that it has received this delay, for example. So, the replay tool is also the test tool. It is also used by your users to analyze, optimize operational actions and procedures. If they did something and they say, maybe we should have done something else, they can replay, of course, on offline systems
and then they can decide to try other actions as to what they did during operations. So, as I have indicated, your operational system needs to have performance tracking and statistics. This is not only for development. Voilà! So that finishes my presentation.
Yes?
Yes, typically... Can you repeat the question? Otherwise... OK, so the question is, when we implement new features, do we inject some data in the replay in order to see how it behaves? Yes, the idea is that during the development,
when we implement a new feature, we are, of course, developing unit tests for this new feature, which are done with the replay tool, and whenever we have relevant data that we can use, we might have to discuss with our users or we might have to create them ourselves, we will measure how the code that we have developed behaves.
No, it's very difficult to know what to do exactly because the real usage depends a lot on the pattern that the real users will do, and sometimes users are a big source of non-determinism.
Yes? Most of the load, yes? OK, so the question is, I've said that it runs on one server, and so how is this highly reliable?
Yes, I said, when it is working, it's enough to have one server, but of course, we have several servers, we have clusters of servers, and if one server has an hardware problem, for example, we move the system to another hardware server. So, operationally, it's good enough to have,
well, let's say, a medium-sized Linux server, but we can move the application to another place. In the… Yes?
No, we are just taking… So, first, another second measure… So, the question was,
how do we ensure precise measurement in the multiprocess? So, first, the measurement that I showed were performance unit tests. So, this we just have a single, typically a single executable, which is a performance unit test. In the real system, we just say, this process has consumed this thread, sorry,
this task has consumed this CPU to do this action in that elapsed time. No, what we want to measure is what is really happening. Let's imagine, for example, that we have somewhere a kind of critical sections, and two processes are accessing some data and are locking this data. You might have that one process has to wait a long time,
so that we will see, because it says, I see that I have spent, let's say, 0.1 seconds waiting, and consume only one millisecond of CPU. Either there was not enough CPU on the system, it was overloaded, or there is wait time, and so we just measure what the kernel gives back via the clock get time system calls.
Other questions? No? Thank you. All speakers…