Statistics 101 for System Administrators
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 101 | |
Number of Parts | 119 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/20026 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Berlin |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
EuroPython 2014100 / 119
1
2
9
10
11
13
15
17
22
23
24
27
28
41
44
46
49
56
78
79
80
81
84
97
98
99
101
102
104
105
107
109
110
111
112
113
116
118
119
00:00
Physical systemSystem administratorStatisticsInterior (topology)RobotBerlin (carriage)Limit (category theory)System administratorBitOrder (biology)Computer animationXMLLecture/Conference
00:43
Element (mathematics)StatisticsLecture/Conference
01:22
Cross-correlationModul <Datentyp>Cross-correlationPlotterComputer animation
02:08
SoftwarePeer-to-peerNumberError messageMessage passingTelecommunicationCondition numberLecture/Conference
02:39
Standard deviationArithmetic meanBarrelled spaceStatisticsLecture/Conference
03:03
Inversion (music)Arithmetic meanLaceStatisticsState diagramBoom (sailing)Message passingPrice indexStandard deviationNumberArithmetic meanDescriptive statisticsTimestampSeries (mathematics)Oval2 (number)Solid geometryData structureComputer animation
03:51
Distribution (mathematics)Standard deviationArithmetic meanMaxima and minimaCartesian coordinate systemMultiplication signInsertion lossField (computer science)Line (geometry)Descriptive statisticsPrice indexRoundness (object)Lecture/Conference
04:34
Distribution (mathematics)Binary fileFrequencyEvent horizonNetwork topologyCategory of beingNetwork topologyDistribution (mathematics)Library (computing)Computer animation
05:03
HistogramLibrary (computing)Multiplication signDifferent (Kate Ryan album)Category of beingNetwork topologyPlotterFunction (mathematics)SoftwareRoundness (object)Lecture/Conference
05:38
Binary fileLaceDistribution (mathematics)Event horizonFrequencyNormed vector spaceBinary fileCartesian coordinate systemIterationElectronic mailing listDistribution (mathematics)Theory of relativityNumberDescriptive statisticsSeries (mathematics)Different (Kate Ryan album)Arithmetic meanCross-correlationComputer animation
06:36
Well-formed formulaPearson product-moment correlation coefficientCross-correlationLine (geometry)Series (mathematics)Parity (mathematics)Well-formed formulaField (computer science)Multiplication signDifferent (Kate Ryan album)Product (business)CuboidExtension (kinesiology)Arithmetic meanLine (geometry)Limit of a functionSeries (mathematics)Lecture/ConferenceComputer animation
07:32
Well-formed formulaLine (geometry)ScatteringPlotterProduct (business)Lecture/Conference
08:11
Cross-correlationPlot (narrative)Degree (graph theory)Theory of relativityGodLine (geometry)PlotterSet (mathematics)Computer animation
08:55
Pattern languageCASE <Informatik>Cross-correlationFunction (mathematics)MereologyLecture/Conference
09:23
Pearson product-moment correlation coefficientCross-correlationPhysical systemFunction (mathematics)Row (database)Cellular automatonSet (mathematics)Lecture/Conference
10:22
Physical systemCross-correlationLecture/Conference
10:51
Cross-correlationCoefficientSystem programmingRange (statistics)Pressure volume diagramPearson product-moment correlation coefficientStatisticsFunction (mathematics)Field (computer science)Suite (music)Mountain passBoom (sailing)Combinational logicLine (geometry)Negative numberPosition operatorCross-correlationAbsolute valueRandomizationComputer animation
11:49
Mathematical analysisEndliche ModelltheorieCombinational logicTable (information)Set (mathematics)Key (cryptography)Lecture/Conference
12:43
Error messageMessage passingResultantInheritance (object-oriented programming)Cross-correlationCombinational logicSeries (mathematics)Lecture/Conference
13:10
Pearson product-moment correlation coefficientFinitary relationWeightLetterpress printingCorrelation and dependenceCross-correlationTotal S.A.Game theoryPoint (geometry)Cross-correlationComputer animation
13:46
Correlation and dependenceTheory of relativityInheritance (object-oriented programming)BlogSlide ruleError messagePlotterLecture/Conference
14:39
Plot (narrative)PlotterScatteringCorrelation and dependenceRadio-frequency identificationCombinational logicPearson product-moment correlation coefficientWeightCross-correlationFile formatTheory of relativityInformationCombinational logicSeries (mathematics)PlotterComputer animation
15:07
PlotterShared memoryPower (physics)Graph (mathematics)ResultantBuffer solutionVisualization (computer graphics)View (database)Medical imagingLecture/Conference
15:38
Cross-correlationPrice indexTheory of relativityBuffer solutionBefehlsprozessorCASE <Informatik>WeightBoss Corporation2 (number)Bit rateLine (geometry)Mathematical analysisComputer animationDiagram
16:51
PlotterPoint (geometry)Distribution (mathematics)Multiplication signMereologyGraph (mathematics)Right angleCycle (graph theory)Buffer solutionBefehlsprozessorPrice indexIterationLecture/Conference
18:06
Cycle (graph theory)ScatteringRange (statistics)Hausdorff dimensionRight angleSet (mathematics)Plot (narrative)Social classReading (process)Data compressionCASE <Informatik>Heegaard splittingSet (mathematics)Computer animation
18:45
Block (periodic table)Price indexLecture/Conference
19:12
Cross-correlationCodeMultiplication signPlotter2 (number)Hidden Markov modelCartesian coordinate systemGraph coloringDiagram
19:51
Multiplication signPrice indexSpeech synthesisPhysical systemMatching (graph theory)Lecture/Conference
20:22
Cross-correlationBoom (sailing)VotingPhysical systemPlotterLine (geometry)Point (geometry)Client (computing)MereologyQuicksortDiagram
21:11
PlotterScatteringCellular automatonMereologyProcess (computing)Cross-correlationRankingPoint (geometry)FamilyTheory of relativity
21:47
Cross-correlationCorrespondence (mathematics)NumberMereologyComputer animationDiagramLecture/Conference
22:24
Cross-correlationMereologyNumberPlotter2 (number)Client (computing)Social classBlogComputer animationDiagramLecture/Conference
23:01
Physical systemPlotterError messageLine (geometry)Multiplication signMereologyOperator (mathematics)Lattice (group)Lecture/Conference
23:27
PlotterTraffic reportingMoment (mathematics)ResultantMereologyStatisticsLecture/ConferenceMeeting/Interview
23:53
Multiplication signCombinational logicComputer configurationLecture/Conference
24:34
Physical systemRight angleSeries (mathematics)Multiplication signRing (mathematics)Inheritance (object-oriented programming)Peer-to-peerTouchscreenCombinational logicError messageLecture/Conference
25:45
InformationCAN busSynchronizationExecution unitSlide ruleComa BerenicesOpen setoutputBinary fileEuler anglesParameter (computer programming)RootkitPersonal area networkSineRaw image formatNormed vector spaceError messageMultiplication signGodCombinational logicTheory of relativityBlock (periodic table)Set (mathematics)PlotterSource code
27:34
Combinational logicCross-correlationNeuroinformatikDifferent (Kate Ryan album)Vector spaceSoftwareNumberLevel (video gaming)PermutationCross-correlationInterrupt <Informatik>
28:17
Local ringDifferent (Kate Ryan album)Price indexPhysical systemNumberCross-correlationLine (geometry)WordWell-formed formulaPlotterLecture/Conference
Transcript: English(auto-generated)
00:15
Welcome to the next session. We have now, I think, 10 o'clock, yes?
00:20
And it's Roberto Poli talking about the 101 of systems administration and for sure focusing on Python and not at AVK, what I heard. And so, yeah, enjoy.
00:43
Hi everybody, I am Roberto Poli. I work in Babel, which is the proud sponsor of this talk and of my hotel bill. Today we will see how to use and learn elements of statistics.
01:00
It's not a statistics course with Python. Before starting, even also I would like to apologize for my English. I hope the English-speaking people can forgive. Go on, we will see a latency issue
01:23
that affects one of our customer and how in a very few minutes we were able to understand what was happening and what was not happening.
01:40
We understood all those things using correlation and combining data. Then we provide a lot of nice plots and also they allowed our customer to say that all that were happening wasn't his fault.
02:01
Everything was done with SciPy and Matplotlib. His problem, the customer problems, was episodic network latency issues. We had locked races with message sizes,
02:21
the number of peers of the communication, and the number of retransmission and errors in this network. The customer asked us, do we need to scale? Are those latency issues related to some peak condition? We found a rapid answer using Python.
02:48
How would you rate? Because Python provides basic statistics like the mean that we will denote with a bear on the X and using standard deviation
03:04
which is actually an indicator of how the mean is a good descriptor of our data series. If the mean is good, the standard deviation is low.
03:25
If the mean is not a good indicator, the standard deviation is high. The T variable contains an extract of our data. There is a timestamp, a latency indicator in seconds,
03:43
and the number of peers, and there are other indicators just like the message size, the number of retransmission. You can see that getting a base description of all those fields is really easy, it's just one line
04:01
because Python provides max and minimum indicator and mean and standard deviation are built in SciPy. Now data distribution, the second thing you do is to create a distribution that is on the X axis
04:24
you have got some time slots, for example. This one is a ping round trip distribution. It says we have three pings returning between 158 and 159 milliseconds.
04:45
Four pings return between 159 and 160 and so on. The faster way to create a distribution with Python is using matplotlib,
05:02
that is a plotting library. When we plot an histogram, for example, an histogram of latencies, ping round trip time is actually a network latency, the history, we have got two outputs. One output is the plot, the other output is a triple.
05:25
The interesting values in this time are the frequencies, that is how many pings return. Three is a frequency, four is a frequency, two is a frequency, and the bins.
05:40
The bins are just like, yes, bins or buckets are on the X axis. So the 158 to 159 bin and so on. To get the distribution, just use zip, which tie together through lists or iterables.
06:04
Now, correlation. We have got a description of our data. But now we ask, are two data series related? Is there a relation between the number of retries and the latency, or whatever?
06:23
If we, for sure, use delta X, just like the difference from an item in the series and the mean, Mr. Pearson, that was a statistician, answered with this formula. It seems complicated if your high school time are far,
06:50
but if you just mind back to your high school, it's actually quite easy. It just checks if the values of the X
07:03
and the epsilon series move together on the same line. If, for example, both X and epsilon move together,
07:20
they start, those differences start with negative values. So the product is positive. And they move on, and if they reach the mean together, they will be zero together. And if they move together on, the product will still be positive.
07:43
So if you try with your, IPython console, with some data sets, you actually find that this formula is quite reasonable. So, Rho defines if the values move together
08:03
on the same line. But anyway, you must plot. These are various scatter plot with their Pearson value. On the first line, we can see that
08:21
we have one relation value, and then when the data began to be unrelated, that value goes to zero, and then it starts to be, again, a negative value
08:42
when the relation is not direct, but inverse. So when one data set grows, and the other decreases. But there are even linear cases where we have a zero correlation value.
09:04
But actually we could find that those data are related, or there are some patterns in the data. So you always should plot. Probably the indicator.
09:21
Python SciPy provides a correlation function. This function returns two values. The first one is the correlation coefficient that we just described. These values are between minus one,
09:42
as said before, when one data grows and the other decreases. And plus one when both data grows together. There is one other value, the probability indicator,
10:01
that its definition is quite tricky. But let's say that this value tells us when such kind of data sets are produced by uncorrelated system.
10:21
So if the probability is high, the system are not correlated. If the probability is low, then those values are unlikely produced by uncorrelated system.
10:44
So if you have got a Python shell, you can just try and check and experience what you can get. The A and B values are just like a straight line,
11:01
and they have one correlation and zero probability. That is, it's unlikely that random data can produce a straight line. While getting two random values, two random data sets, we can see that the correlation is low.
11:22
I don't care if it's positive or negative, but its absolute value is low. But the probability that those data are unrelated is quite high, is about 70%.
11:41
Now, combination, return to our original problem. We have got various data sets. We want to understand which of them and if are related. When we should do such kind of analysis,
12:05
the other tools module is a good place to check. Combinations are quite an intuitive concept.
12:20
They just find every possibilities in which I can mix a set of items without repetitions. We use it to combine all table keys. So we will combine the latency with the errors,
12:44
the errors with the message size, and so on. And now, this is how we get our results. Simply, we use combination to not fish
13:03
for all possible correlation and probability values between all our data series. If the correlation is over a given threshold, we print something.
13:20
Or if the probability is lower than threshold, again, we print those values. This is just a starting point, but we are concentrating. Our customer wanted to know something quickly.
13:44
We started with concentrating on what could be more likely a relation with the latency. So the relation between latency and errors is higher or not?
14:06
Is this clear, I think? If you're acquainted with Python, it is. But well, remember the slide before. Linear correlation is not everything.
14:22
We should use our eyes. And actually, Matplotlib allows us to save the plots. So what will we do? We will save all the possible combination
14:45
of our data and our data sets, put, sticking on the plots, all the possible information. So the relation indicator, the probability indicator,
15:02
the data series, we mixed, and then saved. That could produce 30 or 40 graphs, but we can just watch it with one view or whatever your image visualization.
15:25
And well, at that point, you can easily check if that plot tells you something. This is an example plot with the buffer size
15:41
and the CPU weight. Hey, there is an I relation indicator and a zero probability indicator. Those data are probably related. We can see that when the CPU weight is low, the buffer is constant.
16:02
But when there is IO, the buffer increase. So there is surely a relation. If that relation is a straight line, or the relation is just like moving from CPU weight
16:25
at a constant rate on the buffer side, and then when the CPU weight starts to be for three or four seconds at 40%, then the buffer starts to grow.
16:43
Well, this is a first step of analysis. But for example, if you're searching something, this kind of plot is something you should, it's a good starting point for investigation.
17:01
Then, what lack, lacks in the previous graph was colors and a time indicator. We have not plot time. So we actually don't know if the right side
17:21
is the one, is the starting point, and the end side is the end point. For example, because after CPU work, we flush the buffer, for example, or if the left part is the starting point, and the right the end point,
17:41
and I stopped myself to gather data while buffer was working. Using colors, I can understand better what's happening. And well, again, other tools.
18:02
Cyclone makes an iterable, continuous iterable. So colors, next, colors.next returns RGB, and again, RGB, it's a simple case,
18:20
just, I just trace morning, afternoon, and night. Morning with red, afternoon with green, and night with blue. I just use those compression syntax to split data sets in three chunks.
18:45
And then I start the first one in the morning in red, using labeling it with red. I could even add Pearson and probably indicator
19:01
on the single chunk, and then yes, always set the title, set the plot, and so on. Boom, we are going to end. I was fast. So this is one simple plot with latency on the x-axis
19:24
and throughput on the epsilon axis. The color denotes the time in the day. We can see clearly that if we look at higher latency
19:42
above three seconds, okay, it's not a matter of throughput or size, okay? The higher latency match with lower throughput.
20:02
Moreover, with this plot, we have even an indicator of the ability of the system, of the speeds of the system, because we can see that if we focus just on the first time slot between zero and one second,
20:26
we can see that there is actually an influence of throughput on latency. But it's this influence ends after one and a half second.
20:42
And that line could be a sort of throughput, speed of the system. We can see, moreover, that all the points, all the red points with a higher throughput
21:06
are in the same part of the day. So if, for example, we check that those data are wrongs or there is a problem with those kind of data,
21:22
the plot tells us that, points us to a precise part of the day to check. In other one, a correlation, this is another scatter plot
21:41
with size of the packet and the retry. We can see that there is no relation. The latency problem was not related to the size of the packets. We can see, moreover, that higher size
22:03
corresponds to a lower number of retries. So when the packet size is high, there are no problems. But the problems of retries are concentrated in the green part of the day.
22:25
So we can check if that, in that part of the day, could have been some problem on the network, for example, or some problem related to a part of our clients.
22:43
And all those plots were produced in 30 seconds clock. So once you have the data, just pass those snippets, get your 40 plots, and they'll tell you almost everything.
23:01
So, yeah, again, latency wasn't related to packet size of system throughput. Errors were not related to packet size. We even discovered the system throughput using those straight line capping the plots.
23:20
All these in 30 minutes. The other time was just passing the logs. It was the hardest part of the problem. Wrap up. Use statistics. It's easy, but don't use it.
23:46
Use plots. Plots, plots, and then, yes, continue to collect results. Okay, 24 minutes. That's all, folks. Hope you enjoyed, and the other snippets could be useful.
24:08
Don't know if there are questions, but that's it. We have some time for questions. Any questions? Go to the microphones.
24:22
Okay. Oh, okay, there is one. I didn't understand why are you using combination. Can you give three examples of what pairs of combinations you are trying and why do you have to randomize that? Okay, as I didn't know
24:43
how the system worked, the first time to do was to combine all those data series.
25:00
Using combination will return late and periods, late and errors, periods and errors. Maybe the screen is smaller.
26:10
Let's imagine instead of A, B, C, I've got retries, latency, time.
26:37
Okay? So the combination lets me tie every possible value,
26:48
every possible data set with another one. Maybe this is simple.
27:02
Okay. Imagine A is latency, B is throughput, C is retries. I've got all the possible combination between datas and I can evaluate the relation values
27:25
or plot on every possible... What is?
27:47
Piers was the number of the computer in the network. There's another question. Yeah, I have one quick question. I see that with these permutations you get a lot of different combinations.
28:00
Did you have any problem with spurious correlations with high significance levels? Could you detect them as false correlations, not really related or you didn't have this problem? Actually, there is no false correlation
28:22
because it's just a numbering if I understood the question. The piercing indicator is just applying a formula that tells you if plotting those data you get something like a straight line.
28:43
So for this reason I always say you should plot because obviously you should then learn how the system works. Those are indicators because I made 40 plots.
29:04
I didn't know the system. So I needed something, I needed the A stack to say to steal the words of Costanza. So I got the A stack.
29:21
Then the A stack I started to find the needle. Thanks. Yes, thank you very much Roberto. It was a great talk. Thank you very much.