Statistics 101 for System Administrators
Video in TIB AVPortal:
Statistics 101 for System Administrators
Formal Metadata
Title 
Statistics 101 for System Administrators

Title of Series  
Part Number 
101

Number of Parts 
120

Author 

License 
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2014

Language 
English

Production Place 
Berlin

Content Metadata
Subject Area  
Abstract 
Roberto Polli  Statistics 101 for System Administrators Python allows every sysadmin to run (and learn) basic statistics on system data, replacing sed, awk, bc and gnuplot with an unique, reusable and interactive framework. The talk is a case study where python allowed us to highlight some network performance points in minutes using itertools, scipy and matplotlib. The presentation includes code snippets and a brief plot discussion.  #Statistics 101 for System Administrators ## Agenda * A latency issue * Data distribution * 30 seconds correlation with pearsonr * Combinating data * Plotting and the power of color ## An use case  Network latency issues  Correlate latency with other events

Keywords  EuroPython Conference EP 2014 EuroPython 2014 
00:00
Robots
Berlin (carriage)
Computer animation
administrations
orders
administrations
inner
bits
limitations
statistics
systems
00:43
Correlation
Computer animation
correlations
modularity
elements
02:08
messagebased
statistics
means
standards
Computer animation
Software
communication
barrel
number
conditions
03:03
Statechart
time
loss
Spitze <Mathematik>
solid
inverted
fields
number
second
means
Void's
series
structure
descriptions
standards
distribution
maximal
indicators
lines
applications
statistics
timestamps
means
messagebased
Computer animation
boom
04:34
histograms
distribution
time
plots
functions
bin
category
period
events
Computer animation
topology
different
topology
libraries
05:38
distribution
relation
distribution
Spitze <Mathematik>
bin
number
means
Correlation
events
period
Computer animation
different
Normierte Räume
series
descriptions
06:36
lines
time
correlations
lines
parity
fields
product
means
Formula
Computer animation
Formula
different
correlation coefficient
series
box
extent
07:32
degree
relation
Computer animation
Correlation
plots
lines
Handlungen
scattering
product
God
08:55
Correlation
Computer animation
case
cellular
correlation coefficient
functions
part
systems
record
10:22
Ranges
lines
statistics
correlations
coefficients
Correlation
Computer animation
Correlation
correlation coefficient
functions
Indikatoren
system
negative
position
11:41
Computer animation
suite
gaps
boom
combination
analysis
sets
fields
model
table
12:43
messagebased
Correlation
Computer animation
Super
combination
Results
13:10
point
relation
Super
NET
statistical correlations
print
statistical correlations
total
Correlation
Computer animation
correlation coefficient
Correlation
Blog
Relation
game
14:39
plots
relation
information
NET
plots
graphs
formating
combination
share
statistical correlations
combination
scattering
Handlungen
powerful
Computer animation
PIT
correlation coefficient
Correlation
series
Results
15:38
boss
relation
NET
analysis
correlations
indicators
lines
second
CPUs
Computer animation
rates
case
buffer
16:51
point
distribution
Graph
Computer animation
time
buffer
Right
cycle
part
18:06
Blocks
cycle
Ranges
scattering
Coloured
Handlungen
Computer animation
case
sets
Dimension
Right
source code
reading
classes
19:12
Computer animation
plots
time
Hidden Markov model
speech
correlations
indicators
second
systems
20:22
point
relation
plots
cellular
correlations
lines
clients
part
scattering
voting
processes
Correlation
Computer animation
boom
rank
family
systems
21:47
Computer animation
corresponds
correlations
part
number
22:24
Computer animation
plots
Blog
correlations
clients
part
second
classes
number
23:01
Computer animation
lattice
plots
time
operations
moment
lines
part
report
Results
systems
23:53
Computer animation
configuration
time
combination
Right
systems
24:55
Peertopeer
screens
Computer animation
Super
ring
combination
25:45
time
combination
correlations
number
computational
different
root
level
information
input
God
unit
CAP
relation
Slides
sin
Blocks
coma
argument
bin
attitudes
open subsets
CANbus
errors
Computer animation
real vector
Software
Normierte Räume
Synchronous
WPAN
28:17
Correlation
Computer animation
different
indicators
lines
localization
systems
number
00:15
welcome to the next session you have no way of saying to the buckets
00:20
and its rebuttal Polly talking about the 1 a lot of systems administration and social focusing on Python that not that ABK order here and so on yeah I enjoy a bit hi how everybody that
00:45
I have about the world then i walking barbaric which is a proud sponsor of this talk and of my hope that the today you will see how to use American elements of cities seeking it's not a sadistic courses that would buy them there before starting new also I would like to apologize for my English I hope they're Englishspeaking people and can forgive the good thing were London see issue
01:23
that affects the 0 1 of of our our customers and Ali in of the few minutes we were able to understand what was happening and what was not happening that we understood world all of those things using correlation and combining data than we provide a lot of nice so they allow them or customer to to say that all of them were were happening that wasn't useful everything there was done with scifi and not to then there is problem
02:09
the customer problems was because it is all the network latency issues that we had looked phrases with a message size is the number of years of the communication and the number of retransmission and and the arts in his network that the customer asking us do we need to scale there are those that in the issues related to there some of the condition
02:40
and well we found evidence for using Python but how would you because of them provides basic statistics like the meaning that we will be known to with the barrel on the except and using standard deviation which
03:04
is actually there an indicator of of all the is about it is a good uh descriptor of our data series the mean is good descended reaction is low there is the mean is not a good indicator that the standard the reaction is uh that variable contains an egg structure of our data there is a time stamp alignments indicator in seconds and the number of years and there are other conjugated just like the message solid number of retransmissions the you can see that didn't think
03:52
that a Bayes description of all of those the fields is really just 1 line because Python provides a maximum minimum indicator and mean and standard deviation I believe in their side of now the distribution the 2nd thing that you do is to create a distribution that is uh on the axis you have got some times loss for for example this 1 is up the being wrongfully distribution
04:34
it stays we have tree uh being written uh 158 and 159 of the segments for beings return between 159 and 160 and so on they differ faster way to create a distribution with by is using uh much which is 0 and that is a plotting
05:05
library when we plot the histogram for example and histogram of love and sees being roundtrip time is actually a lot and that of like the in the USA and we have go to a 1 output this is the plot yeah the output is a triple the interests and values in this time of differences that is how many uh beings returns trees forever and C 4 is differently C a 2 is
05:38
a fragmented and that means the beans are just like yet been so all buckets and axis the 158 to 150 and 9 and so on to get a distribution just uses z which tied together
05:59
through are examples no correlation we have good description of all data that but nowhere else onto the other series related Is there a relation between the number of retries and the redundancy in or whatever that year we know for sure enough you that that's just like the difference from on item in the series and the meaning of this of this and that was a statistician
06:37
uncertain with this formula the is field comprehensively if you will but
06:46
if you're a high school time on the problem but if you just reminded back to your eyes school it's actually quite easy and just check you've never use of the elites and that it last uh move together on the same lines the used for example a box the the extent of it you know move uh together they start those differences start with negative values so the product is positive and then move on and if they reach the meaning together they would
07:34
be 0 together and if they move together on the product we think the positive so and if you the try yeah I buy them console uh with some datasets you actually find that this for money is quite reasonable and so on role defines the values move together on the same lines but anyway you must belong these are the values of the scatter
08:12
plot that we don't Busan value on the 1st line we can see that the other 1 relation value and then there was the God began to be unrelated don't that value goes to 0 and then that is 1st to be again and again and negative value when the relationship is not directly but uh in so that when 1 the dataset grows and the other degrees the by even on the
08:59
cases where we have as 0 correlation made but actually we could find that there those laid out really or address some part of the so you always should the probability of the given by the side of a provides a
09:25
correlation functions this function return to read the 1st 1 is the correlation coefficient that we just described is values obviously minus 1 in a before when 1 that grows in the other decrees and a blast while when both that the rows together then there is 1 of the value the probability and here the is it's definition it's quite freaky but let's say that these values tell cells with such kind of that of the after that produced by an correlated system some if the probabilities the
10:25
system on correlated if the probability Israel there those values are unlikely produced by anchoring the so these you know if you have golden by the shall you again just trying the and check and
10:53
that experience uh what you can get it the and the values are just like a straight line and the other 1 operation as you're probably that that is it's unlikely that random data can produce a straight line while getting uh to random values to random datasets where we can see that the correlation is going to be I don't care if it's positive or negative but it is a substantive value is low but the probability that those that are really it is quite i is about 70 besides the research now
11:42
combination we don't to all original problem we have got various datasets we want to be a understand
11:51
reach of then and if there are really there when we should the this is what we should do such kind of analysis the other tools model is yeah Our goal I got a good place to check their combinations were quite an intuitive concept there they just find all every possibilities in which again meets a set of items without repetition uh we use it to combine the whole table so we we combined the air land and sea uh with yeah arts of these
12:46
that parents with the message size and so on in this is how we get Our results seem very we
12:59
use a combination do not feature for all possible correlation probability values between people and the other
13:13
uh if the correlation is of the we print something or you what the probability is that no demonstration that game will bring those values the the is just a starting point but we are concentrating on that our customers wanted to know something the quickly uh really this started with
13:47
concentrating on what could be more likely that our relation with the 11 so the the the relation between lot and parents is higher or not there is is clear I think if requested by the way it is by the well remember that like before now the linear correlation is not everything uh we should use our and that actually must not leave that allows us to save the blogs so what do you do when we do where
14:40
are you saying all the possible combinations all our data in our data look at the on the plots all the possible information so this a relation indicating the probability indicates that better series remix and then
15:08
that could produce total 40 graphs by we can just watch we that when you all whatever your the major result the share FIL and will and the power the you can easily reach out you the dead lock tells you something the this is an example plot we
15:40
the buffer size and the CPU the case the reason I relations indicator that and I 0 probability and he does that are probably
15:53
related there we can see that when the city way Israel the buffer is constant but when there is idea the boss for a increased so there is actually a relation is that relation is a straight line all did the relation is just like the morning from weights at a constant rate on the buffer size and then when they receive you wait starts to be for 3 or 4 seconds at 5 30 40 % then then the buffer starts to to roll well this is a Florida sub of analysis but for example if I if you have a search and something that had these kind of
16:53
talk something you should know is a good starting point for a distribution then the
17:03
what level like in the in previous graph was called 1st and then data the and here we have a lot of time so we actually I don't know if the right side is the 1 is the starting point in the inside is the and applying for example because after this if you walk me flashing the for example or is it the left part is the starting point and the right to the end pollen and they stop myself is it together that that was a uh a buffer was working using a course I can understand that what what's happening there and what they did idea to cycle made and and the durable but continues through
18:07
the available sold colors next class the next the varied on the GE and again that it's a simple case is just I just raised morning afternoon and night this morning with the rest of afternoon with reading and neither the blue I I dont you had just used those compression text tools datasets in previous chunks and then I the
18:47
1st one in the morning there in red using the Lebanese that we drank I could even I that Peter son and probability of the data on the single charge and that he has always said that I don't have the block and so on but you're
19:13
going to during the I was fast so this is 1 simple plot there we go the and see but exact cities and throughput on the HMM axes the decoder and denotes the time in the day the we can see clearly that is uh if we look at hi you're not in the above 3 seconds because uh it's not and neither true or awesome OK
19:53
the the high latency next with lower so moreover a reduced not really you know and in an indicator of the ability of the system the of the speech of the system because we can see that if we followed was just of the 1st time slot between 0 and
20:23
1 said we can see that there is actually and the influence of true vote on land and sea but each of these influences and after 1 and a half a 2nd the and don't don't line it could be could be so true poet be uh of the system that we can't see it more although they're all the plot all the clients of the red points there we go uh and a high true vote buying the same part of the day so if
21:11
for example we check that does the job of rank them all in there is a problem there we those kind of data that we the brother of cell was then points so as to have precise part of the data have to check another 1 a correlation is another scatter plot read size of the patient and enduring tries we can see that there is no relation
21:49
the light and the problem was not related to the size on the packets is we have seen moreover the the highest size corresponds to
22:05
the I all lower number of retries so wondered target sizes I don't know problems but the problems of retries are concentrated the the reason that part of this
22:25
so we can check that that in that part of the day uh could have been some problem on the number for example for some of the problem originates from are part of our clients and all those plots were produced in 20 seconds class
22:49
so once you have the data past those in the books that you know for 2 blogs and then tell you almost everything so yet
23:03
again lattice it wasn't related to but its size the central air through an operator its size we haven't underscored system true book using those straight line a kept in the plots all these years during means the other time was just passing along so it was the
23:28
hardest part of the of the problem the report you something achieves you we're going to by the user moment is it just lost no plot and then yes continue to collect results
23:54
OK kind of earnings next option what you enjoy them I don't know if you have questions about like that of some time for questions questions got comes microphones the working talking this is what I didn't understand why they're using combination can you give like 3 examples of what pairs of combinations you were trying to what you have to run the the gain as it did not
24:41
there system along the 1st thing to do was to combine all right does
24:58
that ring using a combination of will return later in the year ladies and parents peers and parents there may be the screen is more no days
25:54
so so we can use to OK let's imagine instead of ABC of God ritualized latency time hoops but can so the combination let's meet times every possible then you everybody will better serve with another 1 there may be syndrome looking major a is lot b is through both CD is referred have a look at all the possible combinations of doing day that then I can't it was only the relation values or blocks on every possible has been and
27:34
what there is you when he's the US was the number of the computer in the network there another question yeah couple quick question I sit that with these fermentations you got a lot of different combinations did you have any problem with this you use correlations with high significance levels could you along the vector false correlations not really politically of actually
28:18
during no close correlation because there it's just 1 number and the differences in the question the Busan indicator is just applying the performance that tells you you've goes down the house was plotting this that you get something like a straight line for these reasons I always say that you should not because 1 of the you should have done and learn how the system works designed economy I need for the idea on the systems so I mean I needed something and in the very start to say with that to the rewards of local stands there still quite good then they start by to find the thank you the center much that of