We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

A New Kind of Analytics

00:00

Formal Metadata

Title
A New Kind of Analytics
Subtitle
Actionable Performance Analysis
Title of Series
Part Number
79
Number of Parts
94
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Applications today are spidery and include thousands of possible optimization points. No matter how deep performance testing data are, developers are still at a loss when asked to derive meaningful and actionable data that pinpoint to bottlenecks in the application. You know things are slow, but you are left with the challenge of figuring out where to optimize. This presentation describes a new kind of analytics, called performance analytics, that provide tangible ways to root cause performance problems in today’s applications and clearly identify where and what to optimize.
Web 2.0NumberTwitterMultiplication signTrailDependent and independent variablesProduct (business)Software developerUsabilityPage, LarryBit rateCodeCartesian coordinate systemExecution unitWeb application1 (number)Computer animation
Web applicationBit rateDependent and independent variablesRight angleData conversionComputer animation
Software developerStandard deviationWeb 2.0System administratorOperator (mathematics)Model theoryAnalytic continuationCuboidMultiplication signWeb-DesignerMereologyComputer animation
Point cloudComputer animation
GodComputer animation
Model theoryFacebookCartesian coordinate systemComputer animation
Cartesian coordinate systemSoftware testingVideo gameOnline helpType theoryIntegrated development environmentStack (abstract data type)Level (video gaming)CASE <Informatik>QuicksortSpectrum (functional analysis)Figurate numberMultiplication signSource codePhysical systemDifferent (Kate Ryan album)LoginRight angleWebsiteMetric systemProduct (business)Computing platformBeta functionTrailComputer animation
Data miningCartesian coordinate systemSoftware testingMetric systemVideo gameServer (computing)Line (geometry)Real numberLatent heatLevel (video gaming)Product (business)NumberTerm (mathematics)Integrated development environmentVariable (mathematics)DatabaseMultiplication signGame controllerMatrix (mathematics)Source codeOrder (biology)DivisorSet (mathematics)Computer animation
Cartesian coordinate systemMobile appSoftware testingMultiplication signGeometryLatent heatVariety (linguistics)Type theoryDistribution (mathematics)Dependent and independent variablesNumberError messageWeb 2.0Server (computing)Complete metric spaceCategory of beingBit rateResponse time (technology)AverageCoordinate systemDatabase transactionResolvent formalismWeb applicationComputer animation
Software testingType theoryCartesian coordinate systemCloud computingPrice indexProcess (computing)Online helpMultiplication signInformationResponse time (technology)Physical systemLatent heatMathematicsSoftwarePoint (geometry)Client (computing)Logical constantError messageCross-correlationMathematical optimizationMachine learningServer (computing)System callData miningComputing platform2 (number)Software developerNumberMeasurementGoodness of fitDifferent (Kate Ryan album)ResultantSet (mathematics)Order (biology)Flow separationTime zoneRight anglePlug-in (computing)Point cloudField (computer science)Dependent and independent variablesVideo gameComputer animation
Software testingSet (mathematics)Cross-correlationComputer networkIdentifiabilityIntegrated development environmentDifferent (Kate Ryan album)PrototypeSound effectLinearizationCartesian coordinate systemData analysisBounded variationResultantMultiplication signResponse time (technology)Direct numerical simulationComponent-based software engineeringVideo gameMathematical analysisMusical ensembleCASE <Informatik>Arrow of timePerformance appraisalComputer animation
Machine learningComponent-based software engineeringGroup actionData miningObject (grammar)Software testingMetric systemWeb browserMultilaterationLine (geometry)Multiplication signResultantCASE <Informatik>Dependent and independent variablesOnline helpModel theoryPoint (geometry)Data analysisRootSet (mathematics)Bounded variationCorrespondence (mathematics)Time zoneDivisorFunctional (mathematics)Stack (abstract data type)1 (number)Type theoryCartesian coordinate systemMobile appProduct (business)CausalityResponse time (technology)Cluster analysisFeedbackAlgorithmConcurrency (computer science)TwitterServer (computing)2 (number)SoftwareWeb applicationError messageStructural loadSocial classChainLatent heatSource codeMathematical analysisCategory of beingReal-time operating systemPredictabilityPerspective (visual)Chemical equationReal numberVideo gameMetropolitan area networkComputing platformMereologyPhysical systemVirtual machineBound stateSoftware developerWeightDot productKey (cryptography)State of matterTask (computing)Arrow of timeNumberWorkstation <Musikinstrument>Computer animation
Transcript: English(auto-generated)
Hello, everybody. I'm Paola Moreto. I'm the co-founder of a company called Nuvola. You can find me on Twitter at paolamoretostri.
And so a little bit about me. I'm a developer turned entrepreneur. I've been in the high tech industry for a long time. I love solving hard technical problems. I come originally from Italy, but I've been in the US 20 years.
And if you don't find me writing code, I'm usually outdoors hiking. So this is about performance. And we heard it loud and clear here at RailsConf that faster is better. So we all know what performance is,
but it's good to understand really the impact of low performance. And when I talk about performance here, I really mean speed and responsiveness. The speed and responsiveness that your application delivers to your users. So there is a famous quote from Larry Page that says,
speed is product feature number one. So you really need to focus not only on your functional requirements, but also on the non-functional requirements, and speed is paramount for any web application today.
And there is a lot of research and data that backs this and shows what is the impact of low performance. So it impacts visibility, definitely affects your SEO ranking. It impacts your conversion rates. It impacts your brand and the perception that people have of your brand, your brand loyalty,
your brand advocacy. It impacts your cost and resources because the tendency for low performance is usually to over-provision, and that's not usually the response, the right answer. So speed today for web application is paramount.
And then if you have a DevOps model, so if you move to a full combined engineering model where development and QA are combined, and development QA and sysadmin or ops are combined, and you have a full DevOps model where you and you have adopted continuous delivery
and agile methodology, which is like the standard today for web development, then it becomes even more critical. So performance today in the cloud where you have a fully programmable and elastic infrastructure, and you're adopting continuous delivery, it becomes even more critical.
You need to be able to bless every build and make sure that not only it works, but it works at the right speed. So then what? What do you do? How do we tackle this problem?
Well, the first thing is you need data. So this is a quote that I actually stole with pride from a talk yesterday, and I love it. In God with trust for everybody else, give me data.
So is this a good model you deploy and then hope for the best, then you have your customers or your users being essentially your QA department? It's not. I know of a company that says it's an e-commerce application, and they say, oh, we know when we have a slowdown
because our users complain on Facebook. Well, that's not usually the best way to do it. So you need data, and you need a lot of data. So let's get started. So there are different types of data. So basically, on the right-hand side,
you have your deployments, your production, where you deploy. You have your live traffic, and that's usually goes on the big umbrella of monitoring. So there you have all sort of monitoring data and techniques. And then on the left-hand side, that's
your testing environment. It's usually people have a pre-production environment or a staging environment. Sometimes you can also test on production. There you have your synthetic traffic. So you're simulating, you're creating your users, and you're doing performance testing.
So these are the two most typical source of data today. So let's start with monitoring. So you have different many types of monitoring. So you monitor your stack. You monitor your infrastructure. You do some sort of log aggregation.
You monitor what the users are doing with your application and what's the user's behavior and what are the most typical user behavior or what are the corner cases. And then you have what is called today streaming analytics or high-frequency metrics, where there are solutions that pump data out
of the platform at speed. And these are some of the examples of the solutions that exist today. We're not associated in any way with any of those, but it's just to give you an idea of the wide spectrum of monitoring and data instrumentation solutions that you can find.
All of these complement each other. There is not one piece that fits all, and it all depends on your application. And there is an interesting problem. Today you get all of these nice dashboards and how do you correlate all of these data and figure out exactly what's happening. But the first step is definitely monitoring.
As they say, you first instrument and then ask questions. However, monitoring is not enough. And why? So first of all, your life is noisy. So your life, you have all sort of users
doing all sort of things. It's very hard to troubleshoot. If you have a scenario you're interested in and it's perhaps problematic, at the same time you have other users doing other things as such the system responds in unexpected ways. The other problem with monitoring is that it's after the fact.
So monitoring doesn't help you predict and it doesn't help you prevent problems that might occur with your application. So like a friend of mine saying, monitoring is like calling AAA after the accident.
It's useful, but usually you want to prevent the accident instead. So that being said, monitoring is the first line of defense, the first thing you gotta do. So then what are you going to do? Then we're gonna pair up performance testing with monitoring. So the two complement each other really, really well.
And here's why. So we're gonna look at the left-hand side of our data sets and data sources. And here we're gonna look at synthetic traffic so it's not your life. You have the ability to create your traffic. And you're gonna do some performance testing. And it could be on a pre-production environment,
on a staging environment. Usually you don't want to mix your synthetic traffic with your live traffic. And you don't want the synthetic to have an impact on your real users. So that's why you test on pre-production. But you could also test on production for specific applications or specific times of the day,
et cetera. So with performance testing, basically the users are not real but the traffic is absolutely real. You have total control over the amount of traffic and the user scenarios, the workflows because that's how you have designed your tests.
So troubleshooting is simplified here because you have an easy way to reproduce specific scenarios that you thought were problematic. And number two, in terms of peeling the onion which is a typical troubleshooting approach, you have already controlled two variables,
the amount of traffic and what the users are doing. And then the other advantage of performance testing is that you get end-to-end user metrics. So you're measuring exactly what your users are experiencing. This is not about server metrics or database metrics
or applications metrics or Ruby metrics. It's the true end-to-end. So we've seen some numbers where there was a factor of seven in between the end-to-end user metrics under traffic and the server metrics. So the server appeared not to be suffering
but the users did not get a good performance at all. So in order to have a good complete view, you really need the end-to-end user metrics. And the other advantage of, so if you can test and create realistic scenarios as close as possible to what your users are going to do,
and then the goal here is to figure out problems in advance before they happen. So again, one of the problem monitoring is that it's after the fact. Here we are coming before monitoring. So we are doing things before that they happen so that you have time to optimize.
And you can't measure unless and until, and you can't optimize unless and until you measure. So you want realistic scenarios. If you have mobile applications pair up with your web applications, then it's absolutely critical you test your mobile traffic as well.
If you are around the world, that's a global application you need to test from different geos. And then the end-to-end measure, the KPI for the end-to-end user experience. The type of metrics, so this is around time. And so time is a variety of ways of saying
this is response time, or some people call it latency, but essentially it's time to complete transactions, time to complete specific requests, averages, distribution. You could get throughput, the number of successful requests
per test or per specific time intervals, and then you can get also error rates. If you see some suffering on the server side, you can start seeing errors. And then again, the goal is to resolve issues before you deploy.
And then when to test. So this is, so software changes all the time, and as such, it's important to understand whether a specific change is going to impact how your users are going to interact with your platform. And it's not just important that the software
does what it's expected to do, but it also does what it's expected to do at the right speed. The other point here is no matter, even if you don't change anything, things change around you. So applications today are spidery, they have hundreds of possible optimization points,
they pull in plugins, you're sitting on a cloud infrastructure. So this is a complex problem, and the only way around it is to test often. So test for every change. Test if you're going into a peak of traffic, you don't want to go blind into that.
Test if you have any types of infrastructure or changes to your deployment. There is a very good example where at some point, a while ago, several years ago, Morocco changed something in their routing system, and that change was not publicized. They didn't, or at least it was not openly publicized,
and it only impacted a specific set of applications, but it impacted them greatly, and people realized because they started taking measurements and they saw a big difference. So the applications did not change, but in that specific example, the cloud provider made a big change,
and the only way to identify this kind of thing is to measure. So, but guess what? This is still not enough. And why? Well, you can get results like this,
where you say, wow, I have a lot of errors, and under traffic, I apply a linear ramp. That's kind of the green, the green bars. I get a ton of errors. My response time increased dramatically. Then at some point, it decreases because the server doesn't even respond to requests.
So, or you could get things like, well, my tests are telling me if I have 10,000 concurrent users, I get my response time deteriorates from 400 milliseconds to 2.5 seconds. So, okay, your tests are telling me, your tests are telling you that your system is slow
or will be slow under specific traffic and scenarios, but it's still not actionable. You still don't know what to do. You just know that you're gonna have a problem. It's almost like I'm gonna tell you, well, when you have 50,000 users on your platform,
you're gonna have a fever, but there is no medicine. So what if we can extract some more information from this data and find a medicine? So stay with me. So if you look at the typical
performance troubleshooting process, ironically, and where people spend time, the majority of the time is spent, number one, in reproducing the issue with the right data and number two, in isolating the issue and then once you have done that,
the actual fixing of the problem is relatively straightforward. So the reproducing is, I have a very good example here. There is a company that I know and their client was a big bank in India and they had performance problems
with the applications they had and it took two weeks in between the time differences and the engineers on two sides of two different continents, two weeks with a whole team in a room and constant conference calls because before,
they were able to just reproduce the problem and have the data. So reproducing is partially or it's addressed by performance testing but then you're left with the issue of isolating the problem and isolating a problem usually takes a lot of time
and it's a lot of effort and developers are left with doing a lot of correlations with data and it turns out to be a manual and high time-consuming process but then once we're done with isolating, then the fixing becomes relatively straightforward. So what we want is actually the ability to go from,
if you go to the left, that's before testing, that's you're oblivious. You don't even know that you're gonna have a problem. Then once you test, you're like yeah, we're gonna have a problem. I found out that I will have a fever at 50,000 users and then we want the ability to some help
in localizing the bottlenecks because we know that localizing is gonna take a long time and then after that we can fix and then that leads to happiness. So then we're gonna add the third step here. So we talked about monitoring
and all the data instrumentation that you can extract data from your application with your live traffic. We talked about the performance testing and how you could use synthetic testing, create the traffic you want to see how the application responds and now we're gonna extract another layer of information from our data
to help us localize the problem. So how? So what we want to do is we want leading indicators of performance issues. So again, we don't want after the fact, you want to figure out this problem beforehand because so you have the time to fix and to optimize and deliver the performance you want
and we have found that if we localize, if we are able to pinpoint in these spidery applications where the problem resides, then we can accelerate the troubleshooting process which is otherwise quite painful and we want actionable data.
So in order to do that, we're gonna add something else here. So we have our monitoring. So what you have in the middle is our monitoring when you have your live and you have all the monitoring data and you have your data instrumentation and then we already talked about how it pairs up really well with performance testing
so the two go together and now we're adding another layer. So we're adding here some data mining and machine learning to extract another layer of information from this data and help us localize. So this is how we do it. This is an example of our prototype that we built.
So you apply a linear ramp for traffic and so that you do the synthetic testing. At the same time, you use the data instrumentation that is usually used for your live traffic but in this case, we're gonna use it over your synthetic traffic
so it could be on your test environment and then we mix it up all together. If there are historical data for that application and that test, we use that too and then there is data analysis that basically makes an attempt at clustering and identifying statistically meaningful variations
in all of these timing and whether these statistically meaningful variations are clustered around a specific component of the application. So this is essentially how it works. So first you run a test, a performance test.
If your response time is good and you don't have any slowdown, then there is no problem at all but if you have a slowdown, so we go back to the example where you had all of these reds, slowdown and arrows, then you're left with the problem of figuring out how to fix it. So the first thing we are doing
is we are removing what we call network and external effect so we want to see if there is any correlation with data such as network time, DNS time, SSH time and other data that are kind of external to our stack and if we don't find any correlations with those,
then those are excluded from the data analysis. And then, so assuming that there is no correlation there, then we go for, we look into the data set and the data analysis identifies statistically meaningful differences
using clustering and longitudinal analysis and identify whether these variations clusters around the specific sector and then the results are displayed. So I think we already covered it. So the whole point is out of the thousands and thousands
of available metrics, we look at variations in real time and we attempt to clustering them across specific, what we call sectors that are components in the applications. So this is all using specific data analysis techniques.
So what we use is kind of a mix of techniques. It's not only one. They all go under the umbrella of machine learning or unsupervised machine learning or data mining. We, again, it's not just one technique but definitely we use a lot of clustering
and longitudinal analysis. So ready to see some, ready to see some real data? And a real life example. So I'll give you a couple of examples.
So this is a typical web application. It's a real application. So it's not a test application. First we ran some performance tests with a linear ramp up to a thousand users. So this is a thousand concurrent users per second. So that corresponds to,
usually we say there is a factor of a thousand. So it corresponds to kind of a million, monthly visits, that's the type of peak that you could expect if you have that traffic. So, and then we ran some performance tests
and we see that as we apply a linear ramp, the response time deteriorates. It's actually three times as much at traffic than it is without traffic. So this is definitely a case that it's worth investigating. So then we go with a data instrumentation. So the beauty of this model is that
you could apply this method to pretty much any data instrumentation that you have or that you want to use. So it's not married to one specific method or approach. In this case, we use a specific data source.
But again, you could use anything. And the way we look at data is that they're categorized under sectors. So the various components. For each sector, you have categories and then you have classes and you have methods. So you have actually a lot of data that are coming up for each one of these sectors.
And so this data is an agent that, while the test is running, there is an agent that pumps these data constantly into our algorithm and the algorithm works in real time to do this clusterization analysis. And so at the end of the clustering result, this is kind of an eyesore,
but basically you identify the methods that actually have, that shows variations with timing at the same time as the test, as the response time starts increasing. So they correlate well with the performance testing results and with the end-to-end user metrics.
And so this is kind of the end result. So as a reminder, what you see on the left are the sectors. The sectors are groups, large groups of data. You can actually dig down into this data and see exactly what is the component of this group that created the problem.
So what we see from here is that, for example, although this test was run successfully without errors and we put a load of 1,000 concurrent users at the end, we see that the browser, so everything that, what goes under the browser component
starts suffering right before 200 users. So it starts suffering at the very beginning and then it enters the yellow zone, what we call the T zone, a transition zone. So that's where it's kind of deteriorating, but it's not too bad.
And then it enters a red zone, which is way, way over where it's expected to be. And then the next one that starts is the app stack and the app stack is essentially what's happening with your Ruby. And that starts deteriorating right around 300 concurrent users and then enter the red zone later.
So you could see that even though at 1,000 users you see a triple response time, things start deteriorating a lot sooner. And it's also important to understand another very critical data point here is, what is the first component,
because sometimes you have chain reaction effect, if one piece slowdowns then the other slow down as well. So what is the first component that starts slowing down and slowing down the system? And in this specific example is the browser. Now the browser, again, it's a set of data which is represented here and underneath here
you have another hundred of data points. So from here you could actually see, dig down and see what are exactly the components within the browser that causes this slow down. So this is, so again, so the objective here is to identify proactively, so this is all before
you have actually the 1,000 live users on your platform, to identify proactively and under a specific workflow or scenario what is going to happen and what components of your application are actually the root cause of the problem.
So here I'll give you another one. So this is another application. The categories are the same just because we look at the same data. I don't have the raw data here but you could dig down into all the methods that actually cause this. And here you have an interesting perspective.
You still have the browser, you have the app stack that closely follow but then you have what we call server and software which goes from green to red. So it doesn't even enter the T zone. There there is almost like a step function where the metrics go from really well to really bad.
So in summary, what we covered today is speed is product feature number one. Performance is paramount, faster is better. How do we tackle that? We tackle that as developers with data. We start with monitoring. Monitoring is a good start.
First line of defense, it's not enough. Add performance testing complements well with monitoring techniques. That's still not enough because what you want is you want some help in localizing the problem. So here we have performance test plus data instrumentation plus machine learning.
We have another layer that we can extract from our data which we have called predictive performance analytics and we got to see it in action in a couple of example. So thank you. I think I can take some questions now. You can find me on Twitter at Paola Moreto 3
and happy to hear your questions and feedback.