Performance tuning Twitter services with Graal and Machine Learning

Video in TIB AV-Portal: Performance tuning Twitter services with Graal and Machine Learning

Formal Metadata

Performance tuning Twitter services with Graal and Machine Learning
Title of Series
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Running Twitter services on Graal has been very successful and saved Twitter a lot of money on datacenter cost. But we would like to run more efficient to reduce cost even more. I mean, who doesn’t? In order to do this we are using our Machine Learning framework called Autotune to tune Graal inlining parameters. This talk will show how much performance improvement we got by autotuning Graal.
Machine learning Slide rule Service (economics) Presentation of a group Dynamic random-access memory 8 (number) Mereology Twitter Mathematical optimization Twitter
Concurrency (computer science) Graph (mathematics) Multiplication sign Range (statistics) 1 (number) Parameter (computer programming) Mereology Web service Bit rate Befehlsprozessor Software framework Vertex (graph theory) Extension (kinesiology) Physical system Constraint (mathematics) Parallel port Parameter (computer programming) Maxima and minima Instance (computer science) Befehlsprozessor Steady state (chemistry) Configuration space Software testing Right angle Cycle (graph theory) Figurate number Spacetime Slide rule Game controller Service (economics) Open source Computer file MIDI Virtual machine Device driver Graph coloring Twitter Number Revision control Graph (mathematics) Reduction of order Energy level Utility software Software testing YouTube Mathematical optimization Tunis Window Default (computer science) Graph (mathematics) Java applet Memory management Mathematical analysis Line (geometry) Thresholding (image processing) Compiler Performance appraisal Personal digital assistant Web service Fiber bundle Object (grammar) Integer
Sensitivity analysis Suite (music) Greatest element Presentation of a group Euler angles Graph (mathematics) Multiplication sign 1 (number) Parameter (computer programming) Mereology Total S.A. Dimensional analysis Software bug Facebook Web service Mathematics Roundness (object) Semiconductor memory Different (Kate Ryan album) Befehlsprozessor Formal verification Software framework Area Constraint (mathematics) Parameter (computer programming) Range (statistics) Bit Maxima and minima Curvature Befehlsprozessor Process (computing) Configuration space Right angle Cycle (graph theory) Arithmetic progression Chi-squared distribution Resultant Spacetime Point (geometry) Web page Bytecode Service (economics) Open source Computer file Constraint (mathematics) Virtual machine Student's t-test Machine vision Twitter Product (business) Workload Crash (computing) Goodness of fit Reduction of order Energy level Utility software Mathematical optimization Tunis Default (computer science) Stapeldatei Graph (mathematics) Information Graph (mathematics) Projective plane Total S.A. Line (geometry) Machine code Complex number Compiler Performance appraisal Uniform resource locator Web service Data center Statement (computer science) Point cloud Iteration Object (grammar) Table (information)
my name is Chris I work for Twitter this
is a very long presentation but you know I just show you the second half of it
because that's the interesting part so I'm assuming everyone in the room knows what Bayesian optimization is very good so I don't have to explain it yes I'm going to skip a lot of slides
here so you guys know what Twitter is we've run on micro services and we have a lot of them and so this talk is about we're using draw to run our mostly and
Scala written services and with and and by using draw we we save a lot of CPU so what we also have is something called I'm not sure if that will come up yeah so we have something that's called auto-tune it's basically a framework that's using the evasion optimization as a machine learning framework to tune JVM parameters and so we can pass in to auto-tune we say Oh tune this parameter for me and then auto-tune talks to the Bayesian optimization part which is wet lap that's there's also an open-source version of it called spearmint if ever anyone has ever heard of it and so then spearmint or wet lab figures out the next the next you confuse me the neck the next value of the parameter to try to explore the space and then find the optimum via the optimal configuration so it's a driver auto tune is to drive it to run these experiments and people know what growl is so I skip that you can watch this on youtube if you want and so these are the parameters that iTunes there's there's one called trivial inlining size so that's the size it's the default it's ten if a graphic compiler graph of an inline e is smaller than ten nodes it just in lines it without looking at any other data then there's a maximum in line size so it's the other end basically if it's bigger than 300 doesn't inline it and then there's something called small compared to low level graph size it's similar to the second one but it's just a low-level graph so don't worry too much about it what it does but these three parameters are the ones that affect in lining the most all right so I did some previous work I have another talk where I explain how cool growl is and how much CPU we're saving and so these are two slides from my previous talk a look at blue and orange basically so this one and down here like just using draw instead of C to running our our stuff we can reduce parallel GC cycles by about 3.8 4.2 percent no actually no yes yeah four one two and what I did when I ran this a year ago I think I manually tried to change the three parameters you just saw to figure out if we can get better performance out of it and so what I was able to do by sitting down an afternoon and trying this for two three hours I could reduce it by 1.5 percent and I did the same for CPU time so basically by just running raw we can reduce CPU utilization by 13% which is a lot and saves us a lot of money and I could squeeze out another 2% by manually fiddling around with these but you don't want to do this manually for every service you want a machine learning framework to do this for you right so that's exactly what this is so this is the configuration but it's a JSON file that you pass in to Autotune that's basically the parameter and then you tell it the range from where to where it should you know explore the space you don't really have to specify that you could i've ran experiments after this where I just said 1 to 1000 right because it doesn't matter the the framework will figure out what the right parameters are I just used a range here because I wanted it to work for the talk but I'll probably rerun the experiments and just set it to 1 to 1000 and let it figure it out itself so the test setup is I have dedicated machines there's nothing else running on them because crosstalk is a big issue when when I do this performance evaluations all instances receive the exact same requests that's important it's not the same number of requests it's exact same request because a tweet could potentially be one character 280 long and then that would affect memory allocation a lot and it would change the outcome we're running with this version of growl default tiered setup c1 and growl so that's there's nothing we change here so the first experiment is the tweet service I have to experience I'm not sure if I can show you the second one too because of time the tweet service is basically reading and writing tweets it's built in on fin naval it's a framework an open-source framework that we developed and you can get it on github if you want it's an extensive RPC system for the TV am used to construct a concurrency service blah blah blah I have no idea what it is but the most important part is its 92 percent written in Scala and grog can handle scallop very well because color allocates a lot of temporary objects and grows in lining and escape analysis are just better than what c2 has and that's why we can reduce the memory allocation rate reduced UC cycles reduce CPU utilization and so on ok so you have to pass in an objective right in this case it's use of CPU time and since the auto-tuned framework looks for a Maxima we invert this one to find you know the configuration that uses the least CPU and then you can specify some constraints in this case it's something so we run on Aurora in mesas or the other one in Aurora on mesas and there's something called when you get throttled it's basically because you're using too much CPU then it kills you and so that's basically our constraint we say if because if you tuned sometimes you and I noticed that when I was tuning it manually that sometimes you specify some values where the service doesn't even come up because they're just too wild and so we have to put in a constraint so that we know if we win too far so this is 24 hours of doing an experiment an experiment what one is only 30 minutes long it's called an evaluation and I I picked 30 minutes because it's long enough for the tweet service to actually reach a steady state and I wanted to have a lot of evaluation so that we see how auto-tune really works so as you can see this is just request for a second and it's the same fall services for the two instances and this is use of CPU time so the experiment this blue and the control
which doesn't change is orange and if blue the blue line is below the orange one that means we see an improvement and it's above and it's worse I have the same graph slightly different here it's a little easier to see you know every time when it's below it's better when it's above it's worse and the result when this was done looks like this it's a web page it shows you all the experiments and then this one's the best one and you see the objectives is one point zero eight three eight that means we could improve CPU utilization by over eight percent and these are the parameters you remember you know I said ten to twenty five this was by default ten this was default 300 and this was default 300 so if you use these parameters then you get eight percent less CPU and the bottom of the pay of the table looks like this we have three ones that violated the constraint with one that's still in progress when I shut down the experiment and as you can see there are these that were worse like this one's almost five percent worse and like three percent worse now these these are charts of the three parameters it's a I'm showing them it's not perfect because it's a three dimensional space we're exploring because we have three parameters and you know with the n dimensional if you have whatever parameters but it can give you a picture every point in here every data point all depends on two other parameters so you know keep that in mind but if you squint a little bit you can see that there's actually a trend going up so if you increase trivial inlining sighs you you get a little faster if you if you do more but at some point that's too much and it's coming back down and this is maximum in lining size it's kind of flat and then this one we do you don't really have to squint to see what's going on so if you and that one's by default 300s we would be in that area and we can improve it by that much if we you know increase that value I didn't look at the time how much tender have left with all the stuff in the beginning okay so what I what I did then to verify the result was I took the parameters the first the top one and ran a 24-hour experiment I also call it experiment but basically a verification experiment I just ran the tweet service for 24 hours with c2 growl and then Gras with the auto-tune parameters in red this is again PS scavenge cycles because tweet services using parallel GC and as you've seen earlier in this particular run it was 3.4 percent less GCS and with the auto-tune parameters we could increase that by 3.5 so in total we can roughly reduce GC cycles for seven percent and the funny thing here is that we that auto-tune actually squeeze more out of it than it did by default and the same graph yeah it's it's it's basically the same graph as the one before I'm just showing this one it's allocated bytes per tweet and it's very flat over 24 hours and you see the obviously the exact same improvement here through your advisor bhun for 7% roughly and this is user CPU time as you've seen in the beginning that's about 12% ish in that particular run well it varies a little bit and with autotune we can bring that down and not a 6.2 percent which gets us to 18% less CPU we have our own data centers we have we own our own machines but even in the cloud if here if you can run your business with 18% less machines it's a lot of money you don't have to spend and also you know you save electricity you're safe on cooling Bob about all that stuff so we're actually trying to save the world here right then this is latency p99 latency for for tweets and you can see it's it's certainly growl is certainly better it's a little hard to tell how much it really is auto-tune looks like this it's certainly better but you know again hard to tell how much it is so what I did was I was integrating over the over the GC times of 24 hours and that's that's the graph of that and as you can see we can reduce p99 late and C's by 19% but just using draw and then another 8% by using the auto-tune parameters so 28% means you get your tweet 20% faster and you should tweet that so I think I have 5 minutes left so I'll skip this one it's basically the same with a different service let's just go through these so you've seen all that blah blah blah I would say the same thing you 7.6% and yeah these graphs you've seen them before looks similar the run one point six three point five that's interesting that just draw reduces it GC cycles only by one point six percent but then auto-tune gets another 3.5 out of it yes it did it did here no here 23 398 646 I can't remember that once before but yeah they were different scroll Scroll scroll we've seen this GC cycles CPU time so if that graph that a service is also built on top of finding but certainly not as CPU sensitive I'm not sure what but we could only reduce CPU by 5.5 but just using Gras you know compared to 12 for the tweet service and without a tool we could reduce that by 7 point 8 percent and I think the reason why this is actually higher than just drawl it's the same as the cheese see graph because we can reduce GC cycles by more and that automatically means that we don't have to allocate memory we don't have to Chi C it and that reduces CPU okay 30% questions yes there's one question that everyone has correct yeah of course I did I couldn't come up here and not have
done that so I picked these three they're very similar to the other ones that Groth has its max inline level growl doesn't have that it's the depth of inlining that c22 us at 9 it stops so if if you if you call more than nine methods it's just not inlining anymore max in line size 35 the same as T the the Grob her ammeter basically the only difference is growl is looking at nodes in the compiler graph while this 35 is byte code size there's a funny story about this you know if you have assert statements that actually is counted in that 35 so it's stupid we never fixed it but Jason configuration kind of the same 5 to 20 you know I wanted to see we've actually lessen lining level change it changes things experiment that's the outcome same graph that's a result so the best we could do is 5% and and that's kind of I think an outlier because we see here three point eight hundred five three point three I think that's more deranged that if I would run the experiment which I didn't do yet the verification experiment we would see a rough three point five percent improvement that audit even can get out of situ nice right so auto-tuned does a very good job but compared to Gras it's just nothing because this the suite service and we had what an 18 percent improvement by using Rob without a tune and the most we can get here is roughly let's say because I'm nice 4% so no it's not no it's not because this is Scala c2 is not tuned for Scala that's an interesting chart the max in line level you can see it goes up so 9 is the default right so we're round here we can certainly it should be probably 17 to be honest and then this graph flat flat so they don't change a lot yeah that was it my summary is always very simple and I always just ask people please try draw as you saw especially when you run Scala code you should certainly try grow it can reduce you know the the cost of your whatever you're doing business and I want people to try it so that we find more bugs and can make rather better compilers or if you try to run your pet project or I don't know you go to work at Monday and put it in production that would be cool we do it so if you get a crash that would be nice file a bug if something doesn't work as expected or as slower than C - yeah file a bug if it's better tweet about it and ask me I would love to hear it so that was it thank you very much any questions yeah so did you run these experiments only for 24 hours yes okay because if you look at the parameter space it would be over dream with the parameters you showed it could be over everyone right 3 million configurations yeah possibly yes the problem is you didn't see the first half of the presentation yeah but I know of Asian optimization is yes so if you know what it is then you know how it works so it's also sometimes very fragile so not really huh okay I mean you saw yeah but there are a lot of other things out there like how do you batch this can you batch it like there's a post by Facebook where they used the same technique for their hack compiler I guess was recently posed okay and they didn't show that you quite there's quite some informations needed so I was wondering is it feasible for everyone like besides Twitter to like if you have a production workloads right to try this oh absolutely so I you I did 40 iterations yeah right okay that's that's a very good size I'd have to say if you look at the table the results table you'll see that the top you know it kind of your explored the space enough so that you have a good result the the goal we're not there yet but the goal is to have this always on for every service so that the services are tuning themselves automatically all the time and then you can run 30-day experiments or something right you don't have to tune it every day the code is changing yes everyone is deploying multiple times a week but it's not that much right only once a month it's still a hundred times more than you would manually tune it right you only tune it manually when someone gets upset and then he Tunes it and then yeah I have to sit at Twitter I got there and I said hey when was the last time you filmed the parameters for the tweet service I said three years ago right that's basically what happened so we want this always-on and then we can run and a thirty-day experiments and run it for one evaluation one full of is there any intention to make this OSHA tune ferric open source yes there is so the problem the wet lab vision optimization part that probably we can't open source because we bought that company and you know complicated but there's the Spearmint framework which is open source and then auto-tune we wrote ourselves our team so yeah we can open source it it's just at this point it's not very user-friendly ok I have to I have to curl chase and files student URL and then you know and you can only kill all the experiments you cannot kill one of them so it's you know it's working progress but yeah we went to open source it microphone so there exists about 1000 X X parameters how many can you tune at once and how do they interact with each other so how much time do you have right you can do all of them if you want the the attitude was written to tune GC parameters that's why at the time likes likes we or you know you can as many as you want okay yeah it's like the space will be a little bit bigger but yeah you can do as many as you want yeah repping up okay that was it ask me later thank you


  420 ms - page object


AV-Portal 3.21.3 (19e43a18c8aa08bcbdf3e35b975c18acb737c630)