We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The Race to the Bottom: Low Latency in the age of the Transformer

00:00

Formal Metadata

Title
The Race to the Bottom: Low Latency in the age of the Transformer
Title of Series
Number of Parts
56
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
So you want to deploy a large language model, and keep your latency SLA? NLP adds enormous value to customers, but getting it to work efficiently is fraught with uncertainty and high cost. As transformers and other big neural network architectures make their way into your platform, you may be finding it difficult to get the speed and throughput you need within your budget, or even understand why it is so expensive. This talk will give an overview of the latency and throughput challenges, and how to solve them. We will give an overview in the product and cost implications as well as the technical improvements that can be used to get things running fast. We will compare solutions and help make sense of difficult to understand technology. The audience will walk away with the information they need to decide on the best direction for inference in their production platform.
Musical ensembleTouch typingGreatest elementMultiplication signXMLUML
TwitterInferenceScalabilityServer (computing)Computer programmingSemantics (computer science)MiniDiscSystem programmingMathematicsComputerVector spaceArtificial neural networkOpen sourceVector graphicsMathematical analysisToken ringPattern recognitionExecution unitInferenceData conversionArchitectureBit error rateSource codeDemo (music)Run time (program lifecycle phase)Context awarenessEndliche ModelltheorieoutputMaxima and minimaConnected spaceOpen sourceScalabilitySemantics (computer science)InferenceVector spaceResultantSubject indexingNumberPoint (geometry)Einbettung <Mathematik>CASE <Informatik>Office suiteCartesian coordinate systemCoordinate systemMultiplication signInference engineQuery languageSoftware development kitSimilarity (geometry)Search algorithmType theorySearch engine (computing)Token ringServer (computing)Intrusion detection systemWordComputer animation
InferenceDemo (music)Disk read-and-write headOverhead (computing)Computer hardwareMultiplication signAreaDifferent (Kate Ryan album)Endliche ModelltheorieMathematical analysisInferenceSoftwareBitMathematical optimization
InferenceDemo (music)BefehlsprozessorQuery languageToken ringMeasurementDemo (music)GodData conversion
Principal ideal domainRootServer (computing)Set (mathematics)Sanitary sewerNominal numberCore dumpMetric systemMaxima and minimaRight angleMoment (mathematics)2 (number)Radical (chemistry)Process (computing)Server (computing)Client (computing)BefehlsprozessorVirtual machineCountingTouchscreenDifferent (Kate Ryan album)Independence (probability theory)InferenceLastteilungFitness functionComputer animationSource code
AverageServer (computing)Lemma (mathematics)Principal ideal domainBefehlsprozessorRootDemo (music)HistogramInferenceDependent and independent variablesSource codeComputer animation
RootBefehlsprozessorInferenceDemo (music)Token ringQuery languageMeasurementSlide ruleMoment (mathematics)Presentation of a groupRemote procedure callMeeting/InterviewSource codeLecture/Conference
2 (number)Virtual machineNumberHistogramQuery languageMereologyOutlier
Instance (computer science)Thread (computing)Service (economics)Maxima and minimaInferenceServer (computing)Vector spaceFrequencyEndliche ModelltheorieMessage passingProcess (computing)Mathematical analysisRun time (program lifecycle phase)Thread (computing)CalculationInstance (computer science)Greatest elementScaling (geometry)CASE <Informatik>Virtual machinePoint cloudInferenceGraphics processing unitMultiplication signMereologyPiFunction (mathematics)ExtrapolationBefehlsprozessorCartesian coordinate systemTransformation (genetics)Token ringContext awarenessComputer hardwareRight angleQuadratic equationQuery languageInformation overloadComputer animation
Visualization (computer graphics)BefehlsprozessorEndliche ModelltheorieMultiplication signCodierung <Programmierung>Set (mathematics)Focus (optics)Parameter (computer programming)Computer hardwareArtificial neural networkSoftwareField (computer science)outputQuery languageOverhead (computing)CalculationAreaRight angleComputer animation
Sparse matrixCelestial sphereBefehlsprozessorPoint cloudVertical directionField programmable gate arrayComputer programGoogolComputer hardwareComputer hardwareSparse matrixField programmable gate arrayEndliche ModelltheorieBitPoint (geometry)CASE <Informatik>Artificial neural networkGoogolWeightCoprocessorPoint cloudConnected spaceLatent heatParameter (computer programming)Student's t-testCore dumpVertex (graph theory)Geometric quantizationBefehlsprozessorGraphics processing unitArea32-bitSoftwareMultiplication signComputer animation
ArmRun time (program lifecycle phase)SoftwareCuboidComputer hardware
SoftwareAlgorithmRun time (program lifecycle phase)Computer hardwareThread (computing)Core dumpChemical affinityMathematical optimizationCache (computing)Formal languageMultiplication signComputer hardwareDemo (music)Formal languageProcess (computing)Instance (computer science)Greatest elementGraphics processing unitVector spaceSoftwareBefehlsprozessorPresentation of a groupSlide ruleRun time (program lifecycle phase)CuboidTouchscreenAlgorithmMathematical optimizationComputer animation
InformationMultiplication signTouch typingSlide ruleGraphics processing unitTransformation (genetics)Utility softwareMeeting/Interview
Endliche ModelltheorieGraphics processing unitBefehlsprozessorProcess (computing)Core dumpBlogTensorComputer hardwareSoftwareConnected spaceQuery languageoutputThread (computing)Meeting/InterviewLecture/Conference
Software frameworkEndliche ModelltheorieTerm (mathematics)Lecture/Conference
Different (Kate Ryan album)Computer hardwareLecture/Conference
Endliche ModelltheorieGeometric quantizationSoftwareType theoryTransformation (genetics)Different (Kate Ryan album)Parameter (computer programming)NumberWave packetContext awarenessInsertion lossTerm (mathematics)Computer hardwareMeeting/Interview
Maxima and minimaMusical ensembleJSONXMLUML
Transcript: English(auto-generated)
Hi all, thanks for coming. This is The Race to the Bottom, Low Latency in the Age of the Transformer. Thanks to everybody at Berlin Buzzwords who invited me to give this talk today. It's a short talk. It's 20 minutes. I have a lot to cover. It's impossible
to cover everything in the 20-minute time span. I'll do my best. If you have any questions, please get in touch with me afterwards. So I'm Max Irwin. I'm the founder and chief everything officer at max.io, where
I make a mighty inference server, which is a scalable inference engine based on Rust and ONNX and ONNX Runtime. I'm not going to read through all these bullet points. Just importantly, I've been coding for a long time. I've been doing search for a while, starting in around 2010, 2011, with keyword search
and then semantic search in 2015. Then I started getting into the vector search stuff right around 2018, 2019, when I was working with open source connections. I participated in last year's big ANN vector competition with
Dean McCon and some other folks where I made Buddy PQ. I'm also contributing to a book called AI-powered search with Trey Grainger and Doug Turnbull, where I write about semantic search and question answering with some of the techniques I'll show today. Sometime it'll be printed. It's on
Meep right now, if you want to check it out. So what are we talking about when we talk about inference and low latency and all these fancy words? So the solution context I'm going to talk about today is primarily around getting embeddings or vectors from a model and then using
those vectors to index them in an approximate nearest neighbor search index. And then at query time, you need vectors when somebody types in a text query, for example, you inference the vectors from
the text query or the embeddings, and then you run a similarity search with the approximate nearest neighbor vector search engine. So there are a couple of examples. Alessandro Beneditti, who just gave a talk about vector search in Solr. There's a starter kit, if you want to start
playing around with that, that uses some of the tools I'm going to show today in this neural solr. And then there's a vector search engine called Quadrant, which is open source and written in Rust. And there's another starter kit that, again, uses MITEI and will let you use MITEI in
coordination with Quadrant. There's also some other things you can do. There's extractive question answering, which is what I like to call really fancy highlighting, text classification, token classification. There's a whole bunch of other use cases when you talk about inference and latency inference. And these things tend to be quite slow and bulky. So that's
why I'm here today. So again, what is inference? So we take some text, in this case, hello Berlin buzzwords, and we pass the text into a model and we do inference against the model. So we have to tokenize and
then take the tokenized IDs, provide those as inputs, and then the inputs to the model will produce a result. And the result looks like this. So it's basically a bunch of numbers. So in this case, we have basically an array of floating point numbers, which is just a big
vector. And that's basically what we're talking about. How do you go from the text to getting these vectors? So I'm going to show a demo and then I'm going to talk about, it's a little bit flipped on its head,
I'm going to talk about what cost and hosting analysis looks like for something like this. Because when teams come in and they want to get started with approximate nearest neighbor search or model inference using a transformer, it's a cost that they're not used to and it's typically quite large. And it can be a little bit
jarring. And we'll talk about then how overhead occurs and how we reduce the overhead for inference and how to make things faster and lighter and more efficient. And then the areas of optimization specifically around different things like models, software, hardware, stuff like that. And then hopefully time for
one or two questions at the end. So now I'm going to give a demo and I always offer this USB converter to the demo gods as an offering. Please let my demo go well. Thank you demo gods.
Alright, again, on the left, we have a client. And on the And it may take a moment for this to update, but you'll see a
whole bunch of stuff flashing on the on the right terminal. Basically, I've started on 128 core machine, 128 independent inference cores. So we can look and we see a whole bunch of
cores that are now running as processes. And I can just do a count. And I can see that we have 128 cores. So now I'm going to show the top command quickly. And I'm just going to run a performance. I'm just going to run a performance
command from the client. And now you can see the CPU is starting to spike, it doesn't show all of the, it doesn't show all of the cores, it doesn't fit on the screen. But you can see that various different performance metrics will use different amounts of the CPU. So we're going to max out.
And there's a there's a load balancer that takes care of some of this, we're going to max out at about 38 3300 requests, or 3400 requests. And what I'm showing here is
we have the requests per second are showing and then we have the latency histogram. So that's pretty much it. That's the end of the demo. But what you were seeing was I was just sending 1000s of requests to the inference course and getting back responses. Okay, so now I'm done with the
demo. And I'm going to go back to the slides. I'll give that a moment to catch up. Okay, I'm still waiting
for the presentation to be shared again. Sorry for the delay. Thank you all for coming to the remote session. I'm sorry, I'm not there in Berlin, I really wish I was
there. I'm at home in the United States. And I'm still not okay. And we're back to the slides. Thank you so much. So what we saw was various, various throughputs and
latency. So this talk is about latency, but latency and throughput are very closely connected, depending on how you organize the resources on a machine. So you can see the throughput in some situations can be, you know, 880 requests per second or queries per second on a 32 vCPU machine. And you can see on 128 vCPU machine, you can scale that
linearly if you do things right. And that's what we're going to talk about. The latency is about, you know, between 20 and 60 milliseconds, you know, and there's some outliers depending on what's happening. But for the most part, you take a look at, you know, the histogram of
latency, you don't just go on one number. So the cost for this thing is, it can be quite expensive. So what I was just showing you on Amazon is, is a lot right. So you can
spend a lot of money on Amazon, if you want to, a one year reserved instance price is about 360 an hour, and that's in dollars. Or on a 32 thread machine, and these are compute optimized, it's about 80, 88 cents or 90 cents. You can get things a lot cheaper. So if you use something, for example, like Hetzer Cloud, there's no
affiliation, I'm just showing examples, you can take the cost of, you know, how many CPUs you need, how many per how many queries per second you need, and then you can do a cost analysis to understand what it's going to cost you to host an inference server to meet your customer
requirements. So this calculation is quite easy, if you have a solution that can scale linearly, like I just showed, some solutions do not scale linearly, sometimes it's just basically you're stuck with what you get. And then I've seen like quadratic extrapolation and things like
that, if you're if you're overloading and trying to do things that you shouldn't. But for the most part, you spend, you know, 40 euros a month and get about 100 queries per second, which is more than most teams need. So that's about a million, three million queries per day, per peak time in an eight hour period. Just for
fun, I went all the way to the bottom where we have like, you know, close to a billion requests in 24 hours. That's like 5000 euro per month. You know, if you if you need a billion requests in 24 hours, most people do not. So how do you get there? So how do you get
to this thing that I showed where you can do a linear scale using a CPU without a GPU or using more exotic hardware, using something that's more specific and fits your needs. So first, we have to look at what an inference request actually looks like. And you
take content and you pass it in to an inference server. The inference server is responsible for handling the request. So that could be an HTTP request or a GRPC request or something else. Then you have to do pre-processing because you're passing in text and you have to prepare the text to pass
into the model. Then the model is hosted typically in a runtime like, you know, in this case, I was showing ONNX runtime, which is wrapped in the application. But you can do like libPyTorch or, you know, TensorRT. Some other runtimes will wrap the
model. And then that's just the inference part. But you do have the preparation that's required. And then afterwards, you have to post process. So especially if you're, you know, if you're taking just vectors and sending them out, then you can just send the vectors out. But if you're doing something like sentence transformers, you have to do like mean pooling and you, you know, extractive question answering. You have to, you know, align
the probabilities with the tokens and the contexts, things like that. And then at the end, you get the outputs and then you use the outputs further on in your solution. So the three things that people focus on when they try
to make things scalable, faster, higher throughput, smaller models, better hardware and software efficiency. And those are the three things, three areas that that people look at. And there's a lot of effort that goes in into these things. I don't have enough time to dig into all
of the details here. And there's a whole field in this area. But I'll give an overview. So if things sound interesting, you can take a look and dive deeper. So when we talk about models, the model overhead is the
thing that you typically want to reduce. A model tends to be like, you know, between 400 megabytes and, you know, a gigabyte and a half for text by encoder models is typically what I'm talking about, or question answering models, things like that. So how do you make those smaller? And when
we talk about smaller, those are typically in the realm of parameters. So every calculation that has to be done when you're passing inputs through all the layers in the deep neural network. And then this is pretty slow, right? So you have
billions of parameters. And every time you pass in a set of inputs, it takes a while to get to the other side. You know, hardware is really fast these days, but still, it's relatively slow. We're talking, you know, 25 milliseconds, 10 milliseconds on CPU, just for one query. So the
techniques to make the models faster, there are many, but the four that I'll just quickly overview. So there's distillation. So distillation is taking a large model and then you take what we call the teacher student. So the large model is
the teacher. And then you train a student model, which is smaller, based on what the teacher model knows, right? And so you can train a smaller model and the technique is known as distillation. Then you can, that's typically just to reduce the layers and size of the model. Then you have
something called quantization. So quantization is just reducing the possibilities of the values inside of each parameter. So in this case, I'm showing, you know, a floating point 32-bit. Instead of using floating points, you use int dates. So you basically get a fourth the size because a byte is,
you know, an 8-bit byte compared to a 32-bit floating point is pretty small. Pruning and sparse is just removing areas of the network, connections and weights and parameters that aren't used. And you can train something like that.
Mixed precision is when you just mix, you know, floating points with FP16s, FP32s and dates, depending on what the model needs. Better hardware is, well, you have to, you know, spend time on understanding how you optimize a CPU, for example.
You can spend money on GPUs. So you can use like V100s, and these are tensor cores that NVIDIA produces. Some of them are very expensive. You can use cloud vertical specific hardware, so TPUs and Google Cloud, Amazon Inferentia. And you can do exotic stuff. You can map an FPGA to a deep
learning, a deep neural network. You can do something like Graphcore, which is a special processor that they have their own cloud, things like that. And then with software efficiency, so you can take hardware and understand what your target is and then optimize your software around
how you want to target the hardware that you know you're going to be using. So this is ONNX runtime. You can go to ONNX runtime.iei, and you can play around with clicking the little boxes here and see the solution that comes out. Then with software efficiency, people tend to focus on language performance. So it's really tiny on the
screen, but the red boxes in the bottom are Python. And then in the middle you have Go, in the bottom you have Rust. So just switching from Python to a faster, more efficient language typically buys you a lot in the pre-processing and post-processing. Then you focus on algorithms, so again, pre and post-processing. And then
hardware synergy, so optimized runtimes, SIMD, vectorization, and just making sure that you're getting everything out of the hardware as possible using the software tools at your disposal. So costs and trade-offs, you can try to target CPU
like I showed today. You have to spend a bunch of time to do that. You can use GPU instances, so you don't have to spend as much time, but it costs a lot more money to host. And you can just say, I don't want to deal with it. And you know, if you have money to burn, you can just use a third-party API and they'll handle it for you, but you're going to end up
paying a lot more money than we'd like to. And that is all. I clicked the thing, but we're not at the next slide yet. But that's the presentation. It was really short. We got a time, but if you have questions, if we have time
for questions, I'm not sure, but please let me know. Otherwise, get in touch. My contact information is in the slides. Yeah, maybe we have one or two questions, quick. Yeah. Your second then.
Thank you very much. It was really interesting. I was wondering if you have anything to say about sentence transformers on GPUs. What I usually see is that you get very low utilization of the GPU. Yeah. If you have anything to say about that. Yeah, it depends on how you are loading the
model and what you're doing. When you say GPU, it's hard to say because again, there's a very close relationship between the software and the hardware. So you may be seeing low optimization if you're just trying to run something in Python using PyTorch and then loading the model up into the device using PyTorch. You probably have better performance if
you're using something like TensorRT, if you're using a GPU that has tensor cores. So you may see something there. Also, remember what I said about the pre and post processing. So you may see something slow if you are doing mean pooling or if you have a lot of inputs and
you're trying to get embeddings for like a whole paragraph versus a query that can impact things as well. The relationship of how you are running threads, passing the model from a CPU to the GPU that doesn't have affinity because
like the RAM connectivity and requirements there. Sometimes you end up doing more hops than you should and that can slow things down. There's some interesting blog posts about that which I'll try to dig up and post in the chat after the talk. Okay, quick last question.
Okay, thank you. So my question is maybe simple but the answer is complicated. Let's say you have some POC with some transformer-based models that work pretty well in terms of performance, in terms of quality, but it's too slow. Is there a general framework you can use
to figure out where you should start? You mentioned distillation, you mentioned using different hardware, etc., but there's a lot of different solutions to look at, right? Is there like a general first step to look into or is it really problem-dependent? Do you have something to say about that? I would start with distillation. If you're using
models like sentence transformers, there are distilled models that are used as the base models for the state-of-the-art sense of transformer models that are out there these days. If you're using something else, I definitely recommend starting with distillation because that's typically the best thing that you can do. You can train a smaller model
and the number of parameters in the model will shrink and that's the bulk of the overhead, I should say. In terms of accuracy, there will be some accuracy loss, so that's another challenge. If you're looking into something like quantization, there are different types of quantization where the
accuracy is terrible after you quantize the model, but if you do training-aware quantization or even automatic mixed precision, that can get you a lot of the way there as well. Also, interestingly, if you're doing something like quantization, sometimes it ends up being slower because of
the way that, again, the way the software is interacting with the hardware can be quite tricky, so you have to make sure that the hardware you're using will vectorize appropriately and use the quantization to its fullest. Thank you very much, Max Ervin. Give a
warm applause for Max.