We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The Race to the Bottom: Low Latency in the age of the Transformer

00:00

Formal Metadata

Title
The Race to the Bottom: Low Latency in the age of the Transformer
Title of Series
Number of Parts
56
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
So you want to deploy a large language model, and keep your latency SLA? NLP adds enormous value to customers, but getting it to work efficiently is fraught with uncertainty and high cost. As transformers and other big neural network architectures make their way into your platform, you may be finding it difficult to get the speed and throughput you need within your budget, or even understand why it is so expensive. This talk will give an overview of the latency and throughput challenges, and how to solve them. We will give an overview in the product and cost implications as well as the technical improvements that can be used to get things running fast. We will compare solutions and help make sense of difficult to understand technology. The audience will walk away with the information they need to decide on the best direction for inference in their production platform.
Musical ensembleTouch typingGreatest elementMultiplication signXMLUML
TwitterInferenceScalabilityServer (computing)Computer programmingSemantics (computer science)MiniDiscSystem programmingMathematicsComputerVector spaceArtificial neural networkOpen sourceVector graphicsMathematical analysisToken ringPattern recognitionExecution unitInferenceData conversionArchitectureBit error rateSource codeDemo (music)Run time (program lifecycle phase)Context awarenessEndliche ModelltheorieoutputMaxima and minimaConnected spaceOpen sourceScalabilitySemantics (computer science)InferenceVector spaceResultantSubject indexingNumberPoint (geometry)Einbettung <Mathematik>CASE <Informatik>Office suiteCartesian coordinate systemCoordinate systemMultiplication signInference engineQuery languageSoftware development kitSimilarity (geometry)Search algorithmType theorySearch engine (computing)Token ringServer (computing)Intrusion detection systemWordComputer animation
InferenceDemo (music)Disk read-and-write headOverhead (computing)Computer hardwareMultiplication signAreaDifferent (Kate Ryan album)Endliche ModelltheorieMathematical analysisInferenceSoftwareBitMathematical optimization
InferenceDemo (music)BefehlsprozessorQuery languageToken ringMeasurementDemo (music)GodData conversion
Principal ideal domainRootServer (computing)Set (mathematics)Sanitary sewerNominal numberCore dumpMetric systemMaxima and minimaRight angleMoment (mathematics)2 (number)Radical (chemistry)Process (computing)Server (computing)Client (computing)BefehlsprozessorVirtual machineCountingTouchscreenDifferent (Kate Ryan album)Independence (probability theory)InferenceLastteilungFitness functionComputer animationSource code
AverageServer (computing)Lemma (mathematics)Principal ideal domainBefehlsprozessorRootDemo (music)HistogramInferenceDependent and independent variablesSource codeComputer animation
RootBefehlsprozessorInferenceDemo (music)Token ringQuery languageMeasurementSlide ruleMoment (mathematics)Presentation of a groupRemote procedure callMeeting/InterviewSource codeLecture/Conference
2 (number)Virtual machineNumberHistogramQuery languageMereologyOutlier
Instance (computer science)Thread (computing)Service (economics)Maxima and minimaInferenceServer (computing)Vector spaceFrequencyEndliche ModelltheorieMessage passingProcess (computing)Mathematical analysisRun time (program lifecycle phase)Thread (computing)CalculationInstance (computer science)Greatest elementScaling (geometry)CASE <Informatik>Virtual machinePoint cloudInferenceGraphics processing unitMultiplication signMereologyPiFunction (mathematics)ExtrapolationBefehlsprozessorCartesian coordinate systemTransformation (genetics)Token ringContext awarenessComputer hardwareRight angleQuadratic equationQuery languageInformation overloadComputer animation
Visualization (computer graphics)BefehlsprozessorEndliche ModelltheorieMultiplication signCodierung <Programmierung>Set (mathematics)Focus (optics)Parameter (computer programming)Computer hardwareArtificial neural networkSoftwareField (computer science)outputQuery languageOverhead (computing)CalculationAreaRight angleComputer animation
Sparse matrixCelestial sphereBefehlsprozessorPoint cloudVertical directionField programmable gate arrayComputer programGoogolComputer hardwareComputer hardwareSparse matrixField programmable gate arrayEndliche ModelltheorieBitPoint (geometry)CASE <Informatik>Artificial neural networkGoogolWeightCoprocessorPoint cloudConnected spaceLatent heatParameter (computer programming)Student's t-testCore dumpVertex (graph theory)Geometric quantizationBefehlsprozessorGraphics processing unitArea32-bitSoftwareMultiplication signComputer animation
ArmRun time (program lifecycle phase)SoftwareCuboidComputer hardware
SoftwareAlgorithmRun time (program lifecycle phase)Computer hardwareThread (computing)Core dumpChemical affinityMathematical optimizationCache (computing)Formal languageMultiplication signComputer hardwareDemo (music)Formal languageProcess (computing)Instance (computer science)Greatest elementGraphics processing unitVector spaceSoftwareBefehlsprozessorPresentation of a groupSlide ruleRun time (program lifecycle phase)CuboidTouchscreenAlgorithmMathematical optimizationComputer animation
InformationMultiplication signTouch typingSlide ruleGraphics processing unitTransformation (genetics)Utility softwareMeeting/Interview
Endliche ModelltheorieGraphics processing unitBefehlsprozessorProcess (computing)Core dumpBlogTensorComputer hardwareSoftwareConnected spaceQuery languageoutputThread (computing)Meeting/InterviewLecture/Conference
Software frameworkEndliche ModelltheorieTerm (mathematics)Lecture/Conference
Different (Kate Ryan album)Computer hardwareLecture/Conference
Endliche ModelltheorieGeometric quantizationSoftwareType theoryTransformation (genetics)Different (Kate Ryan album)Parameter (computer programming)NumberWave packetContext awarenessInsertion lossTerm (mathematics)Computer hardwareMeeting/Interview
Maxima and minimaMusical ensembleJSONXMLUML
Transcript: English(auto-generated)