We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The Race to the Bottom: Low Latency in the age of the Transformer

Formale Metadaten

Titel
The Race to the Bottom: Low Latency in the age of the Transformer
Serientitel
Anzahl der Teile
56
Autor
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
So you want to deploy a large language model, and keep your latency SLA? NLP adds enormous value to customers, but getting it to work efficiently is fraught with uncertainty and high cost. As transformers and other big neural network architectures make their way into your platform, you may be finding it difficult to get the speed and throughput you need within your budget, or even understand why it is so expensive. This talk will give an overview of the latency and throughput challenges, and how to solve them. We will give an overview in the product and cost implications as well as the technical improvements that can be used to get things running fast. We will compare solutions and help make sense of difficult to understand technology. The audience will walk away with the information they need to decide on the best direction for inference in their production platform.