We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The Race to the Bottom: Low Latency in the age of the Transformer

Formal Metadata

Title
The Race to the Bottom: Low Latency in the age of the Transformer
Title of Series
Number of Parts
56
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
So you want to deploy a large language model, and keep your latency SLA? NLP adds enormous value to customers, but getting it to work efficiently is fraught with uncertainty and high cost. As transformers and other big neural network architectures make their way into your platform, you may be finding it difficult to get the speed and throughput you need within your budget, or even understand why it is so expensive. This talk will give an overview of the latency and throughput challenges, and how to solve them. We will give an overview in the product and cost implications as well as the technical improvements that can be used to get things running fast. We will compare solutions and help make sense of difficult to understand technology. The audience will walk away with the information they need to decide on the best direction for inference in their production platform.