ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Prerequisites
  • Summary
  • Background
  • Problem proposal
  • Contribution
  • Results
  • References

Was this helpful?

Edit on GitHub
  1. AI Accelerator

CachedAttention

Last updated 3 months ago

Was this helpful?

Prerequisites

To understand this paper, you should have full-understanding of Attention.

Summary

This paper propose a KV caching system to reduce the computation during LLM inference.

Background

The paper distinguishes LLM inferencing into two phases:

  1. Prefilling: Making KV caches for previous prompts + Generate the next token

  2. Decoding: Making KV cache for generated token + Generate the next token

The inference engine first do the prefilling-phase. And then iterate decoding-phase until it outputs EOF token or reaches maximum generation length.

Problem proposal

Most of realworld tasks handle multi-turn conversations. And this is the part that problem happens.

For every future turn(Turn 2, Turn 3, ...), it should do the prefilling-phase again. Which is just a duplicate computation. As the conversation length gets bigger and bigger, recomputation cost in prefilling-phase takes 99% of the inference computation.

Contribution

The paper suggests a solution for the problem, CachedAttention

  1. Cache the previous KV values, and use it in the future turn.

  2. Overlap cache saving/loading operation with Transformer operations.

  3. Design hierarchical KV cache placement and positional encoding decoupled KV cache scheme.

The purpose of decoupling positional encoding is due to token truncation. If the conversation gets longer and longer, the token sequence overflow the maximum context window. As a result, it truncates token sequence. If the KV cache contains positional encoding, then all of the KV cache should be invalidated and recomputed from the beginning. Since this happens frequently, CachedAttention decouples the position encoding from KV cache to reuse it even at truncation scenario.

Results

RE means Recomputation(baseline)

CA means CachedAttention

As you can see, CachedAttention significantly improves inference performance.

References

Please read the paper

[1]

Attention is All you need
https://arxiv.org/abs/2403.19708
LogoCost-Efficient Large Language Model Serving for Multi-turn...arXiv.org
TTFT
Prefill throughput
Figure of multi-turn conversation with two-phase inferencing. From
Figure of CachedAttention. From
https://arxiv.org/pdf/2403.19708
https://arxiv.org/pdf/2403.19708