ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Summary
  • Hands-on SGLang
  • References

Was this helpful?

Edit on GitHub
  1. AI Accelerator

SGLang

Last updated 2 months ago

Was this helpful?

Summary

SGLang is a LLM program language.

It consist of two parts: Frontend and Backend

SGLang Frontend

Frontend makes programmers easily build workflow. For example, if you try to implement a essay scoring AI program using SGLang, you can define a program as follows:

It uses several SGLang primitives to make workflow of the program. It also makes parallel calculation possible using fork primitive. And it enables workflow to reuse KV-cache in multiple process.

SGLang Backend

SGLang Backend takes care of running actual model. It optimize the runtime for better latency and throughput.

SGLang backend is special because of three optimization: RadixAttention, Efficient Constrained Decoding and API Speculative Execution.

RadixAttention

RadixAttention is a method of reusing KV-cache. It builds a Radix Tree to store (prefix hash, KV cache). Radix Tree structure with LRU policy enables good caching performance.

Efficient Constrained Decoding with Finite State Machine

Constrained Decoding is needed in various situation. For example, if you want the output of LLM to be JSON format, then the place of comma(",") or curly-braces("{", "}") are really important. In this case, we can constrain the probability of next-token.

Also SGLang makes the decoding efficient by consisting compressed FSM. For example, previous system decodes token-by-token(1 token at a time). However, SGLang compress multiple tokens into one and consist a FSM with that.

From figure above, (a) takes 13 steps to decoding the prefix. However, using compressed FSM (b) takes only one step for decoding the prefix.

API speculative execution

In case where we can only call black-box API endpoint, it isn't hard to optimize the cost directly in runtime. SGLang provides alternative way to optimize the cost of using API endpoint. It asks the API endpoint to generate more tokens, and check if it matches the template.

For example, we can make a pipeline that generates character's details as follows using SGLang primitives:

s += context + "name:" + gen("name", stop="\n") + "job:" + gen("job", stop="\n")

Normal LLM application requires two API calls. However, SGLang ignores the stop point in gen("name", stop="\n")and generate more tokens. Extra generated tokens are also stored in the result. Then it may reuse the result if the extra-generated token starts with "job:". For this case, we can reduce # of API calls from 2 to 1.

Hands-on SGLang

I made a simple tutorial of serving Qwen-0.5B model. You can try it on Google Colab with T4 GPU.

References

[1]

[2]

https://arxiv.org/abs/2312.07104
https://discuss.pytorch.kr/t/radixattention-sglang-llm-feat-lmsys/3318
Drawing
Drawing
Comparing Normal LLM application and SGLang
SGLang system Architecture. From
Implementing essay judge program in SGLang. From
Figure of compressed FSM. From
https://arxiv.org/abs/2312.07104
https://arxiv.org/abs/2312.07104
https://arxiv.org/abs/2312.07104
https://github.com/jinho-choi123/sglang
https://github.com/jinho-choi123/sglang