SGLang
Last updated
Was this helpful?
Last updated
Was this helpful?
SGLang is a LLM program language.
It consist of two parts: Frontend and Backend
Frontend makes programmers easily build workflow. For example, if you try to implement a essay scoring AI program using SGLang, you can define a program as follows:
It uses several SGLang primitives to make workflow of the program.
It also makes parallel calculation possible using fork
primitive. And it enables workflow to reuse KV-cache in multiple process.
SGLang Backend takes care of running actual model. It optimize the runtime for better latency and throughput.
SGLang backend is special because of three optimization: RadixAttention, Efficient Constrained Decoding and API Speculative Execution.
RadixAttention is a method of reusing KV-cache. It builds a Radix Tree to store (prefix hash, KV cache). Radix Tree structure with LRU policy enables good caching performance.
Constrained Decoding is needed in various situation. For example, if you want the output of LLM to be JSON format, then the place of comma(",") or curly-braces("{", "}") are really important. In this case, we can constrain the probability of next-token.
Also SGLang makes the decoding efficient by consisting compressed FSM. For example, previous system decodes token-by-token(1 token at a time). However, SGLang compress multiple tokens into one and consist a FSM with that.
From figure above, (a) takes 13 steps to decoding the prefix. However, using compressed FSM (b) takes only one step for decoding the prefix.
In case where we can only call black-box API endpoint, it isn't hard to optimize the cost directly in runtime. SGLang provides alternative way to optimize the cost of using API endpoint. It asks the API endpoint to generate more tokens, and check if it matches the template.
For example, we can make a pipeline that generates character's details as follows using SGLang primitives:
s += context + "name:" + gen("name", stop="\n") + "job:" + gen("job", stop="\n")
Normal LLM application requires two API calls.
However, SGLang ignores the stop point in gen("name", stop="\n")
and generate more tokens. Extra generated tokens are also stored in the result. Then it may reuse the result if the extra-generated token starts with "job:".
For this case, we can reduce # of API calls from 2 to 1.
I made a simple tutorial of serving Qwen-0.5B model. You can try it on Google Colab with T4 GPU.
[1]
[2]