ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Summary
  • What is SSM?
  • LSSL(Linear State Space Layer)
  • LSSL's uniqueness
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Mamba

Linear State Space Layer

Last updated 3 months ago

Was this helpful?

Summary

LSSL(Linear State Space Layer) is a model for seq2seq similar to RNN, Transformer. However, it use SSM(State-Space Model) which is different from existing RNNs or Transformers.

LSSL can be viewed as Convolutional and Recurrence. For training time, for parallelization, it is viewed as Convolution. In inference time, it is viewed as Recurrence for online inference.

LSSL Model contains a lot of math. I will not go through all the math. Please read The Annotated S4Blog post for more details.

What is SSM?

It uses an intermediate hidden state x(t)x(t)x(t) to store the input history.

x′(t)=Ax(t)+Bu(t)y(t)=Cx(t)+Du(t)\begin{aligned} x^{\prime}(t) & =\boldsymbol{A} x(t)+\boldsymbol{B} u(t) \\ y(t) & =\boldsymbol{C} x(t)+\boldsymbol{D} u(t) \end{aligned}x′(t)y(t)​=Ax(t)+Bu(t)=Cx(t)+Du(t)​

LSSL(Linear State Space Layer)

LSSL view SSM as Discrete, Convolution and Recurrent.

SSM as Discrete

Using various discretization method, LSSL view continuous system as discrete.

By default, SSM use Generalized bilinear Transform.

SSM as Recurrent

Discretized SSM is a form of recurrent.

xt=Aˉxt−1+Bˉutyt=Cxt+Dut\begin{aligned} x_t & =\bar{A} x_{t-1}+\bar{B} u_t \\ y_t & =C x_t+D u_t \end{aligned}xt​yt​​=Aˉxt−1​+Bˉut​=Cxt​+Dut​​

SSM as Convolution

I think this is the most interesting part.

If we set the initial state x−1=0x_{-1}=0x−1​=0, then we can express yty_tyt​ as follows:

yk=C(Aˉ)kBˉu0+C(Aˉ)k−1Bˉu1+⋯+CAB‾uk−1+Bˉuk+Duk.y_k=C(\bar{A})^k \bar{B} u_0+C(\bar{A})^{k-1} \bar{B} u_1+\cdots+C \overline{A B} u_{k-1}+\bar{B} u_k+D u_k .yk​=C(Aˉ)kBˉu0​+C(Aˉ)k−1Bˉu1​+⋯+CABuk−1​+Bˉuk​+Duk​.

Which is applying kernel KLK_LKL​ to {ui}i=1,2,...,k\{u_i\}_{i=1, 2, ..., k}{ui​}i=1,2,...,k​ sequence.

KL(A,B,C)=(CAiB)i∈[L]∈RL=(CB,CAB,…,CAL−1B).\mathcal{K}_L(A, B, C)=\left(C A^i B\right)_{i \in[L]} \in \mathbb{R}^L=\left(C B, C A B, \ldots, C A^{L-1} B\right) .KL​(A,B,C)=(CAiB)i∈[L]​∈RL=(CB,CAB,…,CAL−1B).

As a result, for all given inputs {ui}i=1,2,...,N\{u_i\}_{i=1, 2, ..., N}{ui​}i=1,2,...,N​ we can compute the yNy_NyN​ in parallel.

LSSL's uniqueness

Previously, I explained about SSM.

What makes LSSL special compared to original SSM?

  1. Use HIPPO framework to express latent space x(t)x(t)x(t)

  2. Make KLK_LKL​ kernel computation(especially, computing power of Aˉ\bar{A}Aˉ ) efficient

HIPPO framework with LSSL

The AAA matrix in SSM takes a key role. In previous SSMs, it used random matrix AAA, which made the model performance bad.

Using AAA matrix from HIPPO framework made the model remember past input history {ui}i=1,2,...,k−1\{u_i\}_{i=1, 2, ..., k-1}{ui​}i=1,2,...,k−1​ and improved the performance.

Computing power of Aˉ\bar{A}Aˉ more efficient

So LSSL applies this theorem.

Surprising fact is that matrix AAA from HIPPO framework is Quasiseperable!

References

is a architecture that models input_sequence-output_sequence mapping.

Paper says that MVM(Matrix-Vector Multiplication) for Quasiseperable matrix can be computed efficiently.

[1]

[2]

[3]

[4]

SSM(State-Space Model)
Computing with Quasiseparable Matrices
https://arxiv.org/abs/2110.13985
https://srush.github.io/annotated-s4/
https://velog.io/@euisuk-chung/Structured-State-Space-Models-for-Deep-Sequence-Modeling
https://www.mat.uniroma2.it/~tvmsscho/Rome-Moscow_School/2011/files/quasisep-fasino-notes.pdf
The Annotated S4
Wonderful blog post about S4
LogoCombining Recurrent, Convolutional, and Continuous-time Models...arXiv.org
Explanation of Quasiseperable Matrix. From
https://www.mat.uniroma2.it/~tvmsscho/Rome-Moscow_School/2011/files/quasisep-fasino-notes.pdf