S4(Structures Space for Sequence Model)

Last updated 3 months ago

Was this helpful?

S4(Structures Space for Sequence Model)

Prerequisites

To fully understand S4, I recommend you some reading materials.

Please read it before reading the paper or blog post .

FFT -
Generating Function -
LSSL -

What makes S4 special?

Addressing long-range dependencies using HIPPO framework
Instead of directly calculating power of $\bar{A}$ , S4 use truncated generating function and IFFT(Inverse Fast Fourier Transform) to generate convolution kernel $\bar{K}$ .
When $A$ matrix is DPLR, then by using Woodbury Identity, we can compute generating function $\hat{K}_L$ using efficient Cauchy dot-product.

I will go through all the mathematical parts.

Applying HIPPO framework

Previous SSMs struggled because hidden state $x$ couldn't store the past history. So they made a HIPPO framework, SSM that models past history using orthonormal basis.

In HIPPO framework, there are 4 matrix $A, B, C, D$ . But the most important matrix is $A$ (which is called HIPPO matrix). So they bring the HIPPO matrix to S4.

Using truncated generating function and IFFT to generate $\bar{K}$

Let's think of Roots of Unity Filter.

Converting into inverse-form

Since

Assume A is DPLR(Diagonal Plus Low-Rank), and apply Woodbury identity

Woodbury identity converts inverse of DPLR into simpler form:

Background: Cauchy Matrix

Cauchy matrix is defined as follows:

Background: Cauchy kernel(Cauchy dot-product)

You don't have to understand why this form can be computed efficiently.

For simplicity, We are going to write Cauchy dot-product as

Wrapup

This math-journey is everthing for S4. It was exciting to study FFT, Generating function. I hope this post helps you guide to the ultimate goal, Mamba.

References

Last updated 3 months ago

Was this helpful?

Prerequisites

To fully understand S4, I recommend you some reading materials.

Please read it before reading the paper or blog post .

FFT -
Generating Function -
LSSL -

What makes S4 special?

Addressing long-range dependencies using HIPPO framework
Instead of directly calculating power of $\bar{A}$ , S4 use truncated generating function and IFFT(Inverse Fast Fourier Transform) to generate convolution kernel $\bar{K}$ .
When $A$ matrix is DPLR, then by using Woodbury Identity, we can compute generating function $\hat{K}_L$ using efficient Cauchy dot-product.

I will go through all the mathematical parts.

Applying HIPPO framework

Previous SSMs struggled because hidden state $x$ couldn't store the past history. So they made a HIPPO framework, SSM that models past history using orthonormal basis.

In HIPPO framework, there are 4 matrix $A, B, C, D$ . But the most important matrix is $A$ (which is called HIPPO matrix). So they bring the HIPPO matrix to S4.

Using truncated generating function and IFFT to generate $\bar{K}$

Computing $\bar{K}$ takes huge computation resource.

\bar{K}=(\bar{C}^*\bar{A}^0\bar{B}, \bar{C}^*\bar{A}^1\bar{B}, ..., \bar{C}^*\bar{A}^{L-1}\bar{B})

From $\bar{K}$ , let's create a generating function $\hat{K}_L$

\hat{K}_L(z;\bar{A}, \bar{B}, \bar{C})=\sum_{k=0}^{L-1}\bar{C}^*\bar{A}^k\bar{B}z^k

Let's think of Roots of Unity Filter.

\omega_j=exp(-i 2\pi\frac{jk}{L}), \ \ j=0, 1, 2, ..., L-1

If we subsitute $z$ to $\omega_j$ , then we get the following equation.

\hat{K}_L(\omega_j)=\sum_{k=0}^{L-1}(\bar{C}^*\bar{A}^k\bar{B}) \cdot \omega_j^k \\~\\ =\sum_{k=0}^{L-1}\bar{C}^*\bar{A}^k\bar{B}\cdot exp(-i2\pi\frac{jk}{L})

This is exactly the same as DFT(Discrete Fourier Transform). Think $j$ as frequency, and $k$ as time.

This means that if we can get the generating function $\hat{K}_L(z)$ , we can easily calculate the convolution kernel $\bar{K}$ using IFFT.

How to get generating function $\hat{K}_L$ (z)

Converting into inverse-form

You may think that we need all $\bar{C}\bar{A}^k\bar{B}$ terms to get generating function $\hat{K}_L(z)$ .

Well, actually we don't. Let's look at some tricks to get $\hat{K}_L$ with low computation cost.

\hat{K}_L(z;\bar{A}, \bar{B}, \bar{C})=\sum_{k=0}^{L-1}\bar{C}^*\bar{A}^k\bar{B}z^k \\~\\ = \bar{C}^*(I-\bar{A}^Lz^L)(I-\bar{A}z)^{-1}\bar{B} \\~\\ = \tilde{C}^*(I-\bar{A}z)^{-1}\bar{B}

We are going to use z from $exp(-i2\pi\frac{k}{L}):k\in[L]$ . So $z^L$ is always 1

Convert $\bar{A}, \bar{B}$ into $A, B$

Previously, we discretized $A, B$ into $\bar{A}, \bar{B}$ . But we are doing the reverse.

Since

\bar{A}=(I-\frac{\Delta}{2}A)^{-1}(I+\frac{\Delta}{2}A)

\bar{B}=(I-\frac{\Delta}{2}A)^{-1}\Delta B

We can convert $\hat{K}_L(z)$ as following:

\hat{K}_L=\tilde{C}^*(I-\bar{A}z)^{-1}\bar{B} \\~\\ = \tilde{C}^*(I-(I-\frac{\Delta}{2}A)^{-1}(I+\frac{\Delta}{2}A)z)^{-1}(I-\frac{\Delta}{2}A)^{-1})\Delta B \\~\\ = \tilde{C}^*((I-\frac{\Delta}{2}A)^{-1}(I-\frac{\Delta}{2}A)-(I-\frac{\Delta}{2}A)^{-1}(I+\frac{\Delta}{2}A)z)^{-1}(I-\frac{\Delta}{2}A)^{-1}\Delta B \\~\\ = \tilde{C}^*[I(1-z)-\frac{\Delta}{2}A(1+z)]^{-1}\Delta B \\~\\ = \frac{2}{1+z}\tilde{C}^*[\frac{2(1-z)}{\Delta(1+z)}I-A]^{-1}B

Assume A is DPLR(Diagonal Plus Low-Rank), and apply Woodbury identity

If we assume $A$ is DPLR, we can write $A$ as following: $\Lambda$ is a diagonal matrix $\Lambda \in \mathbb{C}^{N\times N}$ . $P, Q \in \mathbb{C}^{N \times 1}$

A = \Lambda +PQ^*

Woodbury identity converts inverse of DPLR into simpler form:

(\Lambda+PQ^*)^{-1}=\Lambda^{-1}-\Lambda^{-1}P(1+Q^*\Lambda^{-1}P)^{-1}Q^*\Lambda^{-1}

By applying it, we get the following equation for $\hat{K}_L(z)$ : $R(z)$ is also a diagonal matrix!

\hat{K}_L=\frac{2}{1+z}[\tilde{C}^*R(z)^{-1}B-\tilde{C}^*R(z)^{-1}P(1+Q^*R(z)^{-1}P)^{-1}Q^*R(z)^{-1}B] \\~\\ where \ R(z)=\frac{2(1-z)}{\Delta(1+z)}I-\Lambda

Background: Cauchy Matrix

With elements $\Omega=(w_i)\in\mathbb{C}^M$ and $\Lambda=(\lambda_j)\in \mathbb{C}^N$ ,

Cauchy matrix is defined as follows:

M\in\mathbb{C}^{M\times N}=M(\Omega, \Lambda)=(M_{ij})_{i\in[M],\ j\in[N]} \\~\\ M_{ij}=\frac{1}{\omega_i - \lambda_j}

Background: Cauchy kernel(Cauchy dot-product)

Cauchy kernel is a efficient way to compute following form: $A \in \mathbb{C}^{M\times 1}, \ B \in \mathbb{C}^{M\times N}, \ C \in \mathbb{C}^{N \times 1}$ . $B$ is Cauchy matrix.

A^TBC

You don't have to understand why this form can be computed efficiently.

For simplicity, We are going to write Cauchy dot-product as

A^TBC=k_{\Omega, \Lambda}(A, C)

Applying Cauchy dot-product to the $\hat{K}_L$ calculation

Since $R(z)^{-1}$ is a Cauchy-Matrix, we can apply Cauchy dot-product!

\hat{K}_L=\frac{2}{1+z}[\tilde{C}^*R(z)^{-1}B-\tilde{C}^*R(z)^{-1}P(1+Q^*R(z)^{-1}P)^{-1}Q^*R(z)^{-1}B] \\~\\ = c(z)[k_{z, \Lambda}(\tilde{C}, B)-k_{z, \Lambda}(\tilde{C}, P)(1+k_{z, \Lambda}(Q, P))^{-1}k_{z, \Lambda}(Q, B)]

Since we can calculate the generating function $\hat{K}_L$ with low computation cost, we don't need huge computation to generate the convolution kernel $\bar{K}$ .

Wrapup

This math-journey is everthing for S4. It was exciting to study FFT, Generating function. I hope this post helps you guide to the ultimate goal, Mamba.

References

[1]

[2]

Computing $\bar{K}$ takes huge computation resource.

\bar{K}=(\bar{C}^*\bar{A}^0\bar{B}, \bar{C}^*\bar{A}^1\bar{B}, ..., \bar{C}^*\bar{A}^{L-1}\bar{B})

From $\bar{K}$ , let's create a generating function $\hat{K}_L$

\hat{K}_L(z;\bar{A}, \bar{B}, \bar{C})=\sum_{k=0}^{L-1}\bar{C}^*\bar{A}^k\bar{B}z^k

\omega_j=exp(-i 2\pi\frac{jk}{L}), \ \ j=0, 1, 2, ..., L-1

If we subsitute $z$ to $\omega_j$ , then we get the following equation.

\hat{K}_L(\omega_j)=\sum_{k=0}^{L-1}(\bar{C}^*\bar{A}^k\bar{B}) \cdot \omega_j^k \\~\\ =\sum_{k=0}^{L-1}\bar{C}^*\bar{A}^k\bar{B}\cdot exp(-i2\pi\frac{jk}{L})

This is exactly the same as DFT(Discrete Fourier Transform). Think $j$ as frequency, and $k$ as time.

This means that if we can get the generating function $\hat{K}_L(z)$ , we can easily calculate the convolution kernel $\bar{K}$ using IFFT.

How to get generating function $\hat{K}_L$ (z)

You may think that we need all $\bar{C}\bar{A}^k\bar{B}$ terms to get generating function $\hat{K}_L(z)$ .

Well, actually we don't. Let's look at some tricks to get $\hat{K}_L$ with low computation cost.

\hat{K}_L(z;\bar{A}, \bar{B}, \bar{C})=\sum_{k=0}^{L-1}\bar{C}^*\bar{A}^k\bar{B}z^k \\~\\ = \bar{C}^*(I-\bar{A}^Lz^L)(I-\bar{A}z)^{-1}\bar{B} \\~\\ = \tilde{C}^*(I-\bar{A}z)^{-1}\bar{B}

We are going to use z from $exp(-i2\pi\frac{k}{L}):k\in[L]$ . So $z^L$ is always 1

Convert $\bar{A}, \bar{B}$ into $A, B$

Previously, we discretized $A, B$ into $\bar{A}, \bar{B}$ . But we are doing the reverse.

\bar{A}=(I-\frac{\Delta}{2}A)^{-1}(I+\frac{\Delta}{2}A)

\bar{B}=(I-\frac{\Delta}{2}A)^{-1}\Delta B

We can convert $\hat{K}_L(z)$ as following:

\hat{K}_L=\tilde{C}^*(I-\bar{A}z)^{-1}\bar{B} \\~\\ = \tilde{C}^*(I-(I-\frac{\Delta}{2}A)^{-1}(I+\frac{\Delta}{2}A)z)^{-1}(I-\frac{\Delta}{2}A)^{-1})\Delta B \\~\\ = \tilde{C}^*((I-\frac{\Delta}{2}A)^{-1}(I-\frac{\Delta}{2}A)-(I-\frac{\Delta}{2}A)^{-1}(I+\frac{\Delta}{2}A)z)^{-1}(I-\frac{\Delta}{2}A)^{-1}\Delta B \\~\\ = \tilde{C}^*[I(1-z)-\frac{\Delta}{2}A(1+z)]^{-1}\Delta B \\~\\ = \frac{2}{1+z}\tilde{C}^*[\frac{2(1-z)}{\Delta(1+z)}I-A]^{-1}B

If we assume $A$ is DPLR, we can write $A$ as following: $\Lambda$ is a diagonal matrix $\Lambda \in \mathbb{C}^{N\times N}$ . $P, Q \in \mathbb{C}^{N \times 1}$

A = \Lambda +PQ^*

(\Lambda+PQ^*)^{-1}=\Lambda^{-1}-\Lambda^{-1}P(1+Q^*\Lambda^{-1}P)^{-1}Q^*\Lambda^{-1}

By applying it, we get the following equation for $\hat{K}_L(z)$ : $R(z)$ is also a diagonal matrix!

\hat{K}_L=\frac{2}{1+z}[\tilde{C}^*R(z)^{-1}B-\tilde{C}^*R(z)^{-1}P(1+Q^*R(z)^{-1}P)^{-1}Q^*R(z)^{-1}B] \\~\\ where \ R(z)=\frac{2(1-z)}{\Delta(1+z)}I-\Lambda

With elements $\Omega=(w_i)\in\mathbb{C}^M$ and $\Lambda=(\lambda_j)\in \mathbb{C}^N$ ,

M\in\mathbb{C}^{M\times N}=M(\Omega, \Lambda)=(M_{ij})_{i\in[M],\ j\in[N]} \\~\\ M_{ij}=\frac{1}{\omega_i - \lambda_j}

Cauchy kernel is a efficient way to compute following form: $A \in \mathbb{C}^{M\times 1}, \ B \in \mathbb{C}^{M\times N}, \ C \in \mathbb{C}^{N \times 1}$ . $B$ is Cauchy matrix.

A^TBC

A^TBC=k_{\Omega, \Lambda}(A, C)

Applying Cauchy dot-product to the $\hat{K}_L$ calculation

Since $R(z)^{-1}$ is a Cauchy-Matrix, we can apply Cauchy dot-product!

\hat{K}_L=\frac{2}{1+z}[\tilde{C}^*R(z)^{-1}B-\tilde{C}^*R(z)^{-1}P(1+Q^*R(z)^{-1}P)^{-1}Q^*R(z)^{-1}B] \\~\\ = c(z)[k_{z, \Lambda}(\tilde{C}, B)-k_{z, \Lambda}(\tilde{C}, P)(1+k_{z, \Lambda}(Q, P))^{-1}k_{z, \Lambda}(Q, B)]

Since we can calculate the generating function $\hat{K}_L$ with low computation cost, we don't need huge computation to generate the convolution kernel $\bar{K}$ .

Prerequisites

What makes S4 special?

Applying HIPPO framework

Using truncated generating function and IFFT to generate Kˉ\bar{K}Kˉ

Converting into inverse-form

Assume A is DPLR(Diagonal Plus Low-Rank), and apply Woodbury identity

Background: Cauchy Matrix

Background: Cauchy kernel(Cauchy dot-product)

Wrapup

References

Prerequisites

What makes S4 special?

Applying HIPPO framework

Using truncated generating function and IFFT to generate Kˉ\bar{K}Kˉ

How to get generating function K^L\hat{K}_LK^L​(z)

Converting into inverse-form

Convert Aˉ,Bˉ\bar{A}, \bar{B}Aˉ,Bˉ into A,BA, BA,B

Assume A is DPLR(Diagonal Plus Low-Rank), and apply Woodbury identity

Background: Cauchy Matrix

Background: Cauchy kernel(Cauchy dot-product)

Applying Cauchy dot-product to the K^L\hat{K}_LK^L​ calculation

Wrapup

References

How to get generating function K^L\hat{K}_LK^L​(z)

Convert Aˉ,Bˉ\bar{A}, \bar{B}Aˉ,Bˉ into A,BA, BA,B

Applying Cauchy dot-product to the K^L\hat{K}_LK^L​ calculation

Using truncated generating function and IFFT to generate $\bar{K}$

Using truncated generating function and IFFT to generate $\bar{K}$

How to get generating function $\hat{K}_L$ (z)

Convert $\bar{A}, \bar{B}$ into $A, B$

Applying Cauchy dot-product to the $\hat{K}_L$ calculation

How to get generating function $\hat{K}_L$ (z)

Convert $\bar{A}, \bar{B}$ into $A, B$

Applying Cauchy dot-product to the $\hat{K}_L$ calculation