S4(Structures Space for Sequence Model)
Last updated
Was this helpful?
Last updated
Was this helpful?
To fully understand S4, I recommend you some reading materials.
Please read it before reading the paper or blog post .
FFT -
Generating Function -
LSSL -
Addressing long-range dependencies using HIPPO framework
Instead of directly calculating power of , S4 use truncated generating function and IFFT(Inverse Fast Fourier Transform) to generate convolution kernel .
When matrix is DPLR, then by using Woodbury Identity, we can compute generating function using efficient Cauchy dot-product.
I will go through all the mathematical parts.
Previous SSMs struggled because hidden state couldn't store the past history. So they made a HIPPO framework, SSM that models past history using orthonormal basis.
In HIPPO framework, there are 4 matrix . But the most important matrix is (which is called HIPPO matrix). So they bring the HIPPO matrix to S4.
Let's think of Roots of Unity Filter
.
Since
Woodbury identity converts inverse of DPLR into simpler form:
Cauchy matrix is defined as follows:
You don't have to understand why this form can be computed efficiently.
For simplicity, We are going to write Cauchy dot-product as
This math-journey is everthing for S4. It was exciting to study FFT, Generating function. I hope this post helps you guide to the ultimate goal, Mamba.
Computing takes huge computation resource.
From , let's create a generating function
If we subsitute to , then we get the following equation.
This is exactly the same as DFT(Discrete Fourier Transform). Think as frequency, and as time.
This means that if we can get the generating function , we can easily calculate the convolution kernel using IFFT.
You may think that we need all terms to get generating function .
Well, actually we don't. Let's look at some tricks to get with low computation cost.
We are going to use z from . So is always 1
Previously, we discretized into . But we are doing the reverse.
We can convert as following:
If we assume is DPLR, we can write as following: is a diagonal matrix .
By applying it, we get the following equation for : is also a diagonal matrix!
With elements and ,
Cauchy kernel is a efficient way to compute following form: . is Cauchy matrix.
Since is a Cauchy-Matrix, we can apply Cauchy dot-product!
Since we can calculate the generating function with low computation cost, we don't need huge computation to generate the convolution kernel .
[1]
[2]