ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Problems
  • HIPPO framework overview
  • General HIPPO Framework
  • Visualization of HIPPO
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Mamba

HIPPO Framework

Last updated 3 months ago

Was this helpful?

Problems

Many of the real-world datas are series data. For example, sensor datas are time-series data.

Transformer, RNN models use history(previous datas) to predict the next data. However, they suffer from large computation cost(for Transformer) or vanishing gradient(for RNN). The fundamental problem is the memory limitation of storing the history.

HIPPO framework gives a new approach to store previous history with much less memory.

HIPPO framework overview

Let's say there is a function f(t)f(t)f(t) that we want to store. At any time t′t't′, HIPPO approximates the function f(t)∣t≤t′f(t)|_{t \le t'}f(t)∣t≤t′​

This is done by

  1. Define N orthonormal basis {gnt}n=0,1,...,N−1\{ g_n^t \}_{n=0, 1, ..., N-1}{gnt​}n=0,1,...,N−1​ of function space

  2. The coefficients can be described as an dynamic system(ODE)

  3. If we discretize the ODE, we can get the coefficients for every timestep

  4. These coefficients are the compressed information of f(t)∣t≤t′f(t)|_{t \le t'}f(t)∣t≤t′​. Compressed it into N-coefficients per timestep

General HIPPO Framework

In the Appendix C, it suppose the following equation

ddtcn(t)=ζ(t)−12λn∫f(x)(∂∂tpn(t,x))ωχ(t,x)dx+∫f(x)(ζ−12λnpn(t,x))(∂∂tωχ(t,x))dx ...(20)\begin{aligned} \frac{d}{d t} c_n(t)= & \zeta(t)^{-\frac{1}{2}} \lambda_n \int f(x)\left(\frac{\partial}{\partial t} p_n(t, x)\right) \frac{\omega}{\chi}(t, x) \mathrm{d} x \\ & +\int f(x)\left(\zeta^{-\frac{1}{2}} \lambda_n p_n(t, x)\right)\left(\frac{\partial}{\partial t} \frac{\omega}{\chi}(t, x)\right) \mathrm{d} x \end{aligned} \ ...(20)dtd​cn​(t)=​ζ(t)−21​λn​∫f(x)(∂t∂​pn​(t,x))χω​(t,x)dx+∫f(x)(ζ−21​λn​pn​(t,x))(∂t∂​χω​(t,x))dx​ ...(20)

can be reduced to dynamics of the form

ddtc(t)=−A(t)c(t)+B(t)f(t).\frac{d}{d t} c(t)=-A(t) c(t)+B(t) f(t) .dtd​c(t)=−A(t)c(t)+B(t)f(t).

This conversion needs some insights.

Since ddtPn(t)\frac{d}{dt} P_n^{(t)}dtd​Pn(t)​ is a polynomial in xxx of degree n-1, ddtPn(t)\frac{d}{dt} P_n^{(t)}dtd​Pn(t)​ can be expressed as linear combination of P0,P1,...,Pn−1P_0, P_1, ..., P_{n-1}P0​,P1​,...,Pn−1​. So the first term in equation (20) can be expressed with c0,c1,...,cn−1c_0, c_1, ..., c_{n-1}c0​,c1​,...,cn−1​.

Question

For many weight function www, we can find a scaling function χ\chiχ that ∂∂tωχ\frac{\partial}{\partial t} \frac{\omega}{\chi}∂t∂​χω​ can be expressed using ωχ\frac{\omega}{\chi}χω​.

But still I cant think why second term in equation (20) can be expressed as linear combination of c0,c1,...,cN−1,fc_0, c_1, ..., c_{N-1}, fc0​,c1​,...,cN−1​,f.

If you know the answer, please leave comments!!

Visualization of HIPPO

As we use higher N(max-degree for orthonormal polynomial basis), it approximates the input u(t)u(t)u(t) more accurately.

HIPPO using N=64, measure='legs'

HIPPO using N=256, measure='legs'

References

Project f(t)f(t)f(t) to each basis, and get the coefficient. (using inner-product)

I will not explain the derivation of HIPPO Framework. Please read Appendix C from the original paper or . It shows the math part of deriving ODE for coefficient c(t)c(t)c(t).

[1]

[2]

[3]

[4]

[5]

Hilbert space
https://hychiang.info/blog/2024/hippo_matrices/
https://arxiv.org/abs/2008.07669
https://hychiang.info/blog/2024/hippo_matrices/
https://github.com/state-spaces/s4
https://kyujinpy.tistory.com/146
https://hazyresearch.stanford.edu/blog/2020-12-05-hippo
LogoHiPPO: Recurrent Memory with Optimal Polynomial ProjectionsarXiv.org
HIPPO using N=64, measure='legs'
HIPPO using N=256, measure='legs'
Overall process of HIPPO Framework. From
https://arxiv.org/abs/2008.07669