ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Summary
  • FFN in Transformer
  • Memory Augmented Architectures
  • Optimizing the lookup cost
  • Experiment result summary
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Basic

Memory Layers at Scale

Last updated 3 months ago

Was this helpful?

Summary

Language Model knowledges are usually associative. For example, Korea - Seoul, USA - Washington, Open AI - ChatGPT etc

Most of LM's knowledges are stored in FFN layer.

In this paper, it suggest a replacement layer of FFN called memory layer, that stores key-value knowledge in memory.

FFN in Transformer

The FFN in Transformer stores knowledge. It takes embeddings, and output embeddings containing knowledges. (Input shape and output shape is same)

Memory Augmented Architectures

In this paper, it suggest an complement of FFN layer, using key-value pair memory.

Given a query, it calculates the similarity among all the keys. Then it picks top-K similar key-value pair, and get the weighted sum(output).

It is surprising that memory layer works similar to attention mechanism. It has key, query, value and calculates the similarity between keys and given query.

Optimizing the lookup cost

In the key-value pair memory, the lookup cost(getting top-K similar key) takes O(N)O(N)O(N) calculation. This increase linearly as the key space increase. In this paper, it suggest a optimization that splits the keys into two sets.

For key space K∈RN×d_modelK \in \mathbb{R}^{N \times d\_model}K∈RN×d_model it can be split into two key space K1,K2∈RN×d_model/2K_1, K_2 \in \mathbb{R}^{\sqrt{N} \times d\_model/2}K1​,K2​∈RN​×d_model/2. KKK can be formed by pairing each row in K1,K2K_1, K_2K1​,K2​.

For lookup, we also split the query q∈Rd_modelq \in \mathbb{R}^{d\_model}q∈Rd_model into two q1,q2∈Rd_model/2q_1, q_2 \in \mathbb{R}^{d\_model/2}q1​,q2​∈Rd_model/2. We calculate the similarity from seperate key space, sim(q1,K1)sim(q_1, K_1)sim(q1​,K1​) and sim(q2,K2)sim(q_2, K_2)sim(q2​,K2​). Then the overal calculation cost would be O(N)O(\sqrt{N})O(N​) which is scalable.

Experiment result summary

Memory layer shows great improvement in performance without increasing the computation or model size.

References

[1]

[2]

https://arxiv.org/abs/2412.09764
https://dev.to/govindsb/my-take-on-the-memory-layer-paper-by-meta-noob-friendly-3hgo
LogoMemory Layers at ScalearXiv.org
Figure of key-value memory in memory layer
Figure of Transformer. https://github.com/dvgodoy/dl-visuals/?tab=readme-ov-file, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=151216016
Drawing
Drawing