Memory Layers at Scale

Last updated 3 months ago

Was this helpful?

Memory Layers at Scale

Summary

Language Model knowledges are usually associative. For example, Korea - Seoul, USA - Washington, Open AI - ChatGPT etc

Most of LM's knowledges are stored in FFN layer.

In this paper, it suggest a replacement layer of FFN called memory layer, that stores key-value knowledge in memory.

FFN in Transformer

The FFN in Transformer stores knowledge. It takes embeddings, and output embeddings containing knowledges. (Input shape and output shape is same)

Memory Augmented Architectures

In this paper, it suggest an complement of FFN layer, using key-value pair memory.

Given a query, it calculates the similarity among all the keys. Then it picks top-K similar key-value pair, and get the weighted sum(output).

It is surprising that memory layer works similar to attention mechanism. It has key, query, value and calculates the similarity between keys and given query.

Optimizing the lookup cost

In the key-value pair memory, the lookup cost(getting top-K similar key) takes $O(N)$ calculation. This increase linearly as the key space increase. In this paper, it suggest a optimization that splits the keys into two sets.

For key space $K \in \mathbb{R}^{N \times d\_model}$ it can be split into two key space $K_1, K_2 \in \mathbb{R}^{\sqrt{N} \times d\_model/2}$ . $K$ can be formed by pairing each row in $K_1, K_2$ .

For lookup, we also split the query $q \in \mathbb{R}^{d\_model}$ into two $q_1, q_2 \in \mathbb{R}^{d\_model/2}$ . We calculate the similarity from seperate key space, $sim(q_1, K_1)$ and $sim(q_2, K_2)$ . Then the overal calculation cost would be $O(\sqrt{N})$ which is scalable.

Experiment result summary

Memory layer shows great improvement in performance without increasing the computation or model size.

References