Memory Layers at Scale
Last updated
Was this helpful?
Last updated
Was this helpful?
Language Model knowledges are usually associative
.
For example, Korea - Seoul, USA - Washington, Open AI - ChatGPT etc
Most of LM's knowledges are stored in FFN layer.
In this paper, it suggest a replacement layer of FFN called memory layer
, that stores key-value knowledge in memory.
The FFN in Transformer stores knowledge. It takes embeddings, and output embeddings containing knowledges. (Input shape and output shape is same)
In this paper, it suggest an complement of FFN layer, using key-value pair memory.
Given a query, it calculates the similarity among all the keys. Then it picks top-K similar key-value pair, and get the weighted sum(output).
It is surprising that memory layer works similar to attention mechanism. It has key, query, value and calculates the similarity between keys and given query.
In the key-value pair memory, the lookup cost(getting top-K similar key) takes calculation. This increase linearly as the key space increase. In this paper, it suggest a optimization that splits the keys into two sets.
For key space it can be split into two key space . can be formed by pairing each row in .
For lookup, we also split the query into two . We calculate the similarity from seperate key space, and . Then the overal calculation cost would be which is scalable.
Memory layer shows great improvement in performance without increasing the computation or model size.
[1]
[2]