ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • What is a scan?
  • Parallel Scan (Kogge-Stone Algorithm)
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Mamba

Parallel Scan Algorithm

What is a scan?

Let's say you have an following input sequence:

x0,x1,x2,...,xNx_0, x_1, x_2, ..., x_Nx0​,x1​,x2​,...,xN​

And you have an binary operator ⊗\otimes ⊗. ⊗\otimes ⊗ can be a multiplication, addition, subtraction or any other complex operations.

Then the meaning of scanis to calculate the following output for given sequence {xi}i=0,1,...,N\{x_i\}_{i=0, 1, ..., N}{xi​}i=0,1,...,N​ with binary operator ⊗\otimes⊗.

x0,x0⊗x1,x0⊗x1⊗x2,...,x0⊗x1⊗...⊗xN={x0⊗...⊗xi}i=0,1,...,Nx_0, x_0\otimes x_1, x_0 \otimes x_1 \otimes x_2, ..., x_0 \otimes x_1 \otimes ... \otimes x_N=\{ x_0 \otimes ... \otimes x_i\}_{i=0, 1, ..., N}x0​,x0​⊗x1​,x0​⊗x1​⊗x2​,...,x0​⊗x1​⊗...⊗xN​={x0​⊗...⊗xi​}i=0,1,...,N​

This can be easily done using for-loop. The following code is a simple implementation of scan function.

from typing import List, Callable

def operation(a, b):
    return a + b

def scan(inputs: List[int], op: Callable[[int, int], int]) -> List[int]:
    outputs = []
    for input_ in inputs:
        if len(outputs) == 0:
            outputs.append(input_)
        else:
            last_output = outputs[-1]
            outputs.append(op(last_output, input_))
    return outputs

def main():
    inputs = [1, 2, 3, 4, 5]
    outputs = scan(inputs, operation)
    print(outputs)

if __name__ == "__main__":
    main()

Parallel Scan (Kogge-Stone Algorithm)

There is a parallel algorithm for prefix-sum operation.

Let's say we have input {xi}i=0,1,...,N−1\{x_i\}_{i=0, 1, ..., N-1}{xi​}i=0,1,...,N−1​ and operation ⊗\otimes⊗.

The pseudo-code for parallel scan algorithm is as follows:

// Given inputs x[N] and ops
// Assume N = 2**k

Allocate y[N]
Initialize y[N] // y[i] = x[i] for i=0, 1, ..., N-1
stride = 1

while stride <= N:
    Allocate tmp_y[N]
    
    def thread_func(thread_num):
        tmp_y[thread_num+stride] = ops(y[thread_num], y[thread_num+stride])
        
    create_threads(N-stride, thread_func) // Generate (N-stride) threads
    
    y = tmp_y // Copy all the content of tmp_y to y
    
    Free tmp_y
    
    stride = stride * 2

As a result, the time-complexity will be O(log(N))O(log(N))O(log(N)) due to parallel computation. The number of threads won't be a botteneck if we operate this algorithm in GPU or Accelerators.

References

Last updated 3 months ago

Was this helpful?

[1]

[2]

https://junstar92.tistory.com/260
https://blog.hyelie.com/entry/%EC%9D%B4%EC%A2%85%EB%B3%91%EB%A0%AC%EC%BB%B4%ED%93%A8%ED%8C%85-Parallel-Patterns-Reduction-1
Figure of Kogge-Stone algorithm. From
https://blog.hyelie.com/entry/%EC%9D%B4%EC%A2%85%EB%B3%91%EB%A0%AC%EC%BB%B4%ED%93%A8%ED%8C%85-Parallel-Patterns-Reduction-1