This can be easily done using for-loop.
The following code is a simple implementation of scan function.
from typing import List, Callable
def operation(a, b):
return a + b
def scan(inputs: List[int], op: Callable[[int, int], int]) -> List[int]:
outputs = []
for input_ in inputs:
if len(outputs) == 0:
outputs.append(input_)
else:
last_output = outputs[-1]
outputs.append(op(last_output, input_))
return outputs
def main():
inputs = [1, 2, 3, 4, 5]
outputs = scan(inputs, operation)
print(outputs)
if __name__ == "__main__":
main()
Parallel Scan (Kogge-Stone Algorithm)
There is a parallel algorithm for prefix-sum operation.
Let's say we have input {xi}i=0,1,...,N−1 and operation ⊗.
The pseudo-code for parallel scan algorithm is as follows:
// Given inputs x[N] and ops
// Assume N = 2**k
Allocate y[N]
Initialize y[N] // y[i] = x[i] for i=0, 1, ..., N-1
stride = 1
while stride <= N:
Allocate tmp_y[N]
def thread_func(thread_num):
tmp_y[thread_num+stride] = ops(y[thread_num], y[thread_num+stride])
create_threads(N-stride, thread_func) // Generate (N-stride) threads
y = tmp_y // Copy all the content of tmp_y to y
Free tmp_y
stride = stride * 2
As a result, the time-complexity will be O(log(N)) due to parallel computation. The number of threads won't be a botteneck if we operate this algorithm in GPU or Accelerators.