ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Summary
  • Byte-Pair-Encoding(BPE)
  • Wordpiece Encoding
  • Using a Tokenizer
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Basic

Tokenizer

Last updated 3 months ago

Was this helpful?

Summary

In LLM(Large Language Model), model understands meaning of the corpus and generates an answer. Since deep learning models only understand vectors, we need a step to convert the words(or sentence) into a vector(and vice versa). This step is called Tokenization step.

The overall scheme of tokenization is as follows:

  1. Define a vocab codebook

  2. Segmentize the sentence into tokens

  3. Convert tokens into token_ids using codebook.

We are going to look at some vocab codebook generation algorithm.

Byte-Pair-Encoding(BPE)

BPE is a text compression algorithm that merges frequent character pair into single symbol.

For example, let's see the sentence "hug huggingface"

  1. Divide sentence into characters. ["h", "u", "g", "h", "u", "g", "g", "i", "n", "g", "f", "a", "c", "e"]

  2. Check the frequency of character-pairs. ("hu", 2), ("ug", 2), ("gg", 1), ...

  3. Merge frequent pairs into a single symbol. If merging "i", "g", we are going to set the symbol as "ig". This is fine because we are eventually convert it to a integer. ["hu", "g", "hu", "g", ...]

  4. We iterate the merging step until we reach the desired vocab size. ["hug", "hug", ...]

  5. Give an integer id for each tokens. "hug" -> 1 "g" -> 2 ...

Wordpiece Encoding

Wordpiece encoding is similar to BPE, but have different merging algorithm. It merges character pair based on score.

score=P(pair)P(pair[0])P(pair[1])score = \frac{P(pair)}{P(pair[0])P(pair[1])}score=P(pair[0])P(pair[1])P(pair)​

The probability P(pair),P(pair[0]),P(pair[1])P(pair), P(pair[0]), P(pair[1])P(pair),P(pair[0]),P(pair[1]) means the frequency of pair/character appearing in the sentence.

Using a Tokenizer

In real-world NLP application, we use various pretrained tokenizer. The following code use gpt2 pretrained tokenizer(BPE) and tokenize the given prompt.

You can see that input_ids now have index number of each tokens.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
prompt = "Hello, this is an example playing with gpt2 tokenizer. My name is ball!"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
print(input_ids)

# output is 
# tensor([[15496,    11,   428,   318,   281,  1672,  2712,   351,   308,   457,
#            17, 11241,  7509,    13,  2011,  1438,   318,  2613,     0]])

References

[1]

[2]

[3]

https://databoom.tistory.com/entry/NLP-%ED%86%A0%ED%81%AC%EB%82%98%EC%9D%B4%EC%A0%80-Tokenizer
https://huggingface.co/learn/nlp-course/en/chapter6/5
https://huggingface.co/learn/nlp-course/en/chapter6/6?fw=pt
Tokenization result using GPT-4 tokenizer. https://huggingface.co/spaces/Xenova/the-tokenizer-playground
Tokenization process
Drawing
Drawing