Tokenizer

Summary

In LLM(Large Language Model), model understands meaning of the corpus and generates an answer. Since deep learning models only understand vectors, we need a step to convert the words(or sentence) into a vector(and vice versa). This step is called Tokenization step.

The overall scheme of tokenization is as follows:

Define a vocab codebook
Segmentize the sentence into tokens
Convert tokens into token_ids using codebook.

We are going to look at some vocab codebook generation algorithm.

Byte-Pair-Encoding(BPE)

BPE is a text compression algorithm that merges frequent character pair into single symbol.

For example, let's see the sentence "hug huggingface"

Divide sentence into characters. ["h", "u", "g", "h", "u", "g", "g", "i", "n", "g", "f", "a", "c", "e"]
Check the frequency of character-pairs. ("hu", 2), ("ug", 2), ("gg", 1), ...
Merge frequent pairs into a single symbol. If merging "i", "g", we are going to set the symbol as "ig". This is fine because we are eventually convert it to a integer. ["hu", "g", "hu", "g", ...]
We iterate the merging step until we reach the desired vocab size. ["hug", "hug", ...]
Give an integer id for each tokens. "hug" -> 1 "g" -> 2 ...

Wordpiece Encoding

Wordpiece encoding is similar to BPE, but have different merging algorithm. It merges character pair based on score.

score = \frac{P(pair)}{P(pair[0])P(pair[1])}

The probability $P(pair), P(pair[0]), P(pair[1])$ means the frequency of pair/character appearing in the sentence.

Using a Tokenizer

In real-world NLP application, we use various pretrained tokenizer. The following code use gpt2 pretrained tokenizer(BPE) and tokenize the given prompt.

You can see that input_ids now have index number of each tokens.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
prompt = "Hello, this is an example playing with gpt2 tokenizer. My name is ball!"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
print(input_ids)

# output is 
# tensor([[15496,    11,   428,   318,   281,  1672,  2712,   351,   308,   457,
#            17, 11241,  7509,    13,  2011,  1438,   318,  2613,     0]])

References

Last updated 3 months ago

Was this helpful?

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") prompt = "Hello, this is an example playing with gpt2 tokenizer. My name is ball!" input_ids = tokenizer(prompt, return_tensors="pt").input_ids print(input_ids) # output is # tensor([[15496, 11, 428, 318, 281, 1672, 2712, 351, 308, 457, # 17, 11241, 7509, 13, 2011, 1438, 318, 2613, 0]])