Tokenizer
Last updated
Was this helpful?
Last updated
Was this helpful?
In LLM(Large Language Model), model understands meaning of the corpus and generates an answer. Since deep learning models only understand vectors, we need a step to convert the words(or sentence) into a vector(and vice versa). This step is called Tokenization
step.
The overall scheme of tokenization is as follows:
Define a vocab codebook
Segmentize the sentence into tokens
Convert tokens into token_ids using codebook.
We are going to look at some vocab codebook generation algorithm.
BPE is a text compression algorithm that merges frequent character pair into single symbol.
For example, let's see the sentence "hug huggingface"
Divide sentence into characters. ["h", "u", "g", "h", "u", "g", "g", "i", "n", "g", "f", "a", "c", "e"]
Check the frequency of character-pairs. ("hu", 2), ("ug", 2), ("gg", 1), ...
Merge frequent pairs into a single symbol. If merging "i", "g", we are going to set the symbol as "ig". This is fine because we are eventually convert it to a integer. ["hu", "g", "hu", "g", ...]
We iterate the merging step until we reach the desired vocab size. ["hug", "hug", ...]
Give an integer id for each tokens. "hug" -> 1 "g" -> 2 ...
Wordpiece encoding is similar to BPE, but have different merging algorithm. It merges character pair based on score.
The probability means the frequency of pair/character appearing in the sentence.
In real-world NLP application, we use various pretrained tokenizer. The following code use gpt2 pretrained tokenizer(BPE) and tokenize the given prompt.
You can see that input_ids now have index number of each tokens.
[1]
[2]
[3]