kr2en Translator using Tranformer

Last updated 3 months ago

Was this helpful?

kr2en Translator using Tranformer

I implemented kr2en Translator using Transformer model. I want to share some interesting key-points about it.

Purpose of tokenizer

Tokenizer converts sentences into sequence of numbers.\

Tokenizer first chops the sentence into pieces. And then, tokenizer lookup the vocab dictionary, and convert it to it's index number. Eventually, tokenized output is list of numbers.

In the implementation, we used "Helsinki-NLP/opus-mt-ko-en" tokenizer from huggingface hub. This tokenizer handles both Korean and English.

Importance of eos token, pad token

At the Purpose of tokenizer, tokenizer chops the sentence and adds some weird stuff at the end: "<eos>" and "<pad>"

They are special tokens that have special meanings.

EOS token means that the sentence has ended, and all the latter parts will have no meanings. This will make the transformer robust to train only the meaningful part.

PAD token means there are nothing meaningful here. Since input size of transformer has to be consistent, we add paddings to achieve it.

There are also BOS token, which means Begin Of Sentence. However, "Helsinki-NLP/opus-mt-ko-en" tokenizer only use EOS and PAD token.

During training, we still need something to indicate that target sentence is beginning. We will concat EOS token to the front of target sentence. ("Helsinki-NLP/opus-mt-ko-en" tokenizer's BOS token is None)

Since we added EOS token to front during training, at Inference step, we can generate a whole sentence from just EOS token(this EOS token works as BOS token)

E2E understanding of Transformer

Let's take a look at the dimension of input/output of Transformer model. (after encoder embedding + before decoder embedding)

Input Dimension: batch_size * max_len * d_model

Output Dimension: batch_size * trg_seq_len * d_model

Transformer's purpose is to predict the next token. Every tokens in target sequence predicts it's next token.

For example, let's say we have following source and target: source sentence: "내 이름은 최진호입니다." target sentence: "My name is jinho choi."

If we insert whole source sentence and partial target sentence "My name is", then Transformer guess following:

What would be the next token for "My" -> Transformer predicts "nombre"
What would be the next token for "name" -> Transformer predicts "are"
What would be the next token for "is" -> Transformer predicts "jinhochoi"

We know that this transformer is not fully trained. But I hope this gives a E2E understanding for Transformer.

How is Transformer trained

As we mentioned in E2E understanding of Transformer, transformer predicts next sequence for each target tokens.

But we already know the ground-truth value of what is coming next. We just have to train the transformer to make the right prediction.

In transformer, we use cross-entropy loss to match the distribution of prediction and ground-truth distribution.

Warmup steps at Training

In Transformer model, we use warmup method during training. Warmup is a learning rate schedule technique, which gradually increase the lr and then decay it.

Since the initial learning rate is extremely low, it prevents overfitting and make the convergence faster.

How to make a translator using Transformer

After training the Transformer, I made a function that translate Korean sentence into an English sentence.

It starts with EOS token, which we used it as BOS token during training. This works as a seed for sentence generation.

We iteratively generated next token until it reaches the max length or outputs EOS token.

References