ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • What is Kullback-Leibler Divergence
  • Definition of KL-Divergence
  • Characteristics of KL-Divergence
  • Jensen-Shannon divergence
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Basic

KL-Divergence

Last updated 3 months ago

Was this helpful?

What is Kullback-Leibler Divergence

KL-Divergance is defined as the entropy difference between two probability distribution. In other words, it tells if two distributions are similar or not.

Cross-entropy gets smaller as two distribution gets similar. Let's take a look at the definition of cross-entropy.

H(P,Q)=−∑p⋅log2(q)H(P, Q) = - \sum p \cdot log_2(q)H(P,Q)=−∑p⋅log2​(q)

Definition of KL-Divergence

Using cross-entropy, we can express KL-Divergence:

KL(P∣∣Q)=H(P,Q)−H(P)=−∑p⋅log2(pq)KL(P||Q) = H(P, Q) - H(P) = - \sum p \cdot log_2(\frac{p}{q})KL(P∣∣Q)=H(P,Q)−H(P)=−∑p⋅log2​(qp​)

In every cases in ML, distribution P is ground-truth distribution. Which means that H(P)H(P)H(P) is constant. It is fine to subtract it from cross-entropy.

Characteristics of KL-Divergence

KL(P∣∣Q)≥0KL(P||Q) \ge 0KL(P∣∣Q)≥0

This is obvious if we look at the definition of KL-divergence.

Jensen-Shannon divergence

JS-divergence is also used to express the entropy difference between two probability distribution. It is defined as followings:

JS-divergence is expressed using KL-divergence, but not used as much as KL-divergence.

References

Since cross-entropy H(P,Q)H(P, Q)H(P,Q) is always bigger than entropy H(P)H(P)H(P), KL-divergence should always be non-negative.

KL(P∣∣Q)≠KL(Q∣∣P)KL(P||Q) \ne KL(Q||P)KL(P∣∣Q)=KL(Q∣∣P)
JSD(P∣∣Q)=12KL(P∣∣M)∣12KL(Q∣∣M)Where M=12(P+Q)JSD(P||Q) = \frac{1}{2} KL(P||M) | \frac{1}{2} KL(Q||M) \\ Where \ M=\frac{1}{2}(P+Q)JSD(P∣∣Q)=21​KL(P∣∣M)∣21​KL(Q∣∣M)Where M=21​(P+Q)

[1]

https://hyunw.kim/blog/2017/10/27/KL_divergence.html