ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • Information theory
  • Definition of Entropy
  • Entropy with probability distribution
  • References

Was this helpful?

Edit on GitHub
  1. Machine Learning - Basic

Entropy

Last updated 3 months ago

Was this helpful?

Information theory

Let's say I want to send the result of 10 coin-flips with only 0 and 1(which is called bit). We can send it by sending 1 if Head appears, and 1 if Tail appears.

Another example is sending an number between 1~100.

Some might answer that we have to send 100 bits. Sending an answer of following 100 questions: is it 1, is it 2, is it 3, ... is it 100.

Clever people would answer log2(100)log_2(100)log2​(100). This could be achieved by sending answers of following questions: is it bigger than 50? is it bigger than 25/75? ...

Definition of Entropy

Entropy is the quantity of the information. In other words, # of required bits to represent the information.

Entropy with probability distribution

In the examples at Information theory, I only mentioned about the case where possiblities of every cases are even. Coin can have Head or Tail by 50%, 50% probability.

How can we express if the Head / Tail probability is 30% / 70%?

Would it be log2(2)log_2(2)log2​(2)? No. Since we know that Tail occur more often, simply representing it as is it Tail? isn't optimal to compact the informations into bits.

Examples

Let's look at the example. Machine A, and Machine B

Machine A outputs character "A", "B", "C", "D" in equal probability, 0.25.

Machine B outputs character "A", "B", "C", "D" in 0.5, 0.125, 0.125, 0.25 probability.

Since the probability is different, we can't do the same questions as machine A.

Instead, we can question as follows:

Defining Entropy in Probability Distribution

In the probability distribtution, entropy can be expressed as follows:

If the probability distribution is even, then we need more bits(a.k.a # of questions) to express the data. However, if the probability distribution is not even, then lesser bits are required to express the data.

References

Amount of bits required can be calculated as 0.25⋅2+0.25⋅2+0.25⋅2+0.25⋅2=∑0.25⋅log2(0.25)0.25 \cdot 2 + 0.25 \cdot 2 + 0.25 \cdot 2 + 0.25 \cdot 2 = \sum 0.25 \cdot log_2(0.25)0.25⋅2+0.25⋅2+0.25⋅2+0.25⋅2=∑0.25⋅log2​(0.25)

Amout of bits required can be calculated as 0.5⋅1+0.25⋅2+0.125⋅3+0.125⋅3=∑p⋅log2(p)0.5 \cdot 1 + 0.25 \cdot 2 + 0.125 \cdot 3 + 0.125 \cdot 3 = \sum p \cdot log_2(p)0.5⋅1+0.25⋅2+0.125⋅3+0.125⋅3=∑p⋅log2​(p)

H=−∑p⋅log2(p)H = -\sum p \cdot log_2(p)H=−∑p⋅log2​(p)

[1]

[2]

[3]

https://hyunw.kim/blog/2017/10/14/Entropy.html
https://www.youtube.com/watch?v=PtmzfpV6CDE
https://www.youtube.com/watch?v=2s3aJfRr9gE
How to get character of machine A
How to get character of machine B
Entropy function for single coin flip. p is the probability of Head.
Drawing
Drawing