ball-blog
  • Welcome, I'm ball
  • Machine Learning - Basic
    • Entropy
    • Cross Entropy
    • KL-Divergence
    • Monte-Carlo Method
    • Variational Auto Encoder
    • SVM
    • Adam Optimizer
    • Batch Normalization
    • Tokenizer
    • Rotary Positional Encoding
    • Vector Quantized VAE(VQ-VAE)
    • DALL-E
    • Diffusion Model
    • Memory Layers at Scale
    • Chain-of-Thought
  • Einsum
  • Linear Algebra
    • Linear Transformation
    • Determinant
    • Eigen-Value Decomposition(EVD)
    • Singular-Value Decomposition(SVD)
  • AI Accelerator
    • CachedAttention
    • SGLang
    • CacheBlend
  • Reinforcement Learning
    • Markov
  • Policy-Improvement Algorithm
  • Machine Learning - Transformer
    • Attention is All you need
    • Why do we need a mask in Transformer
    • Linear Transformer
    • kr2en Translator using Tranformer
    • Segment Anything
    • MNIST, CIFAR10 Classifier using ViT
    • Finetuning PaliGemma using LoRA
    • LoRA: Low-Rank Adaptation
  • EGTR: Extracting Graph from Transformer for SGG
  • Machine Learning - Mamba
    • Function Space(Hilbert Space)
    • HIPPO Framework
    • Linear State Space Layer
    • S4(Structures Space for Sequence Model)
    • Parallel Scan Algorithm
    • Mamba Model
  • Computer System
    • Memory Ordering: Release/Acquire 1
    • Memory Ordering: Release/Acquire 2
    • BUDAlloc
    • Lock-free Hash Table
    • Address Sanitizer
  • App development
    • Bluetooth connection in linux
    • I use Bun!
    • Using Tanstack-query in Frontend
    • Deploying AI Service using EC2
  • Problem Solving
    • Bipartite Graph
    • Shortest Path Problem in Graph
    • Diameter of a Tree
  • Scribbles
Powered by GitBook
On this page
  • What is SVM?
  • Solving SVM
  • Can SVM only do Linear Classification?

Was this helpful?

Edit on GitHub
  1. Machine Learning - Basic

SVM

Last updated 3 months ago

Was this helpful?

What is SVM?

SVM(Support Vector Machine) is a linear classifier that maximizes the margin between two support vectors.

Support vectors are line that express the bound containing the data-point.

Solving SVM

From SVM, we should get the weight www and bias bbb that maximize the margin.

Support vector 1: wTx+b=+1 ...(1)w^Tx+b = +1 \ ...(1)wTx+b=+1 ...(1) Support vector 2: wTx+b=−1 ...(2)w^Tx+b = -1 \ ...(2)wTx+b=−1 ...(2)

Let's assume that right-side of support vector equation is always +1+1+1 or −1-1−1. (We can always scale w,bw, bw,b to satisfy it!)

We define a point at each support vector: x+x^+x+ at support vector 1 and x−x^-x− at support vector 2. Line x+x−‾\overline{x^+x^-}x+x− is orthogonal to the support vector 1 and 2.

Subtracting equation (1) by equation (2) will result the following:

wT(x+−x−)=2 ...(3) margin=∣∣x+−x−∣∣=2∣∣w∣∣ ...(4)w^T(x^+-x^-) = 2 \ ...(3)\\~\\margin=||x^+-x^-||=\frac{2}{||w||} \ ...(4)wT(x+−x−)=2 ...(3) margin=∣∣x+−x−∣∣=∣∣w∣∣2​ ...(4)

There is a important condition for SVM:

∀iif yi=+1, wTxi+b≥+1  or  if yi=−1, wTxi+b≤−1 ...(5)\forall i \\if \ y_i=+1, \ w^Tx_i+b\ge+1 \ \ or\ \ if \ y_i = -1, \ w^Tx_i+b\le-1 \ ...(5)∀iif yi​=+1, wTxi​+b≥+1  or  if yi​=−1, wTxi​+b≤−1 ...(5)

If we simplify equation (5), then we can say

∀iyi(wTxi+b)≥+1 ...(6)\forall{i} \\ y_i(w^Tx_i+b)\ge+1 \ ...(6)∀iyi​(wTxi​+b)≥+1 ...(6)

Objective of SVM

Our objective is to maximize the margin while satisfying the condition. We can express this objective as mathematical format.

Maximize 2∣∣w∣∣ s.t.  ∀i, yi(wTxi+b)≥+1Maximize \ \frac{2}{||w||} \ s.t.\\~~\\ \forall{i},\ y_i(w^Tx_i+b) \ge +1Maximize ∣∣w∣∣2​ s.t.  ∀i, yi​(wTxi​+b)≥+1

This objective is expressed as maximization target and inequality. If we use Lagrangian, we can express this as a single minimization target.

L=12wTw−∑i=1lαiyi(wTxi+b)+∑i=1lαi , where ∀i,αi≥0 ...(7) argminw,bLL = \frac{1}{2}w^Tw-\sum_{i=1}^l\alpha_iy_i(w^Tx_i+b)+\sum_{i=1}^l\alpha_i \ , \ where \ \forall{i}, \alpha_i \ge 0 \ ...(7)\\~\\argmin_{w, b}LL=21​wTw−i=1∑l​αi​yi​(wTxi​+b)+i=1∑l​αi​ , where ∀i,αi​≥0 ...(7) argminw,b​L

Let's derive the LLL with weight and bias:

∂L∂w=w−∑i=1lαiyixi ∂L∂b=−∑i=1lαiyi\frac{\partial L}{\partial w} = w - \sum_{i=1}^l \alpha_iy_ix_i \\~\\ \frac{\partial L}{\partial b} = -\sum_{i=1}^l \alpha_i y_i∂w∂L​=w−i=1∑l​αi​yi​xi​ ∂b∂L​=−i=1∑l​αi​yi​

At the point where LLL is minimum, both derivative will be zero.

w∗=∑i=1lαiyixi ...(8) ∑i=1lαiyi=0 ...(9)w^* = \sum_{i=1}^l \alpha_i y_i x_i \ ...(8)\\~\\ \sum_{i=1}^l \alpha_i y_i = 0 \ ...(9)w∗=i=1∑l​αi​yi​xi​ ...(8) i=1∑l​αi​yi​=0 ...(9)

If we reconstruct the equation (7) using (8) and (9), we get

L=∑i=1lαi−12∑i=1l∑j=1lαiαjyiyjxiTxj ...(10)L = \sum_{i=1}^l \alpha_i - \frac{1}{2}\sum_{i=1}^l\sum_{j=1}^l \alpha_i \alpha_j y_i y_j x_i^Tx_j \ ...(10)L=i=1∑l​αi​−21​i=1∑l​j=1∑l​αi​αj​yi​yj​xiT​xj​ ...(10)
Derivation of equation (10)

Using equation (10), we can get all the αi′s\alpha_i'sαi′​s, and calculate the weights and bias.

Can SVM only do Linear Classification?

Since support vector is linear, does SVM only supports linear classification?

No, SVM supports complex classification.

Kernel

In upper example, we directly used xxx in SVM calculation. How about using more complex vectors originated from xxx?

x=(x0,x1) ϕ(x)=(x0,x1,x02,x12,x0x1)x = (x_0, x_1) \\~\\ \phi(x) = (x_0, x_1, x_0^2, x_1^2, x_0x_1)x=(x0​,x1​) ϕ(x)=(x0​,x1​,x02​,x12​,x0​x1​)

The kernel function ϕ(x)\phi(x)ϕ(x) expanded the dimension of x from 2 to 5. We can apply SVM to high-dimensional features!

Overhead in kernel method

To solve the SVM, we should solve the following equation:

ϕ:Rn→Rd,d≫n\phi: \mathbb{R}^n \to \mathbb{R}^d, d \gg nϕ:Rn→Rd,d≫n
L=∑i=1lαi−12∑i=1l∑j=1lαiαjyiyjϕ(xiT)ϕ(xj) ...(11) argminw,b LL = \sum_{i=1}^l \alpha_i - \frac{1}{2}\sum_{i=1}^l\sum_{j=1}^l \alpha_i \alpha_j y_i y_j \phi(x_i^T)\phi(x_j) \ ...(11) \\~\\ argmin_{w, b} \ LL=i=1∑l​αi​−21​i=1∑l​j=1∑l​αi​αj​yi​yj​ϕ(xiT​)ϕ(xj​) ...(11) argminw,b​ L

The computational cost of this equation is quadratic, which is O(d2)O(d^2)O(d2)

If d=100,000,000d = 100,000,000d=100,000,000, then the computational cost will explode. To resolve this, we use Kernel-trick.

Kernel-trick

Kernel-trick is defining ϕ(x)\phi(x)ϕ(x) that is easy to compute ϕ(x)ϕ(z)\phi(x) \phi(z)ϕ(x)ϕ(z).

K(x,z)=ϕ(x)ϕ(z)K(x, z) = \phi(x)\phi(z)K(x,z)=ϕ(x)ϕ(z)

So for most of the time, we use ϕ(x)\phi(x)ϕ(x) that K(x,z)K(x, z)K(x,z) is already known.

For example, polynomial kernel is defined as follows:

Kpolynomial(x,z)=(xTz+c)m, where c>0K_{polynomial}(x, z) = (x^Tz + c)^m, \ where \ c> 0Kpolynomial​(x,z)=(xTz+c)m, where c>0

In most case, we don't need to know the exact function ϕ(x)\phi(x)ϕ(x) and just use known K(x,z)K(x, z)K(x,z). This is because if certain condition is satisfied, we ensure that ϕ(x)\phi(x)ϕ(x) exists for kernel K(x,z)K(x, z)K(x,z).

Mercer's Theorem

A symmetric function K(x,z)K(x, z)K(x,z) can be expressed as an dot product

K(x,z)=<ϕ(x),ϕ(z)>for some ϕK(x, z) = <\phi(x), \phi(z)> for \ some \ \phiK(x,z)=<ϕ(x),ϕ(z)>for some ϕ

iff K(x,z)K(x, z)K(x,z) is positive semi-definite. i.e

∀datapoint pair (xi,xj),K(xi,xj)≥0\forall{datapoint \ pair \ (x_i, x_j)}, K(x_i, x_j) \ge 0∀datapoint pair (xi​,xj​),K(xi​,xj​)≥0

Drawing
Figure of Support Vector Machine(SVM)
Drawing
Figure of Support Vector Machine(SVM)
Figure of Kernel at SVM.
https://medium.com/@Suraj_Yadav/what-is-kernel-trick-in-svm-interview-questions-related-to-kernel-trick-97674401c48d