What is SVM?
SVM(Support Vector Machine) is a linear classifier that maximizes the margin between two support vectors.
Support vectors are line that express the bound containing the data-point.
Solving SVM
From SVM, we should get the weight w w w and bias b b b that maximize the margin.
Support vector 1: w T x + b = + 1 . . . ( 1 ) w^Tx+b = +1 \ ...(1) w T x + b = + 1 ... ( 1 )
Support vector 2: w T x + b = − 1 . . . ( 2 ) w^Tx+b = -1 \ ...(2) w T x + b = − 1 ... ( 2 )
Let's assume that right-side of support vector equation is always + 1 +1 + 1 or − 1 -1 − 1 . (We can always scale w , b w, b w , b to satisfy it!)
We define a point at each support vector: x + x^+ x + at support vector 1 and x − x^- x − at support vector 2. Line x + x − ‾ \overline{x^+x^-} x + x − is orthogonal to the support vector 1 and 2.
Subtracting equation (1) by equation (2) will result the following:
w T ( x + − x − ) = 2 . . . ( 3 ) m a r g i n = ∣ ∣ x + − x − ∣ ∣ = 2 ∣ ∣ w ∣ ∣ . . . ( 4 ) w^T(x^+-x^-) = 2 \ ...(3)\\~\\margin=||x^+-x^-||=\frac{2}{||w||} \ ...(4) w T ( x + − x − ) = 2 ... ( 3 ) ma r g in = ∣∣ x + − x − ∣∣ = ∣∣ w ∣∣ 2 ... ( 4 ) There is a important condition for SVM:
∀ i i f y i = + 1 , w T x i + b ≥ + 1 o r i f y i = − 1 , w T x i + b ≤ − 1 . . . ( 5 ) \forall i \\if \ y_i=+1, \ w^Tx_i+b\ge+1 \ \ or\ \ if \ y_i = -1, \ w^Tx_i+b\le-1 \ ...(5) ∀ i i f y i = + 1 , w T x i + b ≥ + 1 or i f y i = − 1 , w T x i + b ≤ − 1 ... ( 5 ) If we simplify equation (5), then we can say
∀ i y i ( w T x i + b ) ≥ + 1 . . . ( 6 ) \forall{i} \\
y_i(w^Tx_i+b)\ge+1 \ ...(6) ∀ i y i ( w T x i + b ) ≥ + 1 ... ( 6 ) Objective of SVM
Our objective is to maximize the margin while satisfying the condition. We can express this objective as mathematical format.
M a x i m i z e 2 ∣ ∣ w ∣ ∣ s . t . ∀ i , y i ( w T x i + b ) ≥ + 1 Maximize \ \frac{2}{||w||} \ s.t.\\~~\\ \forall{i},\ y_i(w^Tx_i+b) \ge +1 M a x imi ze ∣∣ w ∣∣ 2 s . t . ∀ i , y i ( w T x i + b ) ≥ + 1 This objective is expressed as maximization target and inequality. If we use Lagrangian, we can express this as a single minimization target.
L = 1 2 w T w − ∑ i = 1 l α i y i ( w T x i + b ) + ∑ i = 1 l α i , w h e r e ∀ i , α i ≥ 0 . . . ( 7 ) a r g m i n w , b L L = \frac{1}{2}w^Tw-\sum_{i=1}^l\alpha_iy_i(w^Tx_i+b)+\sum_{i=1}^l\alpha_i \ , \ where \ \forall{i}, \alpha_i \ge 0 \ ...(7)\\~\\argmin_{w, b}L L = 2 1 w T w − i = 1 ∑ l α i y i ( w T x i + b ) + i = 1 ∑ l α i , w h ere ∀ i , α i ≥ 0 ... ( 7 ) a r g mi n w , b L Let's derive the L L L with weight and bias:
∂ L ∂ w = w − ∑ i = 1 l α i y i x i ∂ L ∂ b = − ∑ i = 1 l α i y i \frac{\partial L}{\partial w} = w - \sum_{i=1}^l \alpha_iy_ix_i \\~\\ \frac{\partial L}{\partial b} = -\sum_{i=1}^l \alpha_i y_i ∂ w ∂ L = w − i = 1 ∑ l α i y i x i ∂ b ∂ L = − i = 1 ∑ l α i y i At the point where L L L is minimum, both derivative will be zero.
w ∗ = ∑ i = 1 l α i y i x i . . . ( 8 ) ∑ i = 1 l α i y i = 0 . . . ( 9 ) w^* = \sum_{i=1}^l \alpha_i y_i x_i \ ...(8)\\~\\ \sum_{i=1}^l \alpha_i y_i = 0 \ ...(9) w ∗ = i = 1 ∑ l α i y i x i ... ( 8 ) i = 1 ∑ l α i y i = 0 ... ( 9 ) If we reconstruct the equation (7) using (8) and (9), we get
L = ∑ i = 1 l α i − 1 2 ∑ i = 1 l ∑ j = 1 l α i α j y i y j x i T x j . . . ( 10 ) L = \sum_{i=1}^l \alpha_i - \frac{1}{2}\sum_{i=1}^l\sum_{j=1}^l \alpha_i \alpha_j y_i y_j x_i^Tx_j \ ...(10) L = i = 1 ∑ l α i − 2 1 i = 1 ∑ l j = 1 ∑ l α i α j y i y j x i T x j ... ( 10 ) Derivation of equation (10)Using equation (10), we can get all the α i ′ s \alpha_i's α i ′ s , and calculate the weights and bias.
Can SVM only do Linear Classification?
Since support vector is linear, does SVM only supports linear classification?
No, SVM supports complex classification.
Kernel
In upper example, we directly used x x x in SVM calculation. How about using more complex vectors originated from x x x ?
x = ( x 0 , x 1 ) ϕ ( x ) = ( x 0 , x 1 , x 0 2 , x 1 2 , x 0 x 1 ) x = (x_0, x_1) \\~\\ \phi(x) = (x_0, x_1, x_0^2, x_1^2, x_0x_1) x = ( x 0 , x 1 ) ϕ ( x ) = ( x 0 , x 1 , x 0 2 , x 1 2 , x 0 x 1 ) The kernel function ϕ ( x ) \phi(x) ϕ ( x ) expanded the dimension of x from 2 to 5. We can apply SVM to high-dimensional features!
Overhead in kernel method
To solve the SVM, we should solve the following equation:
ϕ : R n → R d , d ≫ n \phi: \mathbb{R}^n \to \mathbb{R}^d, d \gg n ϕ : R n → R d , d ≫ n L = ∑ i = 1 l α i − 1 2 ∑ i = 1 l ∑ j = 1 l α i α j y i y j ϕ ( x i T ) ϕ ( x j ) . . . ( 11 ) a r g m i n w , b L L = \sum_{i=1}^l \alpha_i - \frac{1}{2}\sum_{i=1}^l\sum_{j=1}^l \alpha_i \alpha_j y_i y_j \phi(x_i^T)\phi(x_j) \ ...(11) \\~\\ argmin_{w, b} \ L L = i = 1 ∑ l α i − 2 1 i = 1 ∑ l j = 1 ∑ l α i α j y i y j ϕ ( x i T ) ϕ ( x j ) ... ( 11 ) a r g mi n w , b L The computational cost of this equation is quadratic, which is O ( d 2 ) O(d^2) O ( d 2 )
If d = 100 , 000 , 000 d = 100,000,000 d = 100 , 000 , 000 , then the computational cost will explode. To resolve this, we use Kernel-trick.
Kernel-trick
Kernel-trick is defining ϕ ( x ) \phi(x) ϕ ( x ) that is easy to compute ϕ ( x ) ϕ ( z ) \phi(x) \phi(z) ϕ ( x ) ϕ ( z ) .
K ( x , z ) = ϕ ( x ) ϕ ( z ) K(x, z) = \phi(x)\phi(z) K ( x , z ) = ϕ ( x ) ϕ ( z ) So for most of the time, we use ϕ ( x ) \phi(x) ϕ ( x ) that K ( x , z ) K(x, z) K ( x , z ) is already known.
For example, polynomial kernel is defined as follows:
K p o l y n o m i a l ( x , z ) = ( x T z + c ) m , w h e r e c > 0 K_{polynomial}(x, z) = (x^Tz + c)^m, \ where \ c> 0 K p o l y n o mia l ( x , z ) = ( x T z + c ) m , w h ere c > 0 In most case, we don't need to know the exact function ϕ ( x ) \phi(x) ϕ ( x ) and just use known K ( x , z ) K(x, z) K ( x , z ) . This is because if certain condition is satisfied, we ensure that ϕ ( x ) \phi(x) ϕ ( x ) exists for kernel K ( x , z ) K(x, z) K ( x , z ) .
Mercer's Theorem
A symmetric function K ( x , z ) K(x, z) K ( x , z ) can be expressed as an dot product
K ( x , z ) = < ϕ ( x ) , ϕ ( z ) > f o r s o m e ϕ K(x, z) = <\phi(x), \phi(z)> for \ some \ \phi K ( x , z ) =< ϕ ( x ) , ϕ ( z ) > f or so m e ϕ iff K ( x , z ) K(x, z) K ( x , z ) is positive semi-definite. i.e
∀ d a t a p o i n t p a i r ( x i , x j ) , K ( x i , x j ) ≥ 0 \forall{datapoint \ pair \ (x_i, x_j)}, K(x_i, x_j) \ge 0 ∀ d a t a p o in t p ai r ( x i , x j ) , K ( x i , x j ) ≥ 0