I am taking Convex Optimization this semester (2021 Spring). Here is the first chapter of my course notes. The notes are based on my understanding of the course materials provided by Prof. Zhouchen Lin. The notes only offer concise definition and formulas with few proofs and examples.
Mathematical Backgrounds
Table of Contents:
Topology in R n \mathbb{R}^n R n
open, closed, bounded, compact
Open : ∀ x ∈ C , ∃ ϵ > 0 , B ϵ ( x ) ⊆ C \forall x\in \mathcal{C}, \exist\epsilon>0, B_{\epsilon}(x)\subseteq C ∀ x ∈ C , ∃ ϵ > 0 , B ϵ ( x ) ⊆ C
Closed: C c \mathcal{C}^c C c is open
Bounded: ∃ R > 0 , ∥ x ∥ < R , ∀ x ∈ C \exist R>0,\|x\|<R,\forall x\in \mathcal{C} ∃ R > 0 , ∥ x ∥ < R , ∀ x ∈ C
Compact: bounded+closed
Interior, closure, boundary
Interior : C ∘ = { x ∣ ∃ ϵ > 0 , B ϵ ( x ) ⊆ C } \mathcal{C}^\circ = \{x\mid \exist \epsilon>0, B_{\epsilon}(x)\subseteq \mathcal{C}\} C ∘ = { x ∣ ∃ ϵ > 0 , B ϵ ( x ) ⊆ C }
Interior = points that satisfy openness, so interior of open set C \mathcal{C} C = C \mathcal{C} C itself.
Closure: C ‾ = ( ( C c ) ∘ ) c \overline{\mathcal{C}} = \left((\mathcal{C}^c)^\circ\right)^c C = ( ( C c ) ∘ ) c
Boundary: ∂ C = C ‾ \ C ∘ \partial \mathcal{C} = \overline{\mathcal{C}} \backslash \mathcal{C}^\circ ∂ C = C \ C ∘
Analysis in R n \mathbb{R}^n R n
Accumulation point
Def 1 : ∀ ϵ > 0 , ∃ x ∈ { x k } , x ∈ B ϵ ( x ∗ ) \forall \epsilon>0, \exist x\in \{x_k\}, x\in B_\epsilon(x^*) ∀ ϵ > 0 , ∃ x ∈ { x k } , x ∈ B ϵ ( x ∗ )
Def 2: x ∗ x^* x ∗ is a limit of a subsequence
Bolzano-Weierstrass theorem
Any bounded sequence in R n \mathbb{R}^n R n contains a convergent subsequence .
Proof. Bounded so we can divide the range in half. There always exists one of the halves that has infinite number of points in it.
Convergence rate
When number of operations is infinite, try convergence rate to estimate an optimization algorithm.
Q-linear
Assume x k → x ∗ x_k\to x^{*} x k → x ∗ , then define sequence of errors : e k = x k − x ∗ e_k = x_k-x^{*} e k = x k − x ∗ .
Sequence { x k } \{x_k\} { x k } Q-linearly converges to x ∗ x^{*} x ∗ with rate r r r and rate constant C C C if
lim k → ∞ ∥ e k + 1 ∥ ∥ e k ∥ r = C , C < ∞ \lim_{k\to \infty}\frac{\|e_{k+1}\|}{\|e_k\|^r} = C,C<\infty
k → ∞ lim ∥ e k ∥ r ∥ e k + 1 ∥ = C , C < ∞
Linear: r = 1 , 0 < C < 1 r=1, 0<C<1 r = 1 , 0 < C < 1
Sublinear: r = 1 , C = 1 r=1, C=1 r = 1 , C = 1
Superlinear: r = 1 , C = 0 r=1, C=0 r = 1 , C = 0
Quadratic: r = 2 r=2 r = 2
R-linear
When lim k → ∞ ∥ e k + 1 ∥ ∥ e k ∥ r \lim_{k\to \infty}\frac{\|e_{k+1}\|}{\|e_k\|^r} lim k → ∞ ∥ e k ∥ r ∥ e k + 1 ∥ does not exist, try R-linear definition.
Sequence { x k } \{x_k\} { x k } R-linearly converges to x ∗ x^{*} x ∗ when ∥ x k − x ∗ ∥ ≤ e k , { e k } \|x_k-x^{*}\|\le e_k, \{e_k\} ∥ x k − x ∗ ∥ ≤ e k , { e k } converges to 0 0 0 Q-linearly.
Continuity
x ∈ dom f , ∀ ϵ > 0 , ∃ δ ⇒ y ∈ dom f , ∥ y − x ∥ 2 ≤ δ ⇒ ∥ f ( y ) − f ( x ) ∥ 2 ≤ ε x \in \text{dom} f, \forall \epsilon>0, \exist \delta\Rightarrow y\in \text{dom} f, \|\mathbf{y}-\mathbf{x}\|_{2} \leq \delta \Rightarrow\|f(\mathbf{y})-f(\mathbf{x})\|_{2} \leq \varepsilon x ∈ dom f , ∀ ϵ > 0 , ∃ δ ⇒ y ∈ dom f , ∥ y − x ∥ 2 ≤ δ ⇒ ∥ f ( y ) − f ( x ) ∥ 2 ≤ ε
Minimum vs Minimal point x ∗ x^* x ∗
Minimum: min for all point x
Minimal: min for sufficiently small ϵ > 0 \epsilon>0 ϵ > 0 , ∀ x ∈ B ϵ ( x ∗ ) ∪ dom f \forall x \in B_\epsilon (x^*)\cup \text{dom} f ∀ x ∈ B ϵ ( x ∗ ) ∪ dom f
Closedness
For each α ∈ R , { x ∈ dom f ∣ f ( x ) ≤ α } \alpha \in \mathbb{R}, \{\mathbf{x} \in \operatorname{dom} f \mid f(\mathbf{x}) \leq \alpha\} α ∈ R , { x ∈ d o m f ∣ f ( x ) ≤ α } is closed
epi f = { ( x , t ) ∈ R n + 1 ∣ x ∈ dom f , f ( x ) ≤ t } \operatorname{epi} f=\left\{(\mathbf{x}, t) \in \mathbb{R}^{n+1} \mid \mathbf{x} \in \operatorname{dom} f, f(\mathbf{x}) \leq t\right\} e p i f = { ( x , t ) ∈ R n + 1 ∣ x ∈ d o m f , f ( x ) ≤ t } Is closed
Continuous + closed domain = closed function
Continuous + open domain = closed iff f → ∞ f\to \infty f → ∞ along every sequence converging to a boundary point of domain
Derivative
If f : R n → R m f:\mathbb{R}^n\to\mathbb{R}^m f : R n → R m , D f ( x ) = J m × n Df(x)= \mathbf{J}_{m\times n} D f ( x ) = J m × n when for all choices of sequence { z } ⊂ dom f \{z\}\subset \text{dom} f { z } ⊂ dom f ,
lim z ∈ dom f , z ≠ x , z → x ∥ f ( z ) − f ( x ) − J ( z − x ) ∥ 2 ∥ z − x ∥ 2 = 0 \lim _{\mathbf{z} \in \operatorname{dom} f, \mathbf{z} \neq \mathbf{x}, \mathbf{z} \rightarrow \mathbf{x}} \frac{\|f(\mathbf{z})-f(\mathbf{x})-\mathbf{J}(\mathbf{z}-\mathbf{x})\|_{2}}{\|\mathbf{z}-\mathbf{x}\|_{2}}=0
z ∈ d o m f , z = x , z → x lim ∥ z − x ∥ 2 ∥ f ( z ) − f ( x ) − J ( z − x ) ∥ 2 = 0
Let z = x + t e i z = x+te_i z = x + t e i and let t → 0 t\to 0 t → 0 , we have i i i -th column ∂ f ( x ) ∂ x i \frac{\partial f(\mathbf{x})}{\partial x_{i}} ∂ x i ∂ f ( x ) , so
J = ∂ f ( x ) ∂ x T = ( ∂ f i ( x ) ∂ x j ) , 1 ≤ i ≤ m , 1 ≤ j ≤ n . \mathbf{J}=\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}^{T}}=\left(\frac{\partial f_{i}(\mathbf{x})}{\partial x_{j}}\right), 1\le i\le m, 1\le j\le n.
J = ∂ x T ∂ f ( x ) = ( ∂ x j ∂ f i ( x ) ) , 1 ≤ i ≤ m , 1 ≤ j ≤ n .
Gradient
If f : R n → R f:\mathbb{R}^n\to\mathbb{R} f : R n → R , then D f ( x ) Df(x) D f ( x ) is a 1 × n 1\times n 1 × n row vector. We call its transpose the gradient of the function:
∇ f ( x ) = D f ( x ) ⊤ \nabla f(\mathbf{x})=D f(\mathbf{x})^{\top}
∇ f ( x ) = D f ( x ) ⊤
First-order approximation can thus be: f ( x ) + ⟨ ∇ f ( x ) , z − x ⟩ f(\mathbf{x})+\langle\nabla f(\mathbf{x}), \mathbf{z}-\mathbf{x}\rangle f ( x ) + ⟨ ∇ f ( x ) , z − x ⟩ .
Note that only when f f f is real-value will gradient be different from derivative.
If f ( X ) f(\mathbf{X}) f ( X ) here X \mathbf{X} X is a matrix, then define ∂ f ∂ X ≜ ( ∂ f ( X ) ∂ X i j ) \frac{\partial f}{\partial \mathbf{X}} \triangleq\left(\frac{\partial f(\mathbf{X})}{\partial X_{i j}}\right) ∂ X ∂ f ≜ ( ∂ X i j ∂ f ( X ) )
More on matrix calculus .
First-order Approximation
An alternative method of finding the derivative is by deriving a first-order approximation :
f ( z ) = f ( x ) + D f ( x ) ( z − x ) f(\mathbf{z}) = f(\mathbf{x})+Df(\mathbf{x})(\mathbf{z}-\mathbf{x})
f ( z ) = f ( x ) + D f ( x ) ( z − x )
Example : f ( X ) = log det X , dom f = S + + n f(\mathbf{X})=\log \operatorname{det} \mathbf{X}, \text{dom} f = \mathbb{S}^n_{++} f ( X ) = log d e t X , dom f = S + + n
Let Z \mathbf{Z} Z be close to X , Δ X = Z − X , λ i \mathbf{X}, \Delta\mathbf{X} = \mathbf{Z}-\mathbf{X}, \lambda_i X , Δ X = Z − X , λ i be the i i i -th eigenvalue of X − 1 / 2 Δ X X − 1 / 2 \mathbf{X}^{-1 / 2} \Delta \mathbf{X X}^{-1 / 2} X − 1 / 2 Δ X X − 1 / 2 . Since Δ X \Delta\mathbf{X} Δ X is close to 0, λ i \lambda_i λ i are also small.
Lemma:
det A B = det B A , tr A B = tr B A det A = ∏ i = 1 n λ i ( A ) , tr A = ∑ i = 1 n λ i ( A ) , ⟨ A , B ⟩ = tr A ⊤ B \text{det}\mathbf{A}\mathbf{B} = \text{det}\mathbf{B}\mathbf{A},\text{tr}\mathbf{A}\mathbf{B} = \text{tr}\mathbf{B}\mathbf{A}\\
\text{det}\mathbf{A} = \prod_{i=1}^n \lambda_i(\mathbf{A}), \text{tr}\mathbf{A} = \sum_{i=1}^n \lambda_i(\mathbf{A}),\\
\langle\mathbf{A},\mathbf{B}\rangle = \text{tr}\mathbf{A}^\top\mathbf{B}
det A B = det B A , tr A B = tr B A det A = i = 1 ∏ n λ i ( A ) , tr A = i = 1 ∑ n λ i ( A ) , ⟨ A , B ⟩ = tr A ⊤ B
(Why det A = ∏ i = 1 n λ i ( A ) \text{det}\mathbf{A} = \prod_{i=1}^n \lambda_i(\mathbf{A}) det A = ∏ i = 1 n λ i ( A ) ? Because λ i \lambda_i λ i = multiple of the basis of eigenvector, det \text{det} det = multiple of volume)
log det Z = log det ( X + Δ X ) = log det ( X 1 / 2 ( I + X − 1 / 2 Δ X X − 1 / 2 ) X 1 / 2 ) = log det X + log det ( I + X − 1 / 2 Δ X X − 1 / 2 ) = log det X + ∑ i = 1 n log ( 1 + λ i ) ≈ log det X + ∑ i = 1 n λ i = log det X + tr ( X − 1 / 2 Δ X X − 1 / 2 ) = log det X + tr ( X − 1 Δ X ) = log det X + tr ( X − 1 ( Z − X ) ) \begin{aligned} \log \operatorname{det} \mathbf{Z} &=\log \operatorname{det}(\mathbf{X}+\Delta \mathbf{X}) \\ &=\log \operatorname{det}\left(\mathbf{X}^{1 / 2}\left(\mathbf{I}+\mathbf{X}^{-1 / 2} \Delta \mathbf{X} \mathbf{X}^{-1 / 2}\right) \mathbf{X}^{1 / 2}\right) \\ &=\log \operatorname{det} \mathbf{X}+\log \operatorname{det}\left(\mathbf{I}+\mathbf{X}^{-1 / 2} \Delta \mathbf{X} \mathbf{X}^{-1 / 2}\right) \\ &=\log \operatorname{det} \mathbf{X}+\sum_{i=1}^{n} \log \left(1+\lambda_{i}\right) \approx \log \operatorname{det} \mathbf{X}+\sum_{i=1}^{n} \lambda_{i} \\ &=\log \operatorname{det} \mathbf{X}+\operatorname{tr}\left(\mathbf{X}^{-1 / 2} \Delta \mathbf{X X}^{-1 / 2}\right) \\ &=\log \operatorname{det} \mathbf{X}+\operatorname{tr}\left(\mathbf{X}^{-1} \Delta \mathbf{X}\right) \\ &=\log \operatorname{det} \mathbf{X}+\operatorname{tr}\left(\mathbf{X}^{-1}(\mathbf{Z}-\mathbf{X})\right) \end{aligned}
log d e t Z = log d e t ( X + Δ X ) = log d e t ( X 1 / 2 ( I + X − 1 / 2 Δ X X − 1 / 2 ) X 1 / 2 ) = log d e t X + log d e t ( I + X − 1 / 2 Δ X X − 1 / 2 ) = log d e t X + i = 1 ∑ n log ( 1 + λ i ) ≈ log d e t X + i = 1 ∑ n λ i = log d e t X + t r ( X − 1 / 2 Δ X X − 1 / 2 ) = log d e t X + t r ( X − 1 Δ X ) = log d e t X + t r ( X − 1 ( Z − X ) )
Since ⟨ A , B ⟩ = tr ( A T B ) \langle \mathbf{A},\mathbf{B}\rangle=\text{tr}(\mathbf{A}^T\mathbf{B}) ⟨ A , B ⟩ = tr ( A T B ) , ∇ f ( X ) = X − 1 \nabla f(\mathbf{X}) = \mathbf{X}^{-1} ∇ f ( X ) = X − 1 . ■ \blacksquare ■
Chain rule
D h ( x ) = D g ( f ( x ) ) D f ( x ) D h(\mathbf{x})=D g(f(\mathbf{x})) D f(\mathbf{x})
D h ( x ) = D g ( f ( x ) ) D f ( x )
Remember to take transpose when f f f is real-valued.
🌟Understand derivative as a mapping
Derivative is basically a mapping of Δ x → Δ f ( x ) \Delta x \to \Delta f(x) Δ x → Δ f ( x ) . Therefore, chain rule is actually successional mappings.
Consider f ( x ) = log det ( F 0 + x 1 F 1 + ⋯ x n F n ) f(\mathbf{x}) = \log \text{det} (\mathbf{F}_0+x_1\mathbf{F}_1+\cdots x_n\mathbf{F}_n) f ( x ) = log det ( F 0 + x 1 F 1 + ⋯ x n F n ) , if we fix x − I \mathbf{x}_{-I} x − I , then we have 2 mappings:
x i → F 0 + x 1 F 1 + ⋯ x n F n x_i \to \mathbf{F}_0+x_1\mathbf{F}_1+\cdots x_n\mathbf{F}_n x i → F 0 + x 1 F 1 + ⋯ x n F n
Mapping of derivative R → R n × n \mathbb{R}\to \mathbb{R}^{n\times n} R → R n × n : x k → x k F k x_k\to x_k\mathbf{F}_k x k → x k F k
Here, F k \mathbf{F}_k F k is both the matrix of the mapping and the derivative.
M → log det M \mathbf{M} \to \log \text{det} \mathbf{M} M → log det M
Mapping of derivative R n × n → R \mathbb{R}^{n\times n}\to \mathbb{R} R n × n → R : N → tr ( N ⋅ ∇ log det M ) \mathbf{N} \to \text{tr}(\mathbf{N} \cdot \nabla \log \text{det} \mathbf{M}) N → tr ( N ⋅ ∇ log det M )
Here, the derivative size is supposed to be 1 × ( n × n ) 1\times(n\times n) 1 × ( n × n ) , so the mapping matrix cannot be written down. However, we can write out the above mapping.
By chain rule, combining these two mappings and we will have:
∂ f ( x ) x i = tr ( F i ⋅ ∇ log det N ) = tr ( F − 1 F i ) , F = F 0 + x 1 F 1 + ⋯ + x n F n ⇒ ∇ f ( x ) = [ tr ( F − 1 F 1 ) ⋮ tr ( F − 1 F n ) ] \frac{\partial f(\mathbf{x})}{x_i} = \text{tr}(\mathbf{F_i}\cdot \nabla \log \text{det}\mathbf{N}) = \text{tr}(\mathbf{F}^{-1}\mathbf{F_i}), \\\mathbf{F}=\mathbf{F}_{0}+x_{1} \mathbf{F}_{1}+\cdots+x_{n} \mathbf{F}_{n}\\
\Rightarrow \nabla f(\mathbf{x})=\left[\begin{array}{c}\operatorname{tr}\left(\mathbf{F}^{-1} \mathbf{F}_{1}\right) \\ \vdots \\ \operatorname{tr}\left(\mathbf{F}^{-1} \mathbf{F}_{n}\right)\end{array}\right]
x i ∂ f ( x ) = tr ( F i ⋅ ∇ log det N ) = tr ( F − 1 F i ) , F = F 0 + x 1 F 1 + ⋯ + x n F n ⇒ ∇ f ( x ) = ⎣ ⎢ ⎢ ⎡ t r ( F − 1 F 1 ) ⋮ t r ( F − 1 F n ) ⎦ ⎥ ⎥ ⎤
Why we know the second mapping to be N → tr ( N ⋅ ∇ log det M ) \mathbf{N} \to \text{tr}(\mathbf{N} \cdot \nabla \log \text{det} \mathbf{M}) N → tr ( N ⋅ ∇ log det M ) ?
Back to how we compute the gradient of f ( X ) = log det X , dom f = S + + n f(\mathbf{X})=\log \operatorname{det} \mathbf{X}, \text{dom} f = \mathbb{S}^n_{++} f ( X ) = log d e t X , dom f = S + + n . We know by first-order approximation that
f ( z ) = f ( x ) + D f ( x ) ( z − x ) = f ( x ) + ⟨ ∇ f ( x ) , z − x ⟩ = f ( x ) + tr ( ( z − x ) ∇ f ( x ) ) f(\mathbf{z}) = f(\mathbf{x})+Df(\mathbf{x})(\mathbf{z}-\mathbf{x}) = f(\mathbf{x})+ \langle \nabla f(\mathbf{x}), \mathbf{z}-\mathbf{x}\rangle =f(\mathbf{x})+\text{tr}((\mathbf{z}-\mathbf{x})\nabla f(\mathbf{x}))
f ( z ) = f ( x ) + D f ( x ) ( z − x ) = f ( x ) + ⟨ ∇ f ( x ) , z − x ⟩ = f ( x ) + tr ( ( z − x ) ∇ f ( x ) )
Hence, the derivative is actually a mapping from z − x \mathbf{z}-\mathbf{x} z − x to f ( z ) − f ( x ) = tr ( ( z − x ) ∇ f ( x ) ) f(\mathbf{z}) - f(\mathbf{x}) = \text{tr}((\mathbf{z}-\mathbf{x})\nabla f(\mathbf{x})) f ( z ) − f ( x ) = tr ( ( z − x ) ∇ f ( x ) ) .
Second derivative
If f : R n → R f:\mathbb{R}^n\to \mathbb{R} f : R n → R , then D f ( x ) : R n → R n Df(\mathbf{x}): \mathbb{R}^n\to \mathbb{R}^n D f ( x ) : R n → R n , and D 2 f ( x ) : R n → R n × n D^2f(\mathbf{x}):\mathbb{R}^n\to \mathbb{R}^{n\times n} D 2 f ( x ) : R n → R n × n .
D 2 f ( x ) = ∇ 2 f ( x ) = ∂ ( D f ( x ) ) T ∂ x T = ( ∂ 2 f ( x ) ∂ x i ∂ x j ) = H D^{2} f(\mathbf{x})=\nabla^{2} f(\mathbf{x})=\frac{\partial(D f(\mathbf{x}))^{T}}{\partial \mathbf{x}^{T}}=\left(\frac{\partial^{2} f(\mathbf{x})}{\partial x_{i} \partial x_{j}}\right)=\mathbf{H}
D 2 f ( x ) = ∇ 2 f ( x ) = ∂ x T ∂ ( D f ( x ) ) T = ( ∂ x i ∂ x j ∂ 2 f ( x ) ) = H
Second-order approximation
It needs to satisfy:
lim z ∈ dom f , z ≠ x , z → x ∣ f ( z ) − f ^ ( z ) ∣ ∥ z − x ∥ 2 2 = 0 \lim _{\mathbf{z} \in \operatorname{dom} f, \mathbf{z} \neq \mathbf{x}, \mathbf{z} \rightarrow \mathbf{x}} \frac{|f(\mathbf{z})-\hat{f}(\mathbf{z})|}{\|\mathbf{z}-\mathbf{x}\|_{2}^{2}}=0
z ∈ d o m f , z = x , z → x lim ∥ z − x ∥ 2 2 ∣ f ( z ) − f ^ ( z ) ∣ = 0
So we have:
f ^ ( z ) = f ( x ) + ∇ f ( x ) ⊤ ( z − x ) + ( 1 / 2 ) ( z − x ) ⊤ ∇ 2 f ( x ) ( z − x ) = f ( x ) + D f ( x ) ( z − x ) + 1 2 ⟨ D 2 f ( x ) ( z − x ) , z − x ⟩ \hat{f}(\mathbf{z})=f(\mathbf{x})+\nabla f(\mathbf{x})^{\top}(\mathbf{z}-\mathbf{x})+(1 / 2)(\mathbf{z}-\mathbf{x})^{\top} \nabla^{2} f(\mathbf{x})(\mathbf{z}-\mathbf{x}) \\
=f(\mathbf{x})+D f(\mathbf{x})(\mathbf{z}-\mathbf{x})+\frac{1}{2}\left\langle D^{2} f(\mathbf{x})(\mathbf{z}-\mathbf{x}), \mathbf{z}-\mathbf{x}\right\rangle
f ^ ( z ) = f ( x ) + ∇ f ( x ) ⊤ ( z − x ) + ( 1 / 2 ) ( z − x ) ⊤ ∇ 2 f ( x ) ( z − x ) = f ( x ) + D f ( x ) ( z − x ) + 2 1 ⟨ D 2 f ( x ) ( z − x ) , z − x ⟩
Chain rule for second derivative
Scalar function: f : R n → R , g : R → R , h ( x ) = g ( f ( x ) ) f: \mathbb{R}^{n} \rightarrow \mathbb{R}, g: \mathbb{R} \rightarrow \mathbb{R}, h(\mathbf{x}) = g(f(\mathbf{x})) f : R n → R , g : R → R , h ( x ) = g ( f ( x ) )
∇ 2 h ( x ) = g ′ ( f ( x ) ) ∇ 2 f ( x ) + g ′ ′ ( f ( x ) ) ∇ f ( x ) ∇ f ( x ) T \nabla^{2} h(\mathbf{x})=g^{\prime}(f(\mathbf{x})) \nabla^{2} f(\mathbf{x})+g^{\prime \prime}(f(\mathbf{x})) \nabla f(\mathbf{x}) \nabla f(\mathbf{x})^{T}
∇ 2 h ( x ) = g ′ ( f ( x ) ) ∇ 2 f ( x ) + g ′ ′ ( f ( x ) ) ∇ f ( x ) ∇ f ( x ) T
Affine function: f : R n → R , g : R m → R , g ( x ) = f ( A x + b ) , A ∈ R n × m , b ∈ R n f: \mathbb{R}^{n} \rightarrow \mathbb{R}, g: \mathbb{R}^m \rightarrow \mathbb{R}, g(\mathbf{x}) = f(\mathbf{Ax+b}), \mathbf{A}\in \mathbb{R^{n\times m}}, \mathbf{b}\in \mathbb{R}^n f : R n → R , g : R m → R , g ( x ) = f ( A x + b ) , A ∈ R n × m , b ∈ R n
∇ 2 g ( x ) = A ⊤ ∇ 2 f ( A x + b ) A \nabla^{2} g(\mathbf{x})=\mathbf{A}^{\top} \nabla^{2} f(\mathbf{A} \mathbf{x}+\mathbf{b}) \mathbf{A}
∇ 2 g ( x ) = A ⊤ ∇ 2 f ( A x + b ) A
Neural Network
Back Propagation
The error function and layer computation can be written like:
e ( w , v , y ) = V ( y , x o ) x i = σ ( ∑ j ∈ p a ( i ) w i j x j ) = σ ( a i ) ∂ V ∂ w i j = ∂ V ∂ a i ∂ a i ∂ w i j = δ i x j \begin{array}{l}e(\mathbf{w}, \mathbf{v}, \mathbf{y})=V\left(\mathbf{y}, \mathbf{x}_{o}\right) \\ x_{i}=\sigma\left(\sum_{j \in \mathrm{pa}(i)} w_{i j} x_{j}\right) = \sigma(a_i)\\
\frac{\partial V}{\partial w_{i j}}=\frac{\partial V}{\partial a_{i}} \frac{\partial a_{i}}{\partial w_{i j}}=\delta_{i} x_{j}\end{array}
e ( w , v , y ) = V ( y , x o ) x i = σ ( ∑ j ∈ p a ( i ) w i j x j ) = σ ( a i ) ∂ w i j ∂ V = ∂ a i ∂ V ∂ w i j ∂ a i = δ i x j
Automatic differentiation
Reparameterization trick
Case: Parameter of sampling distribution
Motivation: How to compute the gradient of L ( θ ) = E z ∼ p θ ( z ) [ f ( z ) ] L(\theta)=\mathbb{E}_{z \sim p_{\theta}(z)}[f(z)] L ( θ ) = E z ∼ p θ ( z ) [ f ( z ) ] ? After sampling z z z , we no longer have θ \theta θ in the expression, so back propagation cannot work.
(Note that no θ \theta θ in density function f ( z ) f(z) f ( z ) )
Choose a non-parameterized distribution q ( ϵ ) q(\epsilon) q ( ϵ ) (e.g. standard normal distribution, uniform…) and sample an instance ϵ \epsilon ϵ
Generate a sample of z z z by z = g θ ( ϵ ) z=g_\theta(\epsilon) z = g θ ( ϵ ) . The choice of g θ g_\theta g θ should make z ∼ p θ ( z ) z\sim p_\theta(z) z ∼ p θ ( z )
Now we have L ( θ ) = E ε ∼ q ( ε ) [ f ~ θ ( ε ) ] , f ~ θ = f ( g θ ) L(\theta)=\mathbb{E}_{\varepsilon \sim q(\varepsilon)}\left[\tilde{f}_{\theta}(\varepsilon)\right],\tilde{f}_{\theta} = f(g_\theta) L ( θ ) = E ε ∼ q ( ε ) [ f ~ θ ( ε ) ] , f ~ θ = f ( g θ ) , the sampling will not wipe out θ \theta θ , then:
∂ L ( θ ) ∂ θ = ∂ ∂ θ E ε ∼ q ( ε ) [ f ~ θ ( ε ) ] = E ε ∼ q ( ε ) [ ∂ ∂ θ f ~ θ ( ε ) ] = E ε ∼ q ( ε ) [ ∂ f ∂ g ∂ g θ ( ε ) ∂ θ ] \frac{\partial L(\theta)}{\partial \theta}=\frac{\partial}{\partial \theta} \mathbb{E}_{\varepsilon \sim q(\varepsilon)}\left[\tilde{f}_{\theta}(\varepsilon)\right]=\mathbb{E}_{\varepsilon \sim q(\varepsilon)}\left[\frac{\partial}{\partial \theta} \tilde{f}_{\theta}(\varepsilon)\right]=\mathbb{E}_{\varepsilon \sim q(\varepsilon)}\left[\frac{\partial f}{\partial g} \frac{\partial g_{\theta}(\varepsilon)}{\partial \theta}\right]
∂ θ ∂ L ( θ ) = ∂ θ ∂ E ε ∼ q ( ε ) [ f ~ θ ( ε ) ] = E ε ∼ q ( ε ) [ ∂ θ ∂ f ~ θ ( ε ) ] = E ε ∼ q ( ε ) [ ∂ g ∂ f ∂ θ ∂ g θ ( ε ) ]
Example: Normal distribution
Since we know that N ( z ; μ , σ 2 ) = σ N ( ε ; 0 , 1 ) + μ \mathcal{N}\left(z ; \mu, \sigma^{2}\right)=\sigma \mathcal{N}(\varepsilon ; 0,1)+\mu N ( z ; μ , σ 2 ) = σ N ( ε ; 0 , 1 ) + μ , we sample z = σ ε + μ z = \sigma\varepsilon+\mu z = σ ε + μ and
E z ∼ N ( z ; μ , σ 2 ) [ f ( z ) ] = E ε ∼ N ( ε ; 0 , 1 ) [ f ( σ ε + μ ) ] . \mathbb{E}_{z \sim \mathcal{N}\left(z ; \mu, \sigma^{2}\right)}[f(z)]=\mathbb{E}_{\varepsilon \sim \mathcal{N}(\varepsilon ; 0,1)}[f(\sigma \varepsilon+\mu)].
E z ∼ N ( z ; μ , σ 2 ) [ f ( z ) ] = E ε ∼ N ( ε ; 0 , 1 ) [ f ( σ ε + μ ) ] .
Gumbel-Softmax Trick
Case: Taking max
Motivation: How to compute the gradient of L ( θ ) = argmax i { x i ( θ ) ∣ i = 1 , ⋯ , K } L(\theta)=\underset{i}{\operatorname{argmax}}\left\{x_{i}(\theta) \mid i=1, \cdots, K\right\} L ( θ ) = i a r g m a x { x i ( θ ) ∣ i = 1 , ⋯ , K } ?
Choose x i x_i x i deterministically → \to → Choose (Sample) x i x_i x i according to π i = exp x i ∑ k = 1 K exp x k \pi_{i}=\frac{\exp x_{i}}{\sum_{k=1}^{K} \exp x_{k}} π i = ∑ k = 1 K e x p x k e x p x i .
Since π \mathbf{\pi} π is a probability distribution, we cannot simply take arg max i ( π i ) \arg\max_i(\pi _i) arg max i ( π i ) as the result. Instead, we need to add some randomness:
z = onehot ( arg max i [ g i + log π i ] ) = onehot ( arg max i x i ′ ) , g i ∼ Gumbel ( 0 , 1 ) G ∼ − log ( − log ( U ) ) , U ∼ Unif [ 0 , 1 ] z =\text{onehot}\left( \arg\max_i[g_i+\log \pi_i] \right) =\text{onehot}\left( \arg\max_ix_i'\right), g_i\sim \text{Gumbel}(0,1)\\ G\sim -\log(-\log(U)),U\sim \text{Unif}[0,1]
z = onehot ( arg i max [ g i + log π i ] ) = onehot ( arg i max x i ′ ) , g i ∼ Gumbel ( 0 , 1 ) G ∼ − log ( − log ( U ) ) , U ∼ Unif [ 0 , 1 ]
Since arg max \arg\max arg max is not differentiable, we use softmax to approximate it:
π τ , i ′ = exp ( x i ′ / τ ) ∑ k = 1 K exp ( x k ′ / τ ) \pi_{\tau, i}^{\prime}=\frac{\exp \left(x_{i}^{\prime} / \tau\right)}{\sum_{k=1}^{K} \exp \left(x_{k}^{\prime} / \tau\right)}
π τ , i ′ = ∑ k = 1 K exp ( x k ′ / τ ) exp ( x i ′ / τ )
Here, τ \tau τ is a hyper-parameter called temperature. The smaller τ \tau τ is, the closer the softmax approximates one-hot vector.
Linear Algebra
Symmetric eigenvalue decomposition (EVD)
A ∈ S n \mathbf{A}\in \mathbb{S}^n A ∈ S n Is a real symmetric n × n n\times n n × n matrix, then A = Q Λ Q ⊤ \mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^{\top} A = Q Λ Q ⊤ . Q \mathbf{Q} Q Is orthonormal set of eigenvectors satisfies Q ⊤ Q = I \mathbf{Q}^\top \mathbf{Q} =\mathbf{I} Q ⊤ Q = I , Λ = diag ( λ 1 , ⋯ , λ n ) \mathbf{\Lambda}=\text{diag}(\lambda_1,\cdots,\lambda_n) Λ = diag ( λ 1 , ⋯ , λ n ) .
det A = ∏ i = 1 n λ i , tr A = ∑ i = 1 n λ i , ∥ A ∥ 2 = max ∣ λ ∣ , ∥ A ∥ F = ( ∑ i = 1 n λ i 2 ) 1 / 2 \operatorname{det} \mathbf{A}=\prod_{i=1}^{n} \lambda_{i}, \operatorname{tr} \mathbf{A}=\sum_{i=1}^{n} \lambda_{i},\|\mathbf{A}\|_{2}={\max|\lambda|},\|\mathbf{A}\|_{F}=\left(\sum_{i=1}^{n} \lambda_{i}^{2}\right)^{1 / 2}
d e t A = i = 1 ∏ n λ i , t r A = i = 1 ∑ n λ i , ∥ A ∥ 2 = max ∣ λ ∣ , ∥ A ∥ F = ( i = 1 ∑ n λ i 2 ) 1 / 2
Definiteness
By definition, the largest and smallest eigenvalues satisfy
λ max ( A ) = sup x ≠ 0 x ⊤ A x x ⊤ x , λ min ( A ) = inf x ≠ 0 x ⊤ A x x ⊤ x λ min ( A ) x ⊤ x ≤ x ⊤ A x ≤ λ max x ⊤ x \lambda_{\max }(\mathbf{A})=\sup _{x \neq 0} \frac{\mathbf{x}^{\top} \mathbf{A} \mathbf{x}}{\mathbf{x}^{\top} \mathbf{x}}, \quad \lambda_{\min }(\mathbf{A})=\inf _{\mathbf{x} \neq 0} \frac{\mathbf{x}^{\top} \mathbf{A} \mathbf{x}}{\mathbf{x}^{\top} \mathbf{x}}\\
\lambda_{\min }(\mathbf{A})\mathbf{x}^{\top} \mathbf{x}\le\mathbf{x}^{\top} \mathbf{A} \mathbf{x} \le \lambda_{\max }\mathbf{x}^{\top} \mathbf{x}
λ m a x ( A ) = x = 0 sup x ⊤ x x ⊤ A x , λ m i n ( A ) = x = 0 inf x ⊤ x x ⊤ A x λ m i n ( A ) x ⊤ x ≤ x ⊤ A x ≤ λ m a x x ⊤ x
For A ∈ S n \mathbf{A}\in \mathbb{S}^n A ∈ S n , positive definite A ≻ 0 \mathbf{A} \succ \mathbf{0} A ≻ 0 : ∀ x ≠ 0 , x ⊤ A x > 0 \forall \mathbf{x}\ne \mathbf{0}, \mathbf{x}^\top\mathbf{A}\mathbf{x}>0 ∀ x = 0 , x ⊤ A x > 0 \Leftrightarrow \lambda_\min(\mathbf{A})>0.
Symmetric squareroot
A 1 / 2 = Q diag ( λ 1 1 / 2 , … , λ n 1 / 2 ) Q ⊤ \mathbf{A}^{1 / 2}=\mathbf{Q} \operatorname{diag}\left(\lambda_{1}^{1 / 2}, \ldots, \lambda_{n}^{1 / 2}\right) \mathbf{Q}^{\top}
A 1 / 2 = Q d i a g ( λ 1 1 / 2 , … , λ n 1 / 2 ) Q ⊤
A 1 / 2 \mathbf{A}^{1 / 2} A 1 / 2 Is the unique symmetric positive semidefinite solution of X 2 = A \mathbf{X}^2 = \mathbf{A} X 2 = A .
SVD
A ∈ S m × n \mathbf{A}\in \mathbb{S}^{m\times n} A ∈ S m × n Can be factored as A = U Σ V ⊤ , Σ = diag { σ 1 , ⋯ , σ r } \mathbf{A}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top, \mathbf{\Sigma}=\text{diag}\{\sigma_1,\cdots,\sigma_r\} A = U Σ V ⊤ , Σ = diag { σ 1 , ⋯ , σ r } .
σ i \sigma_i σ i of a symmetric matrix = |λ i ≠ 0 \lambda_i\ne 0 λ i = 0 |, sorted into descending order
Condition number
Condition number is a measure for the sensitivity of the out-data with respect to perturbations in in-data.
For A x = b \mathbf{Ax=b} A x = b , adding perturbations to become A ( x + δ x ) = b + δ b \mathbf{A(x+\delta x)=b+\delta b} A ( x + δ x ) = b + δ b . Since \|\mathbf{b}\|\le \lambda_\max\|\mathbf{x}\|, \|\delta \mathbf{b}\|\ge \lambda_\min\| \delta\mathbf{x}\|, we have:
∥ δ x ∥ ∥ x ∥ ≤ κ ( A ) ∥ δ b ∥ ∥ b ∥ {\|\delta \mathbf{x}\| \over \|\mathbf{x}\|} \leq \kappa(\mathbf{A}) {\|\delta \mathbf{b}\| \over
\|\mathbf{b}\|}
∥ x ∥ ∥ δ x ∥ ≤ κ ( A ) ∥ b ∥ ∥ δ b ∥
Nonsingular (invertible) matrix A ∈ R n × n \mathbf{A}\in \mathbb{R}^{n\times n} A ∈ R n × n has condition number:
cond ( A ) = ∥ A ∥ 2 ∥ A − 1 ∥ 2 = σ max ( A ) / σ min ( A ) \operatorname{cond}(\mathbf{A})=\|\mathbf{A}\|_{2}\left\|\mathbf{A}^{-1}\right\|_{2}=\sigma_{\max }(\mathbf{A}) / \sigma_{\min }(\mathbf{A})
c o n d ( A ) = ∥ A ∥ 2 ∥ ∥ ∥ A − 1 ∥ ∥ ∥ 2 = σ m a x ( A ) / σ m i n ( A )
Pseudo-inverse
A † = V Σ − 1 U ⊤ ∈ R n × m \mathbf{A}^{\dagger}=\mathbf{V} \boldsymbol{\Sigma}^{-1} \mathbf{U}^{\top} \in \mathbb{R}^{n \times m}
A † = V Σ − 1 U ⊤ ∈ R n × m
Adjoint operator
Given linear operator: A : R m → R n \mathcal{A}:\mathbb{R}^m\to \mathbb{R}^n A : R m → R n , its adjoint operator A ∗ \mathcal{A}^* A ∗ satisfies:
⟨ A ∗ ( x ) , y ⟩ = ⟨ x , A ( y ) ⟩ , ∀ x ∈ R n , y ∈ R m \left\langle\mathcal{A}^{*}(\mathbf{x}), \mathbf{y}\right\rangle=\langle\mathbf{x}, \mathcal{A}(\mathbf{y})\rangle, \quad \forall \mathbf{x} \in \mathbb{R}^{n}, \mathbf{y} \in \mathbb{R}^{m}
⟨ A ∗ ( x ) , y ⟩ = ⟨ x , A ( y ) ⟩ , ∀ x ∈ R n , y ∈ R m
Example : If A ( x ) = A x \mathcal{A}^{}(\mathbf{x}) =\mathbf{Ax} A ( x ) = A x , then A ∗ ( x ) = A ⊤ x \mathcal{A}^{*}(\mathbf{x}) =\mathbf{A}^\top \mathbf{x} A ∗ ( x ) = A ⊤ x .
Von Neumann trace theorem
Suppose A , B ∈ R m × n \mathbf{A,B}\in \mathbb{R}^{m\times n} A , B ∈ R m × n , then:
∣ ⟨ A , B ⟩ ∣ ≤ ∑ i = 1 min ( m , n ) σ i ( A ) σ i ( B ) |\langle\mathbf{A}, \mathbf{B}\rangle| \leq \sum_{i=1}^{\min (m, n)} \sigma_{i}(\mathbf{A}) \sigma_{i}(\mathbf{B})
∣ ⟨ A , B ⟩ ∣ ≤ i = 1 ∑ m i n ( m , n ) σ i ( A ) σ i ( B )
When they are both p.s.d. matrices, we have:
∑ i = 1 n σ i ( A ) σ n − i + 1 ( B ) ≤ tr ( A B ) ≤ ∑ i = 1 n σ i ( A ) σ i ( B ) \sum_{i=1}^{n} \sigma_{i}(\mathbf{A}) \sigma_{n-i+1}(\mathbf{B}) \leq \operatorname{tr}(\mathbf{A B}) \leq \sum_{i=1}^{n} \sigma_{i}(\mathbf{A}) \sigma_{i}(\mathbf{B})
i = 1 ∑ n σ i ( A ) σ n − i + 1 ( B ) ≤ t r ( A B ) ≤ i = 1 ∑ n σ i ( A ) σ i ( B )
Norms
f : R n → R f:\mathbb{R}^n\to \mathbb{R} f : R n → R Is called norm if f f f is nonnegative, definite(f ( x ) = 0 ⇔ x = 0 f(\mathbf{x} )=0 \Leftrightarrow \mathbf{x=0} f ( x ) = 0 ⇔ x = 0 ), homogeneous(f ( t x ) = ∣ t ∣ f ( x ) f(t\mathbf{x} )=|t|f(\mathbf{x} ) f ( t x ) = ∣ t ∣ f ( x ) ), and satisfies triangle inequality(f ( x + y ) ≤ f ( x ) + f ( y ) f(\mathbf{x+y} )\le f(\mathbf{x} )+f(\mathbf{y} ) f ( x + y ) ≤ f ( x ) + f ( y ) ).
Inner product
⟨ x , y ⟩ = x ⊤ y ⟨ X , Y ⟩ = tr ( X ⊤ Y ) \langle\mathbf{x}, \mathbf{y}\rangle = \mathbf{x}^\top \mathbf{y} \\
\langle\mathbf{X}, \mathbf{Y}\rangle = \text{tr}\left(\mathbf{X}^\top \mathbf{Y}\right)\\
⟨ x , y ⟩ = x ⊤ y ⟨ X , Y ⟩ = tr ( X ⊤ Y )
Unit balls
If we have a norm ∥ ⋅ ∥ \|\cdot\| ∥ ⋅ ∥ , then the ball is B = { x ∈ R n ∣ ∥ x ∥ ≤ 1 } \mathcal{B} = \{\mathbf{x}\in \mathbb{R}^n\mid \|\mathbf{x}\|\le 1\} B = { x ∈ R n ∣ ∥ x ∥ ≤ 1 } . It is symmetric about the origin, convex, closed, bounded and has nonempty interior.
If we have a ball C \mathcal{C} C that satisfies the above properties, it is a unit ball of the norm ∥ x ∥ = ( sup { t ≥ 0 ∣ t x ∈ C } ) − 1 \|\mathbf{x}\| = \left(\sup \{t\ge 0\mid t\mathbf{x}\in \mathcal{C}\}\right)^{-1} ∥ x ∥ = ( sup { t ≥ 0 ∣ t x ∈ C } ) − 1 .
Quadratic norm
For P ∈ S + + n \mathbf{P}\in \mathbb{S}_{++}^n P ∈ S + + n , ∥ x ∥ P = ( x ⊤ P x ) 1 / 2 = ∥ P 1 / 2 x ∥ 2 \|\mathbf{x}\|_P = (\mathbf{x}^\top \mathbf{P}\mathbf{x})^{1/2} = \|\mathbf{P}^{1/2}\mathbf{x}\|_2 ∥ x ∥ P = ( x ⊤ P x ) 1 / 2 = ∥ P 1 / 2 x ∥ 2 . Its unit ball is an ellipsoid.
Equivalence of norms
α ∥ x ∥ a ≤ ∥ x ∥ b ≤ β ∥ x ∥ a , ∀ x ∈ R n \alpha \|\mathbf{x}\|_a\le \|\mathbf{x}\|_b\le \beta \|\mathbf{x}\|_a, \forall \mathbf{x}\in \mathbb{R}^n
α ∥ x ∥ a ≤ ∥ x ∥ b ≤ β ∥ x ∥ a , ∀ x ∈ R n
This means ∥ ⋅ ∥ a \|\cdot\|_a ∥ ⋅ ∥ a and ∥ ⋅ ∥ b \|\cdot\|_b ∥ ⋅ ∥ b are equivalent, i.e. they define the same set of open subsets and convergent sequences.
Operator/induced norms
We can induce an operator norm of X ∈ R m × n \mathbf{X}\in \mathbb{R}^{m\times n} X ∈ R m × n with ∥ ⋅ ∥ a \|\cdot\|_a ∥ ⋅ ∥ a on R m \mathbb{R}^m R m and ∥ ⋅ ∥ b \|\cdot\|_b ∥ ⋅ ∥ b on R n \mathbb{R}^n R n :
∥ X ∥ a , b = sup { ∥ X u ∥ a ∣ ∥ u ∥ b ≤ 1 } \|\mathbf{X}\|_{a,b} = \sup \{\|\mathbf{Xu}\|_a\mid \|\mathbf{u}\|_b\le 1\}
∥ X ∥ a , b = sup { ∥ X u ∥ a ∣ ∥ u ∥ b ≤ 1 }
Matrix norms
∥ X ∥ F = ( tr ( X ⊤ X ) ) 1 / 2 ∥ X ∥ 1 = max j ∑ i ∣ x i j ∣ ∥ X ∥ 2 = σ max ( X ) ∥ X ∥ ∞ = max i ∑ j ∣ x i j ∣ \|\mathbf{X}\|_F = \left(\text{tr}\left(\mathbf{X}^\top \mathbf{X}\right)\right)^{1/2}\\
{\displaystyle \|\mathbf{X}\|_{1}=\max _{j}\sum _{i}|x_{ij}|}\\
{\displaystyle \|\mathbf{X}\|_{2}=\sigma _{\max }(\mathbf{X})}\\
\|\mathbf{X}\|_{\infty} = \max_i\sum_{j}|x_{ij}|
∥ X ∥ F = ( tr ( X ⊤ X ) ) 1 / 2 ∥ X ∥ 1 = j max i ∑ ∣ x i j ∣ ∥ X ∥ 2 = σ m a x ( X ) ∥ X ∥ ∞ = i max j ∑ ∣ x i j ∣
Nuclear norm
∥ A ∥ ∗ = ∑ i σ i ( A ) ∥ X ∥ ∗ = max U ⊤ U = I V ⊤ V = I tr ( U ⊤ X V ) \|\mathbf{A}\|_* = \sum_i\sigma_i(\mathbf{A})\\
\|\mathbf{X}\|_* = \max_{\mathbf{U}^\top \mathbf{U}=\mathbf{I}\\\mathbf{V}^\top \mathbf{V}=\mathbf{I}} \text{tr}(\mathbf{U}^\top\mathbf{X}\mathbf{V})
∥ A ∥ ∗ = i ∑ σ i ( A ) ∥ X ∥ ∗ = U ⊤ U = I V ⊤ V = I max tr ( U ⊤ X V )
First, we can prove ∥ X ∥ ∗ ≥ tr ( U ⊤ X V ) \|\mathbf{X}\|_* \ge \text{tr}(\mathbf{U}^\top\mathbf{X}\mathbf{V}) ∥ X ∥ ∗ ≥ tr ( U ⊤ X V ) by Von Neumann trace theorem:
tr ( U ⊤ X V ) = ⟨ X , U V ⊤ ⟩ ≤ ∑ i σ i ( X ) σ i ( U I V ⊤ ) = ∑ i σ i ( X ) = ∥ X ∥ ∗ \text{tr}(\mathbf{U}^\top\mathbf{X}\mathbf{V}) = \langle \mathbf{X}, \mathbf{U}\mathbf{V}^\top\rangle \le \sum_i\sigma_i(\mathbf{X})\sigma_i(\mathbf{U}\mathbf{I}\mathbf{V}^\top) = \sum_i\sigma_i(\mathbf{X}) = \|\mathbf{X}\|_*
tr ( U ⊤ X V ) = ⟨ X , U V ⊤ ⟩ ≤ i ∑ σ i ( X ) σ i ( U I V ⊤ ) = i ∑ σ i ( X ) = ∥ X ∥ ∗
Next, when X = U x Σ x V x ⊤ \mathbf{X}=\mathbf{U}_x\mathbf{\Sigma}_x\mathbf{V}_x^\top X = U x Σ x V x ⊤ , then take U = U x , V = V x \mathbf{U}=\mathbf{U}_x, \mathbf{V}=\mathbf{V}_x U = U x , V = V x , we have the equality:
⟨ X , U V ⊤ ⟩ = tr ( U x ⊤ X V x ) = tr ( Σ x ) = ∥ X ∥ ∗ \langle \mathbf{X}, \mathbf{U}\mathbf{V}^\top\rangle =\text{tr}(\mathbf{U}_x^\top\mathbf{X}\mathbf{V}_x) =\text{tr}(\mathbf{\Sigma}_x)= \|\mathbf{X}\|_*
⟨ X , U V ⊤ ⟩ = tr ( U x ⊤ X V x ) = tr ( Σ x ) = ∥ X ∥ ∗
( p , q ) (p,q) ( p , q ) -norm
∥ A ∥ p , q = ( ∑ i = 1 n ∥ A i ∥ p q ) 1 q , p , q ≥ 1 \|\mathbf{A}\|_{p, q}=\left(\sum_{i=1}^{n}\left\|\mathbf{A}_{i}\right\|_{p}^{q}\right)^{\frac{1}{q}}, \quad p,q\ge 1
∥ A ∥ p , q = ( i = 1 ∑ n ∥ A i ∥ p q ) q 1 , p , q ≥ 1
Dual norm
If we have a linear function f z ( x ) = z ⊤ x f_\mathbf{z}(\mathbf{x}) = \mathbf{z}^\top \mathbf{x} f z ( x ) = z ⊤ x , how big is its value relative to the norm of x \mathbf{x} x ? It would be obviously z ⊤ x ∥ x ∥ \frac{ \mathbf{z}^\top \mathbf{x}}{\|\mathbf{x}\|} ∥ x ∥ z ⊤ x . Hence, ∥ z ∥ ∗ \| \mathbf{z}\|^* ∥ z ∥ ∗ can be interpreted as a stretching factor, a linear version of operator norm . (See more intuition here .)
For norm ∥ ⋅ ∥ \|\cdot\| ∥ ⋅ ∥ , its dual norm is:
∥ z ∥ ∗ = sup { z ⊤ x ∣ ∥ x ∥ ≤ 1 } \|\mathbf{z}\|^* = \sup\{\mathbf{z}^\top \mathbf{x}\mid \|\mathbf{x}\|\le 1\}
∥ z ∥ ∗ = sup { z ⊤ x ∣ ∥ x ∥ ≤ 1 }
Example :
∥ z ∥ 2 ⇔ ∥ z ∥ 2 \|z\|_2\Leftrightarrow \|z\|_2 ∥ z ∥ 2 ⇔ ∥ z ∥ 2 : ∥ z ∥ 2 ∗ = sup x { ⟨ z , x ⟩ ∣ ∥ x ∥ 2 ≤ 1 } \|z\|_2^* = \sup_x\{\langle z,x\rangle\mid \|x\|_2\le 1\} ∥ z ∥ 2 ∗ = sup x { ⟨ z , x ⟩ ∣ ∥ x ∥ 2 ≤ 1 } . By Cauchy-Schwartz inequality, ⟨ z , x ⟩ ≤ ∥ x ∥ 2 ∥ z ∥ 2 ≤ ∥ z ∥ 2 \langle z,x\rangle\le \|x\|_2\|z\|_2\le \|z\|_2 ⟨ z , x ⟩ ≤ ∥ x ∥ 2 ∥ z ∥ 2 ≤ ∥ z ∥ 2 . When x = z ∥ z ∥ 2 x = \frac{z}{\|z\|_2} x = ∥ z ∥ 2 z , ⟨ z , x ⟩ = ∥ z ∥ 2 \langle z,x\rangle= \|z\|_2 ⟨ z , x ⟩ = ∥ z ∥ 2 . Hence, ∥ z ∥ 2 ∗ = ∥ z ∥ 2 \|z\|_2^* = \|z\|_2 ∥ z ∥ 2 ∗ = ∥ z ∥ 2 .
∥ z ∥ p ⇔ ∥ z ∥ q \|z\|_p\Leftrightarrow \|z\|_q ∥ z ∥ p ⇔ ∥ z ∥ q : ∥ z ∥ p ∗ = sup x { ⟨ z , x ⟩ ∣ ∥ x ∥ p ≤ 1 } , \|z\|_p^* = \sup_x\{\langle z,x\rangle\mid \|x\|_p\le 1\}, ∥ z ∥ p ∗ = sup x { ⟨ z , x ⟩ ∣ ∥ x ∥ p ≤ 1 } , by Holder inequality, $ \langle z,x\rangle\le |z|_q|x|_p,\frac{1}{p}+\frac{1}{q}=1$.
∥ z ∥ 1 ⇔ ∥ z ∥ ∞ \|z\|_1\Leftrightarrow \|z\|_\infty ∥ z ∥ 1 ⇔ ∥ z ∥ ∞ : ∣ ∑ i z i x i ∣ ≤ ∑ i ∣ z i ∣ ∣ x i ∣ ≤ ( ∑ i ∣ x i ∣ ) max i ∣ z i ∣ = ∥ x ∥ 1 ∥ z ∥ ∞ |\sum_i z_ix_i|\le \sum_i|z_i||x_i|\le \left(\sum_i|x_i|\right) \max_i |z_i| = \|x\|_1\|z\|_\infty ∣ ∑ i z i x i ∣ ≤ ∑ i ∣ z i ∣ ∣ x i ∣ ≤ ( ∑ i ∣ x i ∣ ) max i ∣ z i ∣ = ∥ x ∥ 1 ∥ z ∥ ∞ , let x i = sgn ( z i ) , i = arg max i ∣ z i ∣ x_i = \text{sgn}(z_i), i=\arg\max_i|z_i| x i = sgn ( z i ) , i = arg max i ∣ z i ∣
∥ Z ∥ 2 ∗ = sup { t r ( Z ⊺ X ) ∣ ∥ X ∥ 2 ≤ 1 } = σ 1 ( Z ) + ⋯ + σ r ( Z ) = t r ( Z ⊺ Z ) {\displaystyle \|Z\|_{2}^*=\sup\{\mathrm {\bf {tr}} (Z^{\intercal }X)|\|X\|_{2}\leq 1\}}={\sigma _{1}(Z)+\cdots +\sigma _{r}(Z)=\mathrm {\bf {tr}} ({\sqrt {Z^{\intercal }Z}})} ∥ Z ∥ 2 ∗ = sup { t r ( Z ⊺ X ) ∣ ∥ X ∥ 2 ≤ 1 } = σ 1 ( Z ) + ⋯ + σ r ( Z ) = t r ( Z ⊺ Z )